Re: Number Of Partitions in RDD
CLuster mode with HDFS? or local mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28737.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Number Of Partitions in RDD
What version of spark of spark are you using? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28732.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to set hive configs in Spark 2.1?
All you need to do is - spark.conf.set("spark.sql.shuffle.partitions", 2000) spark.conf.set("spark.sql.orc.filterPushdown", True) ...etc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-hive-configs-in-Spark-2-1-tp28429p28431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark SQL : Join operation failure
It might be a memory issue. Try adding .persist(MEMORY_AND_DISK_ONLY) so that if the RDD can't fit into memory it will persist parts of the RDD into disk. cm_go.registerTempTable("x") ko.registerTempTable("y") joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2") joined_df.persist(StorageLevel.MEMORY_AND_DISK_ONLY) joined_df.write.save("/user/data/output") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-operation-failure-tp28414p28422.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: New Amazon AMIs for EC2 script
You should look into AWS EMR instead, with adding pip install steps to the launch process. They have a pretty nice Jupyter notebook script that setups up jupyter and lets you choose what packages you want to install - https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-Amazon-AMIs-for-EC2-script-tp28419p28421.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: JavaRDD text matadata(file name) findings
You can use the https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#wholeTextFiles(java.lang.String) but it will return a rdd as such (filename,content) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-text-matadata-file-name-findings-tp28353p28356.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Read from Google store and save in AWS s3
Here is how you would read from Google Cloud Storage(note you need to create a service account key) -> os.environ['PYSPARK_SUBMIT_ARGS'] = """--jars /home/neil/Downloads/gcs-connector-latest-hadoop2.jar pyspark-shell""" from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession, SQLContext conf = SparkConf()\ .setMaster("local[8]")\ .setAppName("GS") sc = SparkContext(conf=conf) sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "PUT UR GOOGLE PROJECT ID HERE") sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.email", "testa...@sparkgcs.iam.gserviceaccount.com") sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true") sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.keyfile", "sparkgcs-96bd21691c29.p12") spark = SparkSession.builder\ .config(conf=sc.getConf())\ .getOrCreate() dfTermRaw = spark.read.format("csv")\ .option("header", "true")\ .option("delimiter" ,"\t")\ .option("inferSchema", "true")\ .load("gs://bucket_test/sample.tsv") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Read-from-Google-store-and-save-in-AWS-s3-tp28278p28286.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Setting Spark Properties on Dataframes
This blog post(Not mine) has some nice examples - https://hadoopist.wordpress.com/2016/08/19/how-to-create-compressed-output-files-in-spark-2-0/ >From the blog - df.write.mode("overwrite").format("parquet").option("compression", "none").mode("overwrite").save("/tmp/file_no_compression_parq") df.write.mode("overwrite").format("parquet").option("compression", "gzip").mode("overwrite").save("/tmp/file_with_gzip_parq") df.write.mode("overwrite").format("parquet").option("compression", "snappy").mode("overwrite").save("/tmp/file_with_snappy_parq") //lzo - requires a different method in terms of implementation. df.write.mode("overwrite").format("orc").option("compression", "none").mode("overwrite").save("/tmp/file_no_compression_orc") df.write.mode("overwrite").format("orc").option("compression", "snappy").mode("overwrite").save("/tmp/file_with_snappy_orc") df.write.mode("overwrite").format("orc").option("compression", "zlib").mode("overwrite").save("/tmp/file_with_zlib_orc") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Spark-Properties-on-Dataframes-tp28266p28280.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Setting Spark Properties on Dataframes
Can you be more specific on what you would want to change on the DF level? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Spark-Properties-on-Dataframes-tp28266p28275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Python in Jupyter Notebook
Assuming you don't have your environment variables setup in your .bash_profile you would do it like this - import os import sys spark_home = '/usr/local/spark' sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.1-src.zip')) #os.environ['PYSPARK_SUBMIT_ARGS'] = """--master spark://54.68.147.137:7077 pyspark-shell""" < where you can pass commands you would pass in launching pyspark directly from command line from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession conf = SparkConf()\ .setMaster("local[8]")\ .setAppName("Test") sc = SparkContext(conf=conf) spark = SparkSession.builder\ .config(conf=sc.getConf())\ .enableHiveSupport()\ .getOrCreate() Mind you this is for spark 2.0 and above -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Python-in-Jupyter-Notebook-tp28268p28274.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark on yarn can't load kafka dependency jar
Don't the jars need to be comma sperated when you pass? i.e. --jars "hdfs://zzz:8020/jars/kafka_2.10-0.8.2.2.jar", /opt/bigdevProject/sparkStreaming_jar4/sparkStreaming.jar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-on-yarn-can-t-load-kafka-dependency-jar-tp28216p28220.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Would spark dataframe/rdd read from external source on every action?
Yes it would. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Would-spark-dataframe-rdd-read-from-external-source-on-every-action-tp28157p28158.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark SQL join and subquery
What version of Spark are you using? I believe this was fixed in 2.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-join-and-subquery-tp28093p28097.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
RE: CSV to parquet preserving partitioning
All you need to do is load all the files into one dataframe at once. Then save the dataframe using partitionBy - df.write.format("parquet").partitionBy("directoryCol").save("hdfs://path") Then if you look at the new folder it should look like how you want it I.E - hdfs://path/dir=dir1/part-r-xxx.gz.parquet hdfs://path/dir=dir2/part-r-yyy.gz.parquet hdfs://path/dir=dir3/part-r-zzz.gz.parquet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: CSV to parquet preserving partitioning
Is there anything in the files to let you know which directory they should be in? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28083.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Finding a Spark Equivalent for Pandas' get_dummies
You can have a list of all the columns and pass it to a recursive recursive function to fit and make the transformation. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-a-Spark-Equivalent-for-Pandas-get-dummies-tp28064p28079.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Will Spark SQL completely replace Apache Impala or Apache Hive?
No Spark-SQL, is part of Spark which is processing engine. Apache Hive is a Data Warehouse on top of Hadoop. Apache Impala is Both DataWarehouse(While Utilizing Hive Metastore) and processing Engine. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-Spark-SQL-completely-replace-Apache-Impala-or-Apache-Hive-tp27958p27963.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spyder and SPARK combination problem...Please help!
Are you using Windows? Switching over to Linux environment made that error go away for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-problem-Please-help-tp27882p27884.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Write to Cassandra table from pyspark fails with scala reflect error
You need to use 2.0.0-M2-s_2.11 since Spark 2.0 is compiled with Scala 2.11 by default. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Write-to-Cassandra-table-from-pyspark-fails-with-scala-reflect-error-tp27723p27729.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Java Heap Error
Im assuming the dataset your dealing with is big hence why you wanted to allocate ur full 16gb of Ram to it. I suggest running the python spark-shell as such "pyspark --driver-memory 16g". Also if you cache your data and it doesn't fully fit in memory you can do df.cache(StorageLevel.MEMORY_AND_DISK). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Java-Heap-Error-tp27669p27707.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Java Heap Error
Double check your Driver Memory in your Spark Web UI make sure the driver Memory is close to half of 16gb available. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Java-Heap-Error-tp27669p27704.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Java Heap Error
If your in local mode just allocate all your memory you want to use to your Driver(that acts as the executor in local mode) don't even bother changing the executor memory. So your new settings should look like this... spark.driver.memory 16g spark.driver.maxResultSize 2g spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value You might need to change your spark.driver.maxResultSize settings if you plan on doing a collect on the entire rdd/dataframe. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Java-Heap-Error-tp27669p27673.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Is it possible to submit Spark Application remotely?
You need to pass --cluster-mode to spark-submit, this will push the driver to cluster rather then it run locally on your computer. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-submit-Spark-Application-remotely-tp27640p27668.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark DF CacheTable method. Will it save data to disk?
>From the spark documentation(http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence) yes you can use persist on a dataframe instead of cache. All cache is, is a shorthand for the default persist storage level "MEMORY_ONLY". If you want to persist the dataframe to disk you should do dataframe.persist(StorageLevel.DISK_ONLY). IMO If reads are expensive against the DB and your afraid of failure why not just save the data as a parquet on your cluster in hive and read from there? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DF-CacheTable-method-Will-it-save-data-to-disk-tp27533p27551.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Writing all values for same key to one file
Why not just create a partitions for they key you want to groupby and save it in there? Appending to a file already written to HDFS isn't the best idea IMO. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-all-values-for-same-key-to-one-file-tp27455p27501.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org