Re: Spark SQL High GC time
Hi Yuming - I was running into the same issue with larger worker nodes a few weeks ago. The way I managed to get around the high GC time, as per the suggestion of some others, was to break each worker node up into individual workers of around 10G in size. Divide your cores accordingly. The other way I was able to avoid high GC time was to use the right kind of serialisation to keep the number of objects in memory low. Hope that helps! - nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-High-GC-time-tp23005p23030.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
Could you be more specific in how this is done? A DataFrame class doesn't have that method. On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote: You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote: I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt. I'm reading in both tables using a HiveContext with the underlying files stored as Parquet Files. I'm using something along the lines of HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1) to set up the join. When I execute this (with an action such as .count) I see the first few stages complete, but the job eventually stalls. The GC counts keep increasing for each executor. Running with 6 workers, each with 2T disk and 100GB RAM. Has anyone else run into this issue? I'm thinking I might be running into issues with the shuffling of the data, but I'm unsure of how to get around this? Is there a way to redistribute the rows based on the join key first, and then do the join? Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Long GC pauses with Spark SQL 1.3.0 and billion row tables
I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt. I'm reading in both tables using a HiveContext with the underlying files stored as Parquet Files. I'm using something along the lines of HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1) to set up the join. When I execute this (with an action such as .count) I see the first few stages complete, but the job eventually stalls. The GC counts keep increasing for each executor. Running with 6 workers, each with 2T disk and 100GB RAM. Has anyone else run into this issue? I'm thinking I might be running into issues with the shuffling of the data, but I'm unsure of how to get around this? Is there a way to redistribute the rows based on the join key first, and then do the join? Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark, snappy and HDFS
Thanks all. I was able to get the decompression working by adding the following to my spark-env.sh script: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar On Thu, Apr 2, 2015 at 12:51 AM, Sean Owen so...@cloudera.com wrote: Yes, any Hadoop-related process that asks for Snappy compression or needs to read it will have to have the Snappy libs available on the library path. That's usually set up for you in a distro or you can do it manually like this. This is not Spark-specific. The second question also isn't Spark-specific; you do not have a SequenceFile of byte[] / String, but of byte[] / byte[]. Review what you are writing since it is not BytesWritable / Text. On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers n.e.trav...@gmail.com wrote: I'm actually running this in a separate environment to our HDFS cluster. I think I've been able to sort out the issue by copying /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just using a one-worker setup at present) and adding the following to spark-env.sh: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar I can get past the previous error. The issue now seems to be with what is being returned. import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() returns the following error: java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.io.Text On Wed, Apr 1, 2015 at 7:34 PM, Xianjin YE advance...@gmail.com wrote: Do you have the same hadoop config for all nodes in your cluster(you run it in a cluster, right?)? Check the node(usually the executor) which gives the java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the hadoop native lib path. On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote: Thanks for the super quick response! I can read the file just fine in hadoop, it's just when I point Spark at this file it can't seem to read it due to the missing snappy jars / so's. I'l paying around with adding some things to spark-env.sh file, but still nothing. On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com wrote: Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() but fails when the encoding is Snappy. I've seen some stuff floating around on the web about having to explicitly enable support for Snappy in spark, but it doesn't seem to work for me: http://www.ericlin.me/enabling-snappy-support-for-sharkspark http://www.ericlin.me/enabling-snappy-support-for-sharkspark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark, snappy and HDFS
I'm actually running this in a separate environment to our HDFS cluster. I think I've been able to sort out the issue by copying /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just using a one-worker setup at present) and adding the following to spark-env.sh: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar I can get past the previous error. The issue now seems to be with what is being returned. import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() returns the following error: java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.io.Text On Wed, Apr 1, 2015 at 7:34 PM, Xianjin YE advance...@gmail.com wrote: Do you have the same hadoop config for all nodes in your cluster(you run it in a cluster, right?)? Check the node(usually the executor) which gives the java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the hadoop native lib path. On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote: Thanks for the super quick response! I can read the file just fine in hadoop, it's just when I point Spark at this file it can't seem to read it due to the missing snappy jars / so's. I'l paying around with adding some things to spark-env.sh file, but still nothing. On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com wrote: Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() but fails when the encoding is Snappy. I've seen some stuff floating around on the web about having to explicitly enable support for Snappy in spark, but it doesn't seem to work for me: http://www.ericlin.me/enabling-snappy-support-for-sharkspark http://www.ericlin.me/enabling-snappy-support-for-sharkspark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark, snappy and HDFS
Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() but fails when the encoding is Snappy. I've seen some stuff floating around on the web about having to explicitly enable support for Snappy in spark, but it doesn't seem to work for me: http://www.ericlin.me/enabling-snappy-support-for-sharkspark http://www.ericlin.me/enabling-snappy-support-for-sharkspark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark, snappy and HDFS
Thanks for the super quick response! I can read the file just fine in hadoop, it's just when I point Spark at this file it can't seem to read it due to the missing snappy jars / so's. I'l paying around with adding some things to spark-env.sh file, but still nothing. On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com wrote: Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() but fails when the encoding is Snappy. I've seen some stuff floating around on the web about having to explicitly enable support for Snappy in spark, but it doesn't seem to work for me: http://www.ericlin.me/enabling-snappy-support-for-sharkspark http://www.ericlin.me/enabling-snappy-support-for-sharkspark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
java.io.FileNotFoundException when using HDFS in cluster mode
Hi List, I'm following this example here https://github.com/databricks/learning-spark/tree/master/mini-complete-example with the following: $SPARK_HOME/bin/spark-submit \ --deploy-mode cluster \ --master spark://host.domain.ex:7077 \ --class com.oreilly.learningsparkexamples.mini.scala.WordCount \ hdfs://host.domain.ex/user/nickt/learning-spark-mini-example_2.10-0.0.1.jar \ hdfs://host.domain.ex/user/nickt/linkage hdfs://host.domain.ex/user/nickt/wordcounts The jar is submitted fine and I can see it appear on the driver node (i.e. connecting to and reading from HDFS ok): -rw-r--r-- 1 nickt nickt 15K Mar 29 22:05 learning-spark-mini-example_2.10-0.0.1.jar -rw-r--r-- 1 nickt nickt 9.2K Mar 29 22:05 stderr -rw-r--r-- 1 nickt nickt0 Mar 29 22:05 stdout But it's failing due to a java.io.FileNotFoundException saying my input file is missing: Caused by: java.io.FileNotFoundException: Added file file:/home/nickt/spark-1.3.0/work/driver-20150329220503-0021/hdfs:/host.domain.ex/user/nickt/linkage does not exist. I'm using sc.addFile(hdfs://path/to/the_file.txt) to propagate to all the workers and sc.textFile(SparkFiles(the_file.txt)) to return the path to the file on each of the hosts. Has anyone come up against this before when reading from HDFS? No doubt I'm doing something wrong. Full trace below: Launch Command: /usr/java/java8/bin/java -cp :/home/nickt/spark-1.3.0/conf:/home/nickt/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.0.0-mr1-cdh4.6.0.jar -Dakka.loglevel=WARNING -Dspark.driver.supervise=false -Dspark.app.name=com.oreilly.learningsparkexamples.mini.scala.WordCount -Dspark.akka.askTimeout=10 -Dspark.jars=hdfs://host.domain.ex/user/nickt/learning-spark-mini-example_2.10-0.0.1.jar -Dspark.master=spark://host.domain.ex:7077 -Xms512M -Xmx512M org.apache.spark.deploy.worker.DriverWrapper akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker /home/nickt/spark-1.3.0/work/driver-20150329220503-0021/learning-spark-mini-example_2.10-0.0.1.jar com.oreilly.learningsparkexamples.mini.scala.WordCount hdfs://host.domain.ex/user/nickt/linkage hdfs://host.domain.ex/user/nickt/wordcounts log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/29 22:05:05 INFO SecurityManager: Changing view acls to: nickt 15/03/29 22:05:05 INFO SecurityManager: Changing modify acls to: nickt 15/03/29 22:05:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nickt); users with modify permissions: Set(nickt) 15/03/29 22:05:05 INFO Slf4jLogger: Slf4jLogger started 15/03/29 22:05:05 INFO Utils: Successfully started service 'Driver' on port 44201. 15/03/29 22:05:05 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker 15/03/29 22:05:05 INFO SparkContext: Running Spark version 1.3.0 15/03/29 22:05:05 INFO SecurityManager: Changing view acls to: nickt 15/03/29 22:05:05 INFO SecurityManager: Changing modify acls to: nickt 15/03/29 22:05:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nickt); users with modify permissions: Set(nickt) 15/03/29 22:05:05 INFO Slf4jLogger: Slf4jLogger started 15/03/29 22:05:05 INFO Utils: Successfully started service 'sparkDriver' on port 33382. 15/03/29 22:05:05 INFO SparkEnv: Registering MapOutputTracker 15/03/29 22:05:05 INFO SparkEnv: Registering BlockManagerMaster 15/03/29 22:05:05 INFO DiskBlockManager: Created local directory at /tmp/spark-9c52eb1e-92b9-4e3f-b0e9-699a158f8e40/blockmgr-222a2522-a0fc-4535-a939-4c14d92dc666 15/03/29 22:05:05 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker 15/03/29 22:05:05 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 15/03/29 22:05:05 INFO HttpFileServer: HTTP File server directory is /tmp/spark-031afddd-2a75-4232-931a-89e502b0d722/httpd-7e22bb57-3cfe-4c89-aaec-4e6ca1a65f66 15/03/29 22:05:05 INFO HttpServer: Starting HTTP Server 15/03/29 22:05:05 INFO Server: jetty-8.y.z-SNAPSHOT 15/03/29 22:05:05 INFO AbstractConnector: Started SocketConnector@0.0.0.0:42484 15/03/29 22:05:05 INFO Utils: Successfully started service 'HTTP file server' on port 42484. 15/03/29 22:05:05 INFO SparkEnv: Registering OutputCommitCoordinator 15/03/29 22:05:06 INFO Server: jetty-8.y.z-SNAPSHOT 15/03/29 22:05:06 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/03/29 22:05:06 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/03/29 22:05:06 INFO SparkUI: Started SparkUI at http://host5.domain.ex:4040 15/03/29 22:05:06 ERROR SparkContext: Jar not found at