Re: Spark SQL High GC time

2015-05-25 Thread Nick Travers
Hi Yuming - I was running into the same issue with larger worker nodes a few
weeks ago.

The way I managed to get around the high GC time, as per the suggestion of
some others, was to break each worker node up into individual workers of
around 10G in size. Divide your cores accordingly.

The other way I was able to avoid high GC time was to use the right kind of
serialisation to keep the number of objects in memory low.

Hope that helps!
- nick



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-High-GC-time-tp23005p23030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Nick Travers
Could you be more specific in how this is done?

A DataFrame class doesn't have that method.

On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote:

 You can use custom partitioner to redistribution using partitionby
 On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:

 I'm currently trying to join two large tables (order 1B rows each) using
 Spark SQL (1.3.0) and am running into long GC pauses which bring the job
 to
 a halt.

 I'm reading in both tables using a HiveContext with the underlying files
 stored as Parquet Files. I'm using  something along the lines of
 HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1)
 to
 set up the join.

 When I execute this (with an action such as .count) I see the first few
 stages complete, but the job eventually stalls. The GC counts keep
 increasing for each executor.

 Running with 6 workers, each with 2T disk and 100GB RAM.

 Has anyone else run into this issue? I'm thinking I might be running into
 issues with the shuffling of the data, but I'm unsure of how to get around
 this? Is there a way to redistribute the rows based on the join key first,
 and then do the join?

 Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-03 Thread Nick Travers
I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.

I'm reading in both tables using a HiveContext with the underlying files
stored as Parquet Files. I'm using  something along the lines of
HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1) to
set up the join.

When I execute this (with an action such as .count) I see the first few
stages complete, but the job eventually stalls. The GC counts keep
increasing for each executor.

Running with 6 workers, each with 2T disk and 100GB RAM.

Has anyone else run into this issue? I'm thinking I might be running into
issues with the shuffling of the data, but I'm unsure of how to get around
this? Is there a way to redistribute the rows based on the join key first,
and then do the join?

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark, snappy and HDFS

2015-04-02 Thread Nick Travers
Thanks all. I was able to get the decompression working by adding the
following to my spark-env.sh script:

export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export
SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar

On Thu, Apr 2, 2015 at 12:51 AM, Sean Owen so...@cloudera.com wrote:

 Yes, any Hadoop-related process that asks for Snappy compression or
 needs to read it will have to have the Snappy libs available on the
 library path. That's usually set up for you in a distro or you can do
 it manually like this. This is not Spark-specific.

 The second question also isn't Spark-specific; you do not have a
 SequenceFile of byte[] / String, but of byte[] / byte[]. Review what
 you are writing since it is not BytesWritable / Text.

 On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers n.e.trav...@gmail.com
 wrote:
  I'm actually running this in a separate environment to our HDFS cluster.
 
  I think I've been able to sort out the issue by copying
  /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm
 just
  using a one-worker setup at present) and adding the following to
  spark-env.sh:
 
  export
  JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
  export
  SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
  export
 
 SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar
 
  I can get past the previous error. The issue now seems to be with what is
  being returned.
 
  import org.apache.hadoop.io._
  val hdfsPath = hdfs://nost.name/path/to/folder
  val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
  file.count()
 
  returns the following error:
 
  java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot
 be
  cast to org.apache.hadoop.io.Text
 
 
  On Wed, Apr 1, 2015 at 7:34 PM, Xianjin YE advance...@gmail.com wrote:
 
  Do you have the same hadoop config for all nodes in your cluster(you run
  it in a cluster, right?)?
  Check the node(usually the executor) which gives the
  java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the
  hadoop native lib path.
 
  On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote:
 
  Thanks for the super quick response!
 
  I can read the file just fine in hadoop, it's just when I point Spark at
  this file it can't seem to read it due to the missing snappy jars /
 so's.
 
  I'l paying around with adding some things to spark-env.sh file, but
 still
  nothing.
 
  On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com
 wrote:
 
  Can you read snappy compressed file in hdfs?  Looks like the
 libsnappy.so
  is not in the hadoop native lib path.
 
  On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote:
 
  Has anyone else encountered the following error when trying to read a
  snappy
  compressed sequence file from HDFS?
 
  *java.lang.UnsatisfiedLinkError:
  org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z*
 
  The following works for me when the file is uncompressed:
 
  import org.apache.hadoop.io._
  val hdfsPath = hdfs://nost.name/path/to/folder
  val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
  file.count()
 
  but fails when the encoding is Snappy.
 
  I've seen some stuff floating around on the web about having to
 explicitly
  enable support for Snappy in spark, but it doesn't seem to work for me:
  http://www.ericlin.me/enabling-snappy-support-for-sharkspark
  http://www.ericlin.me/enabling-snappy-support-for-sharkspark
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
 



Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
I'm actually running this in a separate environment to our HDFS cluster.

I think I've been able to sort out the issue by copying
/opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just
using a one-worker setup at present) and adding the following to
spark-env.sh:

export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export
SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar

I can get past the previous error. The issue now seems to be with what is
being returned.

import org.apache.hadoop.io._
val hdfsPath = hdfs://nost.name/path/to/folder
val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
file.count()

returns the following error:

java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
cast to org.apache.hadoop.io.Text


On Wed, Apr 1, 2015 at 7:34 PM, Xianjin YE advance...@gmail.com wrote:

 Do you have the same hadoop config for all nodes in your cluster(you run
 it in a cluster, right?)?
 Check the node(usually the executor) which gives the
 java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the
 hadoop native lib path.

 On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote:

 Thanks for the super quick response!

 I can read the file just fine in hadoop, it's just when I point Spark at
 this file it can't seem to read it due to the missing snappy jars / so's.

 I'l paying around with adding some things to spark-env.sh file, but still
 nothing.

 On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com wrote:

 Can you read snappy compressed file in hdfs?  Looks like the libsnappy.so
 is not in the hadoop native lib path.

 On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote:

 Has anyone else encountered the following error when trying to read a
 snappy
 compressed sequence file from HDFS?

 *java.lang.UnsatisfiedLinkError:
 org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z*

 The following works for me when the file is uncompressed:

 import org.apache.hadoop.io._
 val hdfsPath = hdfs://nost.name/path/to/folder
 val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
 file.count()

 but fails when the encoding is Snappy.

 I've seen some stuff floating around on the web about having to explicitly
 enable support for Snappy in spark, but it doesn't seem to work for me:
 http://www.ericlin.me/enabling-snappy-support-for-sharkspark
 http://www.ericlin.me/enabling-snappy-support-for-sharkspark



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Has anyone else encountered the following error when trying to read a snappy
compressed sequence file from HDFS?

*java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z*

The following works for me when the file is uncompressed:

import org.apache.hadoop.io._
val hdfsPath = hdfs://nost.name/path/to/folder
val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
file.count()

but fails when the encoding is Snappy.

I've seen some stuff floating around on the web about having to explicitly
enable support for Snappy in spark, but it doesn't seem to work for me:
http://www.ericlin.me/enabling-snappy-support-for-sharkspark
http://www.ericlin.me/enabling-snappy-support-for-sharkspark  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Thanks for the super quick response!

I can read the file just fine in hadoop, it's just when I point Spark at
this file it can't seem to read it due to the missing snappy jars / so's.

I'l paying around with adding some things to spark-env.sh file, but still
nothing.

On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com wrote:

 Can you read snappy compressed file in hdfs?  Looks like the libsnappy.so
 is not in the hadoop native lib path.

 On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote:

 Has anyone else encountered the following error when trying to read a
 snappy
 compressed sequence file from HDFS?

 *java.lang.UnsatisfiedLinkError:
 org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z*

 The following works for me when the file is uncompressed:

 import org.apache.hadoop.io._
 val hdfsPath = hdfs://nost.name/path/to/folder
 val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
 file.count()

 but fails when the encoding is Snappy.

 I've seen some stuff floating around on the web about having to explicitly
 enable support for Snappy in spark, but it doesn't seem to work for me:
 http://www.ericlin.me/enabling-snappy-support-for-sharkspark
 http://www.ericlin.me/enabling-snappy-support-for-sharkspark



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-29 Thread Nick Travers
Hi List,

I'm following this example  here
https://github.com/databricks/learning-spark/tree/master/mini-complete-example
  
with the following:

$SPARK_HOME/bin/spark-submit \
  --deploy-mode cluster \
  --master spark://host.domain.ex:7077 \
  --class com.oreilly.learningsparkexamples.mini.scala.WordCount \
 
hdfs://host.domain.ex/user/nickt/learning-spark-mini-example_2.10-0.0.1.jar
\
  hdfs://host.domain.ex/user/nickt/linkage
hdfs://host.domain.ex/user/nickt/wordcounts

The jar is submitted fine and I can see it appear on the driver node (i.e.
connecting to and reading from HDFS ok):

-rw-r--r-- 1 nickt nickt  15K Mar 29 22:05
learning-spark-mini-example_2.10-0.0.1.jar
-rw-r--r-- 1 nickt nickt 9.2K Mar 29 22:05 stderr
-rw-r--r-- 1 nickt nickt0 Mar 29 22:05 stdout

But it's failing due to a java.io.FileNotFoundException saying my input file
is missing:

Caused by: java.io.FileNotFoundException: Added file
file:/home/nickt/spark-1.3.0/work/driver-20150329220503-0021/hdfs:/host.domain.ex/user/nickt/linkage
does not exist.

I'm using sc.addFile(hdfs://path/to/the_file.txt) to propagate to all the
workers and sc.textFile(SparkFiles(the_file.txt)) to return the path to
the file on each of the hosts.

Has anyone come up against this before when reading from HDFS? No doubt I'm
doing something wrong.

Full trace below:

Launch Command: /usr/java/java8/bin/java -cp
:/home/nickt/spark-1.3.0/conf:/home/nickt/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.0.0-mr1-cdh4.6.0.jar
-Dakka.loglevel=WARNING -Dspark.driver.supervise=false
-Dspark.app.name=com.oreilly.learningsparkexamples.mini.scala.WordCount
-Dspark.akka.askTimeout=10
-Dspark.jars=hdfs://host.domain.ex/user/nickt/learning-spark-mini-example_2.10-0.0.1.jar
-Dspark.master=spark://host.domain.ex:7077 -Xms512M -Xmx512M
org.apache.spark.deploy.worker.DriverWrapper
akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker
/home/nickt/spark-1.3.0/work/driver-20150329220503-0021/learning-spark-mini-example_2.10-0.0.1.jar
com.oreilly.learningsparkexamples.mini.scala.WordCount
hdfs://host.domain.ex/user/nickt/linkage
hdfs://host.domain.ex/user/nickt/wordcounts


log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/03/29 22:05:05 INFO SecurityManager: Changing view acls to: nickt
15/03/29 22:05:05 INFO SecurityManager: Changing modify acls to: nickt
15/03/29 22:05:05 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(nickt); users
with modify permissions: Set(nickt)
15/03/29 22:05:05 INFO Slf4jLogger: Slf4jLogger started
15/03/29 22:05:05 INFO Utils: Successfully started service 'Driver' on port
44201.
15/03/29 22:05:05 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker
15/03/29 22:05:05 INFO SparkContext: Running Spark version 1.3.0
15/03/29 22:05:05 INFO SecurityManager: Changing view acls to: nickt
15/03/29 22:05:05 INFO SecurityManager: Changing modify acls to: nickt
15/03/29 22:05:05 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(nickt); users
with modify permissions: Set(nickt)
15/03/29 22:05:05 INFO Slf4jLogger: Slf4jLogger started
15/03/29 22:05:05 INFO Utils: Successfully started service 'sparkDriver' on
port 33382.
15/03/29 22:05:05 INFO SparkEnv: Registering MapOutputTracker
15/03/29 22:05:05 INFO SparkEnv: Registering BlockManagerMaster
15/03/29 22:05:05 INFO DiskBlockManager: Created local directory at
/tmp/spark-9c52eb1e-92b9-4e3f-b0e9-699a158f8e40/blockmgr-222a2522-a0fc-4535-a939-4c14d92dc666
15/03/29 22:05:05 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker
15/03/29 22:05:05 INFO MemoryStore: MemoryStore started with capacity 265.1
MB
15/03/29 22:05:05 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-031afddd-2a75-4232-931a-89e502b0d722/httpd-7e22bb57-3cfe-4c89-aaec-4e6ca1a65f66
15/03/29 22:05:05 INFO HttpServer: Starting HTTP Server
15/03/29 22:05:05 INFO Server: jetty-8.y.z-SNAPSHOT
15/03/29 22:05:05 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:42484
15/03/29 22:05:05 INFO Utils: Successfully started service 'HTTP file
server' on port 42484.
15/03/29 22:05:05 INFO SparkEnv: Registering OutputCommitCoordinator
15/03/29 22:05:06 INFO Server: jetty-8.y.z-SNAPSHOT
15/03/29 22:05:06 INFO AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
15/03/29 22:05:06 INFO Utils: Successfully started service 'SparkUI' on port
4040.
15/03/29 22:05:06 INFO SparkUI: Started SparkUI at
http://host5.domain.ex:4040
15/03/29 22:05:06 ERROR SparkContext: Jar not found at