Re: Spark SQL High GC time

2015-05-25 Thread Nick Travers
Hi Yuming - I was running into the same issue with larger worker nodes a few weeks ago. The way I managed to get around the high GC time, as per the suggestion of some others, was to break each worker node up into individual workers of around 10G in size. Divide your cores accordingly. The other

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Nick Travers
Could you be more specific in how this is done? A DataFrame class doesn't have that method. On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote: You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote

Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-03 Thread Nick Travers
I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt. I'm reading in both tables using a HiveContext with the underlying files stored as Parquet Files. I'm using something along the lines of

Re: Spark, snappy and HDFS

2015-04-02 Thread Nick Travers
what you are writing since it is not BytesWritable / Text. On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers n.e.trav...@gmail.com wrote: I'm actually running this in a separate environment to our HDFS cluster. I think I've been able to sort out the issue by copying /opt/cloudera/parcels/CDH/lib

Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
the executor) which gives the java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the hadoop native lib path. On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote: Thanks for the super quick response! I can read the file just fine in hadoop, it's just when I point Spark

Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import

Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
PM, Xianjin YE advance...@gmail.com wrote: Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: Has anyone else encountered the following error when trying to read a snappy

java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-29 Thread Nick Travers
Hi List, I'm following this example here https://github.com/databricks/learning-spark/tree/master/mini-complete-example with the following: $SPARK_HOME/bin/spark-submit \ --deploy-mode cluster \ --master spark://host.domain.ex:7077 \ --class