Re: Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Thanks Shixiong, I'll try out your PR. Do you know what the status of the PR is? Are there any plans to incorporate this change to the DataFrames/SchemaRDDs in Spark 1.3? Soila On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu zsxw...@gmail.com wrote: I sent a PR to add skewed join last year

Re: NegativeArraySizeException when doing joins on skewed data

2015-03-12 Thread Soila Pertet Kavulya
Hi Tristan, Did upgrading to Kryo3 help? Thanks, Soila On Sun, Mar 1, 2015 at 2:48 PM, Tristan Blakers tris...@blackfrog.org wrote: Yeah I implemented the same solution. It seems to kick in around the 4B mark, but looking at the log I suspect it’s probably a function of the number of unique

Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner but I am still experiencing failures because some keys are too large to fit in a single partition. I cannot use broadcast variables to work-around this because

NegativeArraySizeException when doing joins on skewed data

2015-02-25 Thread soila
I have been running into NegativeArraySizeException's when doing joins on data with very skewed key distributions in Spark 1.2.0. I found a previous post that mentioned that this exception arises when the size of the blocks spilled during the shuffle exceeds 2GB. The post recommended increasing

Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread soila
I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the following exception when the size of the broadcast variable exceeds 2GB. Any ideas on how I can resolve this issue? java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Soila Pertet Kavulya
% of the data. Do you have any pointers on how to handle skewed key distributions during a join. Soila On Fri, Feb 13, 2015 at 10:49 AM, Imran Rashid iras...@cloudera.com wrote: unfortunately this is a known issue: https://issues.apache.org/jira/browse/SPARK-1476 as Sean suggested, you need to think

Re: Largest input data set observed for Spark.

2014-03-20 Thread Soila Pertet Kavulya
Hi Reynold, Nice! What spark configuration parameters did you use to get your job to run successfully on a large dataset? My job is failing on 1TB of input data (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory errors just lost executors. Thanks, Soila On Mar 20, 2014 11

saveAsTextFile() failing for large datasets

2014-03-19 Thread Soila Pertet Kavulya
I am testing the performance of Spark to see how it behaves when the dataset size exceeds the amount of memory available. I am running wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB RAM per node). I limited spark.executor.memory to 64g, so I have 256g of memory available in