Re: Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com: Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner but I am still experiencing failures because some keys are too large to fit in a single partition. I

Re: NegativeArraySizeException when doing joins on skewed data

2015-03-12 Thread Soila Pertet Kavulya
objects more than anything. I definitely don’t have more than 2B unique objects. Will try the same test on Kryo3 and see if it goes away. T On 27 February 2015 at 06:21, Soila Pertet Kavulya skavu...@gmail.com wrote: Thanks Tristan, I ran into a similar issue with broadcast variables

Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner but I am still experiencing failures because some keys are too large to fit in a single partition. I cannot use broadcast variables to work-around this because

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Soila Pertet Kavulya
Thanks Sean and Imran, I'll try splitting the broadcast variable into smaller ones. I had tried a regular join but it was failing due to high garbage collection overhead during the shuffle. One of the RDDs is very large and has a skewed distribution where a handful of keys account for 90% of the

Re: Largest input data set observed for Spark.

2014-03-20 Thread Soila Pertet Kavulya
Hi Reynold, Nice! What spark configuration parameters did you use to get your job to run successfully on a large dataset? My job is failing on 1TB of input data (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory errors just lost executors. Thanks, Soila On Mar 20, 2014

saveAsTextFile() failing for large datasets

2014-03-19 Thread Soila Pertet Kavulya
I am testing the performance of Spark to see how it behaves when the dataset size exceeds the amount of memory available. I am running wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB RAM per node). I limited spark.executor.memory to 64g, so I have 256g of memory available in