9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com:
Does Spark support skewed joins similar to Pig which distributes large
keys over multiple partitions? I tried using the RangePartitioner but
I am still experiencing failures because some keys are too large to
fit in a single partition. I
objects more than anything. I definitely don’t have more
than 2B unique objects.
Will try the same test on Kryo3 and see if it goes away.
T
On 27 February 2015 at 06:21, Soila Pertet Kavulya skavu...@gmail.com
wrote:
Thanks Tristan,
I ran into a similar issue with broadcast variables
Does Spark support skewed joins similar to Pig which distributes large
keys over multiple partitions? I tried using the RangePartitioner but
I am still experiencing failures because some keys are too large to
fit in a single partition. I cannot use broadcast variables to
work-around this because
Thanks Sean and Imran,
I'll try splitting the broadcast variable into smaller ones.
I had tried a regular join but it was failing due to high garbage
collection overhead during the shuffle. One of the RDDs is very large
and has a skewed distribution where a handful of keys account for 90%
of the
Hi Reynold,
Nice! What spark configuration parameters did you use to get your job to
run successfully on a large dataset? My job is failing on 1TB of input data
(uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
errors just lost executors.
Thanks,
Soila
On Mar 20, 2014
I am testing the performance of Spark to see how it behaves when the
dataset size exceeds the amount of memory available. I am running
wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB
RAM per node). I limited spark.executor.memory to 64g, so I have 256g
of memory available in