Thanks Shixiong,
I'll try out your PR. Do you know what the status of the PR is? Are
there any plans to incorporate this change to the
DataFrames/SchemaRDDs in Spark 1.3?
Soila
On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu zsxw...@gmail.com wrote:
I sent a PR to add skewed join last year
Hi Tristan,
Did upgrading to Kryo3 help?
Thanks,
Soila
On Sun, Mar 1, 2015 at 2:48 PM, Tristan Blakers tris...@blackfrog.org wrote:
Yeah I implemented the same solution. It seems to kick in around the 4B
mark, but looking at the log I suspect it’s probably a function of the
number of unique
Does Spark support skewed joins similar to Pig which distributes large
keys over multiple partitions? I tried using the RangePartitioner but
I am still experiencing failures because some keys are too large to
fit in a single partition. I cannot use broadcast variables to
work-around this because
I have been running into NegativeArraySizeException's when doing joins on
data with very skewed key distributions in Spark 1.2.0. I found a previous
post that mentioned that this exception arises when the size of the blocks
spilled during the shuffle exceeds 2GB. The post recommended increasing
I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the
following exception when the size of the broadcast variable exceeds 2GB. Any
ideas on how I can resolve this issue?
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at
%
of the data. Do you have any pointers on how to handle skewed key
distributions during a join.
Soila
On Fri, Feb 13, 2015 at 10:49 AM, Imran Rashid iras...@cloudera.com wrote:
unfortunately this is a known issue:
https://issues.apache.org/jira/browse/SPARK-1476
as Sean suggested, you need to think
Hi Reynold,
Nice! What spark configuration parameters did you use to get your job to
run successfully on a large dataset? My job is failing on 1TB of input data
(uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
errors just lost executors.
Thanks,
Soila
On Mar 20, 2014 11
I am testing the performance of Spark to see how it behaves when the
dataset size exceeds the amount of memory available. I am running
wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB
RAM per node). I limited spark.executor.memory to 64g, so I have 256g
of memory available in