Thanks Shixiong, I'll try out your PR. Do you know what the status of the PR is? Are there any plans to incorporate this change to the DataFrames/SchemaRDDs in Spark 1.3?
Soila On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu <zsxw...@gmail.com> wrote: > I sent a PR to add skewed join last year: > https://github.com/apache/spark/pull/3505 > However, it does not split a key to multiple partitions. Instead, if a key > has too many values that can not be fit in to memory, it will store the > values into the disk temporarily and use disk files to do the join. > > Best Regards, > > Shixiong Zhu > > 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya <skavu...@gmail.com>: >> >> Does Spark support skewed joins similar to Pig which distributes large >> keys over multiple partitions? I tried using the RangePartitioner but >> I am still experiencing failures because some keys are too large to >> fit in a single partition. I cannot use broadcast variables to >> work-around this because both RDDs are too large to fit in driver >> memory. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org