I sent a PR to add skewed join last year: https://github.com/apache/spark/pull/3505 However, it does not split a key to multiple partitions. Instead, if a key has too many values that can not be fit in to memory, it will store the values into the disk temporarily and use disk files to do the join.
Best Regards, Shixiong Zhu 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya <skavu...@gmail.com>: > Does Spark support skewed joins similar to Pig which distributes large > keys over multiple partitions? I tried using the RangePartitioner but > I am still experiencing failures because some keys are too large to > fit in a single partition. I cannot use broadcast variables to > work-around this because both RDDs are too large to fit in driver > memory. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >