Hello Soila, Can you share the code that shows usuag of RangePartitioner ? I am facing issue with .join() where one task runs forever. I tried repartition(100/200/300/1200) and it did not help, I cannot use map-side join because both datasets are huge and beyond driver memory size. Regards, Deepak
On Fri, Mar 13, 2015 at 9:54 AM, Soila Pertet Kavulya <skavu...@gmail.com> wrote: > Thanks Shixiong, > > I'll try out your PR. Do you know what the status of the PR is? Are > there any plans to incorporate this change to the > DataFrames/SchemaRDDs in Spark 1.3? > > Soila > > On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu <zsxw...@gmail.com> wrote: > > I sent a PR to add skewed join last year: > > https://github.com/apache/spark/pull/3505 > > However, it does not split a key to multiple partitions. Instead, if a > key > > has too many values that can not be fit in to memory, it will store the > > values into the disk temporarily and use disk files to do the join. > > > > Best Regards, > > > > Shixiong Zhu > > > > 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya <skavu...@gmail.com>: > >> > >> Does Spark support skewed joins similar to Pig which distributes large > >> keys over multiple partitions? I tried using the RangePartitioner but > >> I am still experiencing failures because some keys are too large to > >> fit in a single partition. I cannot use broadcast variables to > >> work-around this because both RDDs are too large to fit in driver > >> memory. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Deepak