Re: Support for skewed joins in Spark

2015-05-04 Thread ๏̯͡๏
Hello Soila,
Can you share the code that shows usuag of RangePartitioner ?
I am facing issue with .join() where one task runs forever. I tried
repartition(100/200/300/1200) and it did not help, I cannot use map-side
join because both datasets are huge and beyond driver memory size.
Regards,
Deepak

On Fri, Mar 13, 2015 at 9:54 AM, Soila Pertet Kavulya skavu...@gmail.com
wrote:

 Thanks Shixiong,

 I'll try out your PR. Do you know what the status of the PR is? Are
 there any plans to incorporate this change to the
 DataFrames/SchemaRDDs in Spark 1.3?

 Soila

 On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu zsxw...@gmail.com wrote:
  I sent a PR to add skewed join last year:
  https://github.com/apache/spark/pull/3505
  However, it does not split a key to multiple partitions. Instead, if a
 key
  has too many values that can not be fit in to memory, it will store the
  values into the disk temporarily and use disk files to do the join.
 
  Best Regards,
 
  Shixiong Zhu
 
  2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com:
 
  Does Spark support skewed joins similar to Pig which distributes large
  keys over multiple partitions? I tried using the RangePartitioner but
  I am still experiencing failures because some keys are too large to
  fit in a single partition. I cannot use broadcast variables to
  work-around this because both RDDs are too large to fit in driver
  memory.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Deepak


Re: Support for skewed joins in Spark

2015-03-12 Thread Shixiong Zhu
I sent a PR to add skewed join last year:
https://github.com/apache/spark/pull/3505
However, it does not split a key to multiple partitions. Instead, if a key
has too many values that can not be fit in to memory, it will store the
values into the disk temporarily and use disk files to do the join.

Best Regards,
Shixiong Zhu

2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com:

 Does Spark support skewed joins similar to Pig which distributes large
 keys over multiple partitions? I tried using the RangePartitioner but
 I am still experiencing failures because some keys are too large to
 fit in a single partition. I cannot use broadcast variables to
 work-around this because both RDDs are too large to fit in driver
 memory.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Thanks Shixiong,

I'll try out your PR. Do you know what the status of the PR is? Are
there any plans to incorporate this change to the
DataFrames/SchemaRDDs in Spark 1.3?

Soila

On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu zsxw...@gmail.com wrote:
 I sent a PR to add skewed join last year:
 https://github.com/apache/spark/pull/3505
 However, it does not split a key to multiple partitions. Instead, if a key
 has too many values that can not be fit in to memory, it will store the
 values into the disk temporarily and use disk files to do the join.

 Best Regards,

 Shixiong Zhu

 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com:

 Does Spark support skewed joins similar to Pig which distributes large
 keys over multiple partitions? I tried using the RangePartitioner but
 I am still experiencing failures because some keys are too large to
 fit in a single partition. I cannot use broadcast variables to
 work-around this because both RDDs are too large to fit in driver
 memory.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Does Spark support skewed joins similar to Pig which distributes large
keys over multiple partitions? I tried using the RangePartitioner but
I am still experiencing failures because some keys are too large to
fit in a single partition. I cannot use broadcast variables to
work-around this because both RDDs are too large to fit in driver
memory.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org