The point is that if you have skewed data then one single reducer will
finally take a very long time, and you do not need to try this even, just
search in Google and skewed data is a known problem in joins even in SPARK.

Therefore instead of using join, in case the used case permits, just write
a UDF, which then works as a look up. Using broadcast is the SPARK way, and
someone mentioned here the use of Redis, which I remember used to be the
way around in 2011 in the initial days of MR.

Regards,
Gourav

On Thu, Aug 11, 2016 at 9:24 PM, Ben Teeuwen <bteeu...@gmail.com> wrote:

> Hmm, hashing will probably send all of the records with the same key to
> the same partition / machine.
> I’d try it out, and hope that if you have a few superlarge keys bigger
> than the RAM available of one node, they spill to disk. Maybe play with
> persist() and using a different Storage Level.
>
> On Aug 11, 2016, at 9:48 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Hi Ben,
>
> and that will take care of skewed data?
>
> Gourav
>
> On Thu, Aug 11, 2016 at 8:41 PM, Ben Teeuwen <bteeu...@gmail.com> wrote:
>
>> When you read both ‘a’ and ‘b', can you try repartitioning both by column
>> ‘id’?
>> If you .cache() and .count() to force a shuffle, it'll push the records
>> that will be joined to the same executors.
>>
>> So;
>> a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()
>> a.count()
>>
>> b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()
>> b.count()
>>
>> And then join..
>>
>>
>> On Aug 8, 2016, at 8:17 PM, Ashic Mahtab <as...@live.com> wrote:
>>
>> Hello,
>> We have two parquet inputs of the following form:
>>
>> a: id:String, Name:String  (1.5TB)
>> b: id:String, Number:Int  (1.3GB)
>>
>> We need to join these two to get (id, Number, Name). We've tried two
>> approaches:
>>
>> a.join(b, Seq("id"), "right_outer")
>>
>> where a and b are dataframes. We also tried taking the rdds, mapping them
>> to pair rdds with id as the key, and then joining. What we're seeing is
>> that temp file usage is increasing on the join stage, and filling up our
>> disks, causing the job to crash. Is there a way to join these two data sets
>> without well...crashing?
>>
>> Note, the ids are unique, and there's a one to one mapping between the
>> two datasets.
>>
>> Any help would be appreciated.
>>
>> -Ashic.
>>
>>
>>
>
>

Reply via email to