Hi Sonal

Thank you for the response but since we are joining to reference data
different partitions of application data would need to join with same
reference data and thus I am not sure if spark join would be a good fit for
this.

Eg out application data has person with zip code and then the reference
data has attributes of zip code (city, state etc), so person objects in
different partitions in spark cluster may be referring to same zip and if I
partition our application data by zip there will be a lot of shuffling and
then latter for our application code we would have to repatriation with
another key and another shuffling of whole application data.

I think it will not be a good idea.

Thanks
Ankur
On Oct 16, 2014 11:06 PM, "Sonal Goyal" <sonalgoy...@gmail.com> wrote:

> Hi Ankur,
>
> If your rdds have common keys, you can look at partitioning both your
> datasets using a custom partitioner based on keys so that you can avoid
> shuffling and optimize join performance.
>
> HTH
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Fri, Oct 17, 2014 at 4:27 AM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a rdd which is my application data and is huge. I want to join
>> this with reference data which is also huge to fit in-memory and thus I do
>> not want to use Broadcast variable.
>>
>> What other options do I have to perform such joins?
>>
>> I am using Cassandra as my data store, so should I just query cassandra
>> to get the reference data needed?
>>
>> Also when I join two rdds, will it result in rdd scan or would spark do a
>> hash partition on the two rdds to get the data with same keys on same node?
>>
>> Thanks
>> Ankur
>>
>
>

Reply via email to