Hi Ankur,

If your rdds have common keys, you can look at partitioning both your
datasets using a custom partitioner based on keys so that you can avoid
shuffling and optimize join performance.

HTH

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>



On Fri, Oct 17, 2014 at 4:27 AM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi,
>
> I have a rdd which is my application data and is huge. I want to join this
> with reference data which is also huge to fit in-memory and thus I do not
> want to use Broadcast variable.
>
> What other options do I have to perform such joins?
>
> I am using Cassandra as my data store, so should I just query cassandra to
> get the reference data needed?
>
> Also when I join two rdds, will it result in rdd scan or would spark do a
> hash partition on the two rdds to get the data with same keys on same node?
>
> Thanks
> Ankur
>

Reply via email to