Hi Ankur, If your rdds have common keys, you can look at partitioning both your datasets using a custom partitioner based on keys so that you can avoid shuffling and optimize join performance.
HTH Best Regards, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Oct 17, 2014 at 4:27 AM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi, > > I have a rdd which is my application data and is huge. I want to join this > with reference data which is also huge to fit in-memory and thus I do not > want to use Broadcast variable. > > What other options do I have to perform such joins? > > I am using Cassandra as my data store, so should I just query cassandra to > get the reference data needed? > > Also when I join two rdds, will it result in rdd scan or would spark do a > hash partition on the two rdds to get the data with same keys on same node? > > Thanks > Ankur >