Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Deepak Jain
Leftouterjoin and join apis are super slow in spark. 100x slower than hadoop Sent from my iPhone On 14-Jul-2015, at 10:59 PM, Wush Wu wush...@gmail.com wrote: I don't understand. By the way, the `joinWithCassandraTable` does improve my query time from 40 mins to 3 mins. 2015-07-15

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Wush Wu
Dear Sujit, Thanks for your suggestion. After testing, the `joinWithCassandraTable` does the trick like what you mentioned. The rdd2 only query those data which have the same key in rdd1. Best, Wush 2015-07-16 0:00 GMT+08:00 Sujit Pal sujitatgt...@gmail.com: Hi Wush, One option may be to

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Sujit Pal
Hi Wush, One option may be to try a replicated join. Since your rdd1 is small, read it into a collection and broadcast it to the workers, then filter your larger rdd2 against the collection on the workers. -sujit On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain deepuj...@gmail.com wrote:

Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I am trying to join two RDDs, named rdd1 and rdd2. rdd1 is loaded from a textfile with about 33000 records. rdd2 is loaded from a table in cassandra which has about 3 billions records. I tried the following code: ```scala val rdd1 : (String, XXX) = sc.textFile(...).map(...) import

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I have found a post discussing the same thing: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J The solution is using joinWithCassandraTable and the documentation is here:

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
I don't understand. By the way, the `joinWithCassandraTable` does improve my query time from 40 mins to 3 mins. 2015-07-15 13:19 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: I have explored spark joins for last few months (you can search my posts) and its frustrating useless. On Tue, Jul

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread ๏̯͡๏
I have explored spark joins for last few months (you can search my posts) and its frustrating useless. On Tue, Jul 14, 2015 at 9:35 PM, Wush Wu wush...@gmail.com wrote: Dear all, I have found a post discussing the same thing: