I have explored spark joins for last few months (you can search my posts) and its frustrating useless.
On Tue, Jul 14, 2015 at 9:35 PM, Wush Wu <wush...@gmail.com> wrote: > Dear all, > > I have found a post discussing the same thing: > > https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J > > The solution is using "joinWithCassandraTable" and the documentation > is here: > https://github.com/datastax/spark-cassandra-connector/blob/v1.3.0-M2/doc/2_loading.md > > Wush > > 2015-07-15 12:15 GMT+08:00 Wush Wu <wush...@gmail.com>: > > Dear all, > > > > I am trying to join two RDDs, named rdd1 and rdd2. > > > > rdd1 is loaded from a textfile with about 33000 records. > > > > rdd2 is loaded from a table in cassandra which has about 3 billions > records. > > > > I tried the following code: > > > > ```scala > > > > val rdd1 : (String, XXX) = sc.textFile(...).map(...) > > import org.apache.spark.sql.cassandra.CassandraSQLContext > > cc.setKeyspace("xxx") > > val rdd2 : (String, String) = cc.sql("SELECT x, y FROM xxx").map(r => > ...) > > > > val result = rdd1.leftOuterJoin(rdd2) > > result.take(20) > > > > ``` > > > > However, the log shows that the spark loaded 3 billions records from > > cassandra and only 33000 records left at the end. > > > > Is there a way to query the cassandra based on the key in rdd1? > > > > Here is some information of our system: > > > > - The spark version is 1.3.1 > > - The cassandra version is 2.0.14 > > - The key of joining is the primary key of the cassandra table. > > > > Best, > > Wush > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Deepak