Hi ,

I am reading data from Cassandra through datastax spark-cassandra connector
converting it into JSON and then running spark-sql on it. Refer to the code
snippet below :

step 1 >>>>> val o_rdd = sc.cassandraTable[CassandraRDDWrapper](
'<keyspace>', '<column_family>')
step 2 >>>>> val tempObjectRDD = sc.parallelize(o_rdd.toArray.map(i=>i),
100)
step 3 >>>>> val objectRDD = sqlContext.jsonRDD(tempObjectRDD)
step 4 >>>>> objectRDD .registerAsTable("objects")

At step (2) I have to explicitly do a "toArray" because jsonRDD takes in a
RDD[String]. For me calling "toArray" on cassandra rdd takes forever as have
million records in cassandra . Is there a better way of doing this ? How can
I optimize it ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-on-Cassandra-tp13696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to