Hi , I am reading data from Cassandra through datastax spark-cassandra connector converting it into JSON and then running spark-sql on it. Refer to the code snippet below :
step 1 >>>>> val o_rdd = sc.cassandraTable[CassandraRDDWrapper]( '<keyspace>', '<column_family>') step 2 >>>>> val tempObjectRDD = sc.parallelize(o_rdd.toArray.map(i=>i), 100) step 3 >>>>> val objectRDD = sqlContext.jsonRDD(tempObjectRDD) step 4 >>>>> objectRDD .registerAsTable("objects") At step (2) I have to explicitly do a "toArray" because jsonRDD takes in a RDD[String]. For me calling "toArray" on cassandra rdd takes forever as have million records in cassandra . Is there a better way of doing this ? How can I optimize it ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-on-Cassandra-tp13696.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org