Hey all, Just thought I'd share this with the list in case any one else would benefit. Currently working on a proper integration of PySpark and DataStax's new Cassandra-Spark connector, but that's on going.
In the meanwhile, I've basically updated the cassandra_inputformat.py and cassandra_outputformat.py examples that come with Spark. https://github.com/Parsely/pyspark-cassandra. The new example shows reading and writing to Cassandra including proper handling of CQL 3.1 collections: lists, sets and maps. Think it also clarifies the format RDDs are required be in to write data to Cassandra <https://github.com/Parsely/pyspark-cassandra/blob/master/src/main/python/pyspark_cassandra_hadoop_example.py#L83-L97> and provides a more general serializer <https://github.com/Parsely/pyspark-cassandra/blob/master/src/main/scala/SparkConverters.scala#L34-L88> to write Python (serialized via Py4J) structs to Cassandra. Comments or questions are welcome. Will update the group again when we have support for the DataStax connector. -- Mike Sukmanowsky Aspiring Digital Carpenter *p*: +1 (416) 953-4248 *e*: mike.sukmanow...@gmail.com facebook <http://facebook.com/mike.sukmanowsky> | twitter <http://twitter.com/msukmanowsky> | LinkedIn <http://www.linkedin.com/profile/view?id=10897143> | github <https://github.com/msukmanowsky>