I wanted to share a Python implementation of RDDs: pysparkling. http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the
The benefit is that you can apply the same code that you use in PySpark on large datasets in pysparkling on small datasets or single documents. When running with pysparkling, there is no dependency on the Java Virtual Machine or Hadoop. Sven