Hi Mohit, This looks pretty interesting, but just a note on the implementation -- it might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The reason is that SchemaRDDs already have an efficient in-memory representation (columnar storage), and can be read from a variety of data sources (JSON, Hive, soon things like CSV as well). Using the operators in Spark SQL you can also get really efficient code-generated operations on them. I know that stuff like zipping two data frames might become harder, but the overall benefit in performance could be substantial.
Matei On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: Folks, I have been working on a pandas-like dataframe DSL on top of spark. It is written in Scala and can be used from spark-shell. The APIs have the look and feel of pandas which is a wildly popular piece of software data scientists use. The goal is to let people familiar with pandas scale their efforts to larger datasets by using spark but not having to go through a steep learning curve for Spark and Scala. It is open sourced with Apache License and can be found here: https://github.com/AyasdiOpenSource/df I welcome your comments, suggestions and feedback. Any help in developing it further is much appreciated. I have the following items on the roadmap (and happy to change this based on your comments) - Python wrappers most likely in the same way as MLLib - Sliding window aggregations - Row indexing - Graphing/charting - Efficient row-based operations - Pretty printing of output on the spark-shell - Unit test completeness and automated nightly runs Mohit. P.S.: Thanks to my awesome employer Ayasdi for open sourcing this software P.P.S.: I need some design advice on making row operations efficient and I'll start a new thread for that