Re: pandas-like dataframe in spark

Matei Zaharia Thu, 04 Sep 2014 11:25:45 -0700

Hi Mohit,

This looks pretty interesting, but just a note on the implementation -- it 
might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The 
reason is that SchemaRDDs already have an efficient in-memory representation 
(columnar storage), and can be read from a variety of data sources (JSON, Hive, 
soon things like CSV as well). Using the operators in Spark SQL you can also 
get really efficient code-generated operations on them. I know that stuff like 
zipping two data frames might become harder, but the overall benefit in 
performance could be substantial.


Matei

On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:

Folks,
I have been working on a pandas-like dataframe DSL on top of spark. It is 
written in Scala and can be used from spark-shell. The APIs have the look and 
feel of pandas which is a wildly popular piece of software data scientists use. 
The goal is to let people familiar with pandas scale their efforts to larger 
datasets by using spark but not having to go through a steep learning curve for 
Spark and Scala.
It is open sourced with Apache License and can be found here:
https://github.com/AyasdiOpenSource/df

I welcome your comments, suggestions and feedback. Any help in developing it 
further is much appreciated. I have the following items on the roadmap (and 
happy to change this based on your comments)
- Python wrappers most likely in the same way as MLLib
- Sliding window aggregations
- Row indexing
- Graphing/charting
- Efficient row-based operations
- Pretty printing of output on the spark-shell
- Unit test completeness and automated nightly runs

Mohit.

P.S.: Thanks to my awesome employer Ayasdi for open sourcing this software

P.P.S.: I need some design advice on making row operations efficient and I'll 
start a new thread for that

Re: pandas-like dataframe in spark

Reply via email to