pandas-like dataframe in spark

2014-09-04 Thread Mohit Jaggi
Folks,
I have been working on a pandas-like dataframe DSL on top of spark. It is
written in Scala and can be used from spark-shell. The APIs have the look
and feel of pandas which is a wildly popular piece of software data
scientists use. The goal is to let people familiar with pandas scale their
efforts to larger datasets by using spark but not having to go through a
steep learning curve for Spark and Scala.
It is open sourced with Apache License and can be found here:
https://github.com/AyasdiOpenSource/df

I welcome your comments, suggestions and feedback. Any help in developing
it further is much appreciated. I have the following items on the roadmap
(and happy to change this based on your comments)
- Python wrappers most likely in the same way as MLLib
- Sliding window aggregations
- Row indexing
- Graphing/charting
- Efficient row-based operations
- Pretty printing of output on the spark-shell
- Unit test completeness and automated nightly runs

Mohit.

P.S.: Thanks to my awesome employer Ayasdi http://www.ayasdi.com for open
sourcing this software

P.P.S.: I need some design advice on making row operations efficient and
I'll start a new thread for that


Re: pandas-like dataframe in spark

2014-09-04 Thread Matei Zaharia
Hi Mohit,

This looks pretty interesting, but just a note on the implementation -- it 
might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The 
reason is that SchemaRDDs already have an efficient in-memory representation 
(columnar storage), and can be read from a variety of data sources (JSON, Hive, 
soon things like CSV as well). Using the operators in Spark SQL you can also 
get really efficient code-generated operations on them. I know that stuff like 
zipping two data frames might become harder, but the overall benefit in 
performance could be substantial.

Matei

On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:

Folks,
I have been working on a pandas-like dataframe DSL on top of spark. It is 
written in Scala and can be used from spark-shell. The APIs have the look and 
feel of pandas which is a wildly popular piece of software data scientists use. 
The goal is to let people familiar with pandas scale their efforts to larger 
datasets by using spark but not having to go through a steep learning curve for 
Spark and Scala.
It is open sourced with Apache License and can be found here:
https://github.com/AyasdiOpenSource/df

I welcome your comments, suggestions and feedback. Any help in developing it 
further is much appreciated. I have the following items on the roadmap (and 
happy to change this based on your comments)
- Python wrappers most likely in the same way as MLLib
- Sliding window aggregations
- Row indexing
- Graphing/charting
- Efficient row-based operations
- Pretty printing of output on the spark-shell
- Unit test completeness and automated nightly runs

Mohit.

P.S.: Thanks to my awesome employer Ayasdi for open sourcing this software

P.P.S.: I need some design advice on making row operations efficient and I'll 
start a new thread for that

Re: pandas-like dataframe in spark

2014-09-04 Thread Mohit Jaggi
Thanks Matei. I will take a look at SchemaRDDs.


On Thu, Sep 4, 2014 at 11:24 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi Mohit,

 This looks pretty interesting, but just a note on the implementation -- it
 might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The
 reason is that SchemaRDDs already have an efficient in-memory
 representation (columnar storage), and can be read from a variety of data
 sources (JSON, Hive, soon things like CSV as well). Using the operators in
 Spark SQL you can also get really efficient code-generated operations on
 them. I know that stuff like zipping two data frames might become harder,
 but the overall benefit in performance could be substantial.

 Matei

 On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitja...@gmail.com)
 wrote:

 Folks,
 I have been working on a pandas-like dataframe DSL on top of spark. It is
 written in Scala and can be used from spark-shell. The APIs have the look
 and feel of pandas which is a wildly popular piece of software data
 scientists use. The goal is to let people familiar with pandas scale their
 efforts to larger datasets by using spark but not having to go through a
 steep learning curve for Spark and Scala.
 It is open sourced with Apache License and can be found here:
 https://github.com/AyasdiOpenSource/df

 I welcome your comments, suggestions and feedback. Any help in developing
 it further is much appreciated. I have the following items on the roadmap
 (and happy to change this based on your comments)
 - Python wrappers most likely in the same way as MLLib
 - Sliding window aggregations
 - Row indexing
 - Graphing/charting
 - Efficient row-based operations
 - Pretty printing of output on the spark-shell
 - Unit test completeness and automated nightly runs

 Mohit.

 P.S.: Thanks to my awesome employer Ayasdi http://www.ayasdi.com for
 open sourcing this software

 P.P.S.: I need some design advice on making row operations efficient and
 I'll start a new thread for that