Re: efficient zipping of lots of RDDs

2014-09-11 Thread Mohit Jaggi
filed  jira SPARK-3489  https://issues.apache.org/jira/browse/SPARK-3489

On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi mohitja...@gmail.com wrote:

 Folks,
 I sent an email announcing
 https://github.com/AyasdiOpenSource/df

 This dataframe is basically a map of RDDs of columns(along with DSL
 sugar), as column based operations seem to be most common. But row
 operations are not uncommon. To get rows out of columns right now I zip the
 column RDDs together. I use RDD.zip then flatten the tuples I get. I
 realize that RDD.zipPartitions might be faster. However, I believe an even
 better approach should be possible. Surely we can have a zip method that
 can combine a large variable number of RDDs? Can that be added to
 Spark-core? Or is there an alternative equally good or better approach?

 Cheers,
 Mohit.



efficient zipping of lots of RDDs

2014-09-04 Thread Mohit Jaggi
Folks,
I sent an email announcing
https://github.com/AyasdiOpenSource/df

This dataframe is basically a map of RDDs of columns(along with DSL sugar),
as column based operations seem to be most common. But row operations are
not uncommon. To get rows out of columns right now I zip the column RDDs
together. I use RDD.zip then flatten the tuples I get. I realize that
RDD.zipPartitions might be faster. However, I believe an even better
approach should be possible. Surely we can have a zip method that can
combine a large variable number of RDDs? Can that be added to Spark-core?
Or is there an alternative equally good or better approach?

Cheers,
Mohit.