Hi,

Is there some way to get R-style Data.Frame data structures into RDDs? I've
been using RDD[Seq[]] but this is getting quite error-prone and the code
gets pretty hard to read especially after a few joins, maps etc.

Rather than access columns by index, I would prefer to access them by name.
e.g. instead of writing:
myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
I would prefer to write
myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))

Also joins are particularly irritating. Currently I have to first construct
a pair:
somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
Now I have to unzip away the join-key and remap the values into a seq

instead I would rather write
someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)


The question is this:
(1) I started writing a DataFrameRDD class that kept track of the column
names and column values, and some optional attributes common to the entire
dataframe. However I got a little muddled when trying to figure out what
happens when a dataframRDD is chained with other operations and get
transformed to other types of RDDs. The Value part of the RDD is obvious,
but I didn't know the best way to pass on the "column and attribute"
portions of the DataFrame class.

I googled around for some documentation on how to write RDDs, but only
found a pptx slide presentation with very vague info. Is there a better
source of info on how to write RDDs?

(2) Even better than info on how to write RDDs, has anyone written an RDD
that functions as a DataFrame? :-)

tks
shay

Reply via email to