Maybe I'm wrong, but this use case could be a good fit for Shapeless<https://github.com/milessabin/shapeless>' records.
Shapeless' records are like, so to say, lisp's record but typed! In that sense, they're more closer to Haskell's record notation, but imho less powerful, since the access will be based on String (field name) for Shapeless where Haskell will use pure functions! Anyway, this documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records> is self-explanatory and straightforward how we (maybe) could use them to simulate an R's frame Thinking out loud: when reading a csv file, for instance, what would be needed are * a Read[T] for each column, * fold'ling the list of columns by "reading" each and prepending the result (combined with the name with ->>) to an HList The gain would be that we should recover one helpful feature of R's frame which is: R :: frame$newCol = frame$post - frame$pre // which adds a column to a frame Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre"))) // type safe "difference" between ints for instance Of course, we're not recovering R's frame as is, because we're simply dealing with rows on by one, where a frame is dealing with the full table -- but in the case of Spark this would have no sense to mimic that, since we use RDDs for that :-D. I didn't experimented this yet, but It'd be fun to try, don't know if someone is interested in ^^ Cheers andy On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <c...@adatao.com> wrote: > Sure, Shay. Let's connect offline. > > Sent while mobile. Pls excuse typos etc. > On Nov 16, 2013 2:27 AM, "Shay Seng" <s...@1618labs.com> wrote: > >> Nice, any possibility of sharing this code in advance? >> >> >> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <c...@adatao.com>wrote: >> >>> Shay, we've done this at Adatao, specifically a big data frame in RDD >>> representation and subsetting/projections/data mining/machine learning >>> algorithms on that in-memory table structure. >>> >>> We're planning to harmonize that with the MLBase work in the near >>> future. Just a matter of prioritization on limited resources. If there's >>> enough interest we'll accelerate that. >>> >>> Sent while mobile. Pls excuse typos etc. >>> On Nov 16, 2013 1:11 AM, "Shay Seng" <s...@1618labs.com> wrote: >>> >>>> Hi, >>>> >>>> Is there some way to get R-style Data.Frame data structures into RDDs? >>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the >>>> code gets pretty hard to read especially after a few joins, maps etc. >>>> >>>> Rather than access columns by index, I would prefer to access them by >>>> name. >>>> e.g. instead of writing: >>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9)) >>>> I would prefer to write >>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost)) >>>> >>>> Also joins are particularly irritating. Currently I have to first >>>> construct a pair: >>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3))) >>>> Now I have to unzip away the join-key and remap the values into a seq >>>> >>>> instead I would rather write >>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime) >>>> >>>> >>>> The question is this: >>>> (1) I started writing a DataFrameRDD class that kept track of the >>>> column names and column values, and some optional attributes common to the >>>> entire dataframe. However I got a little muddled when trying to figure out >>>> what happens when a dataframRDD is chained with other operations and get >>>> transformed to other types of RDDs. The Value part of the RDD is obvious, >>>> but I didn't know the best way to pass on the "column and attribute" >>>> portions of the DataFrame class. >>>> >>>> I googled around for some documentation on how to write RDDs, but only >>>> found a pptx slide presentation with very vague info. Is there a better >>>> source of info on how to write RDDs? >>>> >>>> (2) Even better than info on how to write RDDs, has anyone written an >>>> RDD that functions as a DataFrame? :-) >>>> >>>> tks >>>> shay >>>> >>> >>