Re: DataFrame RDDs

andy petrella Mon, 18 Nov 2013 15:47:12 -0800

Maybe I'm wrong, but this use case could be a good fit for
Shapeless<https://github.com/milessabin/shapeless>'
records.


Shapeless' records are like, so to say, lisp's record but typed! In that
sense, they're more closer to Haskell's record notation, but imho less
powerful, since the access will be based on String (field name) for
Shapeless where Haskell will use pure functions!

Anyway, this 
documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records>
is
self-explanatory and straightforward how we (maybe) could use them to
simulate an R's frame

Thinking out loud: when reading a csv file, for instance, what would be
needed are
 * a Read[T] for each column,
 * fold'ling the list of columns by "reading" each and prepending the
result (combined with the name with ->>) to an HList

The gain would be that we should recover one helpful feature of R's frame
which is:
  R       :: frame$newCol = frame$post - frame$pre
          // which adds a column to a frame
  Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
  // type safe "difference" between ints for instance

Of course, we're not recovering R's frame as is, because we're simply
dealing with rows on by one, where a frame is dealing with the full table
-- but in the case of Spark this would have no sense to mimic that, since
we use RDDs for that :-D.

I didn't experimented this yet, but It'd be fun to try, don't know if
someone is interested in ^^

Cheers

andy


On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <c...@adatao.com> wrote:

> Sure, Shay. Let's connect offline.
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 2:27 AM, "Shay Seng" <s...@1618labs.com> wrote:
>
>> Nice, any possibility of sharing this code in advance?
>>
>>
>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <c...@adatao.com>wrote:
>>
>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>> representation and subsetting/projections/data mining/machine learning
>>> algorithms on that in-memory table structure.
>>>
>>> We're planning to harmonize that with the MLBase work in the near
>>> future. Just a matter of prioritization on limited resources. If there's
>>> enough interest we'll accelerate that.
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 1:11 AM, "Shay Seng" <s...@1618labs.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there some way to get R-style Data.Frame data structures into RDDs?
>>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>>>> code gets pretty hard to read especially after a few joins, maps etc.
>>>>
>>>> Rather than access columns by index, I would prefer to access them by
>>>> name.
>>>> e.g. instead of writing:
>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>> I would prefer to write
>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>
>>>> Also joins are particularly irritating. Currently I have to first
>>>> construct a pair:
>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>> Now I have to unzip away the join-key and remap the values into a seq
>>>>
>>>> instead I would rather write
>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>
>>>>
>>>> The question is this:
>>>> (1) I started writing a DataFrameRDD class that kept track of the
>>>> column names and column values, and some optional attributes common to the
>>>> entire dataframe. However I got a little muddled when trying to figure out
>>>> what happens when a dataframRDD is chained with other operations and get
>>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>>> but I didn't know the best way to pass on the "column and attribute"
>>>> portions of the DataFrame class.
>>>>
>>>> I googled around for some documentation on how to write RDDs, but only
>>>> found a pptx slide presentation with very vague info. Is there a better
>>>> source of info on how to write RDDs?
>>>>
>>>> (2) Even better than info on how to write RDDs, has anyone written an
>>>> RDD that functions as a DataFrame? :-)
>>>>
>>>> tks
>>>> shay
>>>>
>>>
>>

Re: DataFrame RDDs

Reply via email to