Re: DataFrame RDDs

andy petrella Tue, 19 Nov 2013 00:03:10 -0800

Exactly, we could actually use Dynamic with Records.

Still thinking out of loud:
The fact is that, with Dynamic we would loose the type  -- the
implementation is up to us and makes uses of Any for parameters of course.
Maybe could we use shapeless records as the delegated value for the
selectDynamic (and so on) implementation...


However, we might encounter some problem because of the Any in the
signatures, since the compiler looses the type... most probably we'd need a
ClassTag or something similar... not sure if it'll work -- with a clean
code at least (°_-)



Andy Petrella
Belgium (Liège)


*       *********
 IT Consultant for *NextLab <http://nextlab.be/> sprl* (co-founder)
 Engaged Citizen Coder for *WAJUG <http://wajug.be/>* (co-founder)
 Author of *Learning Play! Framework
2*<http://www.packtpub.com/learning-play-framework-2/book>


*       *********Mobile: *+32 495 99 11 04*
Mails:

   - andy.petre...@nextlab.be
   - andy.petre...@gmail.com

Socials:

   - Twitter: https://twitter.com/#!/noootsab
   - LinkedIn: http://be.linkedin.com/in/andypetrella
   - Blogger: http://ska-la.blogspot.com/
   - GitHub:  https://github.com/andypetrella
   - Masterbranch: https://masterbranch.com/andy.petrella



On Tue, Nov 19, 2013 at 8:07 AM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> Interesting idea — in Scala you can also use the Dynamic type (
> http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic)
> to allow dynamic properties. It has the same potential pitfalls as string
> names, but with nicer syntax.
>
> Matei
>
> On Nov 18, 2013, at 3:45 PM, andy petrella <andy.petre...@gmail.com>
> wrote:
>
> Maybe I'm wrong, but this use case could be a good fit for 
> Shapeless<https://github.com/milessabin/shapeless>'
> records.
>
> Shapeless' records are like, so to say, lisp's record but typed! In that
> sense, they're more closer to Haskell's record notation, but imho less
> powerful, since the access will be based on String (field name) for
> Shapeless where Haskell will use pure functions!
>
> Anyway, this 
> documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records>
>  is
> self-explanatory and straightforward how we (maybe) could use them to
> simulate an R's frame
>
> Thinking out loud: when reading a csv file, for instance, what would be
> needed are
>  * a Read[T] for each column,
>  * fold'ling the list of columns by "reading" each and prepending the
> result (combined with the name with ->>) to an HList
>
> The gain would be that we should recover one helpful feature of R's frame
> which is:
>   R       :: frame$newCol = frame$post - frame$pre
>           // which adds a column to a frame
>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
>     // type safe "difference" between ints for instance
>
> Of course, we're not recovering R's frame as is, because we're simply
> dealing with rows on by one, where a frame is dealing with the full table
> -- but in the case of Spark this would have no sense to mimic that, since
> we use RDDs for that :-D.
>
> I didn't experimented this yet, but It'd be fun to try, don't know if
> someone is interested in ^^
>
> Cheers
>
> andy
>
>
> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <c...@adatao.com>wrote:
>
>> Sure, Shay. Let's connect offline.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 2:27 AM, "Shay Seng" <s...@1618labs.com> wrote:
>>
>>> Nice, any possibility of sharing this code in advance?
>>>
>>>
>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <c...@adatao.com>wrote:
>>>
>>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>>> representation and subsetting/projections/data mining/machine learning
>>>> algorithms on that in-memory table structure.
>>>>
>>>> We're planning to harmonize that with the MLBase work in the near
>>>> future. Just a matter of prioritization on limited resources. If there's
>>>> enough interest we'll accelerate that.
>>>>
>>>> Sent while mobile. Pls excuse typos etc.
>>>>  On Nov 16, 2013 1:11 AM, "Shay Seng" <s...@1618labs.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is there some way to get R-style Data.Frame data structures into RDDs?
>>>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>>>>> code gets pretty hard to read especially after a few joins, maps etc.
>>>>>
>>>>> Rather than access columns by index, I would prefer to access them by
>>>>> name.
>>>>> e.g. instead of writing:
>>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>>> I would prefer to write
>>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>>
>>>>> Also joins are particularly irritating. Currently I have to first
>>>>> construct a pair:
>>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>>> Now I have to unzip away the join-key and remap the values into a seq
>>>>>
>>>>> instead I would rather write
>>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>>
>>>>>
>>>>> The question is this:
>>>>> (1) I started writing a DataFrameRDD class that kept track of the
>>>>> column names and column values, and some optional attributes common to the
>>>>> entire dataframe. However I got a little muddled when trying to figure out
>>>>> what happens when a dataframRDD is chained with other operations and get
>>>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>>>> but I didn't know the best way to pass on the "column and attribute"
>>>>> portions of the DataFrame class.
>>>>>
>>>>> I googled around for some documentation on how to write RDDs, but only
>>>>> found a pptx slide presentation with very vague info. Is there a better
>>>>> source of info on how to write RDDs?
>>>>>
>>>>> (2) Even better than info on how to write RDDs, has anyone written an
>>>>> RDD that functions as a DataFrame? :-)
>>>>>
>>>>> tks
>>>>> shay
>>>>>
>>>>
>>>
>
>

Re: DataFrame RDDs

Reply via email to