Interesting idea — in Scala you can also use the Dynamic type 
(http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic) to 
allow dynamic properties. It has the same potential pitfalls as string names, 
but with nicer syntax.

Matei

On Nov 18, 2013, at 3:45 PM, andy petrella <andy.petre...@gmail.com> wrote:

> Maybe I'm wrong, but this use case could be a good fit for Shapeless' records.
> 
> Shapeless' records are like, so to say, lisp's record but typed! In that 
> sense, they're more closer to Haskell's record notation, but imho less 
> powerful, since the access will be based on String (field name) for Shapeless 
> where Haskell will use pure functions!
> 
> Anyway, this documentation is self-explanatory and straightforward how we 
> (maybe) could use them to simulate an R's frame
> 
> Thinking out loud: when reading a csv file, for instance, what would be 
> needed are 
>  * a Read[T] for each column, 
>  * fold'ling the list of columns by "reading" each and prepending the result 
> (combined with the name with ->>) to an HList
> 
> The gain would be that we should recover one helpful feature of R's frame 
> which is:
>   R       :: frame$newCol = frame$post - frame$pre                            
>        // which adds a column to a frame
>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))     
> // type safe "difference" between ints for instance
>    
> Of course, we're not recovering R's frame as is, because we're simply dealing 
> with rows on by one, where a frame is dealing with the full table -- but in 
> the case of Spark this would have no sense to mimic that, since we use RDDs 
> for that :-D.
> 
> I didn't experimented this yet, but It'd be fun to try, don't know if someone 
> is interested in ^^
> 
> Cheers
> 
> andy
> 
> 
> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <c...@adatao.com> wrote:
> Sure, Shay. Let's connect offline.
> 
> Sent while mobile. Pls excuse typos etc.
> 
> On Nov 16, 2013 2:27 AM, "Shay Seng" <s...@1618labs.com> wrote:
> Nice, any possibility of sharing this code in advance? 
> 
> 
> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <c...@adatao.com> wrote:
> Shay, we've done this at Adatao, specifically a big data frame in RDD 
> representation and subsetting/projections/data mining/machine learning 
> algorithms on that in-memory table structure.
> 
> We're planning to harmonize that with the MLBase work in the near future. 
> Just a matter of prioritization on limited resources. If there's enough 
> interest we'll accelerate that.
> 
> Sent while mobile. Pls excuse typos etc.
> 
> On Nov 16, 2013 1:11 AM, "Shay Seng" <s...@1618labs.com> wrote:
> Hi, 
> 
> Is there some way to get R-style Data.Frame data structures into RDDs? I've 
> been using RDD[Seq[]] but this is getting quite error-prone and the code gets 
> pretty hard to read especially after a few joins, maps etc. 
> 
> Rather than access columns by index, I would prefer to access them by name.
> e.g. instead of writing:
> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
> I would prefer to write
> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
> 
> Also joins are particularly irritating. Currently I have to first construct a 
> pair:
> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
> Now I have to unzip away the join-key and remap the values into a seq
> 
> instead I would rather write
> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
> 
> 
> The question is this:
> (1) I started writing a DataFrameRDD class that kept track of the column 
> names and column values, and some optional attributes common to the entire 
> dataframe. However I got a little muddled when trying to figure out what 
> happens when a dataframRDD is chained with other operations and get 
> transformed to other types of RDDs. The Value part of the RDD is obvious, but 
> I didn't know the best way to pass on the "column and attribute" portions of 
> the DataFrame class.
> 
> I googled around for some documentation on how to write RDDs, but only found 
> a pptx slide presentation with very vague info. Is there a better source of 
> info on how to write RDDs? 
> 
> (2) Even better than info on how to write RDDs, has anyone written an RDD 
> that functions as a DataFrame? :-)
> 
> tks
> shay
> 
> 

Reply via email to