On Wed, Apr 30, 2014 at 11:42 AM, Dmitriy Lyubimov <[email protected]>wrote:
> I also would suggest to take some guinea pigs to validate stuff. > > E.g. if i may make a suggestion, let's see how we'd do a categorical > variable vectorization into predictor variables in our would-be language > here. > to be a bit further specific here, here's what roughly happens here. assuming we have a column named "C1" (1) assess levels and their number (in R sense, aka R "factor" type) (2) assume there's n total levels (i.e. distinct categories). Assign each level, except one, to n-1 Bernoulli features named according to certain convention e.g. "C1_<level-name-prefix>". (3) repeat that for all categorical variables in the data frame. (4) generate final dataframe executing mapping categories established in (2) and (3) (set predictors to 1 if current categorical value matches predictor's). (5) compute resulting data frame summaries (mean, variance, quartiles). seems simple enough, but how would it look like? > > > On Wed, Apr 30, 2014 at 11:40 AM, Dmitriy Lyubimov <[email protected]>wrote: > >> >> >> >> On Wed, Apr 30, 2014 at 10:53 AM, Dmitriy Lyubimov <[email protected]>wrote: >> >>> +1. >>> >>> And the greatest benefit of data frames work is standardization of >>> feature extraction in Mahout, not necessarily any particular algorithms. >>> This has been the thorniest issue in the history and nobody does it well >>> today as it stands. >>> >> >> Correction: nobody does it well in open source and in distributed way, >> that is. >> >> >>> If we tackle feature prep techniques in engine-agnostic way, this would >>> be truly unique differentiation factor for Mahout. >>> >>> >>> >>> On Wed, Apr 30, 2014 at 7:52 AM, Sebastian Schelter <[email protected]>wrote: >>> >>>> I think you should concentrate on MAHOUT-1490, that is a highly >>>> important task that will be the foundation for a lot of stuff to be built >>>> on top. Let's focus on getting this thing right and then move on to other >>>> things. >>>> >>>> --sebastian >>>> >>>> >>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote: >>>> >>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt >>>>> see other algorithms in mahout that are talked about being ported to >>>>> spark, >>>>> I was wondering if there's any interest/need in porting or writing things >>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while >>>>> working on 1490. Also are we planning to port the distributed versions of >>>>> taste to use spark as well at some point. >>>>> Thanks in advance. >>>>> >>>>> >>>> >>> >> >
