On Wed, Apr 30, 2014 at 11:42 AM, Dmitriy Lyubimov <[email protected]>wrote:

> I also would suggest to take some guinea pigs to validate stuff.
>
> E.g. if i may make a suggestion, let's see how we'd do a categorical
> variable vectorization into predictor variables in our would-be language
> here.
>

to be a bit further specific here, here's what roughly happens here.
assuming we have a column named "C1"


(1) assess  levels and their number (in R sense, aka R "factor" type)
(2) assume there's n total levels (i.e. distinct categories). Assign each
level, except one, to n-1 Bernoulli features named according to certain
convention e.g. "C1_<level-name-prefix>".
(3) repeat that for all categorical variables in the data frame.
(4) generate final dataframe executing mapping categories established in
(2) and (3) (set predictors to 1 if current categorical value matches
predictor's).
(5) compute resulting data frame summaries (mean, variance, quartiles).

seems simple enough, but how would it look like?

>
>
> On Wed, Apr 30, 2014 at 11:40 AM, Dmitriy Lyubimov <[email protected]>wrote:
>
>>
>>
>>
>> On Wed, Apr 30, 2014 at 10:53 AM, Dmitriy Lyubimov <[email protected]>wrote:
>>
>>> +1.
>>>
>>> And the greatest benefit of data frames work is standardization of
>>> feature extraction in Mahout, not necessarily any particular algorithms.
>>> This has been the thorniest issue in the history and nobody does it well
>>> today as it stands.
>>>
>>
>> Correction: nobody does it well in open source and in distributed way,
>> that is.
>>
>>
>>>  If we tackle feature prep techniques in engine-agnostic way, this would
>>> be truly unique differentiation factor for Mahout.
>>>
>>>
>>>
>>> On Wed, Apr 30, 2014 at 7:52 AM, Sebastian Schelter <[email protected]>wrote:
>>>
>>>> I think you should concentrate on MAHOUT-1490, that is a highly
>>>> important task that will be the foundation for a lot of stuff to be built
>>>> on top. Let's focus on getting this thing right and then move on to other
>>>> things.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>>>>
>>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>>>>> see other algorithms in mahout that are talked about being ported to 
>>>>> spark,
>>>>> I was wondering if there's any interest/need in porting or writing things
>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>>>>> working on 1490.  Also are we planning to port the distributed versions of
>>>>> taste to use spark as well at some point.
>>>>> Thanks in advance.
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to