On Wed, Apr 30, 2014 at 9:24 PM, Dmitriy Lyubimov <[email protected]> wrote:
> On Wed, Apr 30, 2014 at 11:42 AM, Dmitriy Lyubimov <[email protected] > >wrote: > > > I also would suggest to take some guinea pigs to validate stuff. > > > > E.g. if i may make a suggestion, let's see how we'd do a categorical > > variable vectorization into predictor variables in our would-be language > > here. > > > > to be a bit further specific here, here's what roughly happens here. > assuming we have a column named "C1" > > > (1) assess levels and their number (in R sense, aka R "factor" type) > (2) assume there's n total levels (i.e. distinct categories). Assign each > level, except one, to n-1 Bernoulli features named according to certain > convention e.g. "C1_<level-name-prefix>". > (3) repeat that for all categorical variables in the data frame. > (4) generate final dataframe executing mapping categories established in > (2) and (3) (set predictors to 1 if current categorical value matches > predictor's). > (5) compute resulting data frame summaries (mean, variance, quartiles). > > seems simple enough, but how would it look like? > Sounds good. Minor nit in that 1 of n coding should be allowed as well. I would also expect that we could do random hashing encoding as well. A similar problem statement is possible for values that are textual, in addition to categorical. The process is essentially the same in that you have 0 or 1 passes to optionally agree on a dictionary and then another pass to encode into n columns.
