Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

kundan kumar Wed, 13 Jul 2016 00:52:58 -0700

Hi Sean ,

Thanks for the reply !!


Is there anything already available in spark that can fix the depth of
categorical variables. The OneHotEncoder changes the level of the vector
created depending on the number of distinct values coming in the stream.

Is there any parameter available with the StringIndexer so that I can fix
the level of categorical variable or will I need to write some
implementation of my own.

Thanks,
Kundan

On Tue, Jul 12, 2016 at 5:43 PM, Sean Owen <so...@cloudera.com> wrote:

> Yeah, for this to work, you need to know the number of distinct values
> a categorical feature will take on, ever. Sometimes that's known,
> sometimes it's not.
>
> One option is to use an algorithm that can use categorical features
> directly, like decision trees.
>
> You could consider hashing your features if so. So, you'd have maybe
> 10 indicator columns and you hash the feature into one of those 10
> columns to figure out which one it corresponds to. Of course, when you
> have an 11th value it collides with one of them and they get
> conflated, but, at least you can sort of proceed.
>
> This is more usually done with a large number of feature values, but
> maybe that's what you have. It's more problematic the smaller your
> hash space is.
>
> On Tue, Jul 12, 2016 at 10:21 AM, kundan kumar <iitr.kun...@gmail.com>
> wrote:
> > Hi ,
> >
> > I am trying to use StreamingLogisticRegressionwithSGD to build a CTR
> > prediction model.
> >
> > The document :
> >
> >
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
> >
> > mentions that the numFeatures should be constant.
> >
> > The problem that I am facing is :
> > Since most of my variables are categorical, the numFeatures variable
> should
> > be the final set of variables after encoding and parsing the categorical
> > variables in labeled point format.
> >
> > Suppose, for a categorical variable x1 I have 10 distinct values in
> current
> > window.
> >
> > But in the next window some new values/items gets added to x1 and the
> number
> > of distinct values increases. How should I handle the numFeatures
> variable
> > in this case, because it will change now ?
> >
> > Basically, my question is how should I handle the new values of the
> > categorical variables in streaming model.
> >
> > Thanks,
> > Kundan
> >
> >
>

Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

Reply via email to