Yeah, for this to work, you need to know the number of distinct values
a categorical feature will take on, ever. Sometimes that's known,
sometimes it's not.

One option is to use an algorithm that can use categorical features
directly, like decision trees.

You could consider hashing your features if so. So, you'd have maybe
10 indicator columns and you hash the feature into one of those 10
columns to figure out which one it corresponds to. Of course, when you
have an 11th value it collides with one of them and they get
conflated, but, at least you can sort of proceed.

This is more usually done with a large number of feature values, but
maybe that's what you have. It's more problematic the smaller your
hash space is.

On Tue, Jul 12, 2016 at 10:21 AM, kundan kumar <iitr.kun...@gmail.com> wrote:
> Hi ,
>
> I am trying to use StreamingLogisticRegressionwithSGD to build a CTR
> prediction model.
>
> The document :
>
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
>
> mentions that the numFeatures should be constant.
>
> The problem that I am facing is :
> Since most of my variables are categorical, the numFeatures variable should
> be the final set of variables after encoding and parsing the categorical
> variables in labeled point format.
>
> Suppose, for a categorical variable x1 I have 10 distinct values in current
> window.
>
> But in the next window some new values/items gets added to x1 and the number
> of distinct values increases. How should I handle the numFeatures variable
> in this case, because it will change now ?
>
> Basically, my question is how should I handle the new values of the
> categorical variables in streaming model.
>
> Thanks,
> Kundan
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to