Hi , I am trying to use StreamingLogisticRegressionwithSGD to build a CTR prediction model.
The document : http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression mentions that the numFeatures should be *constant*. The problem that I am facing is : Since most of my variables are categorical, the numFeatures variable should be the final set of variables after encoding and parsing the categorical variables in labeled point format. Suppose, for a categorical variable x1 I have 10 distinct values in current window. But in the next window some new values/items gets added to x1 and the number of distinct values increases. How should I handle the numFeatures variable in this case, because it will change now ? Basically, my question is how should I handle the new values of the categorical variables in streaming model. Thanks, Kundan