Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

2016-07-13 Thread kundan kumar
Hi Sean , Thanks for the reply !! Is there anything already available in spark that can fix the depth of categorical variables. The OneHotEncoder changes the level of the vector created depending on the number of distinct values coming in the stream. Is there any parameter available with the

Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

2016-07-12 Thread Sean Owen
Yeah, for this to work, you need to know the number of distinct values a categorical feature will take on, ever. Sometimes that's known, sometimes it's not. One option is to use an algorithm that can use categorical features directly, like decision trees. You could consider hashing your features

Handling categorical variables in StreamingLogisticRegressionwithSGD

2016-07-12 Thread kundan kumar
Hi , I am trying to use StreamingLogisticRegressionwithSGD to build a CTR prediction model. The document : http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression mentions that the numFeatures should be *constant*. The problem that I am facing is : Since most