Thank you guys for the prompt help. I ended up building spark master and verified what DB has suggested.
val lr = (new MlLogisticRegression) .setFitIntercept(true) .setMaxIter(35) val model = lr.fit(sqlContext.createDataFrame(training)) val scoreAndLabels = model.transform(sqlContext.createDataFrame(test)) .select("probability", "label") .map { case Row(probability: Vector, label: Double) => (probability(1), label) } Without doing much tuning, above generates Weights: [0.0013971323020715888,0.8559779783186241,-0.5052275562089914] Intercept: -3.3076806966913006 Area under ROC: 0.7033511043412033 I also tried it on a much bigger dataset I have and its results are close to what I get from statsmodel. Now early waiting for the 1.4 release. Thanks, Xin On Wed, May 20, 2015 at 9:37 PM, Chris Gore <cdg...@cdgore.com> wrote: > I tried running this data set as described with my own implementation of > L2 regularized logistic regression using LBFGS to compare: > https://github.com/cdgore/fitbox > > Intercept: -0.886745823033 > Weights (['gre', 'gpa', 'rank']):[ 0.28862268 0.19402388 -0.36637964] > Area under ROC: 0.724056603774 > > The difference could be from the feature preprocessing as mentioned. I > normalized the features around 0: > > binary_train_normalized = (binary_train - binary_train.mean()) / > binary_train.std() > binary_test_normalized = (binary_test - binary_train.mean()) / > binary_train.std() > > On a data set this small, the difference in models could also be the > result of how the training/test sets were split. > > Have you tried running k-folds cross validation on a larger data set? > > Chris > > On May 20, 2015, at 6:15 PM, DB Tsai <d...@netflix.com.INVALID> wrote: > > Hi Xin, > > If you take a look at the model you trained, the intercept from Spark > is significantly smaller than StatsModel, and the intercept represents > a prior on categories in LOR which causes the low accuracy in Spark > implementation. In LogisticRegressionWithLBFGS, the intercept is > regularized due to the implementation of Updater, and the intercept > should not be regularized. > > In the new pipleline APIs, a LOR with elasticNet is implemented, and > the intercept is properly handled. > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala > > As you can see the tests, > > https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala > the result is exactly the same as R now. > > BTW, in both version, the feature scalings are done before training, > and we train the model in scaled space but transform the model weights > back to original space. The only difference is in the mllib version, > LogisticRegressionWithLBFGS regularizes the intercept while in the ml > version, the intercept is excluded from regularization. As a result, > if lambda is zero, the model should be the same. > > > > On Wed, May 20, 2015 at 3:42 PM, Xin Liu <liuxin...@gmail.com> wrote: > > Hi, > > I have tried a few models in Mllib to train a LogisticRegression model. > However, I consistently get much better results using other libraries such > as statsmodel (which gives similar results as R) in terms of AUC. For > illustration purpose, I used a small data (I have tried much bigger data) > http://www.ats.ucla.edu/stat/data/binary.csv in > http://www.ats.ucla.edu/stat/r/dae/logit.htm > > Here is the snippet of my usage of LogisticRegressionWithLBFGS. > > val algorithm = new LogisticRegressionWithLBFGS > algorithm.setIntercept(true) > algorithm.optimizer > .setNumIterations(100) > .setRegParam(0.01) > .setConvergenceTol(1e-5) > val model = algorithm.run(training) > model.clearThreshold() > val scoreAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > val metrics = new BinaryClassificationMetrics(scoreAndLabels) > val auROC = metrics.areaUnderROC() > > I did a (0.6, 0.4) split for training/test. The response is "admit" and > features are "GRE score", "GPA", and "college Rank". > > Spark: > Weights (GRE, GPA, Rank): > [0.0011576276331509304,0.048544858567336854,-0.394202150286076] > Intercept: -0.6488972641282202 > Area under ROC: 0.6294070512820512 > > StatsModel: > Weights [0.0018, 0.7220, -0.3148] > Intercept: -3.5913 > Area under ROC: 0.69 > > The weights from statsmodel seems more reasonable if you consider for a one > unit increase in gpa, the log odds of being admitted to graduate school > increases by 0.72 in statsmodel than 0.04 in Spark. > > I have seen much bigger difference with other data. So my question is has > anyone compared the results with other libraries and is anything wrong with > my code to invoke LogisticRegressionWithLBFGS? > > As the real data I am processing is pretty big and really want to use Spark > to get this to work. Please let me know if you have similar experience and > how you resolve it. > > Thanks, > Xin > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >