I tried running this data set as described with my own implementation of L2 regularized logistic regression using LBFGS to compare: https://github.com/cdgore/fitbox <https://github.com/cdgore/fitbox>
Intercept: -0.886745823033 Weights (['gre', 'gpa', 'rank']):[ 0.28862268 0.19402388 -0.36637964] Area under ROC: 0.724056603774 The difference could be from the feature preprocessing as mentioned. I normalized the features around 0: binary_train_normalized = (binary_train - binary_train.mean()) / binary_train.std() binary_test_normalized = (binary_test - binary_train.mean()) / binary_train.std() On a data set this small, the difference in models could also be the result of how the training/test sets were split. Have you tried running k-folds cross validation on a larger data set? Chris > On May 20, 2015, at 6:15 PM, DB Tsai <d...@netflix.com.INVALID> wrote: > > Hi Xin, > > If you take a look at the model you trained, the intercept from Spark > is significantly smaller than StatsModel, and the intercept represents > a prior on categories in LOR which causes the low accuracy in Spark > implementation. In LogisticRegressionWithLBFGS, the intercept is > regularized due to the implementation of Updater, and the intercept > should not be regularized. > > In the new pipleline APIs, a LOR with elasticNet is implemented, and > the intercept is properly handled. > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala > > As you can see the tests, > https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala > the result is exactly the same as R now. > > BTW, in both version, the feature scalings are done before training, > and we train the model in scaled space but transform the model weights > back to original space. The only difference is in the mllib version, > LogisticRegressionWithLBFGS regularizes the intercept while in the ml > version, the intercept is excluded from regularization. As a result, > if lambda is zero, the model should be the same. > > > > On Wed, May 20, 2015 at 3:42 PM, Xin Liu <liuxin...@gmail.com> wrote: >> Hi, >> >> I have tried a few models in Mllib to train a LogisticRegression model. >> However, I consistently get much better results using other libraries such >> as statsmodel (which gives similar results as R) in terms of AUC. For >> illustration purpose, I used a small data (I have tried much bigger data) >> http://www.ats.ucla.edu/stat/data/binary.csv in >> http://www.ats.ucla.edu/stat/r/dae/logit.htm >> >> Here is the snippet of my usage of LogisticRegressionWithLBFGS. >> >> val algorithm = new LogisticRegressionWithLBFGS >> algorithm.setIntercept(true) >> algorithm.optimizer >> .setNumIterations(100) >> .setRegParam(0.01) >> .setConvergenceTol(1e-5) >> val model = algorithm.run(training) >> model.clearThreshold() >> val scoreAndLabels = test.map { point => >> val score = model.predict(point.features) >> (score, point.label) >> } >> val metrics = new BinaryClassificationMetrics(scoreAndLabels) >> val auROC = metrics.areaUnderROC() >> >> I did a (0.6, 0.4) split for training/test. The response is "admit" and >> features are "GRE score", "GPA", and "college Rank". >> >> Spark: >> Weights (GRE, GPA, Rank): >> [0.0011576276331509304,0.048544858567336854,-0.394202150286076] >> Intercept: -0.6488972641282202 >> Area under ROC: 0.6294070512820512 >> >> StatsModel: >> Weights [0.0018, 0.7220, -0.3148] >> Intercept: -3.5913 >> Area under ROC: 0.69 >> >> The weights from statsmodel seems more reasonable if you consider for a one >> unit increase in gpa, the log odds of being admitted to graduate school >> increases by 0.72 in statsmodel than 0.04 in Spark. >> >> I have seen much bigger difference with other data. So my question is has >> anyone compared the results with other libraries and is anything wrong with >> my code to invoke LogisticRegressionWithLBFGS? >> >> As the real data I am processing is pretty big and really want to use Spark >> to get this to work. Please let me know if you have similar experience and >> how you resolve it. >> >> Thanks, >> Xin > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >