I tried running this data set as described with my own implementation of L2 
regularized logistic regression using LBFGS to compare:
https://github.com/cdgore/fitbox <https://github.com/cdgore/fitbox>

Intercept: -0.886745823033
Weights (['gre', 'gpa', 'rank']):[ 0.28862268  0.19402388 -0.36637964]
Area under ROC: 0.724056603774

The difference could be from the feature preprocessing as mentioned.  I 
normalized the features around 0:

binary_train_normalized = (binary_train - binary_train.mean()) / 
binary_train.std()
binary_test_normalized = (binary_test - binary_train.mean()) / 
binary_train.std()

On a data set this small, the difference in models could also be the result of 
how the training/test sets were split.

Have you tried running k-folds cross validation on a larger data set?

Chris

> On May 20, 2015, at 6:15 PM, DB Tsai <d...@netflix.com.INVALID> wrote:
> 
> Hi Xin,
> 
> If you take a look at the model you trained, the intercept from Spark
> is significantly smaller than StatsModel, and the intercept represents
> a prior on categories in LOR which causes the low accuracy in Spark
> implementation. In LogisticRegressionWithLBFGS, the intercept is
> regularized due to the implementation of Updater, and the intercept
> should not be regularized.
> 
> In the new pipleline APIs, a LOR with elasticNet is implemented, and
> the intercept is properly handled.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
> 
> As you can see the tests,
> https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
> the result is exactly the same as R now.
> 
> BTW, in both version, the feature scalings are done before training,
> and we train the model in scaled space but transform the model weights
> back to original space. The only difference is in the mllib version,
> LogisticRegressionWithLBFGS regularizes the intercept while in the ml
> version, the intercept is excluded from regularization. As a result,
> if lambda is zero, the model should be the same.
> 
> 
> 
> On Wed, May 20, 2015 at 3:42 PM, Xin Liu <liuxin...@gmail.com> wrote:
>> Hi,
>> 
>> I have tried a few models in Mllib to train a LogisticRegression model.
>> However, I consistently get much better results using other libraries such
>> as statsmodel (which gives similar results as R) in terms of AUC. For
>> illustration purpose, I used a small data (I have tried much bigger data)
>> http://www.ats.ucla.edu/stat/data/binary.csv in
>> http://www.ats.ucla.edu/stat/r/dae/logit.htm
>> 
>> Here is the snippet of my usage of LogisticRegressionWithLBFGS.
>> 
>> val algorithm = new LogisticRegressionWithLBFGS
>>     algorithm.setIntercept(true)
>>     algorithm.optimizer
>>       .setNumIterations(100)
>>       .setRegParam(0.01)
>>       .setConvergenceTol(1e-5)
>>     val model = algorithm.run(training)
>>     model.clearThreshold()
>>     val scoreAndLabels = test.map { point =>
>>       val score = model.predict(point.features)
>>       (score, point.label)
>>     }
>>     val metrics = new BinaryClassificationMetrics(scoreAndLabels)
>>     val auROC = metrics.areaUnderROC()
>> 
>> I did a (0.6, 0.4) split for training/test. The response is "admit" and
>> features are "GRE score", "GPA", and "college Rank".
>> 
>> Spark:
>> Weights (GRE, GPA, Rank):
>> [0.0011576276331509304,0.048544858567336854,-0.394202150286076]
>> Intercept: -0.6488972641282202
>> Area under ROC: 0.6294070512820512
>> 
>> StatsModel:
>> Weights [0.0018, 0.7220, -0.3148]
>> Intercept: -3.5913
>> Area under ROC: 0.69
>> 
>> The weights from statsmodel seems more reasonable if you consider for a one
>> unit increase in gpa, the log odds of being admitted to graduate school
>> increases by 0.72 in statsmodel than 0.04 in Spark.
>> 
>> I have seen much bigger difference with other data. So my question is has
>> anyone compared the results with other libraries and is anything wrong with
>> my code to invoke LogisticRegressionWithLBFGS?
>> 
>> As the real data I am processing is pretty big and really want to use Spark
>> to get this to work. Please let me know if you have similar experience and
>> how you resolve it.
>> 
>> Thanks,
>> Xin
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

Reply via email to