Thank you guys for the prompt help.

I ended up building spark master and verified what DB has suggested.

val lr = (new MlLogisticRegression)
       .setFitIntercept(true)
       .setMaxIter(35)

     val model = lr.fit(sqlContext.createDataFrame(training))
     val scoreAndLabels = model.transform(sqlContext.createDataFrame(test))
       .select("probability", "label")
       .map { case Row(probability: Vector, label: Double) =>
         (probability(1), label)
       }

Without doing much tuning, above generates

Weights: [0.0013971323020715888,0.8559779783186241,-0.5052275562089914]
Intercept: -3.3076806966913006
Area under ROC: 0.7033511043412033

I also tried it on a much bigger dataset I have and its results are close
to what I get from statsmodel.

Now early waiting for the 1.4 release.

Thanks,
Xin



On Wed, May 20, 2015 at 9:37 PM, Chris Gore <cdg...@cdgore.com> wrote:

> I tried running this data set as described with my own implementation of
> L2 regularized logistic regression using LBFGS to compare:
> https://github.com/cdgore/fitbox
>
> Intercept: -0.886745823033
> Weights (['gre', 'gpa', 'rank']):[ 0.28862268  0.19402388 -0.36637964]
> Area under ROC: 0.724056603774
>
> The difference could be from the feature preprocessing as mentioned.  I
> normalized the features around 0:
>
> binary_train_normalized = (binary_train - binary_train.mean()) /
> binary_train.std()
> binary_test_normalized = (binary_test - binary_train.mean()) /
> binary_train.std()
>
> On a data set this small, the difference in models could also be the
> result of how the training/test sets were split.
>
> Have you tried running k-folds cross validation on a larger data set?
>
> Chris
>
> On May 20, 2015, at 6:15 PM, DB Tsai <d...@netflix.com.INVALID> wrote:
>
> Hi Xin,
>
> If you take a look at the model you trained, the intercept from Spark
> is significantly smaller than StatsModel, and the intercept represents
> a prior on categories in LOR which causes the low accuracy in Spark
> implementation. In LogisticRegressionWithLBFGS, the intercept is
> regularized due to the implementation of Updater, and the intercept
> should not be regularized.
>
> In the new pipleline APIs, a LOR with elasticNet is implemented, and
> the intercept is properly handled.
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
>
> As you can see the tests,
>
> https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
> the result is exactly the same as R now.
>
> BTW, in both version, the feature scalings are done before training,
> and we train the model in scaled space but transform the model weights
> back to original space. The only difference is in the mllib version,
> LogisticRegressionWithLBFGS regularizes the intercept while in the ml
> version, the intercept is excluded from regularization. As a result,
> if lambda is zero, the model should be the same.
>
>
>
> On Wed, May 20, 2015 at 3:42 PM, Xin Liu <liuxin...@gmail.com> wrote:
>
> Hi,
>
> I have tried a few models in Mllib to train a LogisticRegression model.
> However, I consistently get much better results using other libraries such
> as statsmodel (which gives similar results as R) in terms of AUC. For
> illustration purpose, I used a small data (I have tried much bigger data)
> http://www.ats.ucla.edu/stat/data/binary.csv in
> http://www.ats.ucla.edu/stat/r/dae/logit.htm
>
> Here is the snippet of my usage of LogisticRegressionWithLBFGS.
>
> val algorithm = new LogisticRegressionWithLBFGS
>     algorithm.setIntercept(true)
>     algorithm.optimizer
>       .setNumIterations(100)
>       .setRegParam(0.01)
>       .setConvergenceTol(1e-5)
>     val model = algorithm.run(training)
>     model.clearThreshold()
>     val scoreAndLabels = test.map { point =>
>       val score = model.predict(point.features)
>       (score, point.label)
>     }
>     val metrics = new BinaryClassificationMetrics(scoreAndLabels)
>     val auROC = metrics.areaUnderROC()
>
> I did a (0.6, 0.4) split for training/test. The response is "admit" and
> features are "GRE score", "GPA", and "college Rank".
>
> Spark:
> Weights (GRE, GPA, Rank):
> [0.0011576276331509304,0.048544858567336854,-0.394202150286076]
> Intercept: -0.6488972641282202
> Area under ROC: 0.6294070512820512
>
> StatsModel:
> Weights [0.0018, 0.7220, -0.3148]
> Intercept: -3.5913
> Area under ROC: 0.69
>
> The weights from statsmodel seems more reasonable if you consider for a one
> unit increase in gpa, the log odds of being admitted to graduate school
> increases by 0.72 in statsmodel than 0.04 in Spark.
>
> I have seen much bigger difference with other data. So my question is has
> anyone compared the results with other libraries and is anything wrong with
> my code to invoke LogisticRegressionWithLBFGS?
>
> As the real data I am processing is pretty big and really want to use Spark
> to get this to work. Please let me know if you have similar experience and
> how you resolve it.
>
> Thanks,
> Xin
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Reply via email to