Re: Differences between scikit-learn and Spark.ml for regression toy problem

Dhanesh Padmanabhan Mon, 13 Mar 2017 05:07:09 -0700

Hi Frank

Thanks for this question. I have been comparing logistic regression in
sklearn with spark mllib as well. Your example code gave me a perfect way
to compare what is going on in both the packages.


I looked at both the source codes. There are quite a few differences in how
the model fitting is done. I have a solution for the logistic regression
problem. I do not have a solution for the linear regression problem yet.

Here are the key differences:
1. In spark, Regularization for L2 is divided by feature standard
deviation. In sklearn, it is not.
2. In spark, X's are standardized. This changes the solution because of
regularization. In sklearn, no standardization is done.
3. In Spark, Average log loss is used for training. The log loss is
averaged by sum of weights, which is the number of training instances.
Sklearn uses sum of log loss instead. So the spark regularization is very
heavy. You should scale down the regularization parameter by the number of
instances.


So, if you do the following, you should be able to match the outputs of
logistic regression:
1. Standardize the spark and pandas dataframes in a similar fashion. Note:
The standardization in spark works a little differently for ensuring unit
variance - spark uses sqrt(n) as denominator, and sklearn's standardscaler
uses sqrt(n-1) (unbiased estimator when mean is not known)
2. Scale down the regularization in spark by number of instances. Use 0.03
in your example instead of 0.3, given you have 10 training instances.

Hope this helps
-Dhanesh

Spark ml code (I changed it to work with Spark 2.1):
----------------------------------------------------------------

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.feature.StandardScaler

val sparkTrainingData_orig = new SQLContext(sc).
  createDataFrame(Seq(
    (0.0, Vectors.dense(Array(-0.7306653538519616, 0.0))),
    (1.0, Vectors.dense(Array(0.6750417712898752, -0.4232874171873786))),
    (1.0, Vectors.dense(Array(0.1863463229359709, -0.8163423997075965))),
    (0.0, Vectors.dense(Array(-0.6719842051493347, 0.0))),
    (1.0, Vectors.dense(Array(0.9699938346531928, 0.0))),
    (1.0, Vectors.dense(Array(0.22759406190283604, 0.0))),
    (1.0, Vectors.dense(Array(0.9688721028330911, 0.0))),
    (0.0, Vectors.dense(Array(0.5993795346650845, 0.0))),
    (0.0, Vectors.dense(Array(0.9219423508390701, -0.8972778242305388))),
    (0.0, Vectors.dense(Array(0.7006904841584055, -0.5607635619919824))))).
  toDF("label", "features_orig")

val sparkTrainingData=new
StandardScaler().setWithMean(true).setInputCol("features_orig").setOutputCol("features").fit(sparkTrainingData_orig).transform(sparkTrainingData_orig)

val logisticModel = new LogisticRegression().
  setRegParam(0.03).
  setLabelCol("label").
  setFeaturesCol("features").
  setTol(1e-12).
  setMaxIter(100).
  fit(sparkTrainingData)

println(s"Spark logistic model coefficients: ${logisticModel.coefficients}
Intercept: ${logisticModel.intercept}")
// Spark logistic model coefficients:
[0.8212244419577079,0.32615245441495727] Intercept: -0.011815325216668142


Sklearn Code:
-----------------

import numpy as np
from sklearn.linear_model import LogisticRegression, Ridge

X = np.array([
    [-0.7306653538519616, 0.0],
    [0.6750417712898752, -0.4232874171873786],
    [0.1863463229359709, -0.8163423997075965],
    [-0.6719842051493347, 0.0],
    [0.9699938346531928, 0.0],
    [0.22759406190283604, 0.0],
    [0.9688721028330911, 0.0],
    [0.5993795346650845, 0.0],
    [0.9219423508390701, -0.8972778242305388],
    [0.7006904841584055, -0.5607635619919824]
])

y = np.array([
    0.0,
    1.0,
    1.0,
    0.0,
    1.0,
    1.0,
    1.0,
    0.0,
    0.0,
    0.0
])

m, n = X.shape

# Scale and Add intercept term to simulate inputs to GameEstimator

from sklearn.preprocessing import StandardScaler

# Adjust by factor sqrt(n-1)/sqrt(n) to take care of standard deviation
formula differences
Xsc=StandardScaler().fit_transform(X)*3/np.sqrt(10)
Xsc_with_intercept = np.hstack((Xsc, np.ones(m)[:,np.newaxis]))

l = 0.3
e = LogisticRegression(
    fit_intercept=True,
    penalty='l2',
    C=1/l,
    max_iter=100,
    tol=1e-11,
    solver='lbfgs',verbose=1)

e.fit(Xsc, y)

print e.coef_, e.intercept_
# => [[ 0.82122437 0.32615256]] [-0.01181534]



Dhanesh
+91-9741125245

On Mon, Mar 13, 2017 at 7:50 AM, Frank Astier <fast...@linkedin.com.invalid>
wrote:

> (this was also posted to stackoverflow on 03/10)
>
> I am setting up a very simple logistic regression problem in scikit-learn
> and in spark.ml, and the results diverge: the models they learn are
> different, but I can't figure out why (data is the same, model type is the
> same, regularization is the same...).
>
> No doubt I am missing some setting on one side or the other. Which
> setting? How should I set up either scikit or spark.ml to find the same
> model as its counterpart?
>
> I give the sklearn code and spark.ml code below. Both should be ready to
> cut-and-paste and run.
>
> scikit-learn code:
> ----------------------
>
>     import numpy as np
>     from sklearn.linear_model import LogisticRegression, Ridge
>
>     X = np.array([
>         [-0.7306653538519616, 0.0],
>         [0.6750417712898752, -0.4232874171873786],
>         [0.1863463229359709, -0.8163423997075965],
>         [-0.6719842051493347, 0.0],
>         [0.9699938346531928, 0.0],
>         [0.22759406190283604, 0.0],
>         [0.9688721028330911, 0.0],
>         [0.5993795346650845, 0.0],
>         [0.9219423508390701, -0.8972778242305388],
>         [0.7006904841584055, -0.5607635619919824]
>     ])
>
>     y = np.array([
>         0.0,
>         1.0,
>         1.0,
>         0.0,
>         1.0,
>         1.0,
>         1.0,
>         0.0,
>         0.0,
>         0.0
>     ])
>
>     m, n = X.shape
>
>     # Add intercept term to simulate inputs to GameEstimator
>     X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis]))
>
>     l = 0.3
>     e = LogisticRegression(
>         fit_intercept=False,
>         penalty='l2',
>         C=1/l,
>         max_iter=100,
>         tol=1e-11)
>
>     e.fit(X_with_intercept, y)
>
>     print e.coef_
>     # => [[ 0.98662189  0.45571052 -0.23467255 <0234%2067255>]]
>
>     # Linear regression is called Ridge in sklearn
>     e = Ridge(
>         fit_intercept=False,
>         alpha=l,
>         max_iter=100,
>         tol=1e-11)
>
>     e.fit(X_with_intercept, y)
>
>     print e.coef_
>     # =>[ 0.32155545  0.17904355  0.41222418 <04122%202418>]
>
> spark.ml code:
> -------------------
>
>     import org.apache.spark.{SparkConf, SparkContext}
>     import org.apache.spark.ml.classification.LogisticRegression
>     import org.apache.spark.ml.regression.LinearRegression
>     import org.apache.spark.mllib.linalg.Vectors
>     import org.apache.spark.mllib.regression.LabeledPoint
>     import org.apache.spark.sql.SQLContext
>
>     object TestSparkRegression {
>       def main(args: Array[String]): Unit = {
>         import org.apache.log4j.{Level, Logger}
>
>         Logger.getLogger("org").setLevel(Level.OFF)
>         Logger.getLogger("akka").setLevel(Level.OFF)
>
>         val conf = new SparkConf().setAppName("test").setMaster("local")
>         val sc = new SparkContext(conf)
>
>         val sparkTrainingData = new SQLContext(sc)
>           .createDataFrame(Seq(
>             LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)),
>             LabeledPoint(1.0, Vectors.dense(0.6750417712898752,
> -0.4232874171873786)),
>             LabeledPoint(1.0, Vectors.dense(0.1863463229359709,
> -0.8163423997075965)),
>             LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)),
>             LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)),
>             LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)),
>             LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)),
>             LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)),
>             LabeledPoint(0.0, Vectors.dense(0.9219423508390701,
> -0.8972778242305388)),
>             LabeledPoint(0.0, Vectors.dense(0.7006904841584055,
> -0.5607635619919824))))
>           .toDF("label", "features")
>
>         val logisticModel = new LogisticRegression()
>           .setRegParam(0.3)
>           .setLabelCol("label")
>           .setFeaturesCol("features")
>           .fit(sparkTrainingData)
>
>         println(s"Spark logistic model coefficients:
> ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}")
>         // Spark logistic model coefficients:
> [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987
>
>         val linearModel = new LinearRegression()
>           .setRegParam(0.3)
>           .setLabelCol("label")
>           .setFeaturesCol("features")
>           .setSolver("l-bfgs")
>           .fit(sparkTrainingData)
>
>         println(s"Spark linear model coefficients:
> ${linearModel.coefficients} Intercept: ${linearModel.intercept}")
>         // Spark linear model coefficients: 
> [0.19852664861346023,0.11501200541407802]
> Intercept: 0.45464906876832323
>
>         sc.stop()
>       }
>     }
>
> Thanks,
>
> Frank
>

Re: Differences between scikit-learn and Spark.ml for regression toy problem

Reply via email to