[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

Sean R. Owen (Jira) Wed, 24 Feb 2021 10:12:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290158#comment-17290158
 ]


Sean R. Owen commented on SPARK-34448:
--------------------------------------

I crudely ported the test setup to a Scala test, and tried a 0 initial 
intercept in the LR implementation. It still gets the -3.5 intercept in the 
case where the 'const_feature' column is added, but -4 without. So, I'm not 
sure that's it.

Let me ping [~podongfeng] or maybe even [~sethah] who have worked on that code 
a bit and might have more of an idea about why the intercept wouldn't quite fit 
right in this case. I'm wondering if there is some issue in 
LogisticAggregator's treatment of the intercept? no idea, this is outside my 
expertise.

https://github.com/apache/spark/blob/3ce4ab545bfc28db7df2c559726b887b0c8c33b7/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L244

BTW here's my hacked up test: 

{code}
  test("BLR") {
    val centered = false
    val regParam = 1.0e-8
    val num_distribution_samplings = 1000
    val num_rows_per_sampling = 1000
    val theta_1 = 0.3f
    val theta_2 = 0.2f
    val intercept = -4.0f

    val (feature1, feature2, target) = generate_blr_data(theta_1, theta_2, 
intercept, centered,
      num_distribution_samplings, num_rows_per_sampling)

    val num_rows = num_distribution_samplings * num_rows_per_sampling

    val const_feature = Array.fill(num_rows)(1.0f)
    (0 until num_rows / 10).foreach { i => const_feature(i) = 0.9f }


    val data = (0 until num_rows).map { i =>
      (feature1(i), feature2(i), const_feature(i), target(i))
    }

    val spark_df = spark.createDataFrame(data).toDF("feature1", "feature2", 
"const_feature", "label").cache()

    val vec = new VectorAssembler().setInputCols(Array("feature1", 
"feature2")).setOutputCol(("features"))
    val spark_df1 = vec.transform(spark_df).cache()

    val lr = new LogisticRegression().
      
setMaxIter(100).setRegParam(regParam).setElasticNetParam(0.5).setFitIntercept(true)
    val lrModel = lr.fit(spark_df1)
    println("Just the blr data")
    println("Coefficients: " + lrModel.coefficients)
    println("Intercept: " + lrModel.intercept)

    val vec2 = new VectorAssembler().setInputCols(Array("feature1", "feature2", 
"const_feature")).
      setOutputCol(("features"))
    val spark_df2 = vec2.transform(spark_df).cache()

    val lrModel2 = lr.fit(spark_df2)
    println("blr data plus one vector that is filled with 1's and .9's")
    println("Coefficients: " + lrModel2.coefficients)
    println("Intercept: " + lrModel2.intercept)

  }

  def generate_blr_data(theta_1: Float,
                        theta_2: Float,
                        intercept: Float,
                        centered: Boolean,
                        num_distribution_samplings: Int,
                        num_rows_per_sampling: Int): (Array[Float], 
Array[Float], Array[Int]) = {
    val random = new Random(12345L)
    val uniforms = Array.fill(num_distribution_samplings)(random.nextFloat())
    val uniforms2 = Array.fill(num_distribution_samplings)(random.nextFloat())

    if (centered) {
      uniforms.transform(f => f - 0.5f)
      uniforms2.transform(f => 2.0f * f - 1.0f)
    } else {
      uniforms2.transform(f => f + 1.0f)
    }

    val h_theta = uniforms.zip(uniforms2).map { case (a, b) => intercept + 
theta_1 * a + theta_2 * b }
    val prob = h_theta.map(t => 1.0 / (1.0 + math.exp(-t)))
    val array = Array.ofDim[Int](num_distribution_samplings, 
num_rows_per_sampling)
    array.indices.foreach { i =>
      (0 until math.round(num_rows_per_sampling * prob(i)).toInt).foreach { j =>
        array(i)(j) = 1
      }
    }

    val num_rows = num_distribution_samplings * num_rows_per_sampling

    val feature_1 = uniforms.map(f => 
Array.fill(num_rows_per_sampling)(f)).flatten
    val feature_2 = uniforms2.map(f => 
Array.fill(num_rows_per_sampling)(f)).flatten
    val target = array.flatten

    return (feature_1, feature_2, target)
  }
{code}

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34448
>                 URL: https://issues.apache.org/jira/browse/SPARK-34448
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: Yakov Kerzhner
>            Priority: Major
>              Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

Reply via email to