[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

iyounus Mon, 18 Jan 2016 16:33:02 -0800

Github user iyounus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10274#discussion_r50060828
  
    --- Diff: 
mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala 
---
    @@ -74,6 +89,35 @@ class WeightedLeastSquaresSuite extends SparkFunSuite 
with MLlibTestSparkContext
         }
       }
     
    +  test("WLS against lm when label is constant") {
    +    /*
    +       R code:
    +       # here b is constant
    +       df <- as.data.frame(cbind(A, b))
    +       for (formula in c(b ~ . -1, b ~ .)) {
    +         model <- lm(formula, data=df, weights=w)
    +         print(as.vector(coef(model)))
    +       }
    +
    +      [1] -9.221298  3.394343
    +      [1] 17  0  0
    +    */
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, -9.221298, 3.394343),
    +      Vectors.dense(17.0, 0.0, 0.0))
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      val wls = new WeightedLeastSquares(
    +        fitIntercept, regParam = 0.0, standardizeFeatures = false, 
standardizeLabel = true)
    +        .fit(instancesConstLabel)
    --- End diff --
    
    @dbtsai I've implemented normal equation with regularization in R. Here is 
my code:
    ```
    ridge_regression <- function(A, b, lambda, intercept=TRUE){
        if (intercept) {
            A = cbind(rep(1.0, length(b)), A)
            I = diag(ncol(A))
            I[1,1] = 0.0
        } else {
            I = diag(ncol(A))
        }
        R = chol( t(A) %*% A + lambda*I )
        z = solve(t(R), t(A) %*% b)
        w = solve(R, z)
        return(w)
    }
    ```
    And here are the results I get using this function.
    ```
    A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    b <- c(17, 19, 23, 29)
    ridge_regression(A, b, 0.1)
               [,1]
    [1,] 12.9048179
    [2,]  2.1151586
    [3,]  0.6580494
    ```
    The problem is that these don't quite match with `glmnet`. The difference 
can be at few percent level:
    ```
    > model <- glmnet(A, b, intercept=TRUE, lambda=0.1, standardize=FALSE,
    + alpha=0, thresh=1E-20)
    > print(as.vector(coef(model)))
    [1] 13.1018870  2.2362361  0.6159732
    ```
    But, my results match exactly with what I get from ridge regression in 
sklearn:
    
    ```
    from sklearn.linear_model import Ridge
    import numpy as np
    
    A = np.array([[0, 1, 2, 3],[5, 7, 11, 13]]).T
    b = np.array([17.0, 19.0, 23.0, 29.0])
    model = Ridge(alpha=0.1, solver='cholesky', fit_intercept=True)
    model.fit(A, b)
    print model.intercept_
    print model.coef_
    
    12.9048178613
    [ 2.11515864  0.65804935]
    ```
    Even if I use other solvers (`svd`, `lsqr`, `sparse_cg`) in 
`sklearn.linear_model.Ridge`, I get exactly the same results.
    
    If I don't use regularization by setting `lambda=0`, then the results from 
`glmnet` are identical to what I get from normal equation and 
`sklearn.linear_model.Ridge`.
    
    Have you observed such differences before? Is `glmnet` making some other 
corrections or its just numerical precision issue?  I can't seem to reproduce 
`glmnet` results.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

Reply via email to