Andrew Crosby created SPARK-22555:
-------------------------------------

             Summary: Possibly incorrect scaling of L2 regularization strength 
in LinearRegression
                 Key: SPARK-22555
                 URL: https://issues.apache.org/jira/browse/SPARK-22555
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.2.0
            Reporter: Andrew Crosby
            Priority: Minor


According to the Spark documentation, the linear regression estimator minimizes 
the regularized sum of squares:

1/N Sum(y - w x)^2^ + λ( (1-α) |w|~2~ + α |w|~1~ )

Under the hood, in order to improve convergence, the optimization algorithms 
actually work in scaled space using the variables y' = y / σ ~y~, x' = x / σ 
~x~ and w' = w / (σ ~x~ / σ ~y~). In terms of these scaled variables, the above 
expression becomes:

σ ~y~^2^ ( 1/N  Sum(y' - w' x')^2^ + λ( (1-α) / σ ~x~^2^ |w'|~2~ + α / (σ ~x~ σ 
~y~) |w'|~1~ ) )

The solution in scaled space is equivalent to the original problem, provided 
that the regularization strengths are suitably adjusted. The effective L1 
regularization strength should be λ α / (σ ~x~ σ ~y~) and the effective L2 
regularization strength should be λ (1-α) / σ ~x~^2^.

However, this doesn't quite match the regularization strengths that are 
actually used. While the factors of σ ~x~ are correctly included (or correctly 
ommitted if the standardization parameter is set), it appears that the 1 / σ 
~y~ scaling is applied to both the L1 and L2 regularization parameters instead 
of just to the L1 regularization parameter. Both LinearRegression.scala and 
WeightedLeastSquares.scala contain code along the following lines:

{code}
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
{code}

Admittedly, the unit tests confirm that the current behaviour matches that of 
R's glmnet, it just doesn't seem to match the behaviour claimed in the 
documentation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to