Weichen Xu created SPARK-16638: ---------------------------------- Summary: The L2 regularization of LinearRegression seems wrong when standardization is false Key: SPARK-16638 URL: https://issues.apache.org/jira/browse/SPARK-16638 Project: Spark Issue Type: Bug Components: ML, Optimizer Reporter: Weichen Xu
The original L2 is 0.5 * effectiveL2regParam * sigma( wi^2 ) (wi is the coefficients we want to train) And in linearRegression code, when standardization == false, the code modify L2 into: 0.5 * effectiveL2regParam * sigma( ( wi / featuresStd(i) )^2 ) It is obviously wrong, I think. As the purpose of author wrote in the code comment, the modification to L2 reg should be: 0.5 * effectiveL2regParam * sigma( ( wi * featuresStd(i) )^2 ) wi should not be divided by featuresStd(i), but should be multiplied by featuresStd(i) We can simply think this problem in the following way: assume a training data, there is a dimension k with very large std ( the value of featuresStd(k) is very large), so, we hope the training result coefficient w(k) to be small to keep the numerical stability. As a way of that, we can add a penalty to the L2 reg on this dimension. so that it should be surely w(k) * featuresStd(k), not w(k) / featuresStd(k) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org