Andrew Crosby created SPARK-25544: ------------------------------------- Summary: Slow/failed convergence in Spark ML models due to internal predictor scaling Key: SPARK-25544 URL: https://issues.apache.org/jira/browse/SPARK-25544 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.2 Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11 Reporter: Andrew Crosby
The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning of feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org