[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782841#comment-16782841 ]
Sean Owen commented on SPARK-25544: ----------------------------------- I think this is a reasonable change -- you can test it in a PR if you're up for it > Slow/failed convergence in Spark ML models due to internal predictor scaling > ---------------------------------------------------------------------------- > > Key: SPARK-25544 > URL: https://issues.apache.org/jira/browse/SPARK-25544 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.2 > Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11 > Reporter: Andrew Crosby > Priority: Major > > The LinearRegression and LogisticRegression estimators in Spark ML can take a > large number of iterations to converge, or fail to converge altogether, when > trained using the l-bfgs method with standardization turned off. > *Details:* > LinearRegression and LogisticRegression standardize their input features by > default. In SPARK-8522 the option to disable standardization was added. This > is implemented internally by changing the effective strength of > regularization rather than disabling the feature scaling. Mathematically, > both changing the effective regularizaiton strength, and disabling feature > scaling should give the same solution, but they can have very different > convergence properties. > The normal justication given for scaling features is that it ensures that all > covariances are O(1) and should improve numerical convergence, but this > argument does not account for the regularization term. This doesn't cause any > issues if standardization is set to true, since all features will have an > O(1) regularization strength. But it does cause issues when standardization > is set to false, since the effecive regularization strength of feature i is > now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. > This means that predictors with small standard deviations (which can occur > legitimately e.g. via one hot encoding) will have very large effective > regularization strengths and consequently lead to very large gradients and > thus poor convergence in the solver. > *Example code to recreate:* > To demonstrate just how bad these convergence issues can be, here is a very > simple test case which builds a linear regression model with a categorical > feature, a numerical feature and their interaction. When fed the specified > training data, this model will fail to converge before it hits the maximum > iteration limit. In this case, it is the interaction between category "2" and > the numeric feature that leads to a feature with a small standard deviation. > Training data: > ||category||numericFeature||label|| > |1|1.0|0.5| > |1|0.5|1.0| > |2|0.01|2.0| > > {code:java} > val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, > 2.0)).toDF("category", "numericFeature", "label") > val indexer = new StringIndexer().setInputCol("category") > .setOutputCol("categoryIndex") > val encoder = new > OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) > val interaction = new Interaction().setInputCols(Array("categoryEncoded", > "numericFeature")).setOutputCol("interaction") > val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", > "interaction")).setOutputCol("features") > val model = new > LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) > val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, > assembler, model)) > val pipelineModel = pipeline.fit(df) > val numIterations = > pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} > *Possible fix:* > These convergence issues can be fixed by turning off feature scaling when > standardization is set to false rather than using an effective regularization > strength. This can be hacked into LinearRegression.scala by simply replacing > line 423 > {code:java} > val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) > {code} > with > {code:java} > val featuresStd = if ($(standardization)) > featuresSummarizer.variance.toArray.map(math.sqrt) else > featuresSummarizer.variance.toArray.map(x => 1.0) > {code} > Rerunning the above test code with that hack in place, will lead to > convergence after just 4 iterations instead of hitting the max iterations > limit! > *Impact:* > I can't speak for other people, but I've personally encountered these > convergence issues several times when building production scale Spark ML > models, and have resorted to writing my only implementation of > LinearRegression with the above hack in place. The issue is made worse by the > fact that Spark does not raise an error when the maximum number of iterations > is hit, so the first time you encounter the issue it can take a while to > figure out what is going on. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org