[ https://issues.apache.org/jira/browse/SPARK-34765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng updated SPARK-34765: --------------------------------- Issue Type: Umbrella (was: Improvement) > Linear Models standardization optimization > ------------------------------------------ > > Key: SPARK-34765 > URL: https://issues.apache.org/jira/browse/SPARK-34765 > Project: Spark > Issue Type: Umbrella > Components: ML > Affects Versions: 3.2.0, 3.1.1 > Reporter: zhengruifeng > Priority: Major > > Existing impl of standardization in linear models do NOT center the vectors > by removing the means, for the purpose of keep the dataset sparsity. > However, this will cause feature values with small var be scaled to large > values, and underlying solver like LBFGS can not efficiently handle this > case. see SPARK-34448 for details. > If internal vectors are centers (like other famous impl, i.e. > GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in > SPARK-34448, the number of iteration to convergence will be reduced from 93 > to 6. Moreover, the final solution is much more close to the one in GLMNET. > luckily, we find a new way to 'virtually' center the vectors without > densifying the dataset, iff: > 1, fitIntercept is true; > 2, no penalty on the intercept, it seem this is always true in existing impls; > 3, no bounds on the intercept; > > We will also need to check whether this new methods work in all other linear > models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into > those model if possible. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org