[ https://issues.apache.org/jira/browse/SPARK-30641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng updated SPARK-30641: --------------------------------- Description: We had been refactoring linear models for a long time, and there still are some works in the future. After some offline discuss, we mark related works a sub-project Matrix, it includes: # *Blockification (vectorization of vectors)* ** vectors are stacked into matrices, so that high-level BLAS can be used for better performance. (about ~3x faster on sparse datasets, up to ~18x faster on dense datasets, see SPARK-31783 for details). ** Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to blockify KMeans in the future. # *Standardization (virutal centering)* ** Existing impl of standardization in linear models does NOT center the vectors by removing the means, for the purpose of keep the dataset *sparsity*. However, this will cause feature values with small var be scaled to large values, and underlying solver like LBFGS can not efficiently handle this case. see SPARK-34448 for details. ** If internal vectors are centers (like other famous impl, i.e. GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in SPARK-34448, the number of iteration to convergence will be reduced from 93 to 6. Moreover, the final solution is much more close to the one in GLMNET. ** Luckily, we find a new way to 'virtually' center the vectors without densifying the dataset. Good results had been observed observed in LoR, we need to take it into account in other linear models. # *Coef Initialization (To be discussed)* was: We had been refactoring linear models for a long time, and there still are some works in the future: # *Blockification (vectorization of vectors)* ** vectors are stacked into matrices, so that high-level BLAS can be used for better performance. (about ~3x faster on sparse datasets, up to ~15x faster on dense datasets, see). Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to blockify KMeans in the future. # *Standardization (virutal centering)* ** *Existing impl of standardization in linear models does* *NOT* *center the vectors by removing the means, for the purpose of keep the dataset sparsity. However, this will cause feature values with small var be scaled to large values, and underlying solver like LBFGS can not efficiently handle this case. see SPARK-34448 for details.* # ** > Project Matrix: Linear Models revisit and refactor > -------------------------------------------------- > > Key: SPARK-30641 > URL: https://issues.apache.org/jira/browse/SPARK-30641 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark > Affects Versions: 3.1.0, 3.2.0 > Reporter: zhengruifeng > Assignee: zhengruifeng > Priority: Major > > We had been refactoring linear models for a long time, and there still are > some works in the future. After some offline discuss, we mark related works a > sub-project Matrix, it includes: > # *Blockification (vectorization of vectors)* > ** vectors are stacked into matrices, so that high-level BLAS can be used > for better performance. (about ~3x faster on sparse datasets, up to ~18x > faster on dense datasets, see SPARK-31783 for details). > ** Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to > blockify KMeans in the future. > # *Standardization (virutal centering)* > ** Existing impl of standardization in linear models does NOT center the > vectors by removing the means, for the purpose of keep the dataset > *sparsity*. However, this will cause feature values with small var be scaled > to large values, and underlying solver like LBFGS can not efficiently handle > this case. see SPARK-34448 for details. > ** If internal vectors are centers (like other famous impl, i.e. > GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in > SPARK-34448, the number of iteration to convergence will be reduced from 93 > to 6. Moreover, the final solution is much more close to the one in GLMNET. > ** Luckily, we find a new way to 'virtually' center the vectors without > densifying the dataset. Good results had been observed observed in LoR, we > need to take it into account in other linear models. > # *Coef Initialization (To be discussed)* > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org