[ https://issues.apache.org/jira/browse/SPARK-30641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307690#comment-17307690 ]
Weichen Xu commented on SPARK-30641: ------------------------------------ Good work! > Project Matrix: Linear Models revisit and refactor > -------------------------------------------------- > > Key: SPARK-30641 > URL: https://issues.apache.org/jira/browse/SPARK-30641 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark > Affects Versions: 3.1.0, 3.2.0 > Reporter: zhengruifeng > Priority: Major > > We had been refactoring linear models for a long time, and there still are > some works in the future. After some discuss among [~huaxingao] [~srowen] > [~weichenxu123] , we decide to gather related works under a sub-project > Matrix, it include: > # *Blockification (vectorization of vectors)* > ** vectors are stacked into matrices, so that high-level BLAS can be used > for better performance. (about ~3x faster on sparse datasets, up to ~18x > faster on dense datasets, see SPARK-31783 for details). > ** Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to > blockify KMeans in the future. > # *Standardization (virutal centering)* > ** Existing impl of standardization in linear models does NOT center the > vectors by removing the means, for the purpose of keeping dataset > _*sparsity*_. However, this will cause feature values with small var be > scaled to large values, and underlying solver like LBFGS can not efficiently > handle this case. see SPARK-34448 for details. > ** If internal vectors are centers (like other famous impl, i.e. > GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in > SPARK-34448, the number of iteration to convergence will be reduced from 93 > to 6. Moreover, the final solution is much more close to the one in GLMNET. > ** Luckily, we find a new way to _*virtually*_ center the vectors without > densifying the dataset. Good results had been observed in LoR, we will take > it into account in other linear models. > # _*Initialization (To be discussed)*_ > ** Initializing model coef with a given model, should be beneficial to: 1, > convergence ratio (should reduce number of iterations); 2, model stability > (may obtain a new solution more close to the previous one); > # _*Early Stopping* *(To be discussed)*_ > ** we can compute the test error in the procedure (like tree models), and > stop the training procedure if test error begin to increase; > > If you want to add other features in these models, please comment in > the ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org