[ https://issues.apache.org/jira/browse/SPARK-30641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng reassigned SPARK-30641: ------------------------------------ Assignee: zhengruifeng > Project Matrix: Linear Models revisit and refactor > -------------------------------------------------- > > Key: SPARK-30641 > URL: https://issues.apache.org/jira/browse/SPARK-30641 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark > Affects Versions: 3.1.0, 3.2.0 > Reporter: zhengruifeng > Assignee: zhengruifeng > Priority: Major > > We had been refactoring linear models for a long time, and there still are > some works in the future. After some discussions among [~huaxingao] [~srowen] > [~weichenxu123] [~mengxr] [~podongfeng] , we decide to gather related works > under a sub-project Matrix, it includes: > # *Blockification (vectorization of vectors)* > ** vectors are stacked into matrices, so that high-level BLAS can be used > for better performance. (about ~3x faster on sparse datasets, up to ~18x > faster on dense datasets, see SPARK-31783 for details). > ** Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to > blockify KMeans in the future. > # *Standardization (virutal centering)* > ** Existing impl of standardization in linear models does NOT center the > vectors by removing the means, for the purpose of keeping dataset > _*sparsity*_. However, this will cause feature values with small var be > scaled to large values, and underlying solver like LBFGS can not efficiently > handle this case. see SPARK-34448 for details. > ** If internal vectors are centered (like famous GLMNET), the convergence > ratio will be better. In the case in SPARK-34448, the number of iteration to > convergence will be reduced from 93 to 6. Moreover, the final solution is > much more close to the one in GLMNET. > ** Luckily, we find a new way to _*virtually*_ center the vectors without > densifying the dataset. Good results had been observed in LoR, we will take > it into account in other linear models. > # _*Initialization (To be discussed)*_ > ** Initializing model coef with a given model, should be beneficial to: 1, > convergence ratio (should reduce number of iterations); 2, model stability > (may obtain a new solution more close to the previous one); > # _*Early Stopping* *(To be discussed)*_ > ** we can compute the test error in the procedure (like tree models), and > stop the training procedure if test error begin to increase; > > If you want to add other features in these models, please comment in > the ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org