[ 
https://issues.apache.org/jira/browse/SPARK-34765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-34765:
---------------------------------
    Description: 
Existing impl of standardization in linear models does *NOT* center the vectors 
by removing the means, for the purpose of keep the dataset sparsity.

However, this will cause feature values with small var be scaled to large 
values, and underlying solver like LBFGS can not efficiently handle this case. 
see SPARK-34448 for details.

If internal vectors are centers (like other famous impl, i.e. 
GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
SPARK-34448, the number of iteration to convergence will be reduced from 93 to 
6. Moreover, the final solution is much more close to the one in GLMNET.

luckily, we find a new way to 'virtually' center the vectors without densifying 
the dataset, iff:

1, fitIntercept is true;
 2, no penalty on the intercept, it seem this is always true in existing impls;
 3, no bounds on the intercept;

 

We will also need to check whether this new methods work in all other linear 
models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into 
those models if possible.

 

  was:
Existing impl of standardization in linear models do NOT center the vectors by 
removing the means, for the purpose of keep the dataset sparsity.

However, this will cause feature values with small var be scaled to large 
values, and underlying solver like LBFGS can not efficiently handle this case. 
see SPARK-34448 for details.

If internal vectors are centers (like other famous impl, i.e. 
GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
SPARK-34448, the number of iteration to convergence will be reduced from 93 to 
6. Moreover, the final solution is much more close to the one in GLMNET.

luckily, we find a new way to 'virtually' center the vectors without densifying 
the dataset, iff:

1, fitIntercept is true;
2, no penalty on the intercept, it seem this is always true in existing impls;
3, no bounds on the intercept;

 

We will also need to check whether this new methods work in all other linear 
models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into 
those model if possible.

 


> Linear Models standardization optimization
> ------------------------------------------
>
>                 Key: SPARK-34765
>                 URL: https://issues.apache.org/jira/browse/SPARK-34765
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML
>    Affects Versions: 3.2.0, 3.1.1
>            Reporter: zhengruifeng
>            Priority: Major
>
> Existing impl of standardization in linear models does *NOT* center the 
> vectors by removing the means, for the purpose of keep the dataset sparsity.
> However, this will cause feature values with small var be scaled to large 
> values, and underlying solver like LBFGS can not efficiently handle this 
> case. see SPARK-34448 for details.
> If internal vectors are centers (like other famous impl, i.e. 
> GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
> SPARK-34448, the number of iteration to convergence will be reduced from 93 
> to 6. Moreover, the final solution is much more close to the one in GLMNET.
> luckily, we find a new way to 'virtually' center the vectors without 
> densifying the dataset, iff:
> 1, fitIntercept is true;
>  2, no penalty on the intercept, it seem this is always true in existing 
> impls;
>  3, no bounds on the intercept;
>  
> We will also need to check whether this new methods work in all other linear 
> models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into 
> those models if possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to