[ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192217#comment-16192217
 ] 

yuhao yang commented on SPARK-3181:
-----------------------------------

Regarding to whether to separate Huber loss an an independent Estimator, I 
don't see there's an direct conflict.

IMO, LinearRegression should act as an all-in-one Estimator that allow user to 
combine whichever loss function, optimizer and regularization to use. It should 
targets flexibility and also provides some fundamental infrastructure for 
regression algorithms.

In the meantime, we may also support HuberRegression, RidgeRegression and 
others in independent Estimator, which is more convenient but with less 
flexibility (also allow specific parameters). As mentioned by Seth, this would 
require better code abstraction and plugin interface. Besides  
loss/prediction/optimizer, we also need to provide infrastructure for model 
summary and serialization. This should only happen after we can compose 
Estimator like HuberRegression without noticeable code duplication. 


> Add Robust Regression Algorithm with Huber Estimator
> ----------------------------------------------------
>
>                 Key: SPARK-3181
>                 URL: https://issues.apache.org/jira/browse/SPARK-3181
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Fan Jiang
>            Assignee: Yanbo Liang
>              Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to