[ 
https://issues.apache.org/jira/browse/SPARK-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pakhomov updated SPARK-3507:
---------------------------------
    Comment: was deleted

(was: https://github.com/apache/spark/pull/2371)

> Create RegressionLearner trait and make some currect code implement it
> ----------------------------------------------------------------------
>
>                 Key: SPARK-3507
>                 URL: https://issues.apache.org/jira/browse/SPARK-3507
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Egor Pakhomov
>            Assignee: Egor Pakhomov
>            Priority: Minor
>             Fix For: 1.2.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Here in Yandex, during implementation of gradient boosting in spark and 
> creating our ML tool for internal use, we found next serious problems in 
> MLLib:
> There is no Regression/Classification learner model abstraction. We were 
> building abstract data processing pipelines, which should work just with some 
> regression - exact algorithm specified outside this code. There is no 
> abstraction, which will allow me to do that. (It's main reason for all 
> further problems) 
> There is no common practice among MLlib for testing algorithms: every model 
> generates it's own random test data. There is no easy extractable test cases 
> applible to another algorithm. There is no benchmarks for comparing 
> algorithms. After implementing new algorithm it's very hard to understand how 
> it should be tested.  
> Lack of serialization testing: MLlib algorithms don't contain tests which 
> test that model work after serialization.  
> During implementation of new algorithm it's hard to understand what API you 
> should create and which interface to implement.
> Start for solving all these problems must be done in creating common 
> interface for typical algorithms/models - regression, classification, 
> clustering, collaborative filtering.
> All main tests should be written against these interfaces, so when new 
> algorithm implemented - all it should do is passed already written tests. It 
> allow us to have managble quality among all lib.
> There should be couple benchmarks which allow new spark user to get feeling 
> about which algorithm to use.
> Test set against these abstractions should contain serialization test. In 
> production most time there is no need in model, which can't be stored.
> As the first step of this roadmap I'd like to create trait RegressionLearner, 
> ADD methods to current algorithms to implement this trait and create some 
> tests against it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to