[ https://issues.apache.org/jira/browse/SPARK-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Egor Pakhomov updated SPARK-3507: --------------------------------- Comment: was deleted (was: https://github.com/apache/spark/pull/2371) > Create RegressionLearner trait and make some currect code implement it > ---------------------------------------------------------------------- > > Key: SPARK-3507 > URL: https://issues.apache.org/jira/browse/SPARK-3507 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Egor Pakhomov > Assignee: Egor Pakhomov > Priority: Minor > Fix For: 1.2.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Here in Yandex, during implementation of gradient boosting in spark and > creating our ML tool for internal use, we found next serious problems in > MLLib: > There is no Regression/Classification learner model abstraction. We were > building abstract data processing pipelines, which should work just with some > regression - exact algorithm specified outside this code. There is no > abstraction, which will allow me to do that. (It's main reason for all > further problems) > There is no common practice among MLlib for testing algorithms: every model > generates it's own random test data. There is no easy extractable test cases > applible to another algorithm. There is no benchmarks for comparing > algorithms. After implementing new algorithm it's very hard to understand how > it should be tested. > Lack of serialization testing: MLlib algorithms don't contain tests which > test that model work after serialization. > During implementation of new algorithm it's hard to understand what API you > should create and which interface to implement. > Start for solving all these problems must be done in creating common > interface for typical algorithms/models - regression, classification, > clustering, collaborative filtering. > All main tests should be written against these interfaces, so when new > algorithm implemented - all it should do is passed already written tests. It > allow us to have managble quality among all lib. > There should be couple benchmarks which allow new spark user to get feeling > about which algorithm to use. > Test set against these abstractions should contain serialization test. In > production most time there is no need in model, which can't be stored. > As the first step of this roadmap I'd like to create trait RegressionLearner, > ADD methods to current algorithms to implement this trait and create some > tests against it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org