Some architect suggestions on this matter - https://github.com/apache/spark/pull/2371
2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>: > Sorry, I misswrote - I meant learners part of framework - models already > exists. > > 2014-09-12 15:53 GMT+04:00 Christoph Sawade < > christoph.saw...@googlemail.com>: > >> I totally agree, and we discovered also some drawbacks with the >> classification models implementation that are based on GLMs: >> >> - There is no distinction between predicting scores, classes, and >> calibrated scores (probabilities). For these models it is common to have >> access to all of them and the prediction function ``predict``should be >> consistent and stateless. Currently, the score is only available after >> removing the threshold from the model. >> - There is no distinction between multinomial and binomial >> classification. For multinomial problems, it is necessary to handle >> multiple weight vectors and multiple confidences. >> - Models are not serialisable, which makes it hard to use them in >> practise. >> >> I started a pull request [1] some time ago. I would be happy to continue >> the discussion and clarify the interfaces, too! >> >> Cheers, Christoph >> >> [1] https://github.com/apache/spark/pull/2137/ >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>: >> >>> Here in Yandex, during implementation of gradient boosting in spark and >>> creating our ML tool for internal use, we found next serious problems in >>> MLLib: >>> >>> >>> - There is no Regression/Classification model abstraction. We were >>> building abstract data processing pipelines, which should work just >>> with >>> some regression - exact algorithm specified outside this code. There >>> is no >>> abstraction, which will allow me to do that. *(It's main reason for >>> all >>> further problems) * >>> - There is no common practice among MLlib for testing algorithms: >>> every >>> model generates it's own random test data. There is no easy >>> extractable >>> test cases applible to another algorithm. There is no benchmarks for >>> comparing algorithms. After implementing new algorithm it's very hard >>> to >>> understand how it should be tested. >>> - Lack of serialization testing: MLlib algorithms don't contain tests >>> which test that model work after serialization. >>> - During implementation of new algorithm it's hard to understand what >>> API you should create and which interface to implement. >>> >>> Start for solving all these problems must be done in creating common >>> interface for typical algorithms/models - regression, classification, >>> clustering, collaborative filtering. >>> >>> All main tests should be written against these interfaces, so when new >>> algorithm implemented - all it should do is passed already written tests. >>> It allow us to have managble quality among all lib. >>> >>> There should be couple benchmarks which allow new spark user to get >>> feeling >>> about which algorithm to use. >>> >>> Test set against these abstractions should contain serialization test. In >>> production most time there is no need in model, which can't be stored. >>> >>> As the first step of this roadmap I'd like to create trait >>> RegressionModel, >>> *ADD* methods to current algorithms to implement this trait and create >>> some >>> tests against it. Planning of doing it next week. >>> >>> Purpose of this letter is to collect any objections to this approach on >>> early stage: please give any feedback. Second reason is to set lock on >>> this >>> activity so we wouldn't do the same thing twice: I'll create pull request >>> by the end of the next week and any parallalizm in development we can >>> start >>> from there. >>> >>> >>> >>> -- >>> >>> >>> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex* >>> >> >> > > > -- > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* > -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*