Here in Yandex, during implementation of gradient boosting in spark and
creating our ML tool for internal use, we found next serious problems in
MLLib:


   - There is no Regression/Classification model abstraction. We were
   building abstract data processing pipelines, which should work just with
   some regression - exact algorithm specified outside this code. There is no
   abstraction, which will allow me to do that. *(It's main reason for all
   further problems) *
   - There is no common practice among MLlib for testing algorithms: every
   model generates it's own random test data. There is no easy extractable
   test cases applible to another algorithm. There is no benchmarks for
   comparing algorithms. After implementing new algorithm it's very hard to
   understand how it should be tested.
   - Lack of serialization testing: MLlib algorithms don't contain tests
   which test that model work after serialization.
   - During implementation of new algorithm it's hard to understand what
   API you should create and which interface to implement.

Start for solving all these problems must be done in creating common
interface for typical algorithms/models - regression, classification,
clustering, collaborative filtering.

All main tests should be written against these interfaces, so when new
algorithm implemented - all it should do is passed already written tests.
It allow us to have managble quality among all lib.

There should be couple benchmarks which allow new spark user to get feeling
about which algorithm to use.

Test set against these abstractions should contain serialization test. In
production most time there is no need in model, which can't be stored.

As the first step of this roadmap I'd like to create trait RegressionModel,
*ADD* methods to current algorithms to implement this trait and create some
tests against it. Planning of doing it next week.

Purpose of this letter is to collect any objections to this approach on
early stage: please give any feedback. Second reason is to set lock on this
activity so we wouldn't do the same thing twice: I'll create pull request
by the end of the next week and any parallalizm in development we can start
from there.



-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*

Reply via email to