Some architect suggestions on this matter -
https://github.com/apache/spark/pull/2371

2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>:

> Sorry, I misswrote  - I meant learners part of framework - models already
> exists.
>
> 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
> christoph.saw...@googlemail.com>:
>
>> I totally agree, and we discovered also some drawbacks with the
>> classification models implementation that are based on GLMs:
>>
>> - There is no distinction between predicting scores, classes, and
>> calibrated scores (probabilities). For these models it is common to have
>> access to all of them and the prediction function ``predict``should be
>> consistent and stateless. Currently, the score is only available after
>> removing the threshold from the model.
>> - There is no distinction between multinomial and binomial
>> classification. For multinomial problems, it is necessary to handle
>> multiple weight vectors and multiple confidences.
>> - Models are not serialisable, which makes it hard to use them in
>> practise.
>>
>> I started a pull request [1] some time ago. I would be happy to continue
>> the discussion and clarify the interfaces, too!
>>
>> Cheers, Christoph
>>
>> [1] https://github.com/apache/spark/pull/2137/
>>
>> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
>>
>>> Here in Yandex, during implementation of gradient boosting in spark and
>>> creating our ML tool for internal use, we found next serious problems in
>>> MLLib:
>>>
>>>
>>>    - There is no Regression/Classification model abstraction. We were
>>>    building abstract data processing pipelines, which should work just
>>> with
>>>    some regression - exact algorithm specified outside this code. There
>>> is no
>>>    abstraction, which will allow me to do that. *(It's main reason for
>>> all
>>>    further problems) *
>>>    - There is no common practice among MLlib for testing algorithms:
>>> every
>>>    model generates it's own random test data. There is no easy
>>> extractable
>>>    test cases applible to another algorithm. There is no benchmarks for
>>>    comparing algorithms. After implementing new algorithm it's very hard
>>> to
>>>    understand how it should be tested.
>>>    - Lack of serialization testing: MLlib algorithms don't contain tests
>>>    which test that model work after serialization.
>>>    - During implementation of new algorithm it's hard to understand what
>>>    API you should create and which interface to implement.
>>>
>>> Start for solving all these problems must be done in creating common
>>> interface for typical algorithms/models - regression, classification,
>>> clustering, collaborative filtering.
>>>
>>> All main tests should be written against these interfaces, so when new
>>> algorithm implemented - all it should do is passed already written tests.
>>> It allow us to have managble quality among all lib.
>>>
>>> There should be couple benchmarks which allow new spark user to get
>>> feeling
>>> about which algorithm to use.
>>>
>>> Test set against these abstractions should contain serialization test. In
>>> production most time there is no need in model, which can't be stored.
>>>
>>> As the first step of this roadmap I'd like to create trait
>>> RegressionModel,
>>> *ADD* methods to current algorithms to implement this trait and create
>>> some
>>> tests against it. Planning of doing it next week.
>>>
>>> Purpose of this letter is to collect any objections to this approach on
>>> early stage: please give any feedback. Second reason is to set lock on
>>> this
>>> activity so we wouldn't do the same thing twice: I'll create pull request
>>> by the end of the next week and any parallalizm in development we can
>>> start
>>> from there.
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>>
>>
>>
>
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>



-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*

Reply via email to