Re: Adding abstraction in MLlib

Erik Erlandson Fri, 12 Sep 2014 12:11:20 -0700

Are interface designs being captured anywhere as documents that the community 
can follow along with as the proposals evolve?


I've worked on other open source projects where design docs were published as 
"living documents" (e.g. on google docs, or etherpad, but the particular 
mechanism isn't crucial).   FWIW, I found that to be a good way to work in a 
community environment.


----- Original Message -----
> Hi Egor,
> 
> Thanks for the feedback! We are aware of some of the issues you
> mentioned and there are JIRAs created for them. Specifically, I'm
> pushing out the design on pipeline features and algorithm/model
> parameters this week. We can move our discussion to
> https://issues.apache.org/jira/browse/SPARK-1856 .
> 
> It would be nice to make tests against interfaces. But it definitely
> needs more discussion before making PRs. For example, we discussed the
> learning interfaces in Christoph's PR
> (https://github.com/apache/spark/pull/2137/) but it takes time to
> reach a consensus, especially on interfaces. Hopefully all of us could
> benefit from the discussion. The best practice is to break down the
> proposal into small independent piece and discuss them on the JIRA
> before submitting PRs.
> 
> For performance tests, there is a spark-perf package
> (https://github.com/databricks/spark-perf) and we added performance
> tests for MLlib in v1.1. But definitely more work needs to be done.
> 
> The dev-list may not be a good place for discussion on the design,
> could you create JIRAs for each of the issues you pointed out, and we
> track the discussion on JIRA? Thanks!
> 
> Best,
> Xiangrui
> 
> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin <r...@databricks.com> wrote:
> > Xiangrui can comment more, but I believe Joseph and him are actually
> > working on standardize interface and pipeline feature for 1.2 release.
> >
> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov <pahomov.e...@gmail.com>
> > wrote:
> >
> >> Some architect suggestions on this matter -
> >> https://github.com/apache/spark/pull/2371
> >>
> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>:
> >>
> >> > Sorry, I misswrote  - I meant learners part of framework - models
> >> > already
> >> > exists.
> >> >
> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
> >> > christoph.saw...@googlemail.com>:
> >> >
> >> >> I totally agree, and we discovered also some drawbacks with the
> >> >> classification models implementation that are based on GLMs:
> >> >>
> >> >> - There is no distinction between predicting scores, classes, and
> >> >> calibrated scores (probabilities). For these models it is common to
> >> >> have
> >> >> access to all of them and the prediction function ``predict``should be
> >> >> consistent and stateless. Currently, the score is only available after
> >> >> removing the threshold from the model.
> >> >> - There is no distinction between multinomial and binomial
> >> >> classification. For multinomial problems, it is necessary to handle
> >> >> multiple weight vectors and multiple confidences.
> >> >> - Models are not serialisable, which makes it hard to use them in
> >> >> practise.
> >> >>
> >> >> I started a pull request [1] some time ago. I would be happy to
> >> >> continue
> >> >> the discussion and clarify the interfaces, too!
> >> >>
> >> >> Cheers, Christoph
> >> >>
> >> >> [1] https://github.com/apache/spark/pull/2137/
> >> >>
> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
> >> >>
> >> >>> Here in Yandex, during implementation of gradient boosting in spark
> >> >>> and
> >> >>> creating our ML tool for internal use, we found next serious problems
> >> in
> >> >>> MLLib:
> >> >>>
> >> >>>
> >> >>>    - There is no Regression/Classification model abstraction. We were
> >> >>>    building abstract data processing pipelines, which should work just
> >> >>> with
> >> >>>    some regression - exact algorithm specified outside this code.
> >> >>>    There
> >> >>> is no
> >> >>>    abstraction, which will allow me to do that. *(It's main reason for
> >> >>> all
> >> >>>    further problems) *
> >> >>>    - There is no common practice among MLlib for testing algorithms:
> >> >>> every
> >> >>>    model generates it's own random test data. There is no easy
> >> >>> extractable
> >> >>>    test cases applible to another algorithm. There is no benchmarks
> >> >>>    for
> >> >>>    comparing algorithms. After implementing new algorithm it's very
> >> hard
> >> >>> to
> >> >>>    understand how it should be tested.
> >> >>>    - Lack of serialization testing: MLlib algorithms don't contain
> >> tests
> >> >>>    which test that model work after serialization.
> >> >>>    - During implementation of new algorithm it's hard to understand
> >> what
> >> >>>    API you should create and which interface to implement.
> >> >>>
> >> >>> Start for solving all these problems must be done in creating common
> >> >>> interface for typical algorithms/models - regression, classification,
> >> >>> clustering, collaborative filtering.
> >> >>>
> >> >>> All main tests should be written against these interfaces, so when new
> >> >>> algorithm implemented - all it should do is passed already written
> >> tests.
> >> >>> It allow us to have managble quality among all lib.
> >> >>>
> >> >>> There should be couple benchmarks which allow new spark user to get
> >> >>> feeling
> >> >>> about which algorithm to use.
> >> >>>
> >> >>> Test set against these abstractions should contain serialization test.
> >> In
> >> >>> production most time there is no need in model, which can't be stored.
> >> >>>
> >> >>> As the first step of this roadmap I'd like to create trait
> >> >>> RegressionModel,
> >> >>> *ADD* methods to current algorithms to implement this trait and create
> >> >>> some
> >> >>> tests against it. Planning of doing it next week.
> >> >>>
> >> >>> Purpose of this letter is to collect any objections to this approach
> >> >>> on
> >> >>> early stage: please give any feedback. Second reason is to set lock on
> >> >>> this
> >> >>> activity so we wouldn't do the same thing twice: I'll create pull
> >> request
> >> >>> by the end of the next week and any parallalizm in development we can
> >> >>> start
> >> >>> from there.
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>>
> >> >>>
> >> >>>
> >> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> >
> >> >
> >> >
> >> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >> >
> >>
> >>
> >>
> >> --
> >>
> >>
> >>
> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Adding abstraction in MLlib

Reply via email to