Re: Adding abstraction in MLlib

2014-09-16 Thread Xiangrui Meng
Hi Egor,

I posted the design doc for pipeline and parameters on the JIRA, now
I'm trying to work out some details of ML datasets, which I will post
it later this week. You feedback is welcome!

Best,
Xiangrui

On Mon, Sep 15, 2014 at 12:44 AM, Reynold Xin  wrote:
> Hi Egor,
>
> Thanks for the suggestion. It is definitely our intention and practice to
> post design docs as soon as they are ready, and short iteration cycles. As a
> matter of fact, we encourage design docs for major features posted before
> implementation starts, and WIP pull requests before they are fully baked for
> large features.
>
> That said, no, not 100% of a committer's time is on a specific ticket. There
> are lots of tickets that are open for a long time before somebody starts
> actively working on it. So no, it is not true that "all this time was active
> development". Xiangrui should post the design doc as soon as it is ready for
> feedback.
>
>
>
> On Sun, Sep 14, 2014 at 11:26 PM, Egor Pahomov 
> wrote:
>>
>> It's good, that databricks working on this issue! However current process
>> of working on that is not very clear for outsider.
>>
>> Last update on this ticket is August 5. If all this time was active
>> development, I have concerns that without feedback from community for such
>> long time development can fall in wrong way.
>> Even if it would be great big patch as soon as you introduce new
>> interfaces to community it would allow us to start working on our pipeline
>> code. It would allow us write algorithm in new paradigm instead of in lack
>> of any paradigms like it was before. It would allow us to help you transfer
>> old code to new paradigm.
>>
>> My main point - shorter iterations with more transparency.
>>
>> I think it would be good idea to create some pull request with code, which
>> you have so far, even if it doesn't pass tests, so just we can comment on it
>> before formulating it in design doc.
>>
>>
>> 2014-09-13 0:00 GMT+04:00 Patrick Wendell :
>>>
>>> We typically post design docs on JIRA's before major work starts. For
>>> instance, pretty sure SPARk-1856 will have a design doc posted
>>> shortly.
>>>
>>> On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson  wrote:
>>> >
>>> > Are interface designs being captured anywhere as documents that the
>>> > community can follow along with as the proposals evolve?
>>> >
>>> > I've worked on other open source projects where design docs were
>>> > published as "living documents" (e.g. on google docs, or etherpad, but the
>>> > particular mechanism isn't crucial).   FWIW, I found that to be a good way
>>> > to work in a community environment.
>>> >
>>> >
>>> > - Original Message -
>>> >> Hi Egor,
>>> >>
>>> >> Thanks for the feedback! We are aware of some of the issues you
>>> >> mentioned and there are JIRAs created for them. Specifically, I'm
>>> >> pushing out the design on pipeline features and algorithm/model
>>> >> parameters this week. We can move our discussion to
>>> >> https://issues.apache.org/jira/browse/SPARK-1856 .
>>> >>
>>> >> It would be nice to make tests against interfaces. But it definitely
>>> >> needs more discussion before making PRs. For example, we discussed the
>>> >> learning interfaces in Christoph's PR
>>> >> (https://github.com/apache/spark/pull/2137/) but it takes time to
>>> >> reach a consensus, especially on interfaces. Hopefully all of us could
>>> >> benefit from the discussion. The best practice is to break down the
>>> >> proposal into small independent piece and discuss them on the JIRA
>>> >> before submitting PRs.
>>> >>
>>> >> For performance tests, there is a spark-perf package
>>> >> (https://github.com/databricks/spark-perf) and we added performance
>>> >> tests for MLlib in v1.1. But definitely more work needs to be done.
>>> >>
>>> >> The dev-list may not be a good place for discussion on the design,
>>> >> could you create JIRAs for each of the issues you pointed out, and we
>>> >> track the discussion on JIRA? Thanks!
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >>
>>> >> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin 
>>> >> wrote:
>>> >> > Xiangrui can comment more, but I believe Joseph and him are actually
>>> >> > working on standardize interface and pipeline feature for 1.2
>>> >> > release.
>>> >> >
>>> >> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov
>>> >> > 
>>> >> > wrote:
>>> >> >
>>> >> >> Some architect suggestions on this matter -
>>> >> >> https://github.com/apache/spark/pull/2371
>>> >> >>
>>> >> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov :
>>> >> >>
>>> >> >> > Sorry, I misswrote  - I meant learners part of framework - models
>>> >> >> > already
>>> >> >> > exists.
>>> >> >> >
>>> >> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
>>> >> >> > christoph.saw...@googlemail.com>:
>>> >> >> >
>>> >> >> >> I totally agree, and we discovered also some drawbacks with the
>>> >> >> >> classification models implementation that are based on GLMs:
>>> >> >> >>
>>> >> >> >> - There is no distinction between 

Re: Adding abstraction in MLlib

2014-09-15 Thread Reynold Xin
Hi Egor,

Thanks for the suggestion. It is definitely our intention and practice to
post design docs as soon as they are ready, and short iteration cycles. As
a matter of fact, we encourage design docs for major features posted before
implementation starts, and WIP pull requests before they are fully baked
for large features.

That said, no, not 100% of a committer's time is on a specific ticket.
There are lots of tickets that are open for a long time before somebody
starts actively working on it. So no, it is not true that "all this time
was active development". Xiangrui should post the design doc as soon as it
is ready for feedback.



On Sun, Sep 14, 2014 at 11:26 PM, Egor Pahomov 
wrote:

> It's good, that databricks working on this issue! However current process
> of working on that is not very clear for outsider.
>
>- Last update on this ticket is August 5. If all this time was active
>development, I have concerns that without feedback from community for such
>long time development can fall in wrong way.
>- Even if it would be great big patch as soon as you introduce new
>interfaces to community it would allow us to start working on our pipeline
>code. It would allow us write algorithm in new paradigm instead of in lack
>of any paradigms like it was before. It would allow us to help you transfer
>old code to new paradigm.
>
> My main point - shorter iterations with more transparency.
>
> I think it would be good idea to create some pull request with code, which
> you have so far, even if it doesn't pass tests, so just we can comment on
> it before formulating it in design doc.
>
>
> 2014-09-13 0:00 GMT+04:00 Patrick Wendell :
>
>> We typically post design docs on JIRA's before major work starts. For
>> instance, pretty sure SPARk-1856 will have a design doc posted
>> shortly.
>>
>> On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson  wrote:
>> >
>> > Are interface designs being captured anywhere as documents that the
>> community can follow along with as the proposals evolve?
>> >
>> > I've worked on other open source projects where design docs were
>> published as "living documents" (e.g. on google docs, or etherpad, but the
>> particular mechanism isn't crucial).   FWIW, I found that to be a good way
>> to work in a community environment.
>> >
>> >
>> > - Original Message -
>> >> Hi Egor,
>> >>
>> >> Thanks for the feedback! We are aware of some of the issues you
>> >> mentioned and there are JIRAs created for them. Specifically, I'm
>> >> pushing out the design on pipeline features and algorithm/model
>> >> parameters this week. We can move our discussion to
>> >> https://issues.apache.org/jira/browse/SPARK-1856 .
>> >>
>> >> It would be nice to make tests against interfaces. But it definitely
>> >> needs more discussion before making PRs. For example, we discussed the
>> >> learning interfaces in Christoph's PR
>> >> (https://github.com/apache/spark/pull/2137/) but it takes time to
>> >> reach a consensus, especially on interfaces. Hopefully all of us could
>> >> benefit from the discussion. The best practice is to break down the
>> >> proposal into small independent piece and discuss them on the JIRA
>> >> before submitting PRs.
>> >>
>> >> For performance tests, there is a spark-perf package
>> >> (https://github.com/databricks/spark-perf) and we added performance
>> >> tests for MLlib in v1.1. But definitely more work needs to be done.
>> >>
>> >> The dev-list may not be a good place for discussion on the design,
>> >> could you create JIRAs for each of the issues you pointed out, and we
>> >> track the discussion on JIRA? Thanks!
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin 
>> wrote:
>> >> > Xiangrui can comment more, but I believe Joseph and him are actually
>> >> > working on standardize interface and pipeline feature for 1.2
>> release.
>> >> >
>> >> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov <
>> pahomov.e...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Some architect suggestions on this matter -
>> >> >> https://github.com/apache/spark/pull/2371
>> >> >>
>> >> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov :
>> >> >>
>> >> >> > Sorry, I misswrote  - I meant learners part of framework - models
>> >> >> > already
>> >> >> > exists.
>> >> >> >
>> >> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
>> >> >> > christoph.saw...@googlemail.com>:
>> >> >> >
>> >> >> >> I totally agree, and we discovered also some drawbacks with the
>> >> >> >> classification models implementation that are based on GLMs:
>> >> >> >>
>> >> >> >> - There is no distinction between predicting scores, classes, and
>> >> >> >> calibrated scores (probabilities). For these models it is common
>> to
>> >> >> >> have
>> >> >> >> access to all of them and the prediction function
>> ``predict``should be
>> >> >> >> consistent and stateless. Currently, the score is only available
>> after
>> >> >> >> removing the threshold from the model.
>> 

Re: Adding abstraction in MLlib

2014-09-14 Thread Egor Pahomov
It's good, that databricks working on this issue! However current process
of working on that is not very clear for outsider.

   - Last update on this ticket is August 5. If all this time was active
   development, I have concerns that without feedback from community for such
   long time development can fall in wrong way.
   - Even if it would be great big patch as soon as you introduce new
   interfaces to community it would allow us to start working on our pipeline
   code. It would allow us write algorithm in new paradigm instead of in lack
   of any paradigms like it was before. It would allow us to help you transfer
   old code to new paradigm.

My main point - shorter iterations with more transparency.

I think it would be good idea to create some pull request with code, which
you have so far, even if it doesn't pass tests, so just we can comment on
it before formulating it in design doc.


2014-09-13 0:00 GMT+04:00 Patrick Wendell :

> We typically post design docs on JIRA's before major work starts. For
> instance, pretty sure SPARk-1856 will have a design doc posted
> shortly.
>
> On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson  wrote:
> >
> > Are interface designs being captured anywhere as documents that the
> community can follow along with as the proposals evolve?
> >
> > I've worked on other open source projects where design docs were
> published as "living documents" (e.g. on google docs, or etherpad, but the
> particular mechanism isn't crucial).   FWIW, I found that to be a good way
> to work in a community environment.
> >
> >
> > - Original Message -
> >> Hi Egor,
> >>
> >> Thanks for the feedback! We are aware of some of the issues you
> >> mentioned and there are JIRAs created for them. Specifically, I'm
> >> pushing out the design on pipeline features and algorithm/model
> >> parameters this week. We can move our discussion to
> >> https://issues.apache.org/jira/browse/SPARK-1856 .
> >>
> >> It would be nice to make tests against interfaces. But it definitely
> >> needs more discussion before making PRs. For example, we discussed the
> >> learning interfaces in Christoph's PR
> >> (https://github.com/apache/spark/pull/2137/) but it takes time to
> >> reach a consensus, especially on interfaces. Hopefully all of us could
> >> benefit from the discussion. The best practice is to break down the
> >> proposal into small independent piece and discuss them on the JIRA
> >> before submitting PRs.
> >>
> >> For performance tests, there is a spark-perf package
> >> (https://github.com/databricks/spark-perf) and we added performance
> >> tests for MLlib in v1.1. But definitely more work needs to be done.
> >>
> >> The dev-list may not be a good place for discussion on the design,
> >> could you create JIRAs for each of the issues you pointed out, and we
> >> track the discussion on JIRA? Thanks!
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin 
> wrote:
> >> > Xiangrui can comment more, but I believe Joseph and him are actually
> >> > working on standardize interface and pipeline feature for 1.2 release.
> >> >
> >> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov  >
> >> > wrote:
> >> >
> >> >> Some architect suggestions on this matter -
> >> >> https://github.com/apache/spark/pull/2371
> >> >>
> >> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov :
> >> >>
> >> >> > Sorry, I misswrote  - I meant learners part of framework - models
> >> >> > already
> >> >> > exists.
> >> >> >
> >> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
> >> >> > christoph.saw...@googlemail.com>:
> >> >> >
> >> >> >> I totally agree, and we discovered also some drawbacks with the
> >> >> >> classification models implementation that are based on GLMs:
> >> >> >>
> >> >> >> - There is no distinction between predicting scores, classes, and
> >> >> >> calibrated scores (probabilities). For these models it is common
> to
> >> >> >> have
> >> >> >> access to all of them and the prediction function
> ``predict``should be
> >> >> >> consistent and stateless. Currently, the score is only available
> after
> >> >> >> removing the threshold from the model.
> >> >> >> - There is no distinction between multinomial and binomial
> >> >> >> classification. For multinomial problems, it is necessary to
> handle
> >> >> >> multiple weight vectors and multiple confidences.
> >> >> >> - Models are not serialisable, which makes it hard to use them in
> >> >> >> practise.
> >> >> >>
> >> >> >> I started a pull request [1] some time ago. I would be happy to
> >> >> >> continue
> >> >> >> the discussion and clarify the interfaces, too!
> >> >> >>
> >> >> >> Cheers, Christoph
> >> >> >>
> >> >> >> [1] https://github.com/apache/spark/pull/2137/
> >> >> >>
> >> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov :
> >> >> >>
> >> >> >>> Here in Yandex, during implementation of gradient boosting in
> spark
> >> >> >>> and
> >> >> >>> creating our ML tool for internal use, we found next serious
> problems
> >> 

Re: Adding abstraction in MLlib

2014-09-12 Thread Patrick Wendell
We typically post design docs on JIRA's before major work starts. For
instance, pretty sure SPARk-1856 will have a design doc posted
shortly.

On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson  wrote:
>
> Are interface designs being captured anywhere as documents that the community 
> can follow along with as the proposals evolve?
>
> I've worked on other open source projects where design docs were published as 
> "living documents" (e.g. on google docs, or etherpad, but the particular 
> mechanism isn't crucial).   FWIW, I found that to be a good way to work in a 
> community environment.
>
>
> - Original Message -
>> Hi Egor,
>>
>> Thanks for the feedback! We are aware of some of the issues you
>> mentioned and there are JIRAs created for them. Specifically, I'm
>> pushing out the design on pipeline features and algorithm/model
>> parameters this week. We can move our discussion to
>> https://issues.apache.org/jira/browse/SPARK-1856 .
>>
>> It would be nice to make tests against interfaces. But it definitely
>> needs more discussion before making PRs. For example, we discussed the
>> learning interfaces in Christoph's PR
>> (https://github.com/apache/spark/pull/2137/) but it takes time to
>> reach a consensus, especially on interfaces. Hopefully all of us could
>> benefit from the discussion. The best practice is to break down the
>> proposal into small independent piece and discuss them on the JIRA
>> before submitting PRs.
>>
>> For performance tests, there is a spark-perf package
>> (https://github.com/databricks/spark-perf) and we added performance
>> tests for MLlib in v1.1. But definitely more work needs to be done.
>>
>> The dev-list may not be a good place for discussion on the design,
>> could you create JIRAs for each of the issues you pointed out, and we
>> track the discussion on JIRA? Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin  wrote:
>> > Xiangrui can comment more, but I believe Joseph and him are actually
>> > working on standardize interface and pipeline feature for 1.2 release.
>> >
>> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov 
>> > wrote:
>> >
>> >> Some architect suggestions on this matter -
>> >> https://github.com/apache/spark/pull/2371
>> >>
>> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov :
>> >>
>> >> > Sorry, I misswrote  - I meant learners part of framework - models
>> >> > already
>> >> > exists.
>> >> >
>> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
>> >> > christoph.saw...@googlemail.com>:
>> >> >
>> >> >> I totally agree, and we discovered also some drawbacks with the
>> >> >> classification models implementation that are based on GLMs:
>> >> >>
>> >> >> - There is no distinction between predicting scores, classes, and
>> >> >> calibrated scores (probabilities). For these models it is common to
>> >> >> have
>> >> >> access to all of them and the prediction function ``predict``should be
>> >> >> consistent and stateless. Currently, the score is only available after
>> >> >> removing the threshold from the model.
>> >> >> - There is no distinction between multinomial and binomial
>> >> >> classification. For multinomial problems, it is necessary to handle
>> >> >> multiple weight vectors and multiple confidences.
>> >> >> - Models are not serialisable, which makes it hard to use them in
>> >> >> practise.
>> >> >>
>> >> >> I started a pull request [1] some time ago. I would be happy to
>> >> >> continue
>> >> >> the discussion and clarify the interfaces, too!
>> >> >>
>> >> >> Cheers, Christoph
>> >> >>
>> >> >> [1] https://github.com/apache/spark/pull/2137/
>> >> >>
>> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov :
>> >> >>
>> >> >>> Here in Yandex, during implementation of gradient boosting in spark
>> >> >>> and
>> >> >>> creating our ML tool for internal use, we found next serious problems
>> >> in
>> >> >>> MLLib:
>> >> >>>
>> >> >>>
>> >> >>>- There is no Regression/Classification model abstraction. We were
>> >> >>>building abstract data processing pipelines, which should work just
>> >> >>> with
>> >> >>>some regression - exact algorithm specified outside this code.
>> >> >>>There
>> >> >>> is no
>> >> >>>abstraction, which will allow me to do that. *(It's main reason for
>> >> >>> all
>> >> >>>further problems) *
>> >> >>>- There is no common practice among MLlib for testing algorithms:
>> >> >>> every
>> >> >>>model generates it's own random test data. There is no easy
>> >> >>> extractable
>> >> >>>test cases applible to another algorithm. There is no benchmarks
>> >> >>>for
>> >> >>>comparing algorithms. After implementing new algorithm it's very
>> >> hard
>> >> >>> to
>> >> >>>understand how it should be tested.
>> >> >>>- Lack of serialization testing: MLlib algorithms don't contain
>> >> tests
>> >> >>>which test that model work after serialization.
>> >> >>>- During implementation of new algorithm it's hard to understand
>> >> what
>> >> >>> 

Re: Adding abstraction in MLlib

2014-09-12 Thread Erik Erlandson

Are interface designs being captured anywhere as documents that the community 
can follow along with as the proposals evolve?

I've worked on other open source projects where design docs were published as 
"living documents" (e.g. on google docs, or etherpad, but the particular 
mechanism isn't crucial).   FWIW, I found that to be a good way to work in a 
community environment.


- Original Message -
> Hi Egor,
> 
> Thanks for the feedback! We are aware of some of the issues you
> mentioned and there are JIRAs created for them. Specifically, I'm
> pushing out the design on pipeline features and algorithm/model
> parameters this week. We can move our discussion to
> https://issues.apache.org/jira/browse/SPARK-1856 .
> 
> It would be nice to make tests against interfaces. But it definitely
> needs more discussion before making PRs. For example, we discussed the
> learning interfaces in Christoph's PR
> (https://github.com/apache/spark/pull/2137/) but it takes time to
> reach a consensus, especially on interfaces. Hopefully all of us could
> benefit from the discussion. The best practice is to break down the
> proposal into small independent piece and discuss them on the JIRA
> before submitting PRs.
> 
> For performance tests, there is a spark-perf package
> (https://github.com/databricks/spark-perf) and we added performance
> tests for MLlib in v1.1. But definitely more work needs to be done.
> 
> The dev-list may not be a good place for discussion on the design,
> could you create JIRAs for each of the issues you pointed out, and we
> track the discussion on JIRA? Thanks!
> 
> Best,
> Xiangrui
> 
> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin  wrote:
> > Xiangrui can comment more, but I believe Joseph and him are actually
> > working on standardize interface and pipeline feature for 1.2 release.
> >
> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov 
> > wrote:
> >
> >> Some architect suggestions on this matter -
> >> https://github.com/apache/spark/pull/2371
> >>
> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov :
> >>
> >> > Sorry, I misswrote  - I meant learners part of framework - models
> >> > already
> >> > exists.
> >> >
> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
> >> > christoph.saw...@googlemail.com>:
> >> >
> >> >> I totally agree, and we discovered also some drawbacks with the
> >> >> classification models implementation that are based on GLMs:
> >> >>
> >> >> - There is no distinction between predicting scores, classes, and
> >> >> calibrated scores (probabilities). For these models it is common to
> >> >> have
> >> >> access to all of them and the prediction function ``predict``should be
> >> >> consistent and stateless. Currently, the score is only available after
> >> >> removing the threshold from the model.
> >> >> - There is no distinction between multinomial and binomial
> >> >> classification. For multinomial problems, it is necessary to handle
> >> >> multiple weight vectors and multiple confidences.
> >> >> - Models are not serialisable, which makes it hard to use them in
> >> >> practise.
> >> >>
> >> >> I started a pull request [1] some time ago. I would be happy to
> >> >> continue
> >> >> the discussion and clarify the interfaces, too!
> >> >>
> >> >> Cheers, Christoph
> >> >>
> >> >> [1] https://github.com/apache/spark/pull/2137/
> >> >>
> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov :
> >> >>
> >> >>> Here in Yandex, during implementation of gradient boosting in spark
> >> >>> and
> >> >>> creating our ML tool for internal use, we found next serious problems
> >> in
> >> >>> MLLib:
> >> >>>
> >> >>>
> >> >>>- There is no Regression/Classification model abstraction. We were
> >> >>>building abstract data processing pipelines, which should work just
> >> >>> with
> >> >>>some regression - exact algorithm specified outside this code.
> >> >>>There
> >> >>> is no
> >> >>>abstraction, which will allow me to do that. *(It's main reason for
> >> >>> all
> >> >>>further problems) *
> >> >>>- There is no common practice among MLlib for testing algorithms:
> >> >>> every
> >> >>>model generates it's own random test data. There is no easy
> >> >>> extractable
> >> >>>test cases applible to another algorithm. There is no benchmarks
> >> >>>for
> >> >>>comparing algorithms. After implementing new algorithm it's very
> >> hard
> >> >>> to
> >> >>>understand how it should be tested.
> >> >>>- Lack of serialization testing: MLlib algorithms don't contain
> >> tests
> >> >>>which test that model work after serialization.
> >> >>>- During implementation of new algorithm it's hard to understand
> >> what
> >> >>>API you should create and which interface to implement.
> >> >>>
> >> >>> Start for solving all these problems must be done in creating common
> >> >>> interface for typical algorithms/models - regression, classification,
> >> >>> clustering, collaborative filtering.
> >> >>>
> >> >>> All main tests should be wr

Re: Adding abstraction in MLlib

2014-09-12 Thread Xiangrui Meng
Hi Egor,

Thanks for the feedback! We are aware of some of the issues you
mentioned and there are JIRAs created for them. Specifically, I'm
pushing out the design on pipeline features and algorithm/model
parameters this week. We can move our discussion to
https://issues.apache.org/jira/browse/SPARK-1856 .

It would be nice to make tests against interfaces. But it definitely
needs more discussion before making PRs. For example, we discussed the
learning interfaces in Christoph's PR
(https://github.com/apache/spark/pull/2137/) but it takes time to
reach a consensus, especially on interfaces. Hopefully all of us could
benefit from the discussion. The best practice is to break down the
proposal into small independent piece and discuss them on the JIRA
before submitting PRs.

For performance tests, there is a spark-perf package
(https://github.com/databricks/spark-perf) and we added performance
tests for MLlib in v1.1. But definitely more work needs to be done.

The dev-list may not be a good place for discussion on the design,
could you create JIRAs for each of the issues you pointed out, and we
track the discussion on JIRA? Thanks!

Best,
Xiangrui

On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin  wrote:
> Xiangrui can comment more, but I believe Joseph and him are actually
> working on standardize interface and pipeline feature for 1.2 release.
>
> On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov 
> wrote:
>
>> Some architect suggestions on this matter -
>> https://github.com/apache/spark/pull/2371
>>
>> 2014-09-12 16:38 GMT+04:00 Egor Pahomov :
>>
>> > Sorry, I misswrote  - I meant learners part of framework - models already
>> > exists.
>> >
>> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
>> > christoph.saw...@googlemail.com>:
>> >
>> >> I totally agree, and we discovered also some drawbacks with the
>> >> classification models implementation that are based on GLMs:
>> >>
>> >> - There is no distinction between predicting scores, classes, and
>> >> calibrated scores (probabilities). For these models it is common to have
>> >> access to all of them and the prediction function ``predict``should be
>> >> consistent and stateless. Currently, the score is only available after
>> >> removing the threshold from the model.
>> >> - There is no distinction between multinomial and binomial
>> >> classification. For multinomial problems, it is necessary to handle
>> >> multiple weight vectors and multiple confidences.
>> >> - Models are not serialisable, which makes it hard to use them in
>> >> practise.
>> >>
>> >> I started a pull request [1] some time ago. I would be happy to continue
>> >> the discussion and clarify the interfaces, too!
>> >>
>> >> Cheers, Christoph
>> >>
>> >> [1] https://github.com/apache/spark/pull/2137/
>> >>
>> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov :
>> >>
>> >>> Here in Yandex, during implementation of gradient boosting in spark and
>> >>> creating our ML tool for internal use, we found next serious problems
>> in
>> >>> MLLib:
>> >>>
>> >>>
>> >>>- There is no Regression/Classification model abstraction. We were
>> >>>building abstract data processing pipelines, which should work just
>> >>> with
>> >>>some regression - exact algorithm specified outside this code. There
>> >>> is no
>> >>>abstraction, which will allow me to do that. *(It's main reason for
>> >>> all
>> >>>further problems) *
>> >>>- There is no common practice among MLlib for testing algorithms:
>> >>> every
>> >>>model generates it's own random test data. There is no easy
>> >>> extractable
>> >>>test cases applible to another algorithm. There is no benchmarks for
>> >>>comparing algorithms. After implementing new algorithm it's very
>> hard
>> >>> to
>> >>>understand how it should be tested.
>> >>>- Lack of serialization testing: MLlib algorithms don't contain
>> tests
>> >>>which test that model work after serialization.
>> >>>- During implementation of new algorithm it's hard to understand
>> what
>> >>>API you should create and which interface to implement.
>> >>>
>> >>> Start for solving all these problems must be done in creating common
>> >>> interface for typical algorithms/models - regression, classification,
>> >>> clustering, collaborative filtering.
>> >>>
>> >>> All main tests should be written against these interfaces, so when new
>> >>> algorithm implemented - all it should do is passed already written
>> tests.
>> >>> It allow us to have managble quality among all lib.
>> >>>
>> >>> There should be couple benchmarks which allow new spark user to get
>> >>> feeling
>> >>> about which algorithm to use.
>> >>>
>> >>> Test set against these abstractions should contain serialization test.
>> In
>> >>> production most time there is no need in model, which can't be stored.
>> >>>
>> >>> As the first step of this roadmap I'd like to create trait
>> >>> RegressionModel,
>> >>> *ADD* methods to current algorithms to implement this trait and create
>> >>> some

Re: Adding abstraction in MLlib

2014-09-12 Thread Reynold Xin
Xiangrui can comment more, but I believe Joseph and him are actually
working on standardize interface and pipeline feature for 1.2 release.

On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov 
wrote:

> Some architect suggestions on this matter -
> https://github.com/apache/spark/pull/2371
>
> 2014-09-12 16:38 GMT+04:00 Egor Pahomov :
>
> > Sorry, I misswrote  - I meant learners part of framework - models already
> > exists.
> >
> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
> > christoph.saw...@googlemail.com>:
> >
> >> I totally agree, and we discovered also some drawbacks with the
> >> classification models implementation that are based on GLMs:
> >>
> >> - There is no distinction between predicting scores, classes, and
> >> calibrated scores (probabilities). For these models it is common to have
> >> access to all of them and the prediction function ``predict``should be
> >> consistent and stateless. Currently, the score is only available after
> >> removing the threshold from the model.
> >> - There is no distinction between multinomial and binomial
> >> classification. For multinomial problems, it is necessary to handle
> >> multiple weight vectors and multiple confidences.
> >> - Models are not serialisable, which makes it hard to use them in
> >> practise.
> >>
> >> I started a pull request [1] some time ago. I would be happy to continue
> >> the discussion and clarify the interfaces, too!
> >>
> >> Cheers, Christoph
> >>
> >> [1] https://github.com/apache/spark/pull/2137/
> >>
> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov :
> >>
> >>> Here in Yandex, during implementation of gradient boosting in spark and
> >>> creating our ML tool for internal use, we found next serious problems
> in
> >>> MLLib:
> >>>
> >>>
> >>>- There is no Regression/Classification model abstraction. We were
> >>>building abstract data processing pipelines, which should work just
> >>> with
> >>>some regression - exact algorithm specified outside this code. There
> >>> is no
> >>>abstraction, which will allow me to do that. *(It's main reason for
> >>> all
> >>>further problems) *
> >>>- There is no common practice among MLlib for testing algorithms:
> >>> every
> >>>model generates it's own random test data. There is no easy
> >>> extractable
> >>>test cases applible to another algorithm. There is no benchmarks for
> >>>comparing algorithms. After implementing new algorithm it's very
> hard
> >>> to
> >>>understand how it should be tested.
> >>>- Lack of serialization testing: MLlib algorithms don't contain
> tests
> >>>which test that model work after serialization.
> >>>- During implementation of new algorithm it's hard to understand
> what
> >>>API you should create and which interface to implement.
> >>>
> >>> Start for solving all these problems must be done in creating common
> >>> interface for typical algorithms/models - regression, classification,
> >>> clustering, collaborative filtering.
> >>>
> >>> All main tests should be written against these interfaces, so when new
> >>> algorithm implemented - all it should do is passed already written
> tests.
> >>> It allow us to have managble quality among all lib.
> >>>
> >>> There should be couple benchmarks which allow new spark user to get
> >>> feeling
> >>> about which algorithm to use.
> >>>
> >>> Test set against these abstractions should contain serialization test.
> In
> >>> production most time there is no need in model, which can't be stored.
> >>>
> >>> As the first step of this roadmap I'd like to create trait
> >>> RegressionModel,
> >>> *ADD* methods to current algorithms to implement this trait and create
> >>> some
> >>> tests against it. Planning of doing it next week.
> >>>
> >>> Purpose of this letter is to collect any objections to this approach on
> >>> early stage: please give any feedback. Second reason is to set lock on
> >>> this
> >>> activity so we wouldn't do the same thing twice: I'll create pull
> request
> >>> by the end of the next week and any parallalizm in development we can
> >>> start
> >>> from there.
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>>
> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >>>
> >>
> >>
> >
> >
> > --
> >
> >
> >
> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >
>
>
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>


Re: Adding abstraction in MLlib

2014-09-12 Thread Egor Pahomov
Some architect suggestions on this matter -
https://github.com/apache/spark/pull/2371

2014-09-12 16:38 GMT+04:00 Egor Pahomov :

> Sorry, I misswrote  - I meant learners part of framework - models already
> exists.
>
> 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
> christoph.saw...@googlemail.com>:
>
>> I totally agree, and we discovered also some drawbacks with the
>> classification models implementation that are based on GLMs:
>>
>> - There is no distinction between predicting scores, classes, and
>> calibrated scores (probabilities). For these models it is common to have
>> access to all of them and the prediction function ``predict``should be
>> consistent and stateless. Currently, the score is only available after
>> removing the threshold from the model.
>> - There is no distinction between multinomial and binomial
>> classification. For multinomial problems, it is necessary to handle
>> multiple weight vectors and multiple confidences.
>> - Models are not serialisable, which makes it hard to use them in
>> practise.
>>
>> I started a pull request [1] some time ago. I would be happy to continue
>> the discussion and clarify the interfaces, too!
>>
>> Cheers, Christoph
>>
>> [1] https://github.com/apache/spark/pull/2137/
>>
>> 2014-09-12 11:11 GMT+02:00 Egor Pahomov :
>>
>>> Here in Yandex, during implementation of gradient boosting in spark and
>>> creating our ML tool for internal use, we found next serious problems in
>>> MLLib:
>>>
>>>
>>>- There is no Regression/Classification model abstraction. We were
>>>building abstract data processing pipelines, which should work just
>>> with
>>>some regression - exact algorithm specified outside this code. There
>>> is no
>>>abstraction, which will allow me to do that. *(It's main reason for
>>> all
>>>further problems) *
>>>- There is no common practice among MLlib for testing algorithms:
>>> every
>>>model generates it's own random test data. There is no easy
>>> extractable
>>>test cases applible to another algorithm. There is no benchmarks for
>>>comparing algorithms. After implementing new algorithm it's very hard
>>> to
>>>understand how it should be tested.
>>>- Lack of serialization testing: MLlib algorithms don't contain tests
>>>which test that model work after serialization.
>>>- During implementation of new algorithm it's hard to understand what
>>>API you should create and which interface to implement.
>>>
>>> Start for solving all these problems must be done in creating common
>>> interface for typical algorithms/models - regression, classification,
>>> clustering, collaborative filtering.
>>>
>>> All main tests should be written against these interfaces, so when new
>>> algorithm implemented - all it should do is passed already written tests.
>>> It allow us to have managble quality among all lib.
>>>
>>> There should be couple benchmarks which allow new spark user to get
>>> feeling
>>> about which algorithm to use.
>>>
>>> Test set against these abstractions should contain serialization test. In
>>> production most time there is no need in model, which can't be stored.
>>>
>>> As the first step of this roadmap I'd like to create trait
>>> RegressionModel,
>>> *ADD* methods to current algorithms to implement this trait and create
>>> some
>>> tests against it. Planning of doing it next week.
>>>
>>> Purpose of this letter is to collect any objections to this approach on
>>> early stage: please give any feedback. Second reason is to set lock on
>>> this
>>> activity so we wouldn't do the same thing twice: I'll create pull request
>>> by the end of the next week and any parallalizm in development we can
>>> start
>>> from there.
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>>
>>
>>
>
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>



-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Adding abstraction in MLlib

2014-09-12 Thread Egor Pahomov
Sorry, I misswrote  - I meant learners part of framework - models already
exists.

2014-09-12 15:53 GMT+04:00 Christoph Sawade :

> I totally agree, and we discovered also some drawbacks with the
> classification models implementation that are based on GLMs:
>
> - There is no distinction between predicting scores, classes, and
> calibrated scores (probabilities). For these models it is common to have
> access to all of them and the prediction function ``predict``should be
> consistent and stateless. Currently, the score is only available after
> removing the threshold from the model.
> - There is no distinction between multinomial and binomial classification.
> For multinomial problems, it is necessary to handle multiple weight vectors
> and multiple confidences.
> - Models are not serialisable, which makes it hard to use them in practise.
>
> I started a pull request [1] some time ago. I would be happy to continue
> the discussion and clarify the interfaces, too!
>
> Cheers, Christoph
>
> [1] https://github.com/apache/spark/pull/2137/
>
> 2014-09-12 11:11 GMT+02:00 Egor Pahomov :
>
>> Here in Yandex, during implementation of gradient boosting in spark and
>> creating our ML tool for internal use, we found next serious problems in
>> MLLib:
>>
>>
>>- There is no Regression/Classification model abstraction. We were
>>building abstract data processing pipelines, which should work just
>> with
>>some regression - exact algorithm specified outside this code. There
>> is no
>>abstraction, which will allow me to do that. *(It's main reason for all
>>further problems) *
>>- There is no common practice among MLlib for testing algorithms: every
>>model generates it's own random test data. There is no easy extractable
>>test cases applible to another algorithm. There is no benchmarks for
>>comparing algorithms. After implementing new algorithm it's very hard
>> to
>>understand how it should be tested.
>>- Lack of serialization testing: MLlib algorithms don't contain tests
>>which test that model work after serialization.
>>- During implementation of new algorithm it's hard to understand what
>>API you should create and which interface to implement.
>>
>> Start for solving all these problems must be done in creating common
>> interface for typical algorithms/models - regression, classification,
>> clustering, collaborative filtering.
>>
>> All main tests should be written against these interfaces, so when new
>> algorithm implemented - all it should do is passed already written tests.
>> It allow us to have managble quality among all lib.
>>
>> There should be couple benchmarks which allow new spark user to get
>> feeling
>> about which algorithm to use.
>>
>> Test set against these abstractions should contain serialization test. In
>> production most time there is no need in model, which can't be stored.
>>
>> As the first step of this roadmap I'd like to create trait
>> RegressionModel,
>> *ADD* methods to current algorithms to implement this trait and create
>> some
>> tests against it. Planning of doing it next week.
>>
>> Purpose of this letter is to collect any objections to this approach on
>> early stage: please give any feedback. Second reason is to set lock on
>> this
>> activity so we wouldn't do the same thing twice: I'll create pull request
>> by the end of the next week and any parallalizm in development we can
>> start
>> from there.
>>
>>
>>
>> --
>>
>>
>>
>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>
>
>


-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Adding abstraction in MLlib

2014-09-12 Thread Christoph Sawade
I totally agree, and we discovered also some drawbacks with the
classification models implementation that are based on GLMs:

- There is no distinction between predicting scores, classes, and
calibrated scores (probabilities). For these models it is common to have
access to all of them and the prediction function ``predict``should be
consistent and stateless. Currently, the score is only available after
removing the threshold from the model.
- There is no distinction between multinomial and binomial classification.
For multinomial problems, it is necessary to handle multiple weight vectors
and multiple confidences.
- Models are not serialisable, which makes it hard to use them in practise.

I started a pull request [1] some time ago. I would be happy to continue
the discussion and clarify the interfaces, too!

Cheers, Christoph

[1] https://github.com/apache/spark/pull/2137/

2014-09-12 11:11 GMT+02:00 Egor Pahomov :

> Here in Yandex, during implementation of gradient boosting in spark and
> creating our ML tool for internal use, we found next serious problems in
> MLLib:
>
>
>- There is no Regression/Classification model abstraction. We were
>building abstract data processing pipelines, which should work just with
>some regression - exact algorithm specified outside this code. There is
> no
>abstraction, which will allow me to do that. *(It's main reason for all
>further problems) *
>- There is no common practice among MLlib for testing algorithms: every
>model generates it's own random test data. There is no easy extractable
>test cases applible to another algorithm. There is no benchmarks for
>comparing algorithms. After implementing new algorithm it's very hard to
>understand how it should be tested.
>- Lack of serialization testing: MLlib algorithms don't contain tests
>which test that model work after serialization.
>- During implementation of new algorithm it's hard to understand what
>API you should create and which interface to implement.
>
> Start for solving all these problems must be done in creating common
> interface for typical algorithms/models - regression, classification,
> clustering, collaborative filtering.
>
> All main tests should be written against these interfaces, so when new
> algorithm implemented - all it should do is passed already written tests.
> It allow us to have managble quality among all lib.
>
> There should be couple benchmarks which allow new spark user to get feeling
> about which algorithm to use.
>
> Test set against these abstractions should contain serialization test. In
> production most time there is no need in model, which can't be stored.
>
> As the first step of this roadmap I'd like to create trait RegressionModel,
> *ADD* methods to current algorithms to implement this trait and create some
> tests against it. Planning of doing it next week.
>
> Purpose of this letter is to collect any objections to this approach on
> early stage: please give any feedback. Second reason is to set lock on this
> activity so we wouldn't do the same thing twice: I'll create pull request
> by the end of the next week and any parallalizm in development we can start
> from there.
>
>
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>