Re: Fwd: machine learning API, common models

2016-05-30 Thread Simone Robutti
APIs.
> > > >
> > > > - Henry
> > > >
> > > > On Thu, May 19, 2016 at 5:46 AM, Jianfeng Qian <
> > qianjianf...@outlook.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > I am quite interested about this proposal.
> > > > > it is great to consider a lot of machine learning projects.
> > > > > Currently, most algorithms of spark mllib are batch processing,
> while
> > > > > oryx2 and streamDM focus on real-time machine learning.
> > > > > And Flink works with SAMOA team to integrate stream mining
> > algorithms,
> > > > too.
> > > > > So I wonder is that possible to design A flexible SDK which allow
> > user
> > > > > to call different third party packages or their own algorithms?
> > > > >
> > > > > Best,
> > > > > Jianfeng
> > > > >
> > > > > On 2016年05月17日 22:01, Suneel Marthi wrote:
> > > > > > Thanks Simone for pointing this out.
> > > > > >
> > > > > > On the Apache Mahout project we have distributed linear algebra
> > with
> > > > > R-like
> > > > > > semantics that can be executed on Spark/Flink/H2O.
> > > > > >
> > > > > > @Kam: the document u point out is old and outdated, the most
> > > up-to-date
> > > > > > reference to the Samsara api is the book - 'Apache Mahout: Beyond
> > > > > > MapReduce". (shameless marketing here on behalf of fellow
> > committers
> > > > :) )
> > > > > >
> > > > > > We added Flink DataSet API in the recent Mahout 0.12.0 release
> > (April
> > > > 11,
> > > > > > 2016) and has been called out in my talk at ApacheBigData in
> > > Vancouver
> > > > > last
> > > > > > week.
> > > > > >
> > > > > > The Mahout community would definitely be interested in being
> > involved
> > > > > with
> > > > > > this and sharing notes.
> > > > > >
> > > > > > IMHO, the focus should be first on building a good linalg
> > foundations
> > > > > > before embarking on building algos and pipelines. Adding
> @dlyubimov
> > > to
> > > > > this.
> > > > > >
> > > > > >
> > > > > >
> > > > > > -- Forwarded message --
> > > > > > From: Simone Robutti <simone.robu...@radicalbit.io>
> > > > > > Date: Tue, May 17, 2016 at 9:48 AM
> > > > > > Subject: Fwd: machine learning API, common models
> > > > > > To: Suneel Marthi <smar...@apache.org>
> > > > > >
> > > > > >
> > > > > >
> > > > > > -- Forwarded message --
> > > > > > From: Kavulya, Soila P <soila.p.kavu...@intel.com>
> > > > > > Date: 2016-05-17 1:53 GMT+02:00
> > > > > > Subject: RE: machine learning API, common models
> > > > > > To: "dev@beam.incubator.apache.org" <
> dev@beam.incubator.apache.org
> > >
> > > > > >
> > > > > >
> > > > > > Thanks Simone,
> > > > > >
> > > > > > You have raised a valid concern about how different frameworks
> will
> > > > have
> > > > > > different implementations and parameter semantics for the same
> > > > > algorithm. I
> > > > > > agree that it is important to keep this in mind. Hopefully,
> through
> > > > this
> > > > > > exercise, we will identify a good set of common ML abstractions
> > > across
> > > > > > different frameworks.
> > > > > >
> > > > > > Feel free to edit the document. We had limited the first pass of
> > the
> > > > > > comparison matrix to the machine learning pipeline APIs, but we
> can
> > > > > extend
> > > > > > it to include other ML building blocks like linear algebra
> > > operations,
> > > > > and
> > > > > > APIs for optimizers like gradient descent.
> > > > > >
> > > > > > Soila
> > > > > >
> > > > > > -Origina

Re: Fwd: machine learning API, common models

2016-05-27 Thread Kam Kasravi
 integrate stream mining
> algorithms,
> > > too.
> > > > So I wonder is that possible to design A flexible SDK which allow
> user
> > > > to call different third party packages or their own algorithms?
> > > >
> > > > Best,
> > > > Jianfeng
> > > >
> > > > On 2016年05月17日 22:01, Suneel Marthi wrote:
> > > > > Thanks Simone for pointing this out.
> > > > >
> > > > > On the Apache Mahout project we have distributed linear algebra
> with
> > > > R-like
> > > > > semantics that can be executed on Spark/Flink/H2O.
> > > > >
> > > > > @Kam: the document u point out is old and outdated, the most
> > up-to-date
> > > > > reference to the Samsara api is the book - 'Apache Mahout: Beyond
> > > > > MapReduce". (shameless marketing here on behalf of fellow
> committers
> > > :) )
> > > > >
> > > > > We added Flink DataSet API in the recent Mahout 0.12.0 release
> (April
> > > 11,
> > > > > 2016) and has been called out in my talk at ApacheBigData in
> > Vancouver
> > > > last
> > > > > week.
> > > > >
> > > > > The Mahout community would definitely be interested in being
> involved
> > > > with
> > > > > this and sharing notes.
> > > > >
> > > > > IMHO, the focus should be first on building a good linalg
> foundations
> > > > > before embarking on building algos and pipelines. Adding @dlyubimov
> > to
> > > > this.
> > > > >
> > > > >
> > > > >
> > > > > -- Forwarded message --
> > > > > From: Simone Robutti <simone.robu...@radicalbit.io>
> > > > > Date: Tue, May 17, 2016 at 9:48 AM
> > > > > Subject: Fwd: machine learning API, common models
> > > > > To: Suneel Marthi <smar...@apache.org>
> > > > >
> > > > >
> > > > >
> > > > > -- Forwarded message --
> > > > > From: Kavulya, Soila P <soila.p.kavu...@intel.com>
> > > > > Date: 2016-05-17 1:53 GMT+02:00
> > > > > Subject: RE: machine learning API, common models
> > > > > To: "dev@beam.incubator.apache.org" <dev@beam.incubator.apache.org
> >
> > > > >
> > > > >
> > > > > Thanks Simone,
> > > > >
> > > > > You have raised a valid concern about how different frameworks will
> > > have
> > > > > different implementations and parameter semantics for the same
> > > > algorithm. I
> > > > > agree that it is important to keep this in mind. Hopefully, through
> > > this
> > > > > exercise, we will identify a good set of common ML abstractions
> > across
> > > > > different frameworks.
> > > > >
> > > > > Feel free to edit the document. We had limited the first pass of
> the
> > > > > comparison matrix to the machine learning pipeline APIs, but we can
> > > > extend
> > > > > it to include other ML building blocks like linear algebra
> > operations,
> > > > and
> > > > > APIs for optimizers like gradient descent.
> > > > >
> > > > > Soila
> > > > >
> > > > > -Original Message-
> > > > > From: Kam Kasravi [mailto:kamkasr...@gmail.com]
> > > > > Sent: Monday, May 16, 2016 8:22 AM
> > > > > To: dev@beam.incubator.apache.org
> > > > > Subject: Re: machine learning API, common models
> > > > >
> > > > > Thanks Simone - yes I had read your concerns on dev and I think
> > they're
> > > > > well founded.
> > > > > Thanks for the samsura reference - I've been looking at the
> > spark/scala
> > > > > bindings
> > > > http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
> > > > > .
> > > > >
> > > > > I think we should expand the document to include linear algebraic
> ops
> > > or
> > > > > least pay due diligence to it. If you're doing anything on the
> flink
> > > side
> > > > > in this regard let us or feel free to suggest edits/updates to the
> > > > document.
> > > > >
> > &g

Re: Fwd: machine learning API, common models

2016-05-19 Thread Jianfeng Qian
Hi,
I am quite interested about this proposal.
it is great to consider a lot of machine learning projects.
Currently, most algorithms of spark mllib are batch processing, while  
oryx2 and streamDM focus on real-time machine learning.
And Flink works with SAMOA team to integrate stream mining algorithms, too.
So I wonder is that possible to design A flexible SDK which allow user 
to call different third party packages or their own algorithms?

Best,
Jianfeng

On 2016年05月17日 22:01, Suneel Marthi wrote:
> Thanks Simone for pointing this out.
>
> On the Apache Mahout project we have distributed linear algebra with R-like
> semantics that can be executed on Spark/Flink/H2O.
>
> @Kam: the document u point out is old and outdated, the most up-to-date
> reference to the Samsara api is the book - 'Apache Mahout: Beyond
> MapReduce". (shameless marketing here on behalf of fellow committers :) )
>
> We added Flink DataSet API in the recent Mahout 0.12.0 release (April 11,
> 2016) and has been called out in my talk at ApacheBigData in Vancouver last
> week.
>
> The Mahout community would definitely be interested in being involved with
> this and sharing notes.
>
> IMHO, the focus should be first on building a good linalg foundations
> before embarking on building algos and pipelines. Adding @dlyubimov to this.
>
>
>
> -- Forwarded message --
> From: Simone Robutti <simone.robu...@radicalbit.io>
> Date: Tue, May 17, 2016 at 9:48 AM
> Subject: Fwd: machine learning API, common models
> To: Suneel Marthi <smar...@apache.org>
>
>
>
> -- Forwarded message --
> From: Kavulya, Soila P <soila.p.kavu...@intel.com>
> Date: 2016-05-17 1:53 GMT+02:00
> Subject: RE: machine learning API, common models
> To: "dev@beam.incubator.apache.org" <dev@beam.incubator.apache.org>
>
>
> Thanks Simone,
>
> You have raised a valid concern about how different frameworks will have
> different implementations and parameter semantics for the same algorithm. I
> agree that it is important to keep this in mind. Hopefully, through this
> exercise, we will identify a good set of common ML abstractions across
> different frameworks.
>
> Feel free to edit the document. We had limited the first pass of the
> comparison matrix to the machine learning pipeline APIs, but we can extend
> it to include other ML building blocks like linear algebra operations, and
> APIs for optimizers like gradient descent.
>
> Soila
>
> -Original Message-
> From: Kam Kasravi [mailto:kamkasr...@gmail.com]
> Sent: Monday, May 16, 2016 8:22 AM
> To: dev@beam.incubator.apache.org
> Subject: Re: machine learning API, common models
>
> Thanks Simone - yes I had read your concerns on dev and I think they're
> well founded.
> Thanks for the samsura reference - I've been looking at the spark/scala
> bindings http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
> .
>
> I think we should expand the document to include linear algebraic ops or
> least pay due diligence to it. If you're doing anything on the flink side
> in this regard let us or feel free to suggest edits/updates to the document.
>
> Thanks
> Kam
>
> On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
> simone.robu...@radicalbit.io> wrote:
>
>> Hello,
>>
>> I'm Simone and I just began contributing to Flink ML (actually on the
>> distributed linalg part). I already expressed my concerns about the
>> idea of an high level API relying on specific frameworks' implementations:
>> different implementations produce different results and may vary in
>> quality. Also the semantics of parameters may change from one
>> implementation to the other. This could hinder portability and
>> transparency. I believe these problems could be handled paying the due
>> attention to the details of every single implementation but I invite
>> you not to underestimate these problems.
>>
>> On the other hand the API in itself looks good to me. From my side, I
>> hope to fill some of the gaps in Flink you underlined in the comparison
> matrix.
>> Talking about matrices, proper matrices this time, I believe it would
>> be useful to include in this API support for linear algebra operations.
>> Something similar is already present in Mahout's Samsara and it looks
>> really good but clearly a similar implementation on Beam would be way
>> more interesting and powerful.
>>
>> My 2 cents,
>>
>> Simone
>>
>>
>> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P <soila.p.kavu...@intel.com>:
>>
>>> Hi Tyler,
>>>
>>> Thank you so much for your feedback. I a

Fwd: machine learning API, common models

2016-05-17 Thread Suneel Marthi
Thanks Simone for pointing this out.

On the Apache Mahout project we have distributed linear algebra with R-like
semantics that can be executed on Spark/Flink/H2O.

@Kam: the document u point out is old and outdated, the most up-to-date
reference to the Samsara api is the book - 'Apache Mahout: Beyond
MapReduce". (shameless marketing here on behalf of fellow committers :) )

We added Flink DataSet API in the recent Mahout 0.12.0 release (April 11,
2016) and has been called out in my talk at ApacheBigData in Vancouver last
week.

The Mahout community would definitely be interested in being involved with
this and sharing notes.

IMHO, the focus should be first on building a good linalg foundations
before embarking on building algos and pipelines. Adding @dlyubimov to this.



-- Forwarded message --
From: Simone Robutti <simone.robu...@radicalbit.io>
Date: Tue, May 17, 2016 at 9:48 AM
Subject: Fwd: machine learning API, common models
To: Suneel Marthi <smar...@apache.org>



-- Forwarded message --
From: Kavulya, Soila P <soila.p.kavu...@intel.com>
Date: 2016-05-17 1:53 GMT+02:00
Subject: RE: machine learning API, common models
To: "dev@beam.incubator.apache.org" <dev@beam.incubator.apache.org>


Thanks Simone,

You have raised a valid concern about how different frameworks will have
different implementations and parameter semantics for the same algorithm. I
agree that it is important to keep this in mind. Hopefully, through this
exercise, we will identify a good set of common ML abstractions across
different frameworks.

Feel free to edit the document. We had limited the first pass of the
comparison matrix to the machine learning pipeline APIs, but we can extend
it to include other ML building blocks like linear algebra operations, and
APIs for optimizers like gradient descent.

Soila

-Original Message-
From: Kam Kasravi [mailto:kamkasr...@gmail.com]
Sent: Monday, May 16, 2016 8:22 AM
To: dev@beam.incubator.apache.org
Subject: Re: machine learning API, common models

Thanks Simone - yes I had read your concerns on dev and I think they're
well founded.
Thanks for the samsura reference - I've been looking at the spark/scala
bindings http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
.

I think we should expand the document to include linear algebraic ops or
least pay due diligence to it. If you're doing anything on the flink side
in this regard let us or feel free to suggest edits/updates to the document.

Thanks
Kam

On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
simone.robu...@radicalbit.io> wrote:

> Hello,
>
> I'm Simone and I just began contributing to Flink ML (actually on the
> distributed linalg part). I already expressed my concerns about the
> idea of an high level API relying on specific frameworks' implementations:
> different implementations produce different results and may vary in
> quality. Also the semantics of parameters may change from one
> implementation to the other. This could hinder portability and
> transparency. I believe these problems could be handled paying the due
> attention to the details of every single implementation but I invite
> you not to underestimate these problems.
>
> On the other hand the API in itself looks good to me. From my side, I
> hope to fill some of the gaps in Flink you underlined in the comparison
matrix.
>
> Talking about matrices, proper matrices this time, I believe it would
> be useful to include in this API support for linear algebra operations.
> Something similar is already present in Mahout's Samsara and it looks
> really good but clearly a similar implementation on Beam would be way
> more interesting and powerful.
>
> My 2 cents,
>
> Simone
>
>
> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P <soila.p.kavu...@intel.com>:
>
> > Hi Tyler,
> >
> > Thank you so much for your feedback. I agree that starting with the
> > high-level API is a good direction. We are interested in Python
> > because
> it
> > is the language that our data scientists are most familiar with. I
> > think starting with Java would be the best approach, because the
> > Python API can be a thin wrapper for Java API.
> >
> > In Spark, the Scala, Java and Python APIs are identical. Flink does
> > not have a Python API for ML pipelines at present.
> >
> > Could you point me to the updated runner API?
> >
> > Soila
> >
> > -Original Message-
> > From: Tyler Akidau [mailto:taki...@google.com.INVALID]
> > Sent: Friday, May 13, 2016 6:34 PM
> > To: dev@beam.incubator.apache.org
> > Subject: Re: machine learning API, common models
> >
> > Hi Kam & Soila,
> >
> > Thanks a lot for writing this up. I ran the doc past some of the
>