Re: Fwd: machine learning API, common models

Frances Perry Fri, 20 May 2016 15:39:54 -0700

We could have a module with a library of PTransforms (similar to the join
library in extensions) -- so it wouldn't be part of the core / required SDK.




On Fri, May 20, 2016 at 3:35 PM, Henry Saputra <henry.sapu...@gmail.com>
wrote:

> I am a bit concern about adding ML model APIs to Beam because the fluctuate
> nature of ML landscape and also in reality, most data scientists tend to
> use Python and R most the work with existing model definition.
>
> Even though you could say something like Spark ML is popular, it is merely
> because it is involving Apache Spark rather than quality of the ML module
> itself.
>
> The pipeline and most of the tooling are inspired by scikit-learn, and
> hence it is relying on familiarity of the library to attract developers.
>
> My question is whether fully end to end ML APIs is needed as part of core
> Beam APIs.
>
> - Henry
>
> On Thu, May 19, 2016 at 5:46 AM, Jianfeng Qian <qianjianf...@outlook.com>
> wrote:
>
> > Hi,
> > I am quite interested about this proposal.
> > it is great to consider a lot of machine learning projects.
> > Currently, most algorithms of spark mllib are batch processing, while
> > oryx2 and streamDM focus on real-time machine learning.
> > And Flink works with SAMOA team to integrate stream mining algorithms,
> too.
> > So I wonder is that possible to design A flexible SDK which allow user
> > to call different third party packages or their own algorithms?
> >
> > Best,
> > Jianfeng
> >
> > On 2016年05月17日 22:01, Suneel Marthi wrote:
> > > Thanks Simone for pointing this out.
> > >
> > > On the Apache Mahout project we have distributed linear algebra with
> > R-like
> > > semantics that can be executed on Spark/Flink/H2O.
> > >
> > > @Kam: the document u point out is old and outdated, the most up-to-date
> > > reference to the Samsara api is the book - 'Apache Mahout: Beyond
> > > MapReduce". (shameless marketing here on behalf of fellow committers
> :) )
> > >
> > > We added Flink DataSet API in the recent Mahout 0.12.0 release (April
> 11,
> > > 2016) and has been called out in my talk at ApacheBigData in Vancouver
> > last
> > > week.
> > >
> > > The Mahout community would definitely be interested in being involved
> > with
> > > this and sharing notes.
> > >
> > > IMHO, the focus should be first on building a good linalg foundations
> > > before embarking on building algos and pipelines. Adding @dlyubimov to
> > this.
> > >
> > >
> > >
> > > ---------- Forwarded message ----------
> > > From: Simone Robutti <simone.robu...@radicalbit.io>
> > > Date: Tue, May 17, 2016 at 9:48 AM
> > > Subject: Fwd: machine learning API, common models
> > > To: Suneel Marthi <smar...@apache.org>
> > >
> > >
> > >
> > > ---------- Forwarded message ----------
> > > From: Kavulya, Soila P <soila.p.kavu...@intel.com>
> > > Date: 2016-05-17 1:53 GMT+02:00
> > > Subject: RE: machine learning API, common models
> > > To: "dev@beam.incubator.apache.org" <dev@beam.incubator.apache.org>
> > >
> > >
> > > Thanks Simone,
> > >
> > > You have raised a valid concern about how different frameworks will
> have
> > > different implementations and parameter semantics for the same
> > algorithm. I
> > > agree that it is important to keep this in mind. Hopefully, through
> this
> > > exercise, we will identify a good set of common ML abstractions across
> > > different frameworks.
> > >
> > > Feel free to edit the document. We had limited the first pass of the
> > > comparison matrix to the machine learning pipeline APIs, but we can
> > extend
> > > it to include other ML building blocks like linear algebra operations,
> > and
> > > APIs for optimizers like gradient descent.
> > >
> > > Soila
> > >
> > > -----Original Message-----
> > > From: Kam Kasravi [mailto:kamkasr...@gmail.com]
> > > Sent: Monday, May 16, 2016 8:22 AM
> > > To: dev@beam.incubator.apache.org
> > > Subject: Re: machine learning API, common models
> > >
> > > Thanks Simone - yes I had read your concerns on dev and I think they're
> > > well founded.
> > > Thanks for the samsura reference - I've been looking at the spark/scala
> > > bindings
> > http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
> > > .
> > >
> > > I think we should expand the document to include linear algebraic ops
> or
> > > least pay due diligence to it. If you're doing anything on the flink
> side
> > > in this regard let us or feel free to suggest edits/updates to the
> > document.
> > >
> > > Thanks
> > > Kam
> > >
> > > On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
> > > simone.robu...@radicalbit.io> wrote:
> > >
> > >> Hello,
> > >>
> > >> I'm Simone and I just began contributing to Flink ML (actually on the
> > >> distributed linalg part). I already expressed my concerns about the
> > >> idea of an high level API relying on specific frameworks'
> > implementations:
> > >> different implementations produce different results and may vary in
> > >> quality. Also the semantics of parameters may change from one
> > >> implementation to the other. This could hinder portability and
> > >> transparency. I believe these problems could be handled paying the due
> > >> attention to the details of every single implementation but I invite
> > >> you not to underestimate these problems.
> > >>
> > >> On the other hand the API in itself looks good to me. From my side, I
> > >> hope to fill some of the gaps in Flink you underlined in the
> comparison
> > > matrix.
> > >> Talking about matrices, proper matrices this time, I believe it would
> > >> be useful to include in this API support for linear algebra
> operations.
> > >> Something similar is already present in Mahout's Samsara and it looks
> > >> really good but clearly a similar implementation on Beam would be way
> > >> more interesting and powerful.
> > >>
> > >> My 2 cents,
> > >>
> > >> Simone
> > >>
> > >>
> > >> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P <soila.p.kavu...@intel.com
> >:
> > >>
> > >>> Hi Tyler,
> > >>>
> > >>> Thank you so much for your feedback. I agree that starting with the
> > >>> high-level API is a good direction. We are interested in Python
> > >>> because
> > >> it
> > >>> is the language that our data scientists are most familiar with. I
> > >>> think starting with Java would be the best approach, because the
> > >>> Python API can be a thin wrapper for Java API.
> > >>>
> > >>> In Spark, the Scala, Java and Python APIs are identical. Flink does
> > >>> not have a Python API for ML pipelines at present.
> > >>>
> > >>> Could you point me to the updated runner API?
> > >>>
> > >>> Soila
> > >>>
> > >>> -----Original Message-----
> > >>> From: Tyler Akidau [mailto:taki...@google.com.INVALID]
> > >>> Sent: Friday, May 13, 2016 6:34 PM
> > >>> To: dev@beam.incubator.apache.org
> > >>> Subject: Re: machine learning API, common models
> > >>>
> > >>> Hi Kam & Soila,
> > >>>
> > >>> Thanks a lot for writing this up. I ran the doc past some of the
> > >>> folks who've been doing ML work here at Google, and they were
> > >>> generally happy with the distillation of common methods in the doc.
> > >>> I'd be curious to
> > >> hear
> > >>> what folks on the Flink- and Spark- runner sides think.
> > >>>
> > >>> To me, this seems like a good direction for a high-level API.
> > >>> Presumably, once a high-level API is in place, we could begin
> > >>> looking at what it
> > >> would
> > >>> take to add lower-level ML algorithm support (e.g. iterative) to the
> > >>> Beam Model. Is this essentially what you're thinking?
> > >>>
> > >>> Some more specific questions/comments:
> > >>>
> > >>>     - Presumably you'd want to tackle this in Java first, since
> that's
> > > the
> > >>>     only language we currently support? Given that half of your
> > >>> examples are in
> > >>>     Python, I'm also assuming Python will be interesting once it's
> > >>> available.
> > >>>
> > >>>     - Along those lines, what languages are represented in the
> > capability
> > >>>     matrix? E.g. is Spark ML support as detailed there identical
> across
> > >>>     Java/Scala and Python?
> > >>>
> > >>>     - Have you thought about how this would tie in at the runner
> level,
> > >>>     particularly given the updated Runner API changes that are
> coming?
> > > I'm
> > >>>     assuming they'd be provided as composite transforms that (for
> > >>> now)
> > >> would
> > >>>     have no default implementation, given the lack of low-level
> > >>> primitives for
> > >>>     ML algorithms, but am curious what your thoughts are there.
> > >>>
> > >>>     - I still don't fully understand how incremental updates due to
> > model
> > >>>     drift would tie in at the API level. There's a comment thread in
> > >>> the
> > >> doc
> > >>>     still open tracking this, so no need to comment here
> additionally.
> > >> Just
> > >>>     pointing it out as one of the things that stands out as
> > >>> potentially having
> > >>>     API-level impacts to me that doesn't seem 100% fleshed out in the
> > >>> doc yet
> > >>>     (thought that admittedly may just be my limited understanding at
> > >>> this point
> > >>>     :-).
> > >>>
> > >>> -Tyler
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Fri, May 13, 2016 at 10:48 AM Kam Kasravi <kamkasr...@gmail.com>
> > >> wrote:
> > >>>> Hi Tyler - my bad. Comments should be enabled now.
> > >>>>
> > >>>> On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau
> > >>>> <taki...@google.com.invalid
> > >>>> wrote:
> > >>>>
> > >>>>> Thanks a lot, Kam. Can you please enable comment access on the doc?
> > >>>>> I
> > >>>> seem
> > >>>>> to have view access only.
> > >>>>>
> > >>>>> -Tyler
> > >>>>>
> > >>>>> On Fri, May 13, 2016 at 9:54 AM Kam Kasravi
> > >>>>> <kamkasr...@gmail.com>
> > >>>> wrote:
> > >>>>>> Hi
> > >>>>>>
> > >>>>>> A number of readers have made comments on this topic recently.
> > >>>>>> We have created a document that does some analysis of common
> > >>>>>> ML models and
> > >>>>> related
> > >>>>>> APIs. We hope this can drive an approach that will result in
> > >>>>>> an API, compatibility matrix and involvement from the same
> > >>>>>> groups that are implementing transformation runners (spark,
> > > flink, etc).
> > >>>>>> We welcome comments here or in the document itself.
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>> https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1
> > >>>> yjo4
> > >>>> PBECHb-xA/edit?usp=sharing
> >
> >
>

Re: Fwd: machine learning API, common models

Reply via email to