We could have a module with a library of PTransforms (similar to the join library in extensions) -- so it wouldn't be part of the core / required SDK.
On Fri, May 20, 2016 at 3:35 PM, Henry Saputra <henry.sapu...@gmail.com> wrote: > I am a bit concern about adding ML model APIs to Beam because the fluctuate > nature of ML landscape and also in reality, most data scientists tend to > use Python and R most the work with existing model definition. > > Even though you could say something like Spark ML is popular, it is merely > because it is involving Apache Spark rather than quality of the ML module > itself. > > The pipeline and most of the tooling are inspired by scikit-learn, and > hence it is relying on familiarity of the library to attract developers. > > My question is whether fully end to end ML APIs is needed as part of core > Beam APIs. > > - Henry > > On Thu, May 19, 2016 at 5:46 AM, Jianfeng Qian <qianjianf...@outlook.com> > wrote: > > > Hi, > > I am quite interested about this proposal. > > it is great to consider a lot of machine learning projects. > > Currently, most algorithms of spark mllib are batch processing, while > > oryx2 and streamDM focus on real-time machine learning. > > And Flink works with SAMOA team to integrate stream mining algorithms, > too. > > So I wonder is that possible to design A flexible SDK which allow user > > to call different third party packages or their own algorithms? > > > > Best, > > Jianfeng > > > > On 2016年05月17日 22:01, Suneel Marthi wrote: > > > Thanks Simone for pointing this out. > > > > > > On the Apache Mahout project we have distributed linear algebra with > > R-like > > > semantics that can be executed on Spark/Flink/H2O. > > > > > > @Kam: the document u point out is old and outdated, the most up-to-date > > > reference to the Samsara api is the book - 'Apache Mahout: Beyond > > > MapReduce". (shameless marketing here on behalf of fellow committers > :) ) > > > > > > We added Flink DataSet API in the recent Mahout 0.12.0 release (April > 11, > > > 2016) and has been called out in my talk at ApacheBigData in Vancouver > > last > > > week. > > > > > > The Mahout community would definitely be interested in being involved > > with > > > this and sharing notes. > > > > > > IMHO, the focus should be first on building a good linalg foundations > > > before embarking on building algos and pipelines. Adding @dlyubimov to > > this. > > > > > > > > > > > > ---------- Forwarded message ---------- > > > From: Simone Robutti <simone.robu...@radicalbit.io> > > > Date: Tue, May 17, 2016 at 9:48 AM > > > Subject: Fwd: machine learning API, common models > > > To: Suneel Marthi <smar...@apache.org> > > > > > > > > > > > > ---------- Forwarded message ---------- > > > From: Kavulya, Soila P <soila.p.kavu...@intel.com> > > > Date: 2016-05-17 1:53 GMT+02:00 > > > Subject: RE: machine learning API, common models > > > To: "dev@beam.incubator.apache.org" <dev@beam.incubator.apache.org> > > > > > > > > > Thanks Simone, > > > > > > You have raised a valid concern about how different frameworks will > have > > > different implementations and parameter semantics for the same > > algorithm. I > > > agree that it is important to keep this in mind. Hopefully, through > this > > > exercise, we will identify a good set of common ML abstractions across > > > different frameworks. > > > > > > Feel free to edit the document. We had limited the first pass of the > > > comparison matrix to the machine learning pipeline APIs, but we can > > extend > > > it to include other ML building blocks like linear algebra operations, > > and > > > APIs for optimizers like gradient descent. > > > > > > Soila > > > > > > -----Original Message----- > > > From: Kam Kasravi [mailto:kamkasr...@gmail.com] > > > Sent: Monday, May 16, 2016 8:22 AM > > > To: dev@beam.incubator.apache.org > > > Subject: Re: machine learning API, common models > > > > > > Thanks Simone - yes I had read your concerns on dev and I think they're > > > well founded. > > > Thanks for the samsura reference - I've been looking at the spark/scala > > > bindings > > http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf > > > . > > > > > > I think we should expand the document to include linear algebraic ops > or > > > least pay due diligence to it. If you're doing anything on the flink > side > > > in this regard let us or feel free to suggest edits/updates to the > > document. > > > > > > Thanks > > > Kam > > > > > > On Mon, May 16, 2016 at 6:05 AM, Simone Robutti < > > > simone.robu...@radicalbit.io> wrote: > > > > > >> Hello, > > >> > > >> I'm Simone and I just began contributing to Flink ML (actually on the > > >> distributed linalg part). I already expressed my concerns about the > > >> idea of an high level API relying on specific frameworks' > > implementations: > > >> different implementations produce different results and may vary in > > >> quality. Also the semantics of parameters may change from one > > >> implementation to the other. This could hinder portability and > > >> transparency. I believe these problems could be handled paying the due > > >> attention to the details of every single implementation but I invite > > >> you not to underestimate these problems. > > >> > > >> On the other hand the API in itself looks good to me. From my side, I > > >> hope to fill some of the gaps in Flink you underlined in the > comparison > > > matrix. > > >> Talking about matrices, proper matrices this time, I believe it would > > >> be useful to include in this API support for linear algebra > operations. > > >> Something similar is already present in Mahout's Samsara and it looks > > >> really good but clearly a similar implementation on Beam would be way > > >> more interesting and powerful. > > >> > > >> My 2 cents, > > >> > > >> Simone > > >> > > >> > > >> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P <soila.p.kavu...@intel.com > >: > > >> > > >>> Hi Tyler, > > >>> > > >>> Thank you so much for your feedback. I agree that starting with the > > >>> high-level API is a good direction. We are interested in Python > > >>> because > > >> it > > >>> is the language that our data scientists are most familiar with. I > > >>> think starting with Java would be the best approach, because the > > >>> Python API can be a thin wrapper for Java API. > > >>> > > >>> In Spark, the Scala, Java and Python APIs are identical. Flink does > > >>> not have a Python API for ML pipelines at present. > > >>> > > >>> Could you point me to the updated runner API? > > >>> > > >>> Soila > > >>> > > >>> -----Original Message----- > > >>> From: Tyler Akidau [mailto:taki...@google.com.INVALID] > > >>> Sent: Friday, May 13, 2016 6:34 PM > > >>> To: dev@beam.incubator.apache.org > > >>> Subject: Re: machine learning API, common models > > >>> > > >>> Hi Kam & Soila, > > >>> > > >>> Thanks a lot for writing this up. I ran the doc past some of the > > >>> folks who've been doing ML work here at Google, and they were > > >>> generally happy with the distillation of common methods in the doc. > > >>> I'd be curious to > > >> hear > > >>> what folks on the Flink- and Spark- runner sides think. > > >>> > > >>> To me, this seems like a good direction for a high-level API. > > >>> Presumably, once a high-level API is in place, we could begin > > >>> looking at what it > > >> would > > >>> take to add lower-level ML algorithm support (e.g. iterative) to the > > >>> Beam Model. Is this essentially what you're thinking? > > >>> > > >>> Some more specific questions/comments: > > >>> > > >>> - Presumably you'd want to tackle this in Java first, since > that's > > > the > > >>> only language we currently support? Given that half of your > > >>> examples are in > > >>> Python, I'm also assuming Python will be interesting once it's > > >>> available. > > >>> > > >>> - Along those lines, what languages are represented in the > > capability > > >>> matrix? E.g. is Spark ML support as detailed there identical > across > > >>> Java/Scala and Python? > > >>> > > >>> - Have you thought about how this would tie in at the runner > level, > > >>> particularly given the updated Runner API changes that are > coming? > > > I'm > > >>> assuming they'd be provided as composite transforms that (for > > >>> now) > > >> would > > >>> have no default implementation, given the lack of low-level > > >>> primitives for > > >>> ML algorithms, but am curious what your thoughts are there. > > >>> > > >>> - I still don't fully understand how incremental updates due to > > model > > >>> drift would tie in at the API level. There's a comment thread in > > >>> the > > >> doc > > >>> still open tracking this, so no need to comment here > additionally. > > >> Just > > >>> pointing it out as one of the things that stands out as > > >>> potentially having > > >>> API-level impacts to me that doesn't seem 100% fleshed out in the > > >>> doc yet > > >>> (thought that admittedly may just be my limited understanding at > > >>> this point > > >>> :-). > > >>> > > >>> -Tyler > > >>> > > >>> > > >>> > > >>> > > >>> On Fri, May 13, 2016 at 10:48 AM Kam Kasravi <kamkasr...@gmail.com> > > >> wrote: > > >>>> Hi Tyler - my bad. Comments should be enabled now. > > >>>> > > >>>> On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau > > >>>> <taki...@google.com.invalid > > >>>> wrote: > > >>>> > > >>>>> Thanks a lot, Kam. Can you please enable comment access on the doc? > > >>>>> I > > >>>> seem > > >>>>> to have view access only. > > >>>>> > > >>>>> -Tyler > > >>>>> > > >>>>> On Fri, May 13, 2016 at 9:54 AM Kam Kasravi > > >>>>> <kamkasr...@gmail.com> > > >>>> wrote: > > >>>>>> Hi > > >>>>>> > > >>>>>> A number of readers have made comments on this topic recently. > > >>>>>> We have created a document that does some analysis of common > > >>>>>> ML models and > > >>>>> related > > >>>>>> APIs. We hope this can drive an approach that will result in > > >>>>>> an API, compatibility matrix and involvement from the same > > >>>>>> groups that are implementing transformation runners (spark, > > > flink, etc). > > >>>>>> We welcome comments here or in the document itself. > > >>>>>> > > >>>>>> > > >>>>>> > > >>>> https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1 > > >>>> yjo4 > > >>>> PBECHb-xA/edit?usp=sharing > > > > >