Hi, I am quite interested about this proposal. it is great to consider a lot of machine learning projects. Currently, most algorithms of spark mllib are batch processing, while oryx2 and streamDM focus on real-time machine learning. And Flink works with SAMOA team to integrate stream mining algorithms, too. So I wonder is that possible to design A flexible SDK which allow user to call different third party packages or their own algorithms?
Best, Jianfeng On 2016年05月17日 22:01, Suneel Marthi wrote: > Thanks Simone for pointing this out. > > On the Apache Mahout project we have distributed linear algebra with R-like > semantics that can be executed on Spark/Flink/H2O. > > @Kam: the document u point out is old and outdated, the most up-to-date > reference to the Samsara api is the book - 'Apache Mahout: Beyond > MapReduce". (shameless marketing here on behalf of fellow committers :) ) > > We added Flink DataSet API in the recent Mahout 0.12.0 release (April 11, > 2016) and has been called out in my talk at ApacheBigData in Vancouver last > week. > > The Mahout community would definitely be interested in being involved with > this and sharing notes. > > IMHO, the focus should be first on building a good linalg foundations > before embarking on building algos and pipelines. Adding @dlyubimov to this. > > > > ---------- Forwarded message ---------- > From: Simone Robutti <simone.robu...@radicalbit.io> > Date: Tue, May 17, 2016 at 9:48 AM > Subject: Fwd: machine learning API, common models > To: Suneel Marthi <smar...@apache.org> > > > > ---------- Forwarded message ---------- > From: Kavulya, Soila P <soila.p.kavu...@intel.com> > Date: 2016-05-17 1:53 GMT+02:00 > Subject: RE: machine learning API, common models > To: "dev@beam.incubator.apache.org" <dev@beam.incubator.apache.org> > > > Thanks Simone, > > You have raised a valid concern about how different frameworks will have > different implementations and parameter semantics for the same algorithm. I > agree that it is important to keep this in mind. Hopefully, through this > exercise, we will identify a good set of common ML abstractions across > different frameworks. > > Feel free to edit the document. We had limited the first pass of the > comparison matrix to the machine learning pipeline APIs, but we can extend > it to include other ML building blocks like linear algebra operations, and > APIs for optimizers like gradient descent. > > Soila > > -----Original Message----- > From: Kam Kasravi [mailto:kamkasr...@gmail.com] > Sent: Monday, May 16, 2016 8:22 AM > To: dev@beam.incubator.apache.org > Subject: Re: machine learning API, common models > > Thanks Simone - yes I had read your concerns on dev and I think they're > well founded. > Thanks for the samsura reference - I've been looking at the spark/scala > bindings http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf > . > > I think we should expand the document to include linear algebraic ops or > least pay due diligence to it. If you're doing anything on the flink side > in this regard let us or feel free to suggest edits/updates to the document. > > Thanks > Kam > > On Mon, May 16, 2016 at 6:05 AM, Simone Robutti < > simone.robu...@radicalbit.io> wrote: > >> Hello, >> >> I'm Simone and I just began contributing to Flink ML (actually on the >> distributed linalg part). I already expressed my concerns about the >> idea of an high level API relying on specific frameworks' implementations: >> different implementations produce different results and may vary in >> quality. Also the semantics of parameters may change from one >> implementation to the other. This could hinder portability and >> transparency. I believe these problems could be handled paying the due >> attention to the details of every single implementation but I invite >> you not to underestimate these problems. >> >> On the other hand the API in itself looks good to me. From my side, I >> hope to fill some of the gaps in Flink you underlined in the comparison > matrix. >> Talking about matrices, proper matrices this time, I believe it would >> be useful to include in this API support for linear algebra operations. >> Something similar is already present in Mahout's Samsara and it looks >> really good but clearly a similar implementation on Beam would be way >> more interesting and powerful. >> >> My 2 cents, >> >> Simone >> >> >> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P <soila.p.kavu...@intel.com>: >> >>> Hi Tyler, >>> >>> Thank you so much for your feedback. I agree that starting with the >>> high-level API is a good direction. We are interested in Python >>> because >> it >>> is the language that our data scientists are most familiar with. I >>> think starting with Java would be the best approach, because the >>> Python API can be a thin wrapper for Java API. >>> >>> In Spark, the Scala, Java and Python APIs are identical. Flink does >>> not have a Python API for ML pipelines at present. >>> >>> Could you point me to the updated runner API? >>> >>> Soila >>> >>> -----Original Message----- >>> From: Tyler Akidau [mailto:taki...@google.com.INVALID] >>> Sent: Friday, May 13, 2016 6:34 PM >>> To: dev@beam.incubator.apache.org >>> Subject: Re: machine learning API, common models >>> >>> Hi Kam & Soila, >>> >>> Thanks a lot for writing this up. I ran the doc past some of the >>> folks who've been doing ML work here at Google, and they were >>> generally happy with the distillation of common methods in the doc. >>> I'd be curious to >> hear >>> what folks on the Flink- and Spark- runner sides think. >>> >>> To me, this seems like a good direction for a high-level API. >>> Presumably, once a high-level API is in place, we could begin >>> looking at what it >> would >>> take to add lower-level ML algorithm support (e.g. iterative) to the >>> Beam Model. Is this essentially what you're thinking? >>> >>> Some more specific questions/comments: >>> >>> - Presumably you'd want to tackle this in Java first, since that's > the >>> only language we currently support? Given that half of your >>> examples are in >>> Python, I'm also assuming Python will be interesting once it's >>> available. >>> >>> - Along those lines, what languages are represented in the capability >>> matrix? E.g. is Spark ML support as detailed there identical across >>> Java/Scala and Python? >>> >>> - Have you thought about how this would tie in at the runner level, >>> particularly given the updated Runner API changes that are coming? > I'm >>> assuming they'd be provided as composite transforms that (for >>> now) >> would >>> have no default implementation, given the lack of low-level >>> primitives for >>> ML algorithms, but am curious what your thoughts are there. >>> >>> - I still don't fully understand how incremental updates due to model >>> drift would tie in at the API level. There's a comment thread in >>> the >> doc >>> still open tracking this, so no need to comment here additionally. >> Just >>> pointing it out as one of the things that stands out as >>> potentially having >>> API-level impacts to me that doesn't seem 100% fleshed out in the >>> doc yet >>> (thought that admittedly may just be my limited understanding at >>> this point >>> :-). >>> >>> -Tyler >>> >>> >>> >>> >>> On Fri, May 13, 2016 at 10:48 AM Kam Kasravi <kamkasr...@gmail.com> >> wrote: >>>> Hi Tyler - my bad. Comments should be enabled now. >>>> >>>> On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau >>>> <taki...@google.com.invalid >>>> wrote: >>>> >>>>> Thanks a lot, Kam. Can you please enable comment access on the doc? >>>>> I >>>> seem >>>>> to have view access only. >>>>> >>>>> -Tyler >>>>> >>>>> On Fri, May 13, 2016 at 9:54 AM Kam Kasravi >>>>> <kamkasr...@gmail.com> >>>> wrote: >>>>>> Hi >>>>>> >>>>>> A number of readers have made comments on this topic recently. >>>>>> We have created a document that does some analysis of common >>>>>> ML models and >>>>> related >>>>>> APIs. We hope this can drive an approach that will result in >>>>>> an API, compatibility matrix and involvement from the same >>>>>> groups that are implementing transformation runners (spark, > flink, etc). >>>>>> We welcome comments here or in the document itself. >>>>>> >>>>>> >>>>>> >>>> https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1 >>>> yjo4 >>>> PBECHb-xA/edit?usp=sharing