Hi Beam ML community

Based on comments from a number of you and some discussion we've had here
we thought we would suggest the following direction:

   - Begin with primitive operations common and critical to most all ML
   algorithms. These primitive operators would include:
      - linear algebra operations - borrowing from established libraries
      like samsara.
      - iterative processing - also central to ML where replay of datasets
      is easy to specific as well as thresholds or halting criteria. This
      coordinates well with FlinkML's current approach and base API's.
      - possibly new broadcast mechanisms not normally available within BSP
      frameworks such as Beam.
   - Normalize dataset and parameters that differ across current major ML
   libraries that offer the same types of models.
   - Favor a native ML implementation rather than a thin wrapper in order
   to provide consistency across runners. This will also allow the Beam ML to
   maximize quality and consistency issues across runners.
   - Support for languages also supported in the Beam runners (java,
   python, scala).
   - Implement several common ML algorithms using the low level primitives
   on one of more available Runners to validate both the low level API's and
   possible improvements on the high level API.

Skikit-learn pipelines and existing portable libraries like xgboost4j will
be valuable to model the high-level APIs - for example how xgboost4j
currently integrates with spark and flink.

We welcome further comments and further refinements in approach.

On Sun, May 22, 2016 at 7:43 PM, Henry Saputra <henry.sapu...@gmail.com>
wrote:

> @Frances:
>
> that would be probably the way to go IF we decide to have ML in Beam.
>
> @Simone:
>
> I am definitely love to see Beam introduce ML model APIs to abstract and
> unifiy all "dataflow" runner frameworks, such as with Flink ML and Spark
> ML.
>
> However, as you mentioned before, the target audience would be focus on
> distributed or ML engineers as you have mentioned.
> But I could see we have to then make some out of box ML algorithms (model
> train and fine tune) in addition to test the model and APIs.
>
> The expectation would be that these models to be "production" ready, in
> which most cases will be used by Data Scientists via some configurations,
> since they won't and most can't use Java language.
>
> I would love to see instead more on integration with existing ML frameworks
> like XGBoost [1], Mahout Samsara [2], or DL4J [3] for ML APIs and models in
> Beam.
>
> Thoughts and comments are definitely welcomed =)
>
> - Henry
>
> [1] https://github.com/dmlc/xgboost
> [2] https://mahout.apache.org/users/environment/out-of-core-reference.html
> [3] http://deeplearning4j.org
> <http://deeplearning4j.org/image-data-pipeline.html#record>
>
>
> On Sat, May 21, 2016 at 2:01 AM, Simone Robutti <
> simone.robu...@radicalbit.io> wrote:
>
> > I think these APIs won't be used by Data Scientists (R, Python) but by
> > Machine Learning Engineers (Scala, Java or C++ in different environments)
> > and as a ML Engineer it makes a lot of sense to me to have such an API if
> > I'm using Beam. It would make a lot more sense to implement algorithms
> > directly in Beam but that will come in the future, I hope.
> >
> > 2016-05-21 0:35 GMT+02:00 Henry Saputra <henry.sapu...@gmail.com>:
> >
> > > I am a bit concern about adding ML model APIs to Beam because the
> > fluctuate
> > > nature of ML landscape and also in reality, most data scientists tend
> to
> > > use Python and R most the work with existing model definition.
> > >
> > > Even though you could say something like Spark ML is popular, it is
> > merely
> > > because it is involving Apache Spark rather than quality of the ML
> module
> > > itself.
> > >
> > > The pipeline and most of the tooling are inspired by scikit-learn, and
> > > hence it is relying on familiarity of the library to attract
> developers.
> > >
> > > My question is whether fully end to end ML APIs is needed as part of
> core
> > > Beam APIs.
> > >
> > > - Henry
> > >
> > > On Thu, May 19, 2016 at 5:46 AM, Jianfeng Qian <
> qianjianf...@outlook.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > > I am quite interested about this proposal.
> > > > it is great to consider a lot of machine learning projects.
> > > > Currently, most algorithms of spark mllib are batch processing, while
> > > > oryx2 and streamDM focus on real-time machine learning.
> > > > And Flink works with SAMOA team to integrate stream mining
> algorithms,
> > > too.
> > > > So I wonder is that possible to design A flexible SDK which allow
> user
> > > > to call different third party packages or their own algorithms?
> > > >
> > > > Best,
> > > > Jianfeng
> > > >
> > > > On 2016年05月17日 22:01, Suneel Marthi wrote:
> > > > > Thanks Simone for pointing this out.
> > > > >
> > > > > On the Apache Mahout project we have distributed linear algebra
> with
> > > > R-like
> > > > > semantics that can be executed on Spark/Flink/H2O.
> > > > >
> > > > > @Kam: the document u point out is old and outdated, the most
> > up-to-date
> > > > > reference to the Samsara api is the book - 'Apache Mahout: Beyond
> > > > > MapReduce". (shameless marketing here on behalf of fellow
> committers
> > > :) )
> > > > >
> > > > > We added Flink DataSet API in the recent Mahout 0.12.0 release
> (April
> > > 11,
> > > > > 2016) and has been called out in my talk at ApacheBigData in
> > Vancouver
> > > > last
> > > > > week.
> > > > >
> > > > > The Mahout community would definitely be interested in being
> involved
> > > > with
> > > > > this and sharing notes.
> > > > >
> > > > > IMHO, the focus should be first on building a good linalg
> foundations
> > > > > before embarking on building algos and pipelines. Adding @dlyubimov
> > to
> > > > this.
> > > > >
> > > > >
> > > > >
> > > > > ---------- Forwarded message ----------
> > > > > From: Simone Robutti <simone.robu...@radicalbit.io>
> > > > > Date: Tue, May 17, 2016 at 9:48 AM
> > > > > Subject: Fwd: machine learning API, common models
> > > > > To: Suneel Marthi <smar...@apache.org>
> > > > >
> > > > >
> > > > >
> > > > > ---------- Forwarded message ----------
> > > > > From: Kavulya, Soila P <soila.p.kavu...@intel.com>
> > > > > Date: 2016-05-17 1:53 GMT+02:00
> > > > > Subject: RE: machine learning API, common models
> > > > > To: "dev@beam.incubator.apache.org" <dev@beam.incubator.apache.org
> >
> > > > >
> > > > >
> > > > > Thanks Simone,
> > > > >
> > > > > You have raised a valid concern about how different frameworks will
> > > have
> > > > > different implementations and parameter semantics for the same
> > > > algorithm. I
> > > > > agree that it is important to keep this in mind. Hopefully, through
> > > this
> > > > > exercise, we will identify a good set of common ML abstractions
> > across
> > > > > different frameworks.
> > > > >
> > > > > Feel free to edit the document. We had limited the first pass of
> the
> > > > > comparison matrix to the machine learning pipeline APIs, but we can
> > > > extend
> > > > > it to include other ML building blocks like linear algebra
> > operations,
> > > > and
> > > > > APIs for optimizers like gradient descent.
> > > > >
> > > > > Soila
> > > > >
> > > > > -----Original Message-----
> > > > > From: Kam Kasravi [mailto:kamkasr...@gmail.com]
> > > > > Sent: Monday, May 16, 2016 8:22 AM
> > > > > To: dev@beam.incubator.apache.org
> > > > > Subject: Re: machine learning API, common models
> > > > >
> > > > > Thanks Simone - yes I had read your concerns on dev and I think
> > they're
> > > > > well founded.
> > > > > Thanks for the samsura reference - I've been looking at the
> > spark/scala
> > > > > bindings
> > > > http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
> > > > > .
> > > > >
> > > > > I think we should expand the document to include linear algebraic
> ops
> > > or
> > > > > least pay due diligence to it. If you're doing anything on the
> flink
> > > side
> > > > > in this regard let us or feel free to suggest edits/updates to the
> > > > document.
> > > > >
> > > > > Thanks
> > > > > Kam
> > > > >
> > > > > On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
> > > > > simone.robu...@radicalbit.io> wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> I'm Simone and I just began contributing to Flink ML (actually on
> > the
> > > > >> distributed linalg part). I already expressed my concerns about
> the
> > > > >> idea of an high level API relying on specific frameworks'
> > > > implementations:
> > > > >> different implementations produce different results and may vary
> in
> > > > >> quality. Also the semantics of parameters may change from one
> > > > >> implementation to the other. This could hinder portability and
> > > > >> transparency. I believe these problems could be handled paying the
> > due
> > > > >> attention to the details of every single implementation but I
> invite
> > > > >> you not to underestimate these problems.
> > > > >>
> > > > >> On the other hand the API in itself looks good to me. From my
> side,
> > I
> > > > >> hope to fill some of the gaps in Flink you underlined in the
> > > comparison
> > > > > matrix.
> > > > >> Talking about matrices, proper matrices this time, I believe it
> > would
> > > > >> be useful to include in this API support for linear algebra
> > > operations.
> > > > >> Something similar is already present in Mahout's Samsara and it
> > looks
> > > > >> really good but clearly a similar implementation on Beam would be
> > way
> > > > >> more interesting and powerful.
> > > > >>
> > > > >> My 2 cents,
> > > > >>
> > > > >> Simone
> > > > >>
> > > > >>
> > > > >> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P <
> > soila.p.kavu...@intel.com
> > > >:
> > > > >>
> > > > >>> Hi Tyler,
> > > > >>>
> > > > >>> Thank you so much for your feedback. I agree that starting with
> the
> > > > >>> high-level API is a good direction. We are interested in Python
> > > > >>> because
> > > > >> it
> > > > >>> is the language that our data scientists are most familiar with.
> I
> > > > >>> think starting with Java would be the best approach, because the
> > > > >>> Python API can be a thin wrapper for Java API.
> > > > >>>
> > > > >>> In Spark, the Scala, Java and Python APIs are identical. Flink
> does
> > > > >>> not have a Python API for ML pipelines at present.
> > > > >>>
> > > > >>> Could you point me to the updated runner API?
> > > > >>>
> > > > >>> Soila
> > > > >>>
> > > > >>> -----Original Message-----
> > > > >>> From: Tyler Akidau [mailto:taki...@google.com.INVALID]
> > > > >>> Sent: Friday, May 13, 2016 6:34 PM
> > > > >>> To: dev@beam.incubator.apache.org
> > > > >>> Subject: Re: machine learning API, common models
> > > > >>>
> > > > >>> Hi Kam & Soila,
> > > > >>>
> > > > >>> Thanks a lot for writing this up. I ran the doc past some of the
> > > > >>> folks who've been doing ML work here at Google, and they were
> > > > >>> generally happy with the distillation of common methods in the
> doc.
> > > > >>> I'd be curious to
> > > > >> hear
> > > > >>> what folks on the Flink- and Spark- runner sides think.
> > > > >>>
> > > > >>> To me, this seems like a good direction for a high-level API.
> > > > >>> Presumably, once a high-level API is in place, we could begin
> > > > >>> looking at what it
> > > > >> would
> > > > >>> take to add lower-level ML algorithm support (e.g. iterative) to
> > the
> > > > >>> Beam Model. Is this essentially what you're thinking?
> > > > >>>
> > > > >>> Some more specific questions/comments:
> > > > >>>
> > > > >>>     - Presumably you'd want to tackle this in Java first, since
> > > that's
> > > > > the
> > > > >>>     only language we currently support? Given that half of your
> > > > >>> examples are in
> > > > >>>     Python, I'm also assuming Python will be interesting once
> it's
> > > > >>> available.
> > > > >>>
> > > > >>>     - Along those lines, what languages are represented in the
> > > > capability
> > > > >>>     matrix? E.g. is Spark ML support as detailed there identical
> > > across
> > > > >>>     Java/Scala and Python?
> > > > >>>
> > > > >>>     - Have you thought about how this would tie in at the runner
> > > level,
> > > > >>>     particularly given the updated Runner API changes that are
> > > coming?
> > > > > I'm
> > > > >>>     assuming they'd be provided as composite transforms that (for
> > > > >>> now)
> > > > >> would
> > > > >>>     have no default implementation, given the lack of low-level
> > > > >>> primitives for
> > > > >>>     ML algorithms, but am curious what your thoughts are there.
> > > > >>>
> > > > >>>     - I still don't fully understand how incremental updates due
> to
> > > > model
> > > > >>>     drift would tie in at the API level. There's a comment thread
> > in
> > > > >>> the
> > > > >> doc
> > > > >>>     still open tracking this, so no need to comment here
> > > additionally.
> > > > >> Just
> > > > >>>     pointing it out as one of the things that stands out as
> > > > >>> potentially having
> > > > >>>     API-level impacts to me that doesn't seem 100% fleshed out in
> > the
> > > > >>> doc yet
> > > > >>>     (thought that admittedly may just be my limited understanding
> > at
> > > > >>> this point
> > > > >>>     :-).
> > > > >>>
> > > > >>> -Tyler
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Fri, May 13, 2016 at 10:48 AM Kam Kasravi <
> kamkasr...@gmail.com
> > >
> > > > >> wrote:
> > > > >>>> Hi Tyler - my bad. Comments should be enabled now.
> > > > >>>>
> > > > >>>> On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau
> > > > >>>> <taki...@google.com.invalid
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Thanks a lot, Kam. Can you please enable comment access on the
> > doc?
> > > > >>>>> I
> > > > >>>> seem
> > > > >>>>> to have view access only.
> > > > >>>>>
> > > > >>>>> -Tyler
> > > > >>>>>
> > > > >>>>> On Fri, May 13, 2016 at 9:54 AM Kam Kasravi
> > > > >>>>> <kamkasr...@gmail.com>
> > > > >>>> wrote:
> > > > >>>>>> Hi
> > > > >>>>>>
> > > > >>>>>> A number of readers have made comments on this topic recently.
> > > > >>>>>> We have created a document that does some analysis of common
> > > > >>>>>> ML models and
> > > > >>>>> related
> > > > >>>>>> APIs. We hope this can drive an approach that will result in
> > > > >>>>>> an API, compatibility matrix and involvement from the same
> > > > >>>>>> groups that are implementing transformation runners (spark,
> > > > > flink, etc).
> > > > >>>>>> We welcome comments here or in the document itself.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>
> > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1
> > > > >>>> yjo4
> > > > >>>> PBECHb-xA/edit?usp=sharing
> > > >
> > > >
> > >
> >
>

Reply via email to