+1
2016-05-27 17:18 GMT+02:00 Kam Kasravi <kamkasr...@gmail.com>:
> Hi Beam ML community
>
> Based on comments from a number of you and some discussion we've had here
> we thought we would suggest the following direction:
>
>- Begin with primitive operations common and critical to most all ML
>algorithms. These primitive operators would include:
> - linear algebra operations - borrowing from established libraries
> like samsara.
> - iterative processing - also central to ML where replay of datasets
> is easy to specific as well as thresholds or halting criteria. This
> coordinates well with FlinkML's current approach and base API's.
> - possibly new broadcast mechanisms not normally available within BSP
> frameworks such as Beam.
>- Normalize dataset and parameters that differ across current major ML
>libraries that offer the same types of models.
>- Favor a native ML implementation rather than a thin wrapper in order
>to provide consistency across runners. This will also allow the Beam ML
> to
>maximize quality and consistency issues across runners.
>- Support for languages also supported in the Beam runners (java,
>python, scala).
>- Implement several common ML algorithms using the low level primitives
>on one of more available Runners to validate both the low level API's
> and
>possible improvements on the high level API.
>
> Skikit-learn pipelines and existing portable libraries like xgboost4j will
> be valuable to model the high-level APIs - for example how xgboost4j
> currently integrates with spark and flink.
>
> We welcome further comments and further refinements in approach.
>
> On Sun, May 22, 2016 at 7:43 PM, Henry Saputra <henry.sapu...@gmail.com>
> wrote:
>
> > @Frances:
> >
> > that would be probably the way to go IF we decide to have ML in Beam.
> >
> > @Simone:
> >
> > I am definitely love to see Beam introduce ML model APIs to abstract and
> > unifiy all "dataflow" runner frameworks, such as with Flink ML and Spark
> > ML.
> >
> > However, as you mentioned before, the target audience would be focus on
> > distributed or ML engineers as you have mentioned.
> > But I could see we have to then make some out of box ML algorithms (model
> > train and fine tune) in addition to test the model and APIs.
> >
> > The expectation would be that these models to be "production" ready, in
> > which most cases will be used by Data Scientists via some configurations,
> > since they won't and most can't use Java language.
> >
> > I would love to see instead more on integration with existing ML
> frameworks
> > like XGBoost [1], Mahout Samsara [2], or DL4J [3] for ML APIs and models
> in
> > Beam.
> >
> > Thoughts and comments are definitely welcomed =)
> >
> > - Henry
> >
> > [1] https://github.com/dmlc/xgboost
> > [2]
> https://mahout.apache.org/users/environment/out-of-core-reference.html
> > [3] http://deeplearning4j.org
> > <http://deeplearning4j.org/image-data-pipeline.html#record>
> >
> >
> > On Sat, May 21, 2016 at 2:01 AM, Simone Robutti <
> > simone.robu...@radicalbit.io> wrote:
> >
> > > I think these APIs won't be used by Data Scientists (R, Python) but by
> > > Machine Learning Engineers (Scala, Java or C++ in different
> environments)
> > > and as a ML Engineer it makes a lot of sense to me to have such an API
> if
> > > I'm using Beam. It would make a lot more sense to implement algorithms
> > > directly in Beam but that will come in the future, I hope.
> > >
> > > 2016-05-21 0:35 GMT+02:00 Henry Saputra <henry.sapu...@gmail.com>:
> > >
> > > > I am a bit concern about adding ML model APIs to Beam because the
> > > fluctuate
> > > > nature of ML landscape and also in reality, most data scientists tend
> > to
> > > > use Python and R most the work with existing model definition.
> > > >
> > > > Even though you could say something like Spark ML is popular, it is
> > > merely
> > > > because it is involving Apache Spark rather than quality of the ML
> > module
> > > > itself.
> > > >
> > > > The pipeline and most of the tooling are inspired by scikit-learn,
> and
> > > > hence it is relying on familiarity of the library to attract
> > developers.
> > > >
> > > > My question is whether fully end to end ML APIs is needed as part of
> > core
> > > > Beam