RE: machine learning API, common models

2016-05-17 Thread Kavulya, Soila P
Hi Suneel,

The document is a work-in-progress to solicit feedback on the API so feel free 
to add to it.

Based on discussions with Tyler, the plan was to start by defining the types of 
data structures and transforms that we need to support in the high-level API 
(without a default implementation). Once that is done, we would then add 
lower-level ML algorithm support (e.g. iterative).

Soila

-Original Message-
From: Suneel Marthi [mailto:smar...@apache.org] 
Sent: Tuesday, May 17, 2016 7:26 AM
To: mahout ; dev@beam.incubator.apache.org
Subject: Re: machine learning API, common models

I am curious as to why Oryx 2.0 and Mahout have been excluded from this doc. 
Any reasons?
Both the projects have good customer base and are being used in production.

On Tue, May 17, 2016 at 10:01 AM, Suneel Marthi  wrote:

> Thanks Simone for pointing this out.
>
> On the Apache Mahout project we have distributed linear algebra with 
> R-like semantics that can be executed on Spark/Flink/H2O.
>
> @Kam: the document u point out is old and outdated, the most 
> up-to-date reference to the Samsara api is the book - 'Apache Mahout: 
> Beyond MapReduce". (shameless marketing here on behalf of fellow 
> committers :) )
>
> We added Flink DataSet API in the recent Mahout 0.12.0 release (April 
> 11,
> 2016) and has been called out in my talk at ApacheBigData in Vancouver 
> last week.
>
> The Mahout community would definitely be interested in being involved 
> with this and sharing notes.
>
> IMHO, the focus should be first on building a good linalg foundations 
> before embarking on building algos and pipelines. Adding @dlyubimov to this.
>
>
>
> -- Forwarded message --
> From: Simone Robutti 
> Date: Tue, May 17, 2016 at 9:48 AM
> Subject: Fwd: machine learning API, common models
> To: Suneel Marthi 
>
>
>
> -- Forwarded message --
> From: Kavulya, Soila P 
> Date: 2016-05-17 1:53 GMT+02:00
> Subject: RE: machine learning API, common models
> To: "dev@beam.incubator.apache.org" 
>
>
> Thanks Simone,
>
> You have raised a valid concern about how different frameworks will 
> have different implementations and parameter semantics for the same 
> algorithm. I agree that it is important to keep this in mind. 
> Hopefully, through this exercise, we will identify a good set of 
> common ML abstractions across different frameworks.
>
> Feel free to edit the document. We had limited the first pass of the 
> comparison matrix to the machine learning pipeline APIs, but we can 
> extend it to include other ML building blocks like linear algebra 
> operations, and APIs for optimizers like gradient descent.
>
> Soila
>
> -Original Message-
> From: Kam Kasravi [mailto:kamkasr...@gmail.com]
> Sent: Monday, May 16, 2016 8:22 AM
> To: dev@beam.incubator.apache.org
> Subject: Re: machine learning API, common models
>
> Thanks Simone - yes I had read your concerns on dev and I think 
> they're well founded.
> Thanks for the samsura reference - I've been looking at the 
> spark/scala bindings 
> http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf.
>
> I think we should expand the document to include linear algebraic ops 
> or least pay due diligence to it. If you're doing anything on the 
> flink side in this regard let us or feel free to suggest edits/updates to the 
> document.
>
> Thanks
> Kam
>
> On Mon, May 16, 2016 at 6:05 AM, Simone Robutti < 
> simone.robu...@radicalbit.io> wrote:
>
> > Hello,
> >
> > I'm Simone and I just began contributing to Flink ML (actually on 
> > the distributed linalg part). I already expressed my concerns about 
> > the idea of an high level API relying on specific frameworks'
> implementations:
> > different implementations produce different results and may vary in 
> > quality. Also the semantics of parameters may change from one 
> > implementation to the other. This could hinder portability and 
> > transparency. I believe these problems could be handled paying the 
> > due attention to the details of every single implementation but I 
> > invite you not to underestimate these problems.
> >
> > On the other hand the API in itself looks good to me. From my side, 
> > I hope to fill some of the gaps in Flink you underlined in the 
> > comparison
> matrix.
> >
> > Talking about matrices, proper matrices this time, I believe it 
> > would be useful to include in this API support for linear algebra 
> > operations.
> > Something similar is already present in Mahout's Samsara and it 
> > looks really good but clearly a similar implementation on Beam would 
> > be way more interesting and powerful.
> >
> > My 2 cents,
> >
> > Simone
> >
> >
> > 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P :
> >
> > > Hi Tyler,
> > >
> > > Thank you so much for your feedback. I agree that 

Re: machine learning API, common models

2016-05-17 Thread Tyler Akidau
On Sat, May 14, 2016 at 4:53 AM Kavulya, Soila P 
wrote:

> Hi Tyler,
>
> Thank you so much for your feedback. I agree that starting with the
> high-level API is a good direction. We are interested in Python because it
> is the language that our data scientists are most familiar with. I think
> starting with Java would be the best approach, because the Python API can
> be a thin wrapper for Java API.
>
> In Spark, the Scala, Java and Python APIs are identical. Flink does not
> have a Python API for ML pipelines at present.
>
> Could you point me to the updated runner API?
>

Sorry for the delay; I've been traveling. The runner API proposal is here:
https://docs.google.com/document/d/1bao-5B6uBuf-kwH1meenAuXXS0c9cBQ1B2J59I3FiyI/edit

-Tyler


>
> Soila
>
> -Original Message-
> From: Tyler Akidau [mailto:taki...@google.com.INVALID]
> Sent: Friday, May 13, 2016 6:34 PM
> To: dev@beam.incubator.apache.org
> Subject: Re: machine learning API, common models
>
> Hi Kam & Soila,
>
> Thanks a lot for writing this up. I ran the doc past some of the folks
> who've been doing ML work here at Google, and they were generally happy
> with the distillation of common methods in the doc. I'd be curious to hear
> what folks on the Flink- and Spark- runner sides think.
>
> To me, this seems like a good direction for a high-level API. Presumably,
> once a high-level API is in place, we could begin looking at what it would
> take to add lower-level ML algorithm support (e.g. iterative) to the Beam
> Model. Is this essentially what you're thinking?
>
> Some more specific questions/comments:
>
>- Presumably you'd want to tackle this in Java first, since that's the
>only language we currently support? Given that half of your examples
> are in
>Python, I'm also assuming Python will be interesting once it's
> available.
>
>- Along those lines, what languages are represented in the capability
>matrix? E.g. is Spark ML support as detailed there identical across
>Java/Scala and Python?
>
>- Have you thought about how this would tie in at the runner level,
>particularly given the updated Runner API changes that are coming? I'm
>assuming they'd be provided as composite transforms that (for now) would
>have no default implementation, given the lack of low-level primitives
> for
>ML algorithms, but am curious what your thoughts are there.
>
>- I still don't fully understand how incremental updates due to model
>drift would tie in at the API level. There's a comment thread in the doc
>still open tracking this, so no need to comment here additionally. Just
>pointing it out as one of the things that stands out as potentially
> having
>API-level impacts to me that doesn't seem 100% fleshed out in the doc
> yet
>(thought that admittedly may just be my limited understanding at this
> point
>:-).
>
> -Tyler
>
>
>
>
> On Fri, May 13, 2016 at 10:48 AM Kam Kasravi  wrote:
>
> > Hi Tyler - my bad. Comments should be enabled now.
> >
> > On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau
> >  > >
> > wrote:
> >
> > > Thanks a lot, Kam. Can you please enable comment access on the doc?
> > > I
> > seem
> > > to have view access only.
> > >
> > > -Tyler
> > >
> > > On Fri, May 13, 2016 at 9:54 AM Kam Kasravi 
> > wrote:
> > >
> > > > Hi
> > > >
> > > > A number of readers have made comments on this topic recently. We
> > > > have created a document that does some analysis of common ML
> > > > models and
> > > related
> > > > APIs. We hope this can drive an approach that will result in an
> > > > API, compatibility matrix and involvement from the same groups
> > > > that are implementing transformation runners (spark, flink, etc).
> > > > We welcome comments here or in the document itself.
> > > >
> > > >
> > > >
> > >
> > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4
> > PBECHb-xA/edit?usp=sharing
> > > >
> > >
> >
>


Re: Apache Zeppelin Beam Integration

2016-05-17 Thread Neville Li
The RDD API is only tied to the SparkInterpreter.

Scio will probably have its own interpreter but with configurable runner so
users can leverage Dataflow (very attractive to us) or Flink also.

On Tue, May 17, 2016 at 11:12 AM Jean-Baptiste Onofré 
wrote:

> Hi Ismael,
>
> If 1. is probably the easiest way, I think that it would require some
> changes at Zeppelin side anyway. AFAIK, Zeppelin directly leverages the
> RDD and so it's tight to the Spark API.
>
> So, maybe we will need to change a bit the Zeppelin backend to abstract
> the current RDD usage to PCollection.
>
> My $0.01
>
> Regards
> JB
>
> On 05/17/2016 03:03 PM, Ismaël Mejía wrote:
> > Last week during the Apache Big Data / Apachecon conference i assisted to
> > some
> > presentations and one aspect that surprised me is how Apache Zeppelin was
> > used
> > by many presenters to show their data processing code (mostly in
> > python/scala).
> >
> > I consider that even if this integration is not critical for Apache
> Beam, it
> > is important to support this, and i intend to collaborate in such task. I
> > just created an issue on JIRA for the people interested
> > https://issues.apache.org/jira/browse/BEAM-290
> >
> > I briefly discussed with Alexander Bezzubov from Zeppelin about an
> initial
> > plan
> > to support Beam in three phases:
> >
> > 1. support the scala sdk (scio) + scala runners (spark):
> >
> > This is first since most of the pieces exist already, we just need to put
> > the
> > things together.
> >
> > 2. integrate the java sdk
> >
> > The big issue here is that there is not (yet) a decent java repl tool,
> and
> > the
> > support of such repl in zeppelin is an ongoing work.
> >
> > 3. integrate the python sdk
> >
> > This one depends on the release of the python sdk in the upcoming weeks,
> > and its
> > priority can change if integration is easier than the other two tasks.
> >
> > Of course this message is a call to other interested parties to
> contribute,
> > e.g.
> > ideas, agenda to prioritize certain runners, or other complementary tasks
> > to
> > achieve the goals like integrate scio, support the google storage backend
> > for the
> > notebooks (to make a nicer integration for users of the runner in the
> google
> > cloud), etc.
> >
> > Ismaël Mejía
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Apache Zeppelin Beam Integration

2016-05-17 Thread Jean-Baptiste Onofré

Hi Ismael,

If 1. is probably the easiest way, I think that it would require some 
changes at Zeppelin side anyway. AFAIK, Zeppelin directly leverages the 
RDD and so it's tight to the Spark API.


So, maybe we will need to change a bit the Zeppelin backend to abstract 
the current RDD usage to PCollection.


My $0.01

Regards
JB

On 05/17/2016 03:03 PM, Ismaël Mejía wrote:

Last week during the Apache Big Data / Apachecon conference i assisted to
some
presentations and one aspect that surprised me is how Apache Zeppelin was
used
by many presenters to show their data processing code (mostly in
python/scala).

I consider that even if this integration is not critical for Apache Beam, it
is important to support this, and i intend to collaborate in such task. I
just created an issue on JIRA for the people interested
https://issues.apache.org/jira/browse/BEAM-290

I briefly discussed with Alexander Bezzubov from Zeppelin about an initial
plan
to support Beam in three phases:

1. support the scala sdk (scio) + scala runners (spark):

This is first since most of the pieces exist already, we just need to put
the
things together.

2. integrate the java sdk

The big issue here is that there is not (yet) a decent java repl tool, and
the
support of such repl in zeppelin is an ongoing work.

3. integrate the python sdk

This one depends on the release of the python sdk in the upcoming weeks,
and its
priority can change if integration is easier than the other two tasks.

Of course this message is a call to other interested parties to contribute,
e.g.
ideas, agenda to prioritize certain runners, or other complementary tasks
to
achieve the goals like integrate scio, support the google storage backend
for the
notebooks (to make a nicer integration for users of the runner in the google
cloud), etc.

Ismaël Mejía



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Fwd: machine learning API, common models

2016-05-17 Thread Suneel Marthi
Thanks Simone for pointing this out.

On the Apache Mahout project we have distributed linear algebra with R-like
semantics that can be executed on Spark/Flink/H2O.

@Kam: the document u point out is old and outdated, the most up-to-date
reference to the Samsara api is the book - 'Apache Mahout: Beyond
MapReduce". (shameless marketing here on behalf of fellow committers :) )

We added Flink DataSet API in the recent Mahout 0.12.0 release (April 11,
2016) and has been called out in my talk at ApacheBigData in Vancouver last
week.

The Mahout community would definitely be interested in being involved with
this and sharing notes.

IMHO, the focus should be first on building a good linalg foundations
before embarking on building algos and pipelines. Adding @dlyubimov to this.



-- Forwarded message --
From: Simone Robutti 
Date: Tue, May 17, 2016 at 9:48 AM
Subject: Fwd: machine learning API, common models
To: Suneel Marthi 



-- Forwarded message --
From: Kavulya, Soila P 
Date: 2016-05-17 1:53 GMT+02:00
Subject: RE: machine learning API, common models
To: "dev@beam.incubator.apache.org" 


Thanks Simone,

You have raised a valid concern about how different frameworks will have
different implementations and parameter semantics for the same algorithm. I
agree that it is important to keep this in mind. Hopefully, through this
exercise, we will identify a good set of common ML abstractions across
different frameworks.

Feel free to edit the document. We had limited the first pass of the
comparison matrix to the machine learning pipeline APIs, but we can extend
it to include other ML building blocks like linear algebra operations, and
APIs for optimizers like gradient descent.

Soila

-Original Message-
From: Kam Kasravi [mailto:kamkasr...@gmail.com]
Sent: Monday, May 16, 2016 8:22 AM
To: dev@beam.incubator.apache.org
Subject: Re: machine learning API, common models

Thanks Simone - yes I had read your concerns on dev and I think they're
well founded.
Thanks for the samsura reference - I've been looking at the spark/scala
bindings http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
.

I think we should expand the document to include linear algebraic ops or
least pay due diligence to it. If you're doing anything on the flink side
in this regard let us or feel free to suggest edits/updates to the document.

Thanks
Kam

On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
simone.robu...@radicalbit.io> wrote:

> Hello,
>
> I'm Simone and I just began contributing to Flink ML (actually on the
> distributed linalg part). I already expressed my concerns about the
> idea of an high level API relying on specific frameworks' implementations:
> different implementations produce different results and may vary in
> quality. Also the semantics of parameters may change from one
> implementation to the other. This could hinder portability and
> transparency. I believe these problems could be handled paying the due
> attention to the details of every single implementation but I invite
> you not to underestimate these problems.
>
> On the other hand the API in itself looks good to me. From my side, I
> hope to fill some of the gaps in Flink you underlined in the comparison
matrix.
>
> Talking about matrices, proper matrices this time, I believe it would
> be useful to include in this API support for linear algebra operations.
> Something similar is already present in Mahout's Samsara and it looks
> really good but clearly a similar implementation on Beam would be way
> more interesting and powerful.
>
> My 2 cents,
>
> Simone
>
>
> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P :
>
> > Hi Tyler,
> >
> > Thank you so much for your feedback. I agree that starting with the
> > high-level API is a good direction. We are interested in Python
> > because
> it
> > is the language that our data scientists are most familiar with. I
> > think starting with Java would be the best approach, because the
> > Python API can be a thin wrapper for Java API.
> >
> > In Spark, the Scala, Java and Python APIs are identical. Flink does
> > not have a Python API for ML pipelines at present.
> >
> > Could you point me to the updated runner API?
> >
> > Soila
> >
> > -Original Message-
> > From: Tyler Akidau [mailto:taki...@google.com.INVALID]
> > Sent: Friday, May 13, 2016 6:34 PM
> > To: dev@beam.incubator.apache.org
> > Subject: Re: machine learning API, common models
> >
> > Hi Kam & Soila,
> >
> > Thanks a lot for writing this up. I ran the doc past some of the
> > folks who've been doing ML work here at Google, and they were
> > generally happy with the distillation of common methods in the doc.
> > I'd be curious to
> hear
> > what folks on the Flink- and Spark- runner sides think.
> >
> > To me, this seems like a good direction for a high-level API.
> > 

Re: Apache Zeppelin Beam Integration

2016-05-17 Thread Ismaël Mejía
You are right Neville, my idea is to use the scio-repl to offer such
semi-interactiveness, at least to be able to show/execute code snippets,
but I
agree that the experience probably won't be exactly the same of the spark
interpreter, however the different runner support is for me worth the
effort.

Another aspect that could be interesting too and that I have not explored
at all
is to view how we can integrate real-time/unbounded data pipelines that I
imagine displayed as 'dashboards'.

Anyway, I am probably missing some technical details (no doubt), so any
feedback you or the others can give me is more than welcome.

On Tue, May 17, 2016 at 3:16 PM, Neville Li  wrote:

> The biggest appeal of spark in zeppelin is its interactiveness, i.e. the
> ability to pull data from RDDs to the driver/web UI via actions (take,
> collect, top).
> There are no equivalent of actions in Beam/Dataflow, only transformations
> (apply(transform)). How's that gonna work with spark?
>
> In scio-repl we have semi-interactiveness, i.e. each context corresponds to
> a Dataflow job but you have to close the context before collecting data
> back to the REPL with Future.
>
> On Tue, May 17, 2016 at 9:03 AM Ismaël Mejía  wrote:
>
> > Last week during the Apache Big Data / Apachecon conference i assisted to
> > some
> > presentations and one aspect that surprised me is how Apache Zeppelin was
> > used
> > by many presenters to show their data processing code (mostly in
> > python/scala).
> >
> > I consider that even if this integration is not critical for Apache Beam,
> > it
> > is important to support this, and i intend to collaborate in such task. I
> > just created an issue on JIRA for the people interested
> > https://issues.apache.org/jira/browse/BEAM-290
> >
> > I briefly discussed with Alexander Bezzubov from Zeppelin about an
> initial
> > plan
> > to support Beam in three phases:
> >
> > 1. support the scala sdk (scio) + scala runners (spark):
> >
> > This is first since most of the pieces exist already, we just need to put
> > the
> > things together.
> >
> > 2. integrate the java sdk
> >
> > The big issue here is that there is not (yet) a decent java repl tool,
> and
> > the
> > support of such repl in zeppelin is an ongoing work.
> >
> > 3. integrate the python sdk
> >
> > This one depends on the release of the python sdk in the upcoming weeks,
> > and its
> > priority can change if integration is easier than the other two tasks.
> >
> > Of course this message is a call to other interested parties to
> contribute,
> > e.g.
> > ideas, agenda to prioritize certain runners, or other complementary tasks
> > to
> > achieve the goals like integrate scio, support the google storage backend
> > for the
> > notebooks (to make a nicer integration for users of the runner in the
> > google
> > cloud), etc.
> >
> > Ismaël Mejía
> >
>


Re: Apache Zeppelin Beam Integration

2016-05-17 Thread Neville Li
The biggest appeal of spark in zeppelin is its interactiveness, i.e. the
ability to pull data from RDDs to the driver/web UI via actions (take,
collect, top).
There are no equivalent of actions in Beam/Dataflow, only transformations
(apply(transform)). How's that gonna work with spark?

In scio-repl we have semi-interactiveness, i.e. each context corresponds to
a Dataflow job but you have to close the context before collecting data
back to the REPL with Future.

On Tue, May 17, 2016 at 9:03 AM Ismaël Mejía  wrote:

> Last week during the Apache Big Data / Apachecon conference i assisted to
> some
> presentations and one aspect that surprised me is how Apache Zeppelin was
> used
> by many presenters to show their data processing code (mostly in
> python/scala).
>
> I consider that even if this integration is not critical for Apache Beam,
> it
> is important to support this, and i intend to collaborate in such task. I
> just created an issue on JIRA for the people interested
> https://issues.apache.org/jira/browse/BEAM-290
>
> I briefly discussed with Alexander Bezzubov from Zeppelin about an initial
> plan
> to support Beam in three phases:
>
> 1. support the scala sdk (scio) + scala runners (spark):
>
> This is first since most of the pieces exist already, we just need to put
> the
> things together.
>
> 2. integrate the java sdk
>
> The big issue here is that there is not (yet) a decent java repl tool, and
> the
> support of such repl in zeppelin is an ongoing work.
>
> 3. integrate the python sdk
>
> This one depends on the release of the python sdk in the upcoming weeks,
> and its
> priority can change if integration is easier than the other two tasks.
>
> Of course this message is a call to other interested parties to contribute,
> e.g.
> ideas, agenda to prioritize certain runners, or other complementary tasks
> to
> achieve the goals like integrate scio, support the google storage backend
> for the
> notebooks (to make a nicer integration for users of the runner in the
> google
> cloud), etc.
>
> Ismaël Mejía
>