Re: [DISCUSS] Developing new components -- branches, maturity, and committers

2016-05-19 Thread Jean-Baptiste Onofré

Fully agree with Davor for feature idea impl.

Regards
JB

On 05/19/2016 08:59 PM, Davor Bonaci wrote:

If anybody wants to experiment a little with a feature idea -- absolutely,
individual forked repositories are certainly an awesome place for such
attempts.

However, for something that is a significant undertaking, like a new runner
or new SDK, I think feature branches in the main repository make total
sense. We'd avoid important disadvantages of lower visibility, harder for
others to jump in, comment, learn, etc., harder testing because Apache
Jenkins wouldn't be able to do it automatically, etc.

In summary, I think there's a spectrum of feature complexities and
longevity considerations. As such, I'd support being flexible as
appropriate, but have a default answer of starting with a feature branch in
the main repository for new major components.

On Thu, May 19, 2016 at 3:09 AM, Ismaël Mejía  wrote:


I agree with Aljoscha, about not putting the feature branches in the main
repo, however how can we make people  aware of the new developments ?

-Ismaël

On Thu, May 19, 2016 at 11:56 AM, Aljoscha Krettek 
wrote:


+1

When we say feature branch, are we talking about a branch in the main

repo?

I would propose that feature branches live in the repos of the committers
who are working on a feature.

On Thu, 19 May 2016 at 11:54 Jean-Baptiste Onofré 

wrote:



+1

it looks good to me.

Regards
JB

On 05/19/2016 07:01 AM, Frances Perry wrote:

Hi Beamers --

I’m thrilled by the recent energy and activity on writing new Beam

runners!

But that also means it’s probably time for us to figure out how, as a
community, we want to support this process. ;-)

Back near the beginning, we had a thread [1] discussing that feature
branches are the preferred way of doing development of features or
components that may take a while to reach maturity. I think new

components

like runners and SDKs meet the bar to be started from a feature

branch.

(Other features, like an IO connector or library of PTransforms,

might

also

qualify depending on complexity.)

We should also lay out what it takes to be considered mature enough

to

be

merged into master, since once that happens the component gets

released

to

users and failing tests become blocking issues. Here are some initial
thoughts to kick off the discussion...

In order to be merged into master, new components / major features

should:


 -

 have at least 2 contributors interested in maintaining it, and 1
 committer interested in supporting it
 -

 provide both end-user and developer-facing documentation
 -

 have at least a basic level of unit test coverage
 -

 run all existing applicable integration tests with other Beam

components

 and create additional tests as appropriate


In addition...

A runner should:

 -

 be able to handle a subset of the model that address a

significant

set

 of use cases (aka. ‘traditional batch’ or ‘processing time

streaming’)

 -

 update the capability matrix with the current status


An SDK* should:

 -

 provide the ability to construct graphs with all the basic

building

 blocks of the model (ParDo, GroupByKey, Window, Trigger, etc)
 -

 begin fleshing out the common composite transforms (Count, Join,

etc)

 and IO connectors (Text, Kafka, etc)
 -

 have at least one runner that can execute the complete model (may

be

a

 direct runner)
 -

 provide integration tests for executing against current and

future

 runners


* A note on DSLs:  I think it’s important to separate out an SDK

from a

DSL, because in my mind the former is by definition equivalent to the

Beam

model, while the latter may select portions of the model or change

the

user-visible abstractions in order to provide a domain-specific

experience.

We may want to encourage some DSLs to live separately from Beam

because

they may look completely non-Beam-like to their end users. But we can
probably punt this decision until we have concrete examples to

discuss.


Another fun part of this growth is that we’ll likely grow new

committers.

And given the breadth of Beam, I think it would be useful to annotate

our

committers [2] page with which components folks are the most

knowledgeable

about.

Looking forward to your thoughts.

[1]






http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201602.mbox/%3CCAAzyFAymVNpjQgZdz2BoMknnE3H9rYRbdnUemamt9Pavw8ugsw%40mail.gmail.com%3E


[2] http://beam.incubator.apache.org/team/



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] Developing new components -- branches, maturity, and committers

2016-05-19 Thread Amit Sela
+1 for Davor's comment on major features being developed in the main
repository.

@Frances: I think the perquisites you describe for new components are
definitely something to aim for, but I guess the project still has some
maturing to do until we're there. We'll probably have to start by helping
new developers and new features get there (and maybe document a few
example-worthy experiences) - and I'm sure we'll do it well ;)

On Thu, May 19, 2016 at 10:37 PM Aljoscha Krettek 
wrote:

> +1 I see that for such things it would make sense.
>
> On Thu, 19 May 2016 at 20:59 Davor Bonaci 
> wrote:
>
> > If anybody wants to experiment a little with a feature idea --
> absolutely,
> > individual forked repositories are certainly an awesome place for such
> > attempts.
> >
> > However, for something that is a significant undertaking, like a new
> runner
> > or new SDK, I think feature branches in the main repository make total
> > sense. We'd avoid important disadvantages of lower visibility, harder for
> > others to jump in, comment, learn, etc., harder testing because Apache
> > Jenkins wouldn't be able to do it automatically, etc.
> >
> > In summary, I think there's a spectrum of feature complexities and
> > longevity considerations. As such, I'd support being flexible as
> > appropriate, but have a default answer of starting with a feature branch
> in
> > the main repository for new major components.
> >
> > On Thu, May 19, 2016 at 3:09 AM, Ismaël Mejía  wrote:
> >
> > > I agree with Aljoscha, about not putting the feature branches in the
> main
> > > repo, however how can we make people  aware of the new developments ?
> > >
> > > -Ismaël
> > >
> > > On Thu, May 19, 2016 at 11:56 AM, Aljoscha Krettek <
> aljos...@apache.org>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > When we say feature branch, are we talking about a branch in the main
> > > repo?
> > > > I would propose that feature branches live in the repos of the
> > committers
> > > > who are working on a feature.
> > > >
> > > > On Thu, 19 May 2016 at 11:54 Jean-Baptiste Onofré 
> > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > it looks good to me.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 05/19/2016 07:01 AM, Frances Perry wrote:
> > > > > > Hi Beamers --
> > > > > >
> > > > > > I’m thrilled by the recent energy and activity on writing new
> Beam
> > > > > runners!
> > > > > > But that also means it’s probably time for us to figure out how,
> > as a
> > > > > > community, we want to support this process. ;-)
> > > > > >
> > > > > > Back near the beginning, we had a thread [1] discussing that
> > feature
> > > > > > branches are the preferred way of doing development of features
> or
> > > > > > components that may take a while to reach maturity. I think new
> > > > > components
> > > > > > like runners and SDKs meet the bar to be started from a feature
> > > branch.
> > > > > > (Other features, like an IO connector or library of PTransforms,
> > > might
> > > > > also
> > > > > > qualify depending on complexity.)
> > > > > >
> > > > > > We should also lay out what it takes to be considered mature
> enough
> > > to
> > > > be
> > > > > > merged into master, since once that happens the component gets
> > > released
> > > > > to
> > > > > > users and failing tests become blocking issues. Here are some
> > initial
> > > > > > thoughts to kick off the discussion...
> > > > > >
> > > > > > In order to be merged into master, new components / major
> features
> > > > > should:
> > > > > >
> > > > > > -
> > > > > >
> > > > > > have at least 2 contributors interested in maintaining it,
> and
> > 1
> > > > > > committer interested in supporting it
> > > > > > -
> > > > > >
> > > > > > provide both end-user and developer-facing documentation
> > > > > > -
> > > > > >
> > > > > > have at least a basic level of unit test coverage
> > > > > > -
> > > > > >
> > > > > > run all existing applicable integration tests with other Beam
> > > > > components
> > > > > > and create additional tests as appropriate
> > > > > >
> > > > > >
> > > > > > In addition...
> > > > > >
> > > > > > A runner should:
> > > > > >
> > > > > > -
> > > > > >
> > > > > > be able to handle a subset of the model that address a
> > > significant
> > > > > set
> > > > > > of use cases (aka. ‘traditional batch’ or ‘processing time
> > > > > streaming’)
> > > > > > -
> > > > > >
> > > > > > update the capability matrix with the current status
> > > > > >
> > > > > >
> > > > > > An SDK* should:
> > > > > >
> > > > > > -
> > > > > >
> > > > > > provide the ability to construct graphs with all the basic
> > > building
> > > > > > blocks of the model (ParDo, GroupByKey, Window, Trigger, etc)
> > > > > > -
> > > > > >
> > > > > > begin fleshing out the common composite transforms (Count,
> > Join,
> > > > etc)
> > > > > > and IO connectors (Text, Kafka, etc)
> > > > > > -
> > > > > >
> > > > >

Re: [DISCUSS] Developing new components -- branches, maturity, and committers

2016-05-19 Thread Aljoscha Krettek
+1 I see that for such things it would make sense.

On Thu, 19 May 2016 at 20:59 Davor Bonaci  wrote:

> If anybody wants to experiment a little with a feature idea -- absolutely,
> individual forked repositories are certainly an awesome place for such
> attempts.
>
> However, for something that is a significant undertaking, like a new runner
> or new SDK, I think feature branches in the main repository make total
> sense. We'd avoid important disadvantages of lower visibility, harder for
> others to jump in, comment, learn, etc., harder testing because Apache
> Jenkins wouldn't be able to do it automatically, etc.
>
> In summary, I think there's a spectrum of feature complexities and
> longevity considerations. As such, I'd support being flexible as
> appropriate, but have a default answer of starting with a feature branch in
> the main repository for new major components.
>
> On Thu, May 19, 2016 at 3:09 AM, Ismaël Mejía  wrote:
>
> > I agree with Aljoscha, about not putting the feature branches in the main
> > repo, however how can we make people  aware of the new developments ?
> >
> > -Ismaël
> >
> > On Thu, May 19, 2016 at 11:56 AM, Aljoscha Krettek 
> > wrote:
> >
> > > +1
> > >
> > > When we say feature branch, are we talking about a branch in the main
> > repo?
> > > I would propose that feature branches live in the repos of the
> committers
> > > who are working on a feature.
> > >
> > > On Thu, 19 May 2016 at 11:54 Jean-Baptiste Onofré 
> > wrote:
> > >
> > > > +1
> > > >
> > > > it looks good to me.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 05/19/2016 07:01 AM, Frances Perry wrote:
> > > > > Hi Beamers --
> > > > >
> > > > > I’m thrilled by the recent energy and activity on writing new Beam
> > > > runners!
> > > > > But that also means it’s probably time for us to figure out how,
> as a
> > > > > community, we want to support this process. ;-)
> > > > >
> > > > > Back near the beginning, we had a thread [1] discussing that
> feature
> > > > > branches are the preferred way of doing development of features or
> > > > > components that may take a while to reach maturity. I think new
> > > > components
> > > > > like runners and SDKs meet the bar to be started from a feature
> > branch.
> > > > > (Other features, like an IO connector or library of PTransforms,
> > might
> > > > also
> > > > > qualify depending on complexity.)
> > > > >
> > > > > We should also lay out what it takes to be considered mature enough
> > to
> > > be
> > > > > merged into master, since once that happens the component gets
> > released
> > > > to
> > > > > users and failing tests become blocking issues. Here are some
> initial
> > > > > thoughts to kick off the discussion...
> > > > >
> > > > > In order to be merged into master, new components / major features
> > > > should:
> > > > >
> > > > > -
> > > > >
> > > > > have at least 2 contributors interested in maintaining it, and
> 1
> > > > > committer interested in supporting it
> > > > > -
> > > > >
> > > > > provide both end-user and developer-facing documentation
> > > > > -
> > > > >
> > > > > have at least a basic level of unit test coverage
> > > > > -
> > > > >
> > > > > run all existing applicable integration tests with other Beam
> > > > components
> > > > > and create additional tests as appropriate
> > > > >
> > > > >
> > > > > In addition...
> > > > >
> > > > > A runner should:
> > > > >
> > > > > -
> > > > >
> > > > > be able to handle a subset of the model that address a
> > significant
> > > > set
> > > > > of use cases (aka. ‘traditional batch’ or ‘processing time
> > > > streaming’)
> > > > > -
> > > > >
> > > > > update the capability matrix with the current status
> > > > >
> > > > >
> > > > > An SDK* should:
> > > > >
> > > > > -
> > > > >
> > > > > provide the ability to construct graphs with all the basic
> > building
> > > > > blocks of the model (ParDo, GroupByKey, Window, Trigger, etc)
> > > > > -
> > > > >
> > > > > begin fleshing out the common composite transforms (Count,
> Join,
> > > etc)
> > > > > and IO connectors (Text, Kafka, etc)
> > > > > -
> > > > >
> > > > > have at least one runner that can execute the complete model
> (may
> > > be
> > > > a
> > > > > direct runner)
> > > > > -
> > > > >
> > > > > provide integration tests for executing against current and
> > future
> > > > > runners
> > > > >
> > > > >
> > > > > * A note on DSLs:  I think it’s important to separate out an SDK
> > from a
> > > > > DSL, because in my mind the former is by definition equivalent to
> the
> > > > Beam
> > > > > model, while the latter may select portions of the model or change
> > the
> > > > > user-visible abstractions in order to provide a domain-specific
> > > > experience.
> > > > > We may want to encourage some DSLs to live separately from Beam
> > because
> > > > > they may look completely non-Beam-like to their end users. Bu

Re: [PROPOSAL] IRC or slack channel for Apache Beam

2016-05-19 Thread James Malone
Hi all,

It sounds like Slack is the clear winner here. So, I am happy to say that
we now have our own Slack Team, open to all!

http://apachebeam.slack.com

Once I created the Slack team, it rejected the large blanket list of
"acceptable email domains" I wanted to use (so signup is painless.)
Instead, it looks like we'll have to use an invite system. I've already
modified the team so anyone can invite anyone else (to make it easy to grow
the Beam community.) But, we will need to manually invite some people to
get this process started.

If you'd like an invite today, can you please email me -
jamesmal...@apache.org and I will invite you ASAP.

Best,

James

On Thu, May 19, 2016 at 9:36 AM, Milindu Sanoj Kumarage <
agentmili...@gmail.com> wrote:

> +1 for Slack.
> On 19 May 2016 5:43 p.m., "GANESH RAJU"  wrote:
>
> > +1 on slack
> >
> > Ganesh Raju
> >
> > Sent from my iPhone
> >
> > > On May 18, 2016, at 3:41 AM, Jean-Baptiste Onofré 
> > wrote:
> > >
> > > Good point Robert.
> > >
> > > I will be on the channel for sure (I'm already on bunch of Apache IRC
> > channels ;)).
> > >
> > > Regards
> > > JB
> > >
> > >> On 05/18/2016 10:26 AM, Robert Bradshaw wrote:
> > >> The value in such a channel is highly dependent on people regularly
> > >> being there--do we have a critical mass of developers that would hang
> > >> out there? If so, I'd say go for it.
> > >>
> > >>> On Wed, May 18, 2016 at 12:51 AM, Amit Sela 
> > wrote:
> > >>> +1 for Slack
> > >>>
> > >>> On Wed, May 18, 2016 at 10:47 AM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > >>> wrote:
> > >>>
> >  Hi all,
> > 
> >  What do you think about creating a #apache-beam IRC channel on
> > freenode
> >  ? Or if it's more convenient a channel on Slack ?
> > 
> >  Regards
> >  JB
> >  --
> >  Jean-Baptiste Onofré
> >  jbono...@apache.org
> >  http://blog.nanthrax.net
> >  Talend - http://www.talend.com
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> >
>


Re: [DISCUSS] Developing new components -- branches, maturity, and committers

2016-05-19 Thread Davor Bonaci
If anybody wants to experiment a little with a feature idea -- absolutely,
individual forked repositories are certainly an awesome place for such
attempts.

However, for something that is a significant undertaking, like a new runner
or new SDK, I think feature branches in the main repository make total
sense. We'd avoid important disadvantages of lower visibility, harder for
others to jump in, comment, learn, etc., harder testing because Apache
Jenkins wouldn't be able to do it automatically, etc.

In summary, I think there's a spectrum of feature complexities and
longevity considerations. As such, I'd support being flexible as
appropriate, but have a default answer of starting with a feature branch in
the main repository for new major components.

On Thu, May 19, 2016 at 3:09 AM, Ismaël Mejía  wrote:

> I agree with Aljoscha, about not putting the feature branches in the main
> repo, however how can we make people  aware of the new developments ?
>
> -Ismaël
>
> On Thu, May 19, 2016 at 11:56 AM, Aljoscha Krettek 
> wrote:
>
> > +1
> >
> > When we say feature branch, are we talking about a branch in the main
> repo?
> > I would propose that feature branches live in the repos of the committers
> > who are working on a feature.
> >
> > On Thu, 19 May 2016 at 11:54 Jean-Baptiste Onofré 
> wrote:
> >
> > > +1
> > >
> > > it looks good to me.
> > >
> > > Regards
> > > JB
> > >
> > > On 05/19/2016 07:01 AM, Frances Perry wrote:
> > > > Hi Beamers --
> > > >
> > > > I’m thrilled by the recent energy and activity on writing new Beam
> > > runners!
> > > > But that also means it’s probably time for us to figure out how, as a
> > > > community, we want to support this process. ;-)
> > > >
> > > > Back near the beginning, we had a thread [1] discussing that feature
> > > > branches are the preferred way of doing development of features or
> > > > components that may take a while to reach maturity. I think new
> > > components
> > > > like runners and SDKs meet the bar to be started from a feature
> branch.
> > > > (Other features, like an IO connector or library of PTransforms,
> might
> > > also
> > > > qualify depending on complexity.)
> > > >
> > > > We should also lay out what it takes to be considered mature enough
> to
> > be
> > > > merged into master, since once that happens the component gets
> released
> > > to
> > > > users and failing tests become blocking issues. Here are some initial
> > > > thoughts to kick off the discussion...
> > > >
> > > > In order to be merged into master, new components / major features
> > > should:
> > > >
> > > > -
> > > >
> > > > have at least 2 contributors interested in maintaining it, and 1
> > > > committer interested in supporting it
> > > > -
> > > >
> > > > provide both end-user and developer-facing documentation
> > > > -
> > > >
> > > > have at least a basic level of unit test coverage
> > > > -
> > > >
> > > > run all existing applicable integration tests with other Beam
> > > components
> > > > and create additional tests as appropriate
> > > >
> > > >
> > > > In addition...
> > > >
> > > > A runner should:
> > > >
> > > > -
> > > >
> > > > be able to handle a subset of the model that address a
> significant
> > > set
> > > > of use cases (aka. ‘traditional batch’ or ‘processing time
> > > streaming’)
> > > > -
> > > >
> > > > update the capability matrix with the current status
> > > >
> > > >
> > > > An SDK* should:
> > > >
> > > > -
> > > >
> > > > provide the ability to construct graphs with all the basic
> building
> > > > blocks of the model (ParDo, GroupByKey, Window, Trigger, etc)
> > > > -
> > > >
> > > > begin fleshing out the common composite transforms (Count, Join,
> > etc)
> > > > and IO connectors (Text, Kafka, etc)
> > > > -
> > > >
> > > > have at least one runner that can execute the complete model (may
> > be
> > > a
> > > > direct runner)
> > > > -
> > > >
> > > > provide integration tests for executing against current and
> future
> > > > runners
> > > >
> > > >
> > > > * A note on DSLs:  I think it’s important to separate out an SDK
> from a
> > > > DSL, because in my mind the former is by definition equivalent to the
> > > Beam
> > > > model, while the latter may select portions of the model or change
> the
> > > > user-visible abstractions in order to provide a domain-specific
> > > experience.
> > > > We may want to encourage some DSLs to live separately from Beam
> because
> > > > they may look completely non-Beam-like to their end users. But we can
> > > > probably punt this decision until we have concrete examples to
> discuss.
> > > >
> > > > Another fun part of this growth is that we’ll likely grow new
> > committers.
> > > > And given the breadth of Beam, I think it would be useful to annotate
> > our
> > > > committers [2] page with which components folks are the most
> > > knowledgeable
> > > > about.
> > > >
> > > > Looking 

Re: Failing Jenkins Runs

2016-05-19 Thread Davor Bonaci
This is a wider problem, not specific to our project, tracked by
INFRA-11878 [1]. Nothing we can do right now.

[1] https://issues.apache.org/jira/browse/INFRA-11878

On Thu, May 19, 2016 at 2:21 AM, Aljoscha Krettek 
wrote:

> Hi,
> on all of the recent PRs Jenkins fails with this message:
> https://builds.apache.org/job/beam_PreCommit_MavenVerify/1213/console
>
> Does anyone have an idea what might be going on? Also, where is Jenkins
> configured? With this I could take a look myself.
>
> -Aljoscha
>


Re: [PROPOSAL] IRC or slack channel for Apache Beam

2016-05-19 Thread Milindu Sanoj Kumarage
+1 for Slack.
On 19 May 2016 5:43 p.m., "GANESH RAJU"  wrote:

> +1 on slack
>
> Ganesh Raju
>
> Sent from my iPhone
>
> > On May 18, 2016, at 3:41 AM, Jean-Baptiste Onofré 
> wrote:
> >
> > Good point Robert.
> >
> > I will be on the channel for sure (I'm already on bunch of Apache IRC
> channels ;)).
> >
> > Regards
> > JB
> >
> >> On 05/18/2016 10:26 AM, Robert Bradshaw wrote:
> >> The value in such a channel is highly dependent on people regularly
> >> being there--do we have a critical mass of developers that would hang
> >> out there? If so, I'd say go for it.
> >>
> >>> On Wed, May 18, 2016 at 12:51 AM, Amit Sela 
> wrote:
> >>> +1 for Slack
> >>>
> >>> On Wed, May 18, 2016 at 10:47 AM Jean-Baptiste Onofré  >
> >>> wrote:
> >>>
>  Hi all,
> 
>  What do you think about creating a #apache-beam IRC channel on
> freenode
>  ? Or if it's more convenient a channel on Slack ?
> 
>  Regards
>  JB
>  --
>  Jean-Baptiste Onofré
>  jbono...@apache.org
>  http://blog.nanthrax.net
>  Talend - http://www.talend.com
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>


Jenkins build is back to normal : beam_Release_NightlySnapshot #46

2016-05-19 Thread Apache Jenkins Server
See 



Re: Fwd: machine learning API, common models

2016-05-19 Thread Jianfeng Qian
Hi,
I am quite interested about this proposal.
it is great to consider a lot of machine learning projects.
Currently, most algorithms of spark mllib are batch processing, while  
oryx2 and streamDM focus on real-time machine learning.
And Flink works with SAMOA team to integrate stream mining algorithms, too.
So I wonder is that possible to design A flexible SDK which allow user 
to call different third party packages or their own algorithms?

Best,
Jianfeng

On 2016年05月17日 22:01, Suneel Marthi wrote:
> Thanks Simone for pointing this out.
>
> On the Apache Mahout project we have distributed linear algebra with R-like
> semantics that can be executed on Spark/Flink/H2O.
>
> @Kam: the document u point out is old and outdated, the most up-to-date
> reference to the Samsara api is the book - 'Apache Mahout: Beyond
> MapReduce". (shameless marketing here on behalf of fellow committers :) )
>
> We added Flink DataSet API in the recent Mahout 0.12.0 release (April 11,
> 2016) and has been called out in my talk at ApacheBigData in Vancouver last
> week.
>
> The Mahout community would definitely be interested in being involved with
> this and sharing notes.
>
> IMHO, the focus should be first on building a good linalg foundations
> before embarking on building algos and pipelines. Adding @dlyubimov to this.
>
>
>
> -- Forwarded message --
> From: Simone Robutti 
> Date: Tue, May 17, 2016 at 9:48 AM
> Subject: Fwd: machine learning API, common models
> To: Suneel Marthi 
>
>
>
> -- Forwarded message --
> From: Kavulya, Soila P 
> Date: 2016-05-17 1:53 GMT+02:00
> Subject: RE: machine learning API, common models
> To: "dev@beam.incubator.apache.org" 
>
>
> Thanks Simone,
>
> You have raised a valid concern about how different frameworks will have
> different implementations and parameter semantics for the same algorithm. I
> agree that it is important to keep this in mind. Hopefully, through this
> exercise, we will identify a good set of common ML abstractions across
> different frameworks.
>
> Feel free to edit the document. We had limited the first pass of the
> comparison matrix to the machine learning pipeline APIs, but we can extend
> it to include other ML building blocks like linear algebra operations, and
> APIs for optimizers like gradient descent.
>
> Soila
>
> -Original Message-
> From: Kam Kasravi [mailto:kamkasr...@gmail.com]
> Sent: Monday, May 16, 2016 8:22 AM
> To: dev@beam.incubator.apache.org
> Subject: Re: machine learning API, common models
>
> Thanks Simone - yes I had read your concerns on dev and I think they're
> well founded.
> Thanks for the samsura reference - I've been looking at the spark/scala
> bindings http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
> .
>
> I think we should expand the document to include linear algebraic ops or
> least pay due diligence to it. If you're doing anything on the flink side
> in this regard let us or feel free to suggest edits/updates to the document.
>
> Thanks
> Kam
>
> On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
> simone.robu...@radicalbit.io> wrote:
>
>> Hello,
>>
>> I'm Simone and I just began contributing to Flink ML (actually on the
>> distributed linalg part). I already expressed my concerns about the
>> idea of an high level API relying on specific frameworks' implementations:
>> different implementations produce different results and may vary in
>> quality. Also the semantics of parameters may change from one
>> implementation to the other. This could hinder portability and
>> transparency. I believe these problems could be handled paying the due
>> attention to the details of every single implementation but I invite
>> you not to underestimate these problems.
>>
>> On the other hand the API in itself looks good to me. From my side, I
>> hope to fill some of the gaps in Flink you underlined in the comparison
> matrix.
>> Talking about matrices, proper matrices this time, I believe it would
>> be useful to include in this API support for linear algebra operations.
>> Something similar is already present in Mahout's Samsara and it looks
>> really good but clearly a similar implementation on Beam would be way
>> more interesting and powerful.
>>
>> My 2 cents,
>>
>> Simone
>>
>>
>> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P :
>>
>>> Hi Tyler,
>>>
>>> Thank you so much for your feedback. I agree that starting with the
>>> high-level API is a good direction. We are interested in Python
>>> because
>> it
>>> is the language that our data scientists are most familiar with. I
>>> think starting with Java would be the best approach, because the
>>> Python API can be a thin wrapper for Java API.
>>>
>>> In Spark, the Scala, Java and Python APIs are identical. Flink does
>>> not have a Python API for ML pipelines at present.
>>>
>>> Could you point me to the updated runner API?
>>>
>>> Soila
>>>
>>> -Original Message-
>>> From: Tyler Akidau [mailto:taki...@google.com.INVALID]
>>> Se

Re: [PROPOSAL] IRC or slack channel for Apache Beam

2016-05-19 Thread GANESH RAJU
+1 on slack

Ganesh Raju

Sent from my iPhone

> On May 18, 2016, at 3:41 AM, Jean-Baptiste Onofré  wrote:
> 
> Good point Robert.
> 
> I will be on the channel for sure (I'm already on bunch of Apache IRC 
> channels ;)).
> 
> Regards
> JB
> 
>> On 05/18/2016 10:26 AM, Robert Bradshaw wrote:
>> The value in such a channel is highly dependent on people regularly
>> being there--do we have a critical mass of developers that would hang
>> out there? If so, I'd say go for it.
>> 
>>> On Wed, May 18, 2016 at 12:51 AM, Amit Sela  wrote:
>>> +1 for Slack
>>> 
>>> On Wed, May 18, 2016 at 10:47 AM Jean-Baptiste Onofré 
>>> wrote:
>>> 
 Hi all,
 
 What do you think about creating a #apache-beam IRC channel on freenode
 ? Or if it's more convenient a channel on Slack ?
 
 Regards
 JB
 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: [DISCUSS] Developing new components -- branches, maturity, and committers

2016-05-19 Thread Ismaël Mejía
I agree with Aljoscha, about not putting the feature branches in the main
repo, however how can we make people  aware of the new developments ?

-Ismaël

On Thu, May 19, 2016 at 11:56 AM, Aljoscha Krettek 
wrote:

> +1
>
> When we say feature branch, are we talking about a branch in the main repo?
> I would propose that feature branches live in the repos of the committers
> who are working on a feature.
>
> On Thu, 19 May 2016 at 11:54 Jean-Baptiste Onofré  wrote:
>
> > +1
> >
> > it looks good to me.
> >
> > Regards
> > JB
> >
> > On 05/19/2016 07:01 AM, Frances Perry wrote:
> > > Hi Beamers --
> > >
> > > I’m thrilled by the recent energy and activity on writing new Beam
> > runners!
> > > But that also means it’s probably time for us to figure out how, as a
> > > community, we want to support this process. ;-)
> > >
> > > Back near the beginning, we had a thread [1] discussing that feature
> > > branches are the preferred way of doing development of features or
> > > components that may take a while to reach maturity. I think new
> > components
> > > like runners and SDKs meet the bar to be started from a feature branch.
> > > (Other features, like an IO connector or library of PTransforms, might
> > also
> > > qualify depending on complexity.)
> > >
> > > We should also lay out what it takes to be considered mature enough to
> be
> > > merged into master, since once that happens the component gets released
> > to
> > > users and failing tests become blocking issues. Here are some initial
> > > thoughts to kick off the discussion...
> > >
> > > In order to be merged into master, new components / major features
> > should:
> > >
> > > -
> > >
> > > have at least 2 contributors interested in maintaining it, and 1
> > > committer interested in supporting it
> > > -
> > >
> > > provide both end-user and developer-facing documentation
> > > -
> > >
> > > have at least a basic level of unit test coverage
> > > -
> > >
> > > run all existing applicable integration tests with other Beam
> > components
> > > and create additional tests as appropriate
> > >
> > >
> > > In addition...
> > >
> > > A runner should:
> > >
> > > -
> > >
> > > be able to handle a subset of the model that address a significant
> > set
> > > of use cases (aka. ‘traditional batch’ or ‘processing time
> > streaming’)
> > > -
> > >
> > > update the capability matrix with the current status
> > >
> > >
> > > An SDK* should:
> > >
> > > -
> > >
> > > provide the ability to construct graphs with all the basic building
> > > blocks of the model (ParDo, GroupByKey, Window, Trigger, etc)
> > > -
> > >
> > > begin fleshing out the common composite transforms (Count, Join,
> etc)
> > > and IO connectors (Text, Kafka, etc)
> > > -
> > >
> > > have at least one runner that can execute the complete model (may
> be
> > a
> > > direct runner)
> > > -
> > >
> > > provide integration tests for executing against current and future
> > > runners
> > >
> > >
> > > * A note on DSLs:  I think it’s important to separate out an SDK from a
> > > DSL, because in my mind the former is by definition equivalent to the
> > Beam
> > > model, while the latter may select portions of the model or change the
> > > user-visible abstractions in order to provide a domain-specific
> > experience.
> > > We may want to encourage some DSLs to live separately from Beam because
> > > they may look completely non-Beam-like to their end users. But we can
> > > probably punt this decision until we have concrete examples to discuss.
> > >
> > > Another fun part of this growth is that we’ll likely grow new
> committers.
> > > And given the breadth of Beam, I think it would be useful to annotate
> our
> > > committers [2] page with which components folks are the most
> > knowledgeable
> > > about.
> > >
> > > Looking forward to your thoughts.
> > >
> > > [1]
> > >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201602.mbox/%3CCAAzyFAymVNpjQgZdz2BoMknnE3H9rYRbdnUemamt9Pavw8ugsw%40mail.gmail.com%3E
> > >
> > > [2] http://beam.incubator.apache.org/team/
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [DISCUSS] Developing new components -- branches, maturity, and committers

2016-05-19 Thread Aljoscha Krettek
+1

When we say feature branch, are we talking about a branch in the main repo?
I would propose that feature branches live in the repos of the committers
who are working on a feature.

On Thu, 19 May 2016 at 11:54 Jean-Baptiste Onofré  wrote:

> +1
>
> it looks good to me.
>
> Regards
> JB
>
> On 05/19/2016 07:01 AM, Frances Perry wrote:
> > Hi Beamers --
> >
> > I’m thrilled by the recent energy and activity on writing new Beam
> runners!
> > But that also means it’s probably time for us to figure out how, as a
> > community, we want to support this process. ;-)
> >
> > Back near the beginning, we had a thread [1] discussing that feature
> > branches are the preferred way of doing development of features or
> > components that may take a while to reach maturity. I think new
> components
> > like runners and SDKs meet the bar to be started from a feature branch.
> > (Other features, like an IO connector or library of PTransforms, might
> also
> > qualify depending on complexity.)
> >
> > We should also lay out what it takes to be considered mature enough to be
> > merged into master, since once that happens the component gets released
> to
> > users and failing tests become blocking issues. Here are some initial
> > thoughts to kick off the discussion...
> >
> > In order to be merged into master, new components / major features
> should:
> >
> > -
> >
> > have at least 2 contributors interested in maintaining it, and 1
> > committer interested in supporting it
> > -
> >
> > provide both end-user and developer-facing documentation
> > -
> >
> > have at least a basic level of unit test coverage
> > -
> >
> > run all existing applicable integration tests with other Beam
> components
> > and create additional tests as appropriate
> >
> >
> > In addition...
> >
> > A runner should:
> >
> > -
> >
> > be able to handle a subset of the model that address a significant
> set
> > of use cases (aka. ‘traditional batch’ or ‘processing time
> streaming’)
> > -
> >
> > update the capability matrix with the current status
> >
> >
> > An SDK* should:
> >
> > -
> >
> > provide the ability to construct graphs with all the basic building
> > blocks of the model (ParDo, GroupByKey, Window, Trigger, etc)
> > -
> >
> > begin fleshing out the common composite transforms (Count, Join, etc)
> > and IO connectors (Text, Kafka, etc)
> > -
> >
> > have at least one runner that can execute the complete model (may be
> a
> > direct runner)
> > -
> >
> > provide integration tests for executing against current and future
> > runners
> >
> >
> > * A note on DSLs:  I think it’s important to separate out an SDK from a
> > DSL, because in my mind the former is by definition equivalent to the
> Beam
> > model, while the latter may select portions of the model or change the
> > user-visible abstractions in order to provide a domain-specific
> experience.
> > We may want to encourage some DSLs to live separately from Beam because
> > they may look completely non-Beam-like to their end users. But we can
> > probably punt this decision until we have concrete examples to discuss.
> >
> > Another fun part of this growth is that we’ll likely grow new committers.
> > And given the breadth of Beam, I think it would be useful to annotate our
> > committers [2] page with which components folks are the most
> knowledgeable
> > about.
> >
> > Looking forward to your thoughts.
> >
> > [1]
> >
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201602.mbox/%3CCAAzyFAymVNpjQgZdz2BoMknnE3H9rYRbdnUemamt9Pavw8ugsw%40mail.gmail.com%3E
> >
> > [2] http://beam.incubator.apache.org/team/
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Dynamic work rebalancing for Beam

2016-05-19 Thread Jean-Baptiste Onofré

Hi Dan,

very very interesting ! Thanks for sharing.

Regards
JB

On 05/19/2016 07:09 AM, Dan Halperin wrote:

Hey folks,

This morning, my colleagues Eugene & Malo posted *No shard left behind:
dynamic work rebalancing in Google Cloud Dataflow
*.
This article discusses Cloud Dataflow’s solution to the well-known
straggler problem.

In a large batch processing job with many tasks executing in parallel, some
of the tasks – the stragglers – can take a much longer time to complete
than others, perhaps due to imperfect splitting of the work into parallel
chunks when issuing the job. Typically, waiting for stragglers means that
the overall job completes later than it should, and may also reserve too
many machines that may be underutilized at the end. Cloud Dataflow’s
dynamic work rebalancing can mitigate stragglers in most cases.

What I’d like to highlight for the Apache Beam (incubating) community is
that Cloud Dataflow’s dynamic work rebalancing is implemented using
*runner-specific* control logic on top of Beam’s *runner-independent*
BoundedSource
API
.
Specifically, to steal work from a straggler, a runner need only call the
reader’s splitAtFraction method. This will generate a new source containing
leftover work, and then the runner can pass that source off to another idle
worker. As Beam matures, I hope that other runners are interested in
figuring out whether these APIs can help them improve performance,
implementing dynamic work rebalancing, and collaborating on API changes
that will help solve other pain points.

Dan

(Also posted on Beam blog:
http://beam.incubator.apache.org/blog/2016/05/18/splitAtFraction-method.html
)



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] Developing new components -- branches, maturity, and committers

2016-05-19 Thread Jean-Baptiste Onofré

+1

it looks good to me.

Regards
JB

On 05/19/2016 07:01 AM, Frances Perry wrote:

Hi Beamers --

I’m thrilled by the recent energy and activity on writing new Beam runners!
But that also means it’s probably time for us to figure out how, as a
community, we want to support this process. ;-)

Back near the beginning, we had a thread [1] discussing that feature
branches are the preferred way of doing development of features or
components that may take a while to reach maturity. I think new components
like runners and SDKs meet the bar to be started from a feature branch.
(Other features, like an IO connector or library of PTransforms, might also
qualify depending on complexity.)

We should also lay out what it takes to be considered mature enough to be
merged into master, since once that happens the component gets released to
users and failing tests become blocking issues. Here are some initial
thoughts to kick off the discussion...

In order to be merged into master, new components / major features should:

-

have at least 2 contributors interested in maintaining it, and 1
committer interested in supporting it
-

provide both end-user and developer-facing documentation
-

have at least a basic level of unit test coverage
-

run all existing applicable integration tests with other Beam components
and create additional tests as appropriate


In addition...

A runner should:

-

be able to handle a subset of the model that address a significant set
of use cases (aka. ‘traditional batch’ or ‘processing time streaming’)
-

update the capability matrix with the current status


An SDK* should:

-

provide the ability to construct graphs with all the basic building
blocks of the model (ParDo, GroupByKey, Window, Trigger, etc)
-

begin fleshing out the common composite transforms (Count, Join, etc)
and IO connectors (Text, Kafka, etc)
-

have at least one runner that can execute the complete model (may be a
direct runner)
-

provide integration tests for executing against current and future
runners


* A note on DSLs:  I think it’s important to separate out an SDK from a
DSL, because in my mind the former is by definition equivalent to the Beam
model, while the latter may select portions of the model or change the
user-visible abstractions in order to provide a domain-specific experience.
We may want to encourage some DSLs to live separately from Beam because
they may look completely non-Beam-like to their end users. But we can
probably punt this decision until we have concrete examples to discuss.

Another fun part of this growth is that we’ll likely grow new committers.
And given the breadth of Beam, I think it would be useful to annotate our
committers [2] page with which components folks are the most knowledgeable
about.

Looking forward to your thoughts.

[1]
http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201602.mbox/%3CCAAzyFAymVNpjQgZdz2BoMknnE3H9rYRbdnUemamt9Pavw8ugsw%40mail.gmail.com%3E

[2] http://beam.incubator.apache.org/team/



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Dynamic work rebalancing for Beam

2016-05-19 Thread Aljoscha Krettek
Interesting read, thanks for the link!

On Thu, 19 May 2016 at 07:09 Dan Halperin 
wrote:

> Hey folks,
>
> This morning, my colleagues Eugene & Malo posted *No shard left behind:
> dynamic work rebalancing in Google Cloud Dataflow
> <
> https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow
> >*.
> This article discusses Cloud Dataflow’s solution to the well-known
> straggler problem.
>
> In a large batch processing job with many tasks executing in parallel, some
> of the tasks – the stragglers – can take a much longer time to complete
> than others, perhaps due to imperfect splitting of the work into parallel
> chunks when issuing the job. Typically, waiting for stragglers means that
> the overall job completes later than it should, and may also reserve too
> many machines that may be underutilized at the end. Cloud Dataflow’s
> dynamic work rebalancing can mitigate stragglers in most cases.
>
> What I’d like to highlight for the Apache Beam (incubating) community is
> that Cloud Dataflow’s dynamic work rebalancing is implemented using
> *runner-specific* control logic on top of Beam’s *runner-independent*
> BoundedSource
> API
> <
> https://github.com/apache/incubator-beam/blob/9fa97fb2491bc784df53fb0f044409dbbc2af3d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java
> >.
> Specifically, to steal work from a straggler, a runner need only call the
> reader’s splitAtFraction method. This will generate a new source containing
> leftover work, and then the runner can pass that source off to another idle
> worker. As Beam matures, I hope that other runners are interested in
> figuring out whether these APIs can help them improve performance,
> implementing dynamic work rebalancing, and collaborating on API changes
> that will help solve other pain points.
>
> Dan
>
> (Also posted on Beam blog:
>
> http://beam.incubator.apache.org/blog/2016/05/18/splitAtFraction-method.html
> )
>


Failing Jenkins Runs

2016-05-19 Thread Aljoscha Krettek
Hi,
on all of the recent PRs Jenkins fails with this message:
https://builds.apache.org/job/beam_PreCommit_MavenVerify/1213/console

Does anyone have an idea what might be going on? Also, where is Jenkins
configured? With this I could take a look myself.

-Aljoscha