Re: [VOTE] groupId/artifactId naming & layout

2016-06-03 Thread Jean-Baptiste Onofré

The purpose of the vote is to get a consensus actually.

We have two options expressed on the mailing list: the current "layout" 
is good IMHO but all don't agree. So, let's put things on the table and 
move forward. The vote is a way of discussing. It's not a vote for the 
release, it's a vote/discussion for the layout and Maven coordinates (so 
not a formal vote).


Just to remember: all should be discussed and informed on the mailing list.

Regards
JB

On 06/03/2016 06:50 PM, Davor Bonaci wrote:

This is not a great vote proposal for several reasons:
* "Use the current layout" is ambiguous, because it is inconsistent (it is
now partly flat and party hierarchical).
* Getting the outcome won't move us much closer to the resolution, given
that there are several sub-variants in each option.
* We have not laid out advantages, disadvantages, and consequences of each
option for everyone to make an informed decision.
* It is premature: we haven't tried to reach a consensus or explored
alternatives. 3 hours and just a few emails is way too short from a issue
being raised to vote call.

I'd suggest to try to find a consensus on the original thread first, and
call for a vote if/when needed.

On Fri, Jun 3, 2016 at 5:15 AM, Amit Sela  wrote:


+1 for Option2

On Fri, Jun 3, 2016 at 2:09 PM Jean-Baptiste Onofré 
wrote:


As said in my previous e-mail, just proposed PR #416.

Let's start a vote for groupId and artifactId naming.

[ ] Option1: use the current layout (multiple groupId, artifactId
relative to groupId)
[ ] Option2: use unique org.apache.beam groupId and rename artifactId
with a prefix (beam-parent/apache-beam, flink-runner, spark-runner, etc)

Regards
JB

On 06/03/2016 01:03 PM, Jean-Baptiste Onofré wrote:

Hi Max,

I discussed with Davor yesterday. Basically, I proposed:

1. To rename all parent with a prefix (beam-parent,

flink-runner-parent,

spark-runner-parent, etc).
2. For the groupId, I prefer to use different groupId, it's clearer to
me, and it's exactly the usage of the groupId. Some projects use a
single groupId (spark, hadoop, etc), others use multiple (camel, karaf,
activemq, etc). I prefer different groupIds but ok to go back to single
one.

Anyway, I'm preparing a PR to introduce a new Maven module:
"distribution". The purpose is to address both BEAM-319 (first) and
BEAM-320 (later). It's where we will be able to define the different
distributions we plan to publish (source and binaries).

Regards
JB

On 06/03/2016 11:02 AM, Maximilian Michels wrote:

Thanks for getting us ready for the first release, Davor! We would
like to fix BEAM-315 next week. Is there already a timeline for the
first release? If so, we could also address this in a minor release.
Releasing often will give us some experience with our release process
:)

I would like everyone to think about the artifact names and group ids
again. "parent" and "flink" are not very suitable names for the Beam
parent or the Flink Runner artifact (same goes for the Spark Runner).
I'd prefer "beam-parent", "flink-runner", and "spark-runner" as
artifact ids.

One might think of Maven GroupIds as a sort of hierarchy but they're
not. They're just an identifier. Renaming the parent pom to
"apache-beam" or "beam-parent" would give us the old naming scheme
which used flat group ids (before [1]).

In the end, I guess it doesn't matter too much if we document the
naming schemes accordingly. What matters is that we use a consistent
naming scheme.

Cheers,
Max

[1] https://issues.apache.org/jira/browse/BEAM-287


On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré 

Re: Apache Beam for Python

2016-06-03 Thread Amit Sela
Welcome Python people ;)

I know a few people who've been waiting for this one!

On Fri, Jun 3, 2016, 19:53 Davor Bonaci  wrote:

> Welcome Python SDK, as well as Silviu, Charles, Ahmet and Chamikara!
>
> On Fri, Jun 3, 2016 at 7:07 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Absolutely ;)
> >
> >
> > On 06/03/2016 03:51 PM, James Malone wrote:
> >
> >> Hey Silviu!
> >>
> >> I think JB is proposing we create a python directory in the sdks
> directory
> >> in the root repository (and modify the configuration files accordingly):
> >>
> >> https://github.com/apache/incubator-beam/tree/master/sdks
> >>
> >> This Beam document here titled "Apache Beam (Incubating): Repository
> >> Structure" details the proposed repository structure and may be useful:
> >>
> >>
> >>
> >>
> https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc
> >>
> >> Best,
> >>
> >> James
> >>
> >>
> >>
> >> On Fri, Jun 3, 2016 at 6:34 AM, Silviu Calinoiu
> >> 
> >> wrote:
> >>
> >> Hi JB,
> >>> Thanks for the welcome! I come from the Python land so  I am not quite
> >>> familiar with Maven. What do you mean by a Maven module? You mean an
> >>> artifact so you can install things? In Python, people are used to
> >>> packages
> >>> downloaded from PyPI (pypi.python.org -- which is sort of Maven for
> >>> Python). Whatever is the standard way of doing things in Apache we'll
> do
> >>> it. Just asking for clarifications.
> >>>
> >>> By the way this discussion is very useful since we will have to iron
> out
> >>> several details like this.
> >>> Thanks,
> >>> Silviu
> >>>
> >>> On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofré 
> >>> wrote:
> >>>
> >>> Hi Silviu,
> 
>  thanks for detailed update and great work !
> 
>  I would advice to create a:
> 
>  sdks/python
> 
>  Maven module to store the Python SDK.
> 
>  WDYT ?
> 
>  By the way, welcome aboard and great to have you all guys in the team
> !
> 
>  Regards
>  JB
> 
>  On 06/03/2016 03:13 PM, Silviu Calinoiu wrote:
> 
>  Hi all,
> >
> > My name is Silviu Calinoiu and I am a member of the Cloud Dataflow
> team
> > working on the Python SDK.  As the original Beam proposal (
> > https://wiki.apache.org/incubator/BeamProposal) mentioned, we have
> > been
> > planning to merge the Python SDK into Beam. The Python SDK is in an
> >
>  early
> >>>
>  stage of development (alpha milestone) and so this is a good time to
> >
>  move
> >>>
>  the code without causing too much disruption to our customers.
> > Additionally, this enables the Beam community to contribute as soon
> as
> > possible.
> >
> > The current state of the SDK is as follows:
> >
> >  -
> >
> >  Open-sourced at
> > https://github.com/GoogleCloudPlatform/DataflowPythonSDK/
> >
> >
> >  -
> >
> >  Model: All main concepts are present.
> >  -
> >
> >  I/O: SDK supports text (Google Cloud Storage) and BigQuery
> >
>  connectors
> >>>
>   and has a framework for adding additional sources and sinks.
> >  -
> >
> >  Runners: SDK has two pipeline runners: direct runner (in
> process,
> > local
> >  execution) and Cloud Dataflow runner for batch pipelines (submit
> > job
> > to
> >  Google Dataflow service). The current direct runner is bounded
> > only
> > (batch
> >  execution) but there is work in progress to support unbounded
> (as
> > in
> > Java).
> >  -
> >
> >  Testing: The code base has unit test coverage for all the
> modules
> >
>  and
> >>>
>   several integration and end to end tests (similar in coverage to
> > the
> > Java
> >  SDK). Streaming is not well tested end to end yet since Cloud
> >
>  Dataflow
> >>>
>   focused first on batch.
> >  -
> >
> >  Docs: We have matching Python documentation for the features
> >
>  currently
> >>>
>   supported by Cloud Dataflow. The docs are on cloud.google.com
> >
>  (access
> >>>
>   only by whitelist due to the alpha stage of the project). Devin
> is
> > working
> >  on the transition of all docs to Apache.
> >
> >
> > In the next days/weeks we would like to prepare and start migrating
> the
> > code and you should start seeing some pull requests. We also hope
> that
> >
>  the
> >>>
>  Beam community will shape the SDK going forward. In particular, all
> the
> > model improvements implemented for Java (Runner API, etc.) will have
> > equivalents in Python once they stabilize. If you have any advice
> > before
> > we
> > start the journey please let us know.
> >
> > The team that will 

Re: 0.1.0-incubating release

2016-06-03 Thread Thomas Weise
Another consideration for potential future packaging/distribution solutions
is how the artifacts line up as files in a flat directory. For that it may
be good to have a common prefix in the artifactId and unique artifactId.

The name for the source archive (when relying on ASF parent POM) can also
be controlled without expanding the artifactId:

 

  
maven-assembly-plugin

  apache-beam

  

 

Thanks,
Thomas

On Fri, Jun 3, 2016 at 9:39 AM, Davor Bonaci 
wrote:

> BEAM-315 is definitely important. Normally, I'd always advocate for holding
> the release to pick that fix. For the very first release, however, I'd
> prefer to proceed to get something out there and test the process. As you
> said, we can address this rather quickly once we have the fix merged in.
>
> In terms of Maven coordinates, there are two basic approaches:
> * flat structure, where artifacts live under "org.apache.beam" group and
> are differentiated by their artifact id.
> * hierarchical structure, where we use different groups for different types
> of artifacts (org.apache.beam.sdks; org.apache.beam.runners).
>
> There are pros and cons on the both sides of the argument. Different
> projects made different choices. Flat structure is easier to find and
> navigate, but often breaks down with too many artifacts. Hierarchical
> structure is just the opposite.
>
> On my end, the only important thing is consistency. We used to have it, and
> it got broken by PR #365. This part should be fixed -- we should either
> finish the vision of the hierarchical structure, or rollback that PR to get
> back to a fully flat structure.
>
> My general biases tend to be:
> * hierarchical structure, since we have many artifacts already.
> * short identifiers; no need to repeat a part of the group id in the
> artifact id.
>
> On Fri, Jun 3, 2016 at 4:03 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Max,
> >
> > I discussed with Davor yesterday. Basically, I proposed:
> >
> > 1. To rename all parent with a prefix (beam-parent, flink-runner-parent,
> > spark-runner-parent, etc).
> > 2. For the groupId, I prefer to use different groupId, it's clearer to
> me,
> > and it's exactly the usage of the groupId. Some projects use a single
> > groupId (spark, hadoop, etc), others use multiple (camel, karaf,
> activemq,
> > etc). I prefer different groupIds but ok to go back to single one.
> >
> > Anyway, I'm preparing a PR to introduce a new Maven module:
> > "distribution". The purpose is to address both BEAM-319 (first) and
> > BEAM-320 (later). It's where we will be able to define the different
> > distributions we plan to publish (source and binaries).
> >
> > Regards
> > JB
> >
> >
> > On 06/03/2016 11:02 AM, Maximilian Michels wrote:
> >
> >> Thanks for getting us ready for the first release, Davor! We would
> >> like to fix BEAM-315 next week. Is there already a timeline for the
> >> first release? If so, we could also address this in a minor release.
> >> Releasing often will give us some experience with our release process
> >> :)
> >>
> >> I would like everyone to think about the artifact names and group ids
> >> again. "parent" and "flink" are not very suitable names for the Beam
> >> parent or the Flink Runner artifact (same goes for the Spark Runner).
> >> I'd prefer "beam-parent", "flink-runner", and "spark-runner" as
> >> artifact ids.
> >>
> >> One might think of Maven GroupIds as a sort of hierarchy but they're
> >> not. They're just an identifier. Renaming the parent pom to
> >> "apache-beam" or "beam-parent" would give us the old naming scheme
> >> which used flat group ids (before [1]).
> >>
> >> In the end, I guess it doesn't matter too much if we document the
> >> naming schemes accordingly. What matters is that we use a consistent
> >> naming scheme.
> >>
> >> Cheers,
> >> Max
> >>
> >> [1] https://issues.apache.org/jira/browse/BEAM-287
> >>
> >>
> >> On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré 
> >> wrote:
> >>
> >>> Actually, I think we can fix both issue in one commit.
> >>>
> >>> What do you think about renaming the main parent POM with:
> >>> groupId: org.apache.beam
> >>> artifactId: apache-beam
> >>>
> >>> ?
> >>>
> >>> Thanks to that, the source distribution will be named
> >>> apache-beam-xxx-sources.zip and it would be clearer to dev.
> >>>
> >>> Thoughts ?
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote:
> >>>
> 
>  Another annoying thing is the main parent POM artifactId.
> 
>  Now, it's just "parent". What do you think about renaming to
>  "beam-parent" ?
> 
>  Regarding the source distribution name, I would cancel this staging to
>  fix that (I will have a PR ready soon).
> 
>  Thoughts ?
> 
>  Regards
>  JB
> 
>  On 06/02/2016 03:46 AM, Davor Bonaci wrote:
> 
> 

Re: [VOTE] groupId/artifactId naming & layout

2016-06-03 Thread Davor Bonaci
This is not a great vote proposal for several reasons:
* "Use the current layout" is ambiguous, because it is inconsistent (it is
now partly flat and party hierarchical).
* Getting the outcome won't move us much closer to the resolution, given
that there are several sub-variants in each option.
* We have not laid out advantages, disadvantages, and consequences of each
option for everyone to make an informed decision.
* It is premature: we haven't tried to reach a consensus or explored
alternatives. 3 hours and just a few emails is way too short from a issue
being raised to vote call.

I'd suggest to try to find a consensus on the original thread first, and
call for a vote if/when needed.

On Fri, Jun 3, 2016 at 5:15 AM, Amit Sela  wrote:

> +1 for Option2
>
> On Fri, Jun 3, 2016 at 2:09 PM Jean-Baptiste Onofré 
> wrote:
>
> > As said in my previous e-mail, just proposed PR #416.
> >
> > Let's start a vote for groupId and artifactId naming.
> >
> > [ ] Option1: use the current layout (multiple groupId, artifactId
> > relative to groupId)
> > [ ] Option2: use unique org.apache.beam groupId and rename artifactId
> > with a prefix (beam-parent/apache-beam, flink-runner, spark-runner, etc)
> >
> > Regards
> > JB
> >
> > On 06/03/2016 01:03 PM, Jean-Baptiste Onofré wrote:
> > > Hi Max,
> > >
> > > I discussed with Davor yesterday. Basically, I proposed:
> > >
> > > 1. To rename all parent with a prefix (beam-parent,
> flink-runner-parent,
> > > spark-runner-parent, etc).
> > > 2. For the groupId, I prefer to use different groupId, it's clearer to
> > > me, and it's exactly the usage of the groupId. Some projects use a
> > > single groupId (spark, hadoop, etc), others use multiple (camel, karaf,
> > > activemq, etc). I prefer different groupIds but ok to go back to single
> > > one.
> > >
> > > Anyway, I'm preparing a PR to introduce a new Maven module:
> > > "distribution". The purpose is to address both BEAM-319 (first) and
> > > BEAM-320 (later). It's where we will be able to define the different
> > > distributions we plan to publish (source and binaries).
> > >
> > > Regards
> > > JB
> > >
> > > On 06/03/2016 11:02 AM, Maximilian Michels wrote:
> > >> Thanks for getting us ready for the first release, Davor! We would
> > >> like to fix BEAM-315 next week. Is there already a timeline for the
> > >> first release? If so, we could also address this in a minor release.
> > >> Releasing often will give us some experience with our release process
> > >> :)
> > >>
> > >> I would like everyone to think about the artifact names and group ids
> > >> again. "parent" and "flink" are not very suitable names for the Beam
> > >> parent or the Flink Runner artifact (same goes for the Spark Runner).
> > >> I'd prefer "beam-parent", "flink-runner", and "spark-runner" as
> > >> artifact ids.
> > >>
> > >> One might think of Maven GroupIds as a sort of hierarchy but they're
> > >> not. They're just an identifier. Renaming the parent pom to
> > >> "apache-beam" or "beam-parent" would give us the old naming scheme
> > >> which used flat group ids (before [1]).
> > >>
> > >> In the end, I guess it doesn't matter too much if we document the
> > >> naming schemes accordingly. What matters is that we use a consistent
> > >> naming scheme.
> > >>
> > >> Cheers,
> > >> Max
> > >>
> > >> [1] https://issues.apache.org/jira/browse/BEAM-287
> > >>
> > >>
> > >> On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré  >
> > >> wrote:
> > >>> Actually, I think we can fix both issue in one commit.
> > >>>
> > >>> What do you think about renaming the main parent POM with:
> > >>> groupId: org.apache.beam
> > >>> artifactId: apache-beam
> > >>>
> > >>> ?
> > >>>
> > >>> Thanks to that, the source distribution will be named
> > >>> apache-beam-xxx-sources.zip and it would be clearer to dev.
> > >>>
> > >>> Thoughts ?
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>>
> > >>> On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote:
> > 
> >  Another annoying thing is the main parent POM artifactId.
> > 
> >  Now, it's just "parent". What do you think about renaming to
> >  "beam-parent" ?
> > 
> >  Regarding the source distribution name, I would cancel this staging
> to
> >  fix that (I will have a PR ready soon).
> > 
> >  Thoughts ?
> > 
> >  Regards
> >  JB
> > 
> >  On 06/02/2016 03:46 AM, Davor Bonaci wrote:
> > >
> > > Hi everyone!
> > > We've started the release process for our first release,
> > > 0.1.0-incubating.
> > >
> > > To recap previous discussions, we don't have particular functional
> > > goals
> > > for this release. Instead, we'd like to make available what's
> > > currently in
> > > the repository, as well as work through the release process.
> > >
> > > With this in mind, we've:
> > > * branched off the release branch [1] at master's commit 8485272,
> > > * 

Re: Apache Beam for Python

2016-06-03 Thread Jean-Baptiste Onofré

Absolutely ;)

On 06/03/2016 03:51 PM, James Malone wrote:

Hey Silviu!

I think JB is proposing we create a python directory in the sdks directory
in the root repository (and modify the configuration files accordingly):

https://github.com/apache/incubator-beam/tree/master/sdks

This Beam document here titled "Apache Beam (Incubating): Repository
Structure" details the proposed repository structure and may be useful:


https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc

Best,

James



On Fri, Jun 3, 2016 at 6:34 AM, Silviu Calinoiu 
wrote:


Hi JB,
Thanks for the welcome! I come from the Python land so  I am not quite
familiar with Maven. What do you mean by a Maven module? You mean an
artifact so you can install things? In Python, people are used to packages
downloaded from PyPI (pypi.python.org -- which is sort of Maven for
Python). Whatever is the standard way of doing things in Apache we'll do
it. Just asking for clarifications.

By the way this discussion is very useful since we will have to iron out
several details like this.
Thanks,
Silviu

On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofré 
wrote:


Hi Silviu,

thanks for detailed update and great work !

I would advice to create a:

sdks/python

Maven module to store the Python SDK.

WDYT ?

By the way, welcome aboard and great to have you all guys in the team !

Regards
JB

On 06/03/2016 03:13 PM, Silviu Calinoiu wrote:


Hi all,

My name is Silviu Calinoiu and I am a member of the Cloud Dataflow team
working on the Python SDK.  As the original Beam proposal (
https://wiki.apache.org/incubator/BeamProposal) mentioned, we have been
planning to merge the Python SDK into Beam. The Python SDK is in an

early

stage of development (alpha milestone) and so this is a good time to

move

the code without causing too much disruption to our customers.
Additionally, this enables the Beam community to contribute as soon as
possible.

The current state of the SDK is as follows:

 -

 Open-sourced at
https://github.com/GoogleCloudPlatform/DataflowPythonSDK/


 -

 Model: All main concepts are present.
 -

 I/O: SDK supports text (Google Cloud Storage) and BigQuery

connectors

 and has a framework for adding additional sources and sinks.
 -

 Runners: SDK has two pipeline runners: direct runner (in process,
local
 execution) and Cloud Dataflow runner for batch pipelines (submit job
to
 Google Dataflow service). The current direct runner is bounded only
(batch
 execution) but there is work in progress to support unbounded (as in
Java).
 -

 Testing: The code base has unit test coverage for all the modules

and

 several integration and end to end tests (similar in coverage to the
Java
 SDK). Streaming is not well tested end to end yet since Cloud

Dataflow

 focused first on batch.
 -

 Docs: We have matching Python documentation for the features

currently

 supported by Cloud Dataflow. The docs are on cloud.google.com

(access

 only by whitelist due to the alpha stage of the project). Devin is
working
 on the transition of all docs to Apache.


In the next days/weeks we would like to prepare and start migrating the
code and you should start seeing some pull requests. We also hope that

the

Beam community will shape the SDK going forward. In particular, all the
model improvements implemented for Java (Runner API, etc.) will have
equivalents in Python once they stabilize. If you have any advice before
we
start the journey please let us know.

The team that will join the Beam effort consists of me (Silviu

Calinoiu),

Charles Chen, Ahmet Altay, Chamikara Jayalath, and last but not least
Robert Bradshaw (who is already an Apache Beam committer).

So let us know what you think!

Best regards,

Silviu



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Apache Beam for Python

2016-06-03 Thread Jean-Baptiste Onofré
I'm more proposing just a folder containing Pythong SDK, not necessary 
part of the Maven reactor.


Regards
JB

On 06/03/2016 03:34 PM, Silviu Calinoiu wrote:

Hi JB,
Thanks for the welcome! I come from the Python land so  I am not quite
familiar with Maven. What do you mean by a Maven module? You mean an
artifact so you can install things? In Python, people are used to packages
downloaded from PyPI (pypi.python.org -- which is sort of Maven for
Python). Whatever is the standard way of doing things in Apache we'll do
it. Just asking for clarifications.

By the way this discussion is very useful since we will have to iron out
several details like this.
Thanks,
Silviu

On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofré 
wrote:


Hi Silviu,

thanks for detailed update and great work !

I would advice to create a:

sdks/python

Maven module to store the Python SDK.

WDYT ?

By the way, welcome aboard and great to have you all guys in the team !

Regards
JB

On 06/03/2016 03:13 PM, Silviu Calinoiu wrote:


Hi all,

My name is Silviu Calinoiu and I am a member of the Cloud Dataflow team
working on the Python SDK.  As the original Beam proposal (
https://wiki.apache.org/incubator/BeamProposal) mentioned, we have been
planning to merge the Python SDK into Beam. The Python SDK is in an early
stage of development (alpha milestone) and so this is a good time to move
the code without causing too much disruption to our customers.
Additionally, this enables the Beam community to contribute as soon as
possible.

The current state of the SDK is as follows:

 -

 Open-sourced at
https://github.com/GoogleCloudPlatform/DataflowPythonSDK/


 -

 Model: All main concepts are present.
 -

 I/O: SDK supports text (Google Cloud Storage) and BigQuery connectors
 and has a framework for adding additional sources and sinks.
 -

 Runners: SDK has two pipeline runners: direct runner (in process,
local
 execution) and Cloud Dataflow runner for batch pipelines (submit job
to
 Google Dataflow service). The current direct runner is bounded only
(batch
 execution) but there is work in progress to support unbounded (as in
Java).
 -

 Testing: The code base has unit test coverage for all the modules and
 several integration and end to end tests (similar in coverage to the
Java
 SDK). Streaming is not well tested end to end yet since Cloud Dataflow
 focused first on batch.
 -

 Docs: We have matching Python documentation for the features currently
 supported by Cloud Dataflow. The docs are on cloud.google.com (access
 only by whitelist due to the alpha stage of the project). Devin is
working
 on the transition of all docs to Apache.


In the next days/weeks we would like to prepare and start migrating the
code and you should start seeing some pull requests. We also hope that the
Beam community will shape the SDK going forward. In particular, all the
model improvements implemented for Java (Runner API, etc.) will have
equivalents in Python once they stabilize. If you have any advice before
we
start the journey please let us know.

The team that will join the Beam effort consists of me (Silviu Calinoiu),
Charles Chen, Ahmet Altay, Chamikara Jayalath, and last but not least
Robert Bradshaw (who is already an Apache Beam committer).

So let us know what you think!

Best regards,

Silviu



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Apache Beam for Python

2016-06-03 Thread James Malone
Hey Silviu!

I think JB is proposing we create a python directory in the sdks directory
in the root repository (and modify the configuration files accordingly):

   https://github.com/apache/incubator-beam/tree/master/sdks

This Beam document here titled "Apache Beam (Incubating): Repository
Structure" details the proposed repository structure and may be useful:


https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc

Best,

James



On Fri, Jun 3, 2016 at 6:34 AM, Silviu Calinoiu 
wrote:

> Hi JB,
> Thanks for the welcome! I come from the Python land so  I am not quite
> familiar with Maven. What do you mean by a Maven module? You mean an
> artifact so you can install things? In Python, people are used to packages
> downloaded from PyPI (pypi.python.org -- which is sort of Maven for
> Python). Whatever is the standard way of doing things in Apache we'll do
> it. Just asking for clarifications.
>
> By the way this discussion is very useful since we will have to iron out
> several details like this.
> Thanks,
> Silviu
>
> On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Silviu,
> >
> > thanks for detailed update and great work !
> >
> > I would advice to create a:
> >
> > sdks/python
> >
> > Maven module to store the Python SDK.
> >
> > WDYT ?
> >
> > By the way, welcome aboard and great to have you all guys in the team !
> >
> > Regards
> > JB
> >
> > On 06/03/2016 03:13 PM, Silviu Calinoiu wrote:
> >
> >> Hi all,
> >>
> >> My name is Silviu Calinoiu and I am a member of the Cloud Dataflow team
> >> working on the Python SDK.  As the original Beam proposal (
> >> https://wiki.apache.org/incubator/BeamProposal) mentioned, we have been
> >> planning to merge the Python SDK into Beam. The Python SDK is in an
> early
> >> stage of development (alpha milestone) and so this is a good time to
> move
> >> the code without causing too much disruption to our customers.
> >> Additionally, this enables the Beam community to contribute as soon as
> >> possible.
> >>
> >> The current state of the SDK is as follows:
> >>
> >> -
> >>
> >> Open-sourced at
> >> https://github.com/GoogleCloudPlatform/DataflowPythonSDK/
> >>
> >>
> >> -
> >>
> >> Model: All main concepts are present.
> >> -
> >>
> >> I/O: SDK supports text (Google Cloud Storage) and BigQuery
> connectors
> >> and has a framework for adding additional sources and sinks.
> >> -
> >>
> >> Runners: SDK has two pipeline runners: direct runner (in process,
> >> local
> >> execution) and Cloud Dataflow runner for batch pipelines (submit job
> >> to
> >> Google Dataflow service). The current direct runner is bounded only
> >> (batch
> >> execution) but there is work in progress to support unbounded (as in
> >> Java).
> >> -
> >>
> >> Testing: The code base has unit test coverage for all the modules
> and
> >> several integration and end to end tests (similar in coverage to the
> >> Java
> >> SDK). Streaming is not well tested end to end yet since Cloud
> Dataflow
> >> focused first on batch.
> >> -
> >>
> >> Docs: We have matching Python documentation for the features
> currently
> >> supported by Cloud Dataflow. The docs are on cloud.google.com
> (access
> >> only by whitelist due to the alpha stage of the project). Devin is
> >> working
> >> on the transition of all docs to Apache.
> >>
> >>
> >> In the next days/weeks we would like to prepare and start migrating the
> >> code and you should start seeing some pull requests. We also hope that
> the
> >> Beam community will shape the SDK going forward. In particular, all the
> >> model improvements implemented for Java (Runner API, etc.) will have
> >> equivalents in Python once they stabilize. If you have any advice before
> >> we
> >> start the journey please let us know.
> >>
> >> The team that will join the Beam effort consists of me (Silviu
> Calinoiu),
> >> Charles Chen, Ahmet Altay, Chamikara Jayalath, and last but not least
> >> Robert Bradshaw (who is already an Apache Beam committer).
> >>
> >> So let us know what you think!
> >>
> >> Best regards,
> >>
> >> Silviu
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [VOTE] groupId/artifactId naming & layout

2016-06-03 Thread Amit Sela
+1 for Option2

On Fri, Jun 3, 2016 at 2:09 PM Jean-Baptiste Onofré  wrote:

> As said in my previous e-mail, just proposed PR #416.
>
> Let's start a vote for groupId and artifactId naming.
>
> [ ] Option1: use the current layout (multiple groupId, artifactId
> relative to groupId)
> [ ] Option2: use unique org.apache.beam groupId and rename artifactId
> with a prefix (beam-parent/apache-beam, flink-runner, spark-runner, etc)
>
> Regards
> JB
>
> On 06/03/2016 01:03 PM, Jean-Baptiste Onofré wrote:
> > Hi Max,
> >
> > I discussed with Davor yesterday. Basically, I proposed:
> >
> > 1. To rename all parent with a prefix (beam-parent, flink-runner-parent,
> > spark-runner-parent, etc).
> > 2. For the groupId, I prefer to use different groupId, it's clearer to
> > me, and it's exactly the usage of the groupId. Some projects use a
> > single groupId (spark, hadoop, etc), others use multiple (camel, karaf,
> > activemq, etc). I prefer different groupIds but ok to go back to single
> > one.
> >
> > Anyway, I'm preparing a PR to introduce a new Maven module:
> > "distribution". The purpose is to address both BEAM-319 (first) and
> > BEAM-320 (later). It's where we will be able to define the different
> > distributions we plan to publish (source and binaries).
> >
> > Regards
> > JB
> >
> > On 06/03/2016 11:02 AM, Maximilian Michels wrote:
> >> Thanks for getting us ready for the first release, Davor! We would
> >> like to fix BEAM-315 next week. Is there already a timeline for the
> >> first release? If so, we could also address this in a minor release.
> >> Releasing often will give us some experience with our release process
> >> :)
> >>
> >> I would like everyone to think about the artifact names and group ids
> >> again. "parent" and "flink" are not very suitable names for the Beam
> >> parent or the Flink Runner artifact (same goes for the Spark Runner).
> >> I'd prefer "beam-parent", "flink-runner", and "spark-runner" as
> >> artifact ids.
> >>
> >> One might think of Maven GroupIds as a sort of hierarchy but they're
> >> not. They're just an identifier. Renaming the parent pom to
> >> "apache-beam" or "beam-parent" would give us the old naming scheme
> >> which used flat group ids (before [1]).
> >>
> >> In the end, I guess it doesn't matter too much if we document the
> >> naming schemes accordingly. What matters is that we use a consistent
> >> naming scheme.
> >>
> >> Cheers,
> >> Max
> >>
> >> [1] https://issues.apache.org/jira/browse/BEAM-287
> >>
> >>
> >> On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré 
> >> wrote:
> >>> Actually, I think we can fix both issue in one commit.
> >>>
> >>> What do you think about renaming the main parent POM with:
> >>> groupId: org.apache.beam
> >>> artifactId: apache-beam
> >>>
> >>> ?
> >>>
> >>> Thanks to that, the source distribution will be named
> >>> apache-beam-xxx-sources.zip and it would be clearer to dev.
> >>>
> >>> Thoughts ?
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote:
> 
>  Another annoying thing is the main parent POM artifactId.
> 
>  Now, it's just "parent". What do you think about renaming to
>  "beam-parent" ?
> 
>  Regarding the source distribution name, I would cancel this staging to
>  fix that (I will have a PR ready soon).
> 
>  Thoughts ?
> 
>  Regards
>  JB
> 
>  On 06/02/2016 03:46 AM, Davor Bonaci wrote:
> >
> > Hi everyone!
> > We've started the release process for our first release,
> > 0.1.0-incubating.
> >
> > To recap previous discussions, we don't have particular functional
> > goals
> > for this release. Instead, we'd like to make available what's
> > currently in
> > the repository, as well as work through the release process.
> >
> > With this in mind, we've:
> > * branched off the release branch [1] at master's commit 8485272,
> > * updated master to prepare for the second release, 0.2.0-incubating,
> > * built the first release candidate, RC1, and deployed it to a
> staging
> > repository [2].
> >
> > We are not ready to start a vote just yet -- we've already identified
> > a few
> > issues worth fixing. That said, I'd like to invite everybody to take
> a
> > peek
> > and comment. I'm hoping we can address as many issues as possible
> > before we
> > start the voting process.
> >
> > Please let us know if you see any issues.
> >
> > Thanks,
> > Davor
> >
> > [1]
> >
> https://github.com/apache/incubator-beam/tree/release-0.1.0-incubating
> > [2]
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1000/
> >
> 
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> 

Re: 0.1.0-incubating release

2016-06-03 Thread Jean-Baptiste Onofré

Hi Max,

I discussed with Davor yesterday. Basically, I proposed:

1. To rename all parent with a prefix (beam-parent, flink-runner-parent, 
spark-runner-parent, etc).
2. For the groupId, I prefer to use different groupId, it's clearer to 
me, and it's exactly the usage of the groupId. Some projects use a 
single groupId (spark, hadoop, etc), others use multiple (camel, karaf, 
activemq, etc). I prefer different groupIds but ok to go back to single one.


Anyway, I'm preparing a PR to introduce a new Maven module: 
"distribution". The purpose is to address both BEAM-319 (first) and 
BEAM-320 (later). It's where we will be able to define the different 
distributions we plan to publish (source and binaries).


Regards
JB

On 06/03/2016 11:02 AM, Maximilian Michels wrote:

Thanks for getting us ready for the first release, Davor! We would
like to fix BEAM-315 next week. Is there already a timeline for the
first release? If so, we could also address this in a minor release.
Releasing often will give us some experience with our release process
:)

I would like everyone to think about the artifact names and group ids
again. "parent" and "flink" are not very suitable names for the Beam
parent or the Flink Runner artifact (same goes for the Spark Runner).
I'd prefer "beam-parent", "flink-runner", and "spark-runner" as
artifact ids.

One might think of Maven GroupIds as a sort of hierarchy but they're
not. They're just an identifier. Renaming the parent pom to
"apache-beam" or "beam-parent" would give us the old naming scheme
which used flat group ids (before [1]).

In the end, I guess it doesn't matter too much if we document the
naming schemes accordingly. What matters is that we use a consistent
naming scheme.

Cheers,
Max

[1] https://issues.apache.org/jira/browse/BEAM-287


On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré  wrote:

Actually, I think we can fix both issue in one commit.

What do you think about renaming the main parent POM with:
groupId: org.apache.beam
artifactId: apache-beam

?

Thanks to that, the source distribution will be named
apache-beam-xxx-sources.zip and it would be clearer to dev.

Thoughts ?

Regards
JB


On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote:


Another annoying thing is the main parent POM artifactId.

Now, it's just "parent". What do you think about renaming to
"beam-parent" ?

Regarding the source distribution name, I would cancel this staging to
fix that (I will have a PR ready soon).

Thoughts ?

Regards
JB

On 06/02/2016 03:46 AM, Davor Bonaci wrote:


Hi everyone!
We've started the release process for our first release,
0.1.0-incubating.

To recap previous discussions, we don't have particular functional goals
for this release. Instead, we'd like to make available what's
currently in
the repository, as well as work through the release process.

With this in mind, we've:
* branched off the release branch [1] at master's commit 8485272,
* updated master to prepare for the second release, 0.2.0-incubating,
* built the first release candidate, RC1, and deployed it to a staging
repository [2].

We are not ready to start a vote just yet -- we've already identified
a few
issues worth fixing. That said, I'd like to invite everybody to take a
peek
and comment. I'm hoping we can address as many issues as possible
before we
start the voting process.

Please let us know if you see any issues.

Thanks,
Davor

[1]
https://github.com/apache/incubator-beam/tree/release-0.1.0-incubating
[2]
https://repository.apache.org/content/repositories/orgapachebeam-1000/





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Serialization for org.apache.beam.sdk.util.WindowedValue$*

2016-06-03 Thread Thomas Weise
Amit,

Thanks for this pointer as well, CoderHelpers helps indeed!

Thomas

On Thu, Jun 2, 2016 at 12:51 PM, Amit Sela  wrote:

> Oh sorry, of course I meant Thomas Groh in my previous email.. But @Thomas
> Weise this example
> <
> https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/EvaluationContext.java#L108
> >
> might
> help, this is how the Spark runner uses Coders like Thomas Groh described.
>
> And i agree that we should consider making PipelineOptions Serializable or
> provide a generic solution for Runners.
>
> Hope this helps,
> Amit
>
> On Thu, Jun 2, 2016 at 10:35 PM Amit Sela  wrote:
>
> > Thomas is right, though in my case, I encountered this issue when using
> > Spark's new API that uses Encoders
> > <
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoder.scala>
> not
> > just for serialization but also for "translating" the object into a
> schema
> > of optimized execution with Tungsten
> > <
> https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
> >.
> >
> > I this case I'm using Kryo and I've solved this by registering (in Spark
> > not Beam) custom serializers from
> > https://github.com/magro/kryo-serializers
> > I would consider (in the future) to implement Encoders with the help of
> > Coders but I still didn't wrap my mind around this.
> >
> > On Thu, Jun 2, 2016 at 9:59 PM Thomas Groh 
> > wrote:
> >
> >> The Beam Model ensures that all PCollections have a Coder; the
> PCollection
> >> Coder is the standard way to materialize the elements of a
> >> PCollection[1][2]. Most SDK-provided classes that will need to be
> >> transferred across the wire have an associated coder, and some
> additional
> >> default datatypes have coders associated with (in the CoderRegistry[3]).
> >>
> >> FullWindowedValueCoder[4] is capable of encoding and decoding the
> entirety
> >> of a WindowedValue, and is constructed from a ValueCoder (obtained from
> >> the
> >> PCollection) and a WindowCoder (obtained from the WindowFn of the
> >> WindowingStrategy of the PCollection). Given an input PCollection `pc`,
> >> you
> >> can construct the FullWindowedValueCoder with the following code snippet
> >>
> >> ```
> >> FullWindowedValueCoder.of(pc.getCoder(),
> >> pc.getWindowingStrategy().getWindowFn().windowCoder())
> >> ```
> >>
> >> [1]
> >>
> >>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java
> >> [2]
> >>
> >>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollection.java#L130
> >> [3]
> >>
> >>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java#L94
> >> [4]
> >>
> >>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java#L515
> >>
> >> On Thu, Jun 2, 2016 at 10:41 AM, Thomas Weise 
> >> wrote:
> >>
> >> > Hi Amit,
> >> >
> >> > Thanks for the help. I implemented the same serialization workaround
> for
> >> > the PipelineOptions. Since every distributed runner will have to solve
> >> > this, would it make sense to provide the serialization support along
> >> with
> >> > the interface proxy?
> >> >
> >> > Here is the exception I get with with WindowedValue:
> >> >
> >> > com.esotericsoftware.kryo.KryoException: Class cannot be created
> >> (missing
> >> > no-arg constructor):
> >> > org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow
> >> > at
> >> >
> >> >
> >>
> com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228)
> >> > at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049)
> >> > at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058)
> >> > at
> >> >
> >> >
> >>
> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:547)
> >> > at
> >> >
> >> >
> >>
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:523)
> >> > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
> >> >
> >> > Thanks,
> >> > Thomas
> >> >
> >> >
> >> > On Wed, Jun 1, 2016 at 12:45 AM, Amit Sela 
> >> wrote:
> >> >
> >> > > Hi Thomas,
> >> > >
> >> > > Spark and the Spark runner are using kryo for serialization and it
> >> seems
> >> > to
> >> > > work just fine. What is your exact problem ? stack trace/message ?
> >> > > I've hit an issue with Guava's ImmutableList/Map etc. and used
> >> > > https://github.com/magro/kryo-serializers for that.
> >> > >
> >> > > For PipelineOptions you can take a look at the Spark runner code
> here:
> >> > >
> >> > >
> >> >
> >>
> 

Re: Serialization for org.apache.beam.sdk.util.WindowedValue$*

2016-06-03 Thread Thomas Weise
Thanks, works like a charm! For such hidden gems there should be a Beam
runner newbie guide ;-)

Thomas


On Thu, Jun 2, 2016 at 11:59 AM, Thomas Groh 
wrote:

> The Beam Model ensures that all PCollections have a Coder; the PCollection
> Coder is the standard way to materialize the elements of a
> PCollection[1][2]. Most SDK-provided classes that will need to be
> transferred across the wire have an associated coder, and some additional
> default datatypes have coders associated with (in the CoderRegistry[3]).
>
> FullWindowedValueCoder[4] is capable of encoding and decoding the entirety
> of a WindowedValue, and is constructed from a ValueCoder (obtained from the
> PCollection) and a WindowCoder (obtained from the WindowFn of the
> WindowingStrategy of the PCollection). Given an input PCollection `pc`, you
> can construct the FullWindowedValueCoder with the following code snippet
>
> ```
> FullWindowedValueCoder.of(pc.getCoder(),
> pc.getWindowingStrategy().getWindowFn().windowCoder())
> ```
>
> [1]
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java
> [2]
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollection.java#L130
> [3]
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java#L94
> [4]
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java#L515
>
> On Thu, Jun 2, 2016 at 10:41 AM, Thomas Weise 
> wrote:
>
> > Hi Amit,
> >
> > Thanks for the help. I implemented the same serialization workaround for
> > the PipelineOptions. Since every distributed runner will have to solve
> > this, would it make sense to provide the serialization support along with
> > the interface proxy?
> >
> > Here is the exception I get with with WindowedValue:
> >
> > com.esotericsoftware.kryo.KryoException: Class cannot be created (missing
> > no-arg constructor):
> > org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow
> > at
> >
> >
> com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228)
> > at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049)
> > at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058)
> > at
> >
> >
> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:547)
> > at
> >
> >
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:523)
> > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
> >
> > Thanks,
> > Thomas
> >
> >
> > On Wed, Jun 1, 2016 at 12:45 AM, Amit Sela  wrote:
> >
> > > Hi Thomas,
> > >
> > > Spark and the Spark runner are using kryo for serialization and it
> seems
> > to
> > > work just fine. What is your exact problem ? stack trace/message ?
> > > I've hit an issue with Guava's ImmutableList/Map etc. and used
> > > https://github.com/magro/kryo-serializers for that.
> > >
> > > For PipelineOptions you can take a look at the Spark runner code here:
> > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkRuntimeContext.java#L73
> > >
> > > I'd be happy to assist with Kryo.
> > >
> > > Thanks,
> > > Amit
> > >
> > > On Wed, Jun 1, 2016 at 7:10 AM Thomas Weise  wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm working on putting together a basic runner for Apache Apex.
> > > >
> > > > Hitting a couple of serialization related issues with running tests.
> > Apex
> > > > is using Kryo for serialization by default (and Kryo can delegate to
> > > other
> > > > serialization frameworks).
> > > >
> > > > The inner classes of WindowedValue are private and have no default
> > > > constructor, which the Kryo field serializer does not like. Also
> these
> > > > classes are not Java serializable, so that's not a fallback option
> (not
> > > > that it would be efficient anyways).
> > > >
> > > > What's the recommended technique to move the WindowedValues over the
> > > wire?
> > > >
> > > > Also, PipelineOptions aren't serializable, while most other classes
> > are.
> > > > They are needed for example with DoFnRunnerBase, so what's the
> > > recommended
> > > > way to distribute them? Disassemble/reassemble? :)
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > >
> >
>