Re: [VOTE] groupId/artifactId naming & layout
The purpose of the vote is to get a consensus actually. We have two options expressed on the mailing list: the current "layout" is good IMHO but all don't agree. So, let's put things on the table and move forward. The vote is a way of discussing. It's not a vote for the release, it's a vote/discussion for the layout and Maven coordinates (so not a formal vote). Just to remember: all should be discussed and informed on the mailing list. Regards JB On 06/03/2016 06:50 PM, Davor Bonaci wrote: This is not a great vote proposal for several reasons: * "Use the current layout" is ambiguous, because it is inconsistent (it is now partly flat and party hierarchical). * Getting the outcome won't move us much closer to the resolution, given that there are several sub-variants in each option. * We have not laid out advantages, disadvantages, and consequences of each option for everyone to make an informed decision. * It is premature: we haven't tried to reach a consensus or explored alternatives. 3 hours and just a few emails is way too short from a issue being raised to vote call. I'd suggest to try to find a consensus on the original thread first, and call for a vote if/when needed. On Fri, Jun 3, 2016 at 5:15 AM, Amit Selawrote: +1 for Option2 On Fri, Jun 3, 2016 at 2:09 PM Jean-Baptiste Onofré wrote: As said in my previous e-mail, just proposed PR #416. Let's start a vote for groupId and artifactId naming. [ ] Option1: use the current layout (multiple groupId, artifactId relative to groupId) [ ] Option2: use unique org.apache.beam groupId and rename artifactId with a prefix (beam-parent/apache-beam, flink-runner, spark-runner, etc) Regards JB On 06/03/2016 01:03 PM, Jean-Baptiste Onofré wrote: Hi Max, I discussed with Davor yesterday. Basically, I proposed: 1. To rename all parent with a prefix (beam-parent, flink-runner-parent, spark-runner-parent, etc). 2. For the groupId, I prefer to use different groupId, it's clearer to me, and it's exactly the usage of the groupId. Some projects use a single groupId (spark, hadoop, etc), others use multiple (camel, karaf, activemq, etc). I prefer different groupIds but ok to go back to single one. Anyway, I'm preparing a PR to introduce a new Maven module: "distribution". The purpose is to address both BEAM-319 (first) and BEAM-320 (later). It's where we will be able to define the different distributions we plan to publish (source and binaries). Regards JB On 06/03/2016 11:02 AM, Maximilian Michels wrote: Thanks for getting us ready for the first release, Davor! We would like to fix BEAM-315 next week. Is there already a timeline for the first release? If so, we could also address this in a minor release. Releasing often will give us some experience with our release process :) I would like everyone to think about the artifact names and group ids again. "parent" and "flink" are not very suitable names for the Beam parent or the Flink Runner artifact (same goes for the Spark Runner). I'd prefer "beam-parent", "flink-runner", and "spark-runner" as artifact ids. One might think of Maven GroupIds as a sort of hierarchy but they're not. They're just an identifier. Renaming the parent pom to "apache-beam" or "beam-parent" would give us the old naming scheme which used flat group ids (before [1]). In the end, I guess it doesn't matter too much if we document the naming schemes accordingly. What matters is that we use a consistent naming scheme. Cheers, Max [1] https://issues.apache.org/jira/browse/BEAM-287 On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré
Re: Apache Beam for Python
Welcome Python people ;) I know a few people who've been waiting for this one! On Fri, Jun 3, 2016, 19:53 Davor Bonaciwrote: > Welcome Python SDK, as well as Silviu, Charles, Ahmet and Chamikara! > > On Fri, Jun 3, 2016 at 7:07 AM, Jean-Baptiste Onofré > wrote: > > > Absolutely ;) > > > > > > On 06/03/2016 03:51 PM, James Malone wrote: > > > >> Hey Silviu! > >> > >> I think JB is proposing we create a python directory in the sdks > directory > >> in the root repository (and modify the configuration files accordingly): > >> > >> https://github.com/apache/incubator-beam/tree/master/sdks > >> > >> This Beam document here titled "Apache Beam (Incubating): Repository > >> Structure" details the proposed repository structure and may be useful: > >> > >> > >> > >> > https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc > >> > >> Best, > >> > >> James > >> > >> > >> > >> On Fri, Jun 3, 2016 at 6:34 AM, Silviu Calinoiu > >> > >> wrote: > >> > >> Hi JB, > >>> Thanks for the welcome! I come from the Python land so I am not quite > >>> familiar with Maven. What do you mean by a Maven module? You mean an > >>> artifact so you can install things? In Python, people are used to > >>> packages > >>> downloaded from PyPI (pypi.python.org -- which is sort of Maven for > >>> Python). Whatever is the standard way of doing things in Apache we'll > do > >>> it. Just asking for clarifications. > >>> > >>> By the way this discussion is very useful since we will have to iron > out > >>> several details like this. > >>> Thanks, > >>> Silviu > >>> > >>> On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofré > >>> wrote: > >>> > >>> Hi Silviu, > > thanks for detailed update and great work ! > > I would advice to create a: > > sdks/python > > Maven module to store the Python SDK. > > WDYT ? > > By the way, welcome aboard and great to have you all guys in the team > ! > > Regards > JB > > On 06/03/2016 03:13 PM, Silviu Calinoiu wrote: > > Hi all, > > > > My name is Silviu Calinoiu and I am a member of the Cloud Dataflow > team > > working on the Python SDK. As the original Beam proposal ( > > https://wiki.apache.org/incubator/BeamProposal) mentioned, we have > > been > > planning to merge the Python SDK into Beam. The Python SDK is in an > > > early > >>> > stage of development (alpha milestone) and so this is a good time to > > > move > >>> > the code without causing too much disruption to our customers. > > Additionally, this enables the Beam community to contribute as soon > as > > possible. > > > > The current state of the SDK is as follows: > > > > - > > > > Open-sourced at > > https://github.com/GoogleCloudPlatform/DataflowPythonSDK/ > > > > > > - > > > > Model: All main concepts are present. > > - > > > > I/O: SDK supports text (Google Cloud Storage) and BigQuery > > > connectors > >>> > and has a framework for adding additional sources and sinks. > > - > > > > Runners: SDK has two pipeline runners: direct runner (in > process, > > local > > execution) and Cloud Dataflow runner for batch pipelines (submit > > job > > to > > Google Dataflow service). The current direct runner is bounded > > only > > (batch > > execution) but there is work in progress to support unbounded > (as > > in > > Java). > > - > > > > Testing: The code base has unit test coverage for all the > modules > > > and > >>> > several integration and end to end tests (similar in coverage to > > the > > Java > > SDK). Streaming is not well tested end to end yet since Cloud > > > Dataflow > >>> > focused first on batch. > > - > > > > Docs: We have matching Python documentation for the features > > > currently > >>> > supported by Cloud Dataflow. The docs are on cloud.google.com > > > (access > >>> > only by whitelist due to the alpha stage of the project). Devin > is > > working > > on the transition of all docs to Apache. > > > > > > In the next days/weeks we would like to prepare and start migrating > the > > code and you should start seeing some pull requests. We also hope > that > > > the > >>> > Beam community will shape the SDK going forward. In particular, all > the > > model improvements implemented for Java (Runner API, etc.) will have > > equivalents in Python once they stabilize. If you have any advice > > before > > we > > start the journey please let us know. > > > > The team that will
Re: 0.1.0-incubating release
Another consideration for potential future packaging/distribution solutions is how the artifacts line up as files in a flat directory. For that it may be good to have a common prefix in the artifactId and unique artifactId. The name for the source archive (when relying on ASF parent POM) can also be controlled without expanding the artifactId: maven-assembly-plugin apache-beam Thanks, Thomas On Fri, Jun 3, 2016 at 9:39 AM, Davor Bonaciwrote: > BEAM-315 is definitely important. Normally, I'd always advocate for holding > the release to pick that fix. For the very first release, however, I'd > prefer to proceed to get something out there and test the process. As you > said, we can address this rather quickly once we have the fix merged in. > > In terms of Maven coordinates, there are two basic approaches: > * flat structure, where artifacts live under "org.apache.beam" group and > are differentiated by their artifact id. > * hierarchical structure, where we use different groups for different types > of artifacts (org.apache.beam.sdks; org.apache.beam.runners). > > There are pros and cons on the both sides of the argument. Different > projects made different choices. Flat structure is easier to find and > navigate, but often breaks down with too many artifacts. Hierarchical > structure is just the opposite. > > On my end, the only important thing is consistency. We used to have it, and > it got broken by PR #365. This part should be fixed -- we should either > finish the vision of the hierarchical structure, or rollback that PR to get > back to a fully flat structure. > > My general biases tend to be: > * hierarchical structure, since we have many artifacts already. > * short identifiers; no need to repeat a part of the group id in the > artifact id. > > On Fri, Jun 3, 2016 at 4:03 AM, Jean-Baptiste Onofré > wrote: > > > Hi Max, > > > > I discussed with Davor yesterday. Basically, I proposed: > > > > 1. To rename all parent with a prefix (beam-parent, flink-runner-parent, > > spark-runner-parent, etc). > > 2. For the groupId, I prefer to use different groupId, it's clearer to > me, > > and it's exactly the usage of the groupId. Some projects use a single > > groupId (spark, hadoop, etc), others use multiple (camel, karaf, > activemq, > > etc). I prefer different groupIds but ok to go back to single one. > > > > Anyway, I'm preparing a PR to introduce a new Maven module: > > "distribution". The purpose is to address both BEAM-319 (first) and > > BEAM-320 (later). It's where we will be able to define the different > > distributions we plan to publish (source and binaries). > > > > Regards > > JB > > > > > > On 06/03/2016 11:02 AM, Maximilian Michels wrote: > > > >> Thanks for getting us ready for the first release, Davor! We would > >> like to fix BEAM-315 next week. Is there already a timeline for the > >> first release? If so, we could also address this in a minor release. > >> Releasing often will give us some experience with our release process > >> :) > >> > >> I would like everyone to think about the artifact names and group ids > >> again. "parent" and "flink" are not very suitable names for the Beam > >> parent or the Flink Runner artifact (same goes for the Spark Runner). > >> I'd prefer "beam-parent", "flink-runner", and "spark-runner" as > >> artifact ids. > >> > >> One might think of Maven GroupIds as a sort of hierarchy but they're > >> not. They're just an identifier. Renaming the parent pom to > >> "apache-beam" or "beam-parent" would give us the old naming scheme > >> which used flat group ids (before [1]). > >> > >> In the end, I guess it doesn't matter too much if we document the > >> naming schemes accordingly. What matters is that we use a consistent > >> naming scheme. > >> > >> Cheers, > >> Max > >> > >> [1] https://issues.apache.org/jira/browse/BEAM-287 > >> > >> > >> On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré > >> wrote: > >> > >>> Actually, I think we can fix both issue in one commit. > >>> > >>> What do you think about renaming the main parent POM with: > >>> groupId: org.apache.beam > >>> artifactId: apache-beam > >>> > >>> ? > >>> > >>> Thanks to that, the source distribution will be named > >>> apache-beam-xxx-sources.zip and it would be clearer to dev. > >>> > >>> Thoughts ? > >>> > >>> Regards > >>> JB > >>> > >>> > >>> On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote: > >>> > > Another annoying thing is the main parent POM artifactId. > > Now, it's just "parent". What do you think about renaming to > "beam-parent" ? > > Regarding the source distribution name, I would cancel this staging to > fix that (I will have a PR ready soon). > > Thoughts ? > > Regards > JB > > On 06/02/2016 03:46 AM, Davor Bonaci wrote: > >
Re: [VOTE] groupId/artifactId naming & layout
This is not a great vote proposal for several reasons: * "Use the current layout" is ambiguous, because it is inconsistent (it is now partly flat and party hierarchical). * Getting the outcome won't move us much closer to the resolution, given that there are several sub-variants in each option. * We have not laid out advantages, disadvantages, and consequences of each option for everyone to make an informed decision. * It is premature: we haven't tried to reach a consensus or explored alternatives. 3 hours and just a few emails is way too short from a issue being raised to vote call. I'd suggest to try to find a consensus on the original thread first, and call for a vote if/when needed. On Fri, Jun 3, 2016 at 5:15 AM, Amit Selawrote: > +1 for Option2 > > On Fri, Jun 3, 2016 at 2:09 PM Jean-Baptiste Onofré > wrote: > > > As said in my previous e-mail, just proposed PR #416. > > > > Let's start a vote for groupId and artifactId naming. > > > > [ ] Option1: use the current layout (multiple groupId, artifactId > > relative to groupId) > > [ ] Option2: use unique org.apache.beam groupId and rename artifactId > > with a prefix (beam-parent/apache-beam, flink-runner, spark-runner, etc) > > > > Regards > > JB > > > > On 06/03/2016 01:03 PM, Jean-Baptiste Onofré wrote: > > > Hi Max, > > > > > > I discussed with Davor yesterday. Basically, I proposed: > > > > > > 1. To rename all parent with a prefix (beam-parent, > flink-runner-parent, > > > spark-runner-parent, etc). > > > 2. For the groupId, I prefer to use different groupId, it's clearer to > > > me, and it's exactly the usage of the groupId. Some projects use a > > > single groupId (spark, hadoop, etc), others use multiple (camel, karaf, > > > activemq, etc). I prefer different groupIds but ok to go back to single > > > one. > > > > > > Anyway, I'm preparing a PR to introduce a new Maven module: > > > "distribution". The purpose is to address both BEAM-319 (first) and > > > BEAM-320 (later). It's where we will be able to define the different > > > distributions we plan to publish (source and binaries). > > > > > > Regards > > > JB > > > > > > On 06/03/2016 11:02 AM, Maximilian Michels wrote: > > >> Thanks for getting us ready for the first release, Davor! We would > > >> like to fix BEAM-315 next week. Is there already a timeline for the > > >> first release? If so, we could also address this in a minor release. > > >> Releasing often will give us some experience with our release process > > >> :) > > >> > > >> I would like everyone to think about the artifact names and group ids > > >> again. "parent" and "flink" are not very suitable names for the Beam > > >> parent or the Flink Runner artifact (same goes for the Spark Runner). > > >> I'd prefer "beam-parent", "flink-runner", and "spark-runner" as > > >> artifact ids. > > >> > > >> One might think of Maven GroupIds as a sort of hierarchy but they're > > >> not. They're just an identifier. Renaming the parent pom to > > >> "apache-beam" or "beam-parent" would give us the old naming scheme > > >> which used flat group ids (before [1]). > > >> > > >> In the end, I guess it doesn't matter too much if we document the > > >> naming schemes accordingly. What matters is that we use a consistent > > >> naming scheme. > > >> > > >> Cheers, > > >> Max > > >> > > >> [1] https://issues.apache.org/jira/browse/BEAM-287 > > >> > > >> > > >> On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré > > > >> wrote: > > >>> Actually, I think we can fix both issue in one commit. > > >>> > > >>> What do you think about renaming the main parent POM with: > > >>> groupId: org.apache.beam > > >>> artifactId: apache-beam > > >>> > > >>> ? > > >>> > > >>> Thanks to that, the source distribution will be named > > >>> apache-beam-xxx-sources.zip and it would be clearer to dev. > > >>> > > >>> Thoughts ? > > >>> > > >>> Regards > > >>> JB > > >>> > > >>> > > >>> On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote: > > > > Another annoying thing is the main parent POM artifactId. > > > > Now, it's just "parent". What do you think about renaming to > > "beam-parent" ? > > > > Regarding the source distribution name, I would cancel this staging > to > > fix that (I will have a PR ready soon). > > > > Thoughts ? > > > > Regards > > JB > > > > On 06/02/2016 03:46 AM, Davor Bonaci wrote: > > > > > > Hi everyone! > > > We've started the release process for our first release, > > > 0.1.0-incubating. > > > > > > To recap previous discussions, we don't have particular functional > > > goals > > > for this release. Instead, we'd like to make available what's > > > currently in > > > the repository, as well as work through the release process. > > > > > > With this in mind, we've: > > > * branched off the release branch [1] at master's commit 8485272, > > > *
Re: Apache Beam for Python
Absolutely ;) On 06/03/2016 03:51 PM, James Malone wrote: Hey Silviu! I think JB is proposing we create a python directory in the sdks directory in the root repository (and modify the configuration files accordingly): https://github.com/apache/incubator-beam/tree/master/sdks This Beam document here titled "Apache Beam (Incubating): Repository Structure" details the proposed repository structure and may be useful: https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc Best, James On Fri, Jun 3, 2016 at 6:34 AM, Silviu Calinoiuwrote: Hi JB, Thanks for the welcome! I come from the Python land so I am not quite familiar with Maven. What do you mean by a Maven module? You mean an artifact so you can install things? In Python, people are used to packages downloaded from PyPI (pypi.python.org -- which is sort of Maven for Python). Whatever is the standard way of doing things in Apache we'll do it. Just asking for clarifications. By the way this discussion is very useful since we will have to iron out several details like this. Thanks, Silviu On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofré wrote: Hi Silviu, thanks for detailed update and great work ! I would advice to create a: sdks/python Maven module to store the Python SDK. WDYT ? By the way, welcome aboard and great to have you all guys in the team ! Regards JB On 06/03/2016 03:13 PM, Silviu Calinoiu wrote: Hi all, My name is Silviu Calinoiu and I am a member of the Cloud Dataflow team working on the Python SDK. As the original Beam proposal ( https://wiki.apache.org/incubator/BeamProposal) mentioned, we have been planning to merge the Python SDK into Beam. The Python SDK is in an early stage of development (alpha milestone) and so this is a good time to move the code without causing too much disruption to our customers. Additionally, this enables the Beam community to contribute as soon as possible. The current state of the SDK is as follows: - Open-sourced at https://github.com/GoogleCloudPlatform/DataflowPythonSDK/ - Model: All main concepts are present. - I/O: SDK supports text (Google Cloud Storage) and BigQuery connectors and has a framework for adding additional sources and sinks. - Runners: SDK has two pipeline runners: direct runner (in process, local execution) and Cloud Dataflow runner for batch pipelines (submit job to Google Dataflow service). The current direct runner is bounded only (batch execution) but there is work in progress to support unbounded (as in Java). - Testing: The code base has unit test coverage for all the modules and several integration and end to end tests (similar in coverage to the Java SDK). Streaming is not well tested end to end yet since Cloud Dataflow focused first on batch. - Docs: We have matching Python documentation for the features currently supported by Cloud Dataflow. The docs are on cloud.google.com (access only by whitelist due to the alpha stage of the project). Devin is working on the transition of all docs to Apache. In the next days/weeks we would like to prepare and start migrating the code and you should start seeing some pull requests. We also hope that the Beam community will shape the SDK going forward. In particular, all the model improvements implemented for Java (Runner API, etc.) will have equivalents in Python once they stabilize. If you have any advice before we start the journey please let us know. The team that will join the Beam effort consists of me (Silviu Calinoiu), Charles Chen, Ahmet Altay, Chamikara Jayalath, and last but not least Robert Bradshaw (who is already an Apache Beam committer). So let us know what you think! Best regards, Silviu -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Re: Apache Beam for Python
I'm more proposing just a folder containing Pythong SDK, not necessary part of the Maven reactor. Regards JB On 06/03/2016 03:34 PM, Silviu Calinoiu wrote: Hi JB, Thanks for the welcome! I come from the Python land so I am not quite familiar with Maven. What do you mean by a Maven module? You mean an artifact so you can install things? In Python, people are used to packages downloaded from PyPI (pypi.python.org -- which is sort of Maven for Python). Whatever is the standard way of doing things in Apache we'll do it. Just asking for clarifications. By the way this discussion is very useful since we will have to iron out several details like this. Thanks, Silviu On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofréwrote: Hi Silviu, thanks for detailed update and great work ! I would advice to create a: sdks/python Maven module to store the Python SDK. WDYT ? By the way, welcome aboard and great to have you all guys in the team ! Regards JB On 06/03/2016 03:13 PM, Silviu Calinoiu wrote: Hi all, My name is Silviu Calinoiu and I am a member of the Cloud Dataflow team working on the Python SDK. As the original Beam proposal ( https://wiki.apache.org/incubator/BeamProposal) mentioned, we have been planning to merge the Python SDK into Beam. The Python SDK is in an early stage of development (alpha milestone) and so this is a good time to move the code without causing too much disruption to our customers. Additionally, this enables the Beam community to contribute as soon as possible. The current state of the SDK is as follows: - Open-sourced at https://github.com/GoogleCloudPlatform/DataflowPythonSDK/ - Model: All main concepts are present. - I/O: SDK supports text (Google Cloud Storage) and BigQuery connectors and has a framework for adding additional sources and sinks. - Runners: SDK has two pipeline runners: direct runner (in process, local execution) and Cloud Dataflow runner for batch pipelines (submit job to Google Dataflow service). The current direct runner is bounded only (batch execution) but there is work in progress to support unbounded (as in Java). - Testing: The code base has unit test coverage for all the modules and several integration and end to end tests (similar in coverage to the Java SDK). Streaming is not well tested end to end yet since Cloud Dataflow focused first on batch. - Docs: We have matching Python documentation for the features currently supported by Cloud Dataflow. The docs are on cloud.google.com (access only by whitelist due to the alpha stage of the project). Devin is working on the transition of all docs to Apache. In the next days/weeks we would like to prepare and start migrating the code and you should start seeing some pull requests. We also hope that the Beam community will shape the SDK going forward. In particular, all the model improvements implemented for Java (Runner API, etc.) will have equivalents in Python once they stabilize. If you have any advice before we start the journey please let us know. The team that will join the Beam effort consists of me (Silviu Calinoiu), Charles Chen, Ahmet Altay, Chamikara Jayalath, and last but not least Robert Bradshaw (who is already an Apache Beam committer). So let us know what you think! Best regards, Silviu -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Re: Apache Beam for Python
Hey Silviu! I think JB is proposing we create a python directory in the sdks directory in the root repository (and modify the configuration files accordingly): https://github.com/apache/incubator-beam/tree/master/sdks This Beam document here titled "Apache Beam (Incubating): Repository Structure" details the proposed repository structure and may be useful: https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc Best, James On Fri, Jun 3, 2016 at 6:34 AM, Silviu Calinoiuwrote: > Hi JB, > Thanks for the welcome! I come from the Python land so I am not quite > familiar with Maven. What do you mean by a Maven module? You mean an > artifact so you can install things? In Python, people are used to packages > downloaded from PyPI (pypi.python.org -- which is sort of Maven for > Python). Whatever is the standard way of doing things in Apache we'll do > it. Just asking for clarifications. > > By the way this discussion is very useful since we will have to iron out > several details like this. > Thanks, > Silviu > > On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofré > wrote: > > > Hi Silviu, > > > > thanks for detailed update and great work ! > > > > I would advice to create a: > > > > sdks/python > > > > Maven module to store the Python SDK. > > > > WDYT ? > > > > By the way, welcome aboard and great to have you all guys in the team ! > > > > Regards > > JB > > > > On 06/03/2016 03:13 PM, Silviu Calinoiu wrote: > > > >> Hi all, > >> > >> My name is Silviu Calinoiu and I am a member of the Cloud Dataflow team > >> working on the Python SDK. As the original Beam proposal ( > >> https://wiki.apache.org/incubator/BeamProposal) mentioned, we have been > >> planning to merge the Python SDK into Beam. The Python SDK is in an > early > >> stage of development (alpha milestone) and so this is a good time to > move > >> the code without causing too much disruption to our customers. > >> Additionally, this enables the Beam community to contribute as soon as > >> possible. > >> > >> The current state of the SDK is as follows: > >> > >> - > >> > >> Open-sourced at > >> https://github.com/GoogleCloudPlatform/DataflowPythonSDK/ > >> > >> > >> - > >> > >> Model: All main concepts are present. > >> - > >> > >> I/O: SDK supports text (Google Cloud Storage) and BigQuery > connectors > >> and has a framework for adding additional sources and sinks. > >> - > >> > >> Runners: SDK has two pipeline runners: direct runner (in process, > >> local > >> execution) and Cloud Dataflow runner for batch pipelines (submit job > >> to > >> Google Dataflow service). The current direct runner is bounded only > >> (batch > >> execution) but there is work in progress to support unbounded (as in > >> Java). > >> - > >> > >> Testing: The code base has unit test coverage for all the modules > and > >> several integration and end to end tests (similar in coverage to the > >> Java > >> SDK). Streaming is not well tested end to end yet since Cloud > Dataflow > >> focused first on batch. > >> - > >> > >> Docs: We have matching Python documentation for the features > currently > >> supported by Cloud Dataflow. The docs are on cloud.google.com > (access > >> only by whitelist due to the alpha stage of the project). Devin is > >> working > >> on the transition of all docs to Apache. > >> > >> > >> In the next days/weeks we would like to prepare and start migrating the > >> code and you should start seeing some pull requests. We also hope that > the > >> Beam community will shape the SDK going forward. In particular, all the > >> model improvements implemented for Java (Runner API, etc.) will have > >> equivalents in Python once they stabilize. If you have any advice before > >> we > >> start the journey please let us know. > >> > >> The team that will join the Beam effort consists of me (Silviu > Calinoiu), > >> Charles Chen, Ahmet Altay, Chamikara Jayalath, and last but not least > >> Robert Bradshaw (who is already an Apache Beam committer). > >> > >> So let us know what you think! > >> > >> Best regards, > >> > >> Silviu > >> > >> > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > >
Re: [VOTE] groupId/artifactId naming & layout
+1 for Option2 On Fri, Jun 3, 2016 at 2:09 PM Jean-Baptiste Onofréwrote: > As said in my previous e-mail, just proposed PR #416. > > Let's start a vote for groupId and artifactId naming. > > [ ] Option1: use the current layout (multiple groupId, artifactId > relative to groupId) > [ ] Option2: use unique org.apache.beam groupId and rename artifactId > with a prefix (beam-parent/apache-beam, flink-runner, spark-runner, etc) > > Regards > JB > > On 06/03/2016 01:03 PM, Jean-Baptiste Onofré wrote: > > Hi Max, > > > > I discussed with Davor yesterday. Basically, I proposed: > > > > 1. To rename all parent with a prefix (beam-parent, flink-runner-parent, > > spark-runner-parent, etc). > > 2. For the groupId, I prefer to use different groupId, it's clearer to > > me, and it's exactly the usage of the groupId. Some projects use a > > single groupId (spark, hadoop, etc), others use multiple (camel, karaf, > > activemq, etc). I prefer different groupIds but ok to go back to single > > one. > > > > Anyway, I'm preparing a PR to introduce a new Maven module: > > "distribution". The purpose is to address both BEAM-319 (first) and > > BEAM-320 (later). It's where we will be able to define the different > > distributions we plan to publish (source and binaries). > > > > Regards > > JB > > > > On 06/03/2016 11:02 AM, Maximilian Michels wrote: > >> Thanks for getting us ready for the first release, Davor! We would > >> like to fix BEAM-315 next week. Is there already a timeline for the > >> first release? If so, we could also address this in a minor release. > >> Releasing often will give us some experience with our release process > >> :) > >> > >> I would like everyone to think about the artifact names and group ids > >> again. "parent" and "flink" are not very suitable names for the Beam > >> parent or the Flink Runner artifact (same goes for the Spark Runner). > >> I'd prefer "beam-parent", "flink-runner", and "spark-runner" as > >> artifact ids. > >> > >> One might think of Maven GroupIds as a sort of hierarchy but they're > >> not. They're just an identifier. Renaming the parent pom to > >> "apache-beam" or "beam-parent" would give us the old naming scheme > >> which used flat group ids (before [1]). > >> > >> In the end, I guess it doesn't matter too much if we document the > >> naming schemes accordingly. What matters is that we use a consistent > >> naming scheme. > >> > >> Cheers, > >> Max > >> > >> [1] https://issues.apache.org/jira/browse/BEAM-287 > >> > >> > >> On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré > >> wrote: > >>> Actually, I think we can fix both issue in one commit. > >>> > >>> What do you think about renaming the main parent POM with: > >>> groupId: org.apache.beam > >>> artifactId: apache-beam > >>> > >>> ? > >>> > >>> Thanks to that, the source distribution will be named > >>> apache-beam-xxx-sources.zip and it would be clearer to dev. > >>> > >>> Thoughts ? > >>> > >>> Regards > >>> JB > >>> > >>> > >>> On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote: > > Another annoying thing is the main parent POM artifactId. > > Now, it's just "parent". What do you think about renaming to > "beam-parent" ? > > Regarding the source distribution name, I would cancel this staging to > fix that (I will have a PR ready soon). > > Thoughts ? > > Regards > JB > > On 06/02/2016 03:46 AM, Davor Bonaci wrote: > > > > Hi everyone! > > We've started the release process for our first release, > > 0.1.0-incubating. > > > > To recap previous discussions, we don't have particular functional > > goals > > for this release. Instead, we'd like to make available what's > > currently in > > the repository, as well as work through the release process. > > > > With this in mind, we've: > > * branched off the release branch [1] at master's commit 8485272, > > * updated master to prepare for the second release, 0.2.0-incubating, > > * built the first release candidate, RC1, and deployed it to a > staging > > repository [2]. > > > > We are not ready to start a vote just yet -- we've already identified > > a few > > issues worth fixing. That said, I'd like to invite everybody to take > a > > peek > > and comment. I'm hoping we can address as many issues as possible > > before we > > start the voting process. > > > > Please let us know if you see any issues. > > > > Thanks, > > Davor > > > > [1] > > > https://github.com/apache/incubator-beam/tree/release-0.1.0-incubating > > [2] > > > https://repository.apache.org/content/repositories/orgapachebeam-1000/ > > > > >>> > >>> -- > >>> Jean-Baptiste Onofré > >>> jbono...@apache.org > >>> http://blog.nanthrax.net > >>> Talend - http://www.talend.com > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org >
Re: 0.1.0-incubating release
Hi Max, I discussed with Davor yesterday. Basically, I proposed: 1. To rename all parent with a prefix (beam-parent, flink-runner-parent, spark-runner-parent, etc). 2. For the groupId, I prefer to use different groupId, it's clearer to me, and it's exactly the usage of the groupId. Some projects use a single groupId (spark, hadoop, etc), others use multiple (camel, karaf, activemq, etc). I prefer different groupIds but ok to go back to single one. Anyway, I'm preparing a PR to introduce a new Maven module: "distribution". The purpose is to address both BEAM-319 (first) and BEAM-320 (later). It's where we will be able to define the different distributions we plan to publish (source and binaries). Regards JB On 06/03/2016 11:02 AM, Maximilian Michels wrote: Thanks for getting us ready for the first release, Davor! We would like to fix BEAM-315 next week. Is there already a timeline for the first release? If so, we could also address this in a minor release. Releasing often will give us some experience with our release process :) I would like everyone to think about the artifact names and group ids again. "parent" and "flink" are not very suitable names for the Beam parent or the Flink Runner artifact (same goes for the Spark Runner). I'd prefer "beam-parent", "flink-runner", and "spark-runner" as artifact ids. One might think of Maven GroupIds as a sort of hierarchy but they're not. They're just an identifier. Renaming the parent pom to "apache-beam" or "beam-parent" would give us the old naming scheme which used flat group ids (before [1]). In the end, I guess it doesn't matter too much if we document the naming schemes accordingly. What matters is that we use a consistent naming scheme. Cheers, Max [1] https://issues.apache.org/jira/browse/BEAM-287 On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofréwrote: Actually, I think we can fix both issue in one commit. What do you think about renaming the main parent POM with: groupId: org.apache.beam artifactId: apache-beam ? Thanks to that, the source distribution will be named apache-beam-xxx-sources.zip and it would be clearer to dev. Thoughts ? Regards JB On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote: Another annoying thing is the main parent POM artifactId. Now, it's just "parent". What do you think about renaming to "beam-parent" ? Regarding the source distribution name, I would cancel this staging to fix that (I will have a PR ready soon). Thoughts ? Regards JB On 06/02/2016 03:46 AM, Davor Bonaci wrote: Hi everyone! We've started the release process for our first release, 0.1.0-incubating. To recap previous discussions, we don't have particular functional goals for this release. Instead, we'd like to make available what's currently in the repository, as well as work through the release process. With this in mind, we've: * branched off the release branch [1] at master's commit 8485272, * updated master to prepare for the second release, 0.2.0-incubating, * built the first release candidate, RC1, and deployed it to a staging repository [2]. We are not ready to start a vote just yet -- we've already identified a few issues worth fixing. That said, I'd like to invite everybody to take a peek and comment. I'm hoping we can address as many issues as possible before we start the voting process. Please let us know if you see any issues. Thanks, Davor [1] https://github.com/apache/incubator-beam/tree/release-0.1.0-incubating [2] https://repository.apache.org/content/repositories/orgapachebeam-1000/ -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Re: Serialization for org.apache.beam.sdk.util.WindowedValue$*
Amit, Thanks for this pointer as well, CoderHelpers helps indeed! Thomas On Thu, Jun 2, 2016 at 12:51 PM, Amit Selawrote: > Oh sorry, of course I meant Thomas Groh in my previous email.. But @Thomas > Weise this example > < > https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/EvaluationContext.java#L108 > > > might > help, this is how the Spark runner uses Coders like Thomas Groh described. > > And i agree that we should consider making PipelineOptions Serializable or > provide a generic solution for Runners. > > Hope this helps, > Amit > > On Thu, Jun 2, 2016 at 10:35 PM Amit Sela wrote: > > > Thomas is right, though in my case, I encountered this issue when using > > Spark's new API that uses Encoders > > < > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoder.scala> > not > > just for serialization but also for "translating" the object into a > schema > > of optimized execution with Tungsten > > < > https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html > >. > > > > I this case I'm using Kryo and I've solved this by registering (in Spark > > not Beam) custom serializers from > > https://github.com/magro/kryo-serializers > > I would consider (in the future) to implement Encoders with the help of > > Coders but I still didn't wrap my mind around this. > > > > On Thu, Jun 2, 2016 at 9:59 PM Thomas Groh > > wrote: > > > >> The Beam Model ensures that all PCollections have a Coder; the > PCollection > >> Coder is the standard way to materialize the elements of a > >> PCollection[1][2]. Most SDK-provided classes that will need to be > >> transferred across the wire have an associated coder, and some > additional > >> default datatypes have coders associated with (in the CoderRegistry[3]). > >> > >> FullWindowedValueCoder[4] is capable of encoding and decoding the > entirety > >> of a WindowedValue, and is constructed from a ValueCoder (obtained from > >> the > >> PCollection) and a WindowCoder (obtained from the WindowFn of the > >> WindowingStrategy of the PCollection). Given an input PCollection `pc`, > >> you > >> can construct the FullWindowedValueCoder with the following code snippet > >> > >> ``` > >> FullWindowedValueCoder.of(pc.getCoder(), > >> pc.getWindowingStrategy().getWindowFn().windowCoder()) > >> ``` > >> > >> [1] > >> > >> > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java > >> [2] > >> > >> > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollection.java#L130 > >> [3] > >> > >> > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java#L94 > >> [4] > >> > >> > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java#L515 > >> > >> On Thu, Jun 2, 2016 at 10:41 AM, Thomas Weise > >> wrote: > >> > >> > Hi Amit, > >> > > >> > Thanks for the help. I implemented the same serialization workaround > for > >> > the PipelineOptions. Since every distributed runner will have to solve > >> > this, would it make sense to provide the serialization support along > >> with > >> > the interface proxy? > >> > > >> > Here is the exception I get with with WindowedValue: > >> > > >> > com.esotericsoftware.kryo.KryoException: Class cannot be created > >> (missing > >> > no-arg constructor): > >> > org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow > >> > at > >> > > >> > > >> > com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228) > >> > at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049) > >> > at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058) > >> > at > >> > > >> > > >> > com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:547) > >> > at > >> > > >> > > >> > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:523) > >> > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761) > >> > > >> > Thanks, > >> > Thomas > >> > > >> > > >> > On Wed, Jun 1, 2016 at 12:45 AM, Amit Sela > >> wrote: > >> > > >> > > Hi Thomas, > >> > > > >> > > Spark and the Spark runner are using kryo for serialization and it > >> seems > >> > to > >> > > work just fine. What is your exact problem ? stack trace/message ? > >> > > I've hit an issue with Guava's ImmutableList/Map etc. and used > >> > > https://github.com/magro/kryo-serializers for that. > >> > > > >> > > For PipelineOptions you can take a look at the Spark runner code > here: > >> > > > >> > > > >> > > >> >
Re: Serialization for org.apache.beam.sdk.util.WindowedValue$*
Thanks, works like a charm! For such hidden gems there should be a Beam runner newbie guide ;-) Thomas On Thu, Jun 2, 2016 at 11:59 AM, Thomas Grohwrote: > The Beam Model ensures that all PCollections have a Coder; the PCollection > Coder is the standard way to materialize the elements of a > PCollection[1][2]. Most SDK-provided classes that will need to be > transferred across the wire have an associated coder, and some additional > default datatypes have coders associated with (in the CoderRegistry[3]). > > FullWindowedValueCoder[4] is capable of encoding and decoding the entirety > of a WindowedValue, and is constructed from a ValueCoder (obtained from the > PCollection) and a WindowCoder (obtained from the WindowFn of the > WindowingStrategy of the PCollection). Given an input PCollection `pc`, you > can construct the FullWindowedValueCoder with the following code snippet > > ``` > FullWindowedValueCoder.of(pc.getCoder(), > pc.getWindowingStrategy().getWindowFn().windowCoder()) > ``` > > [1] > > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java > [2] > > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollection.java#L130 > [3] > > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java#L94 > [4] > > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java#L515 > > On Thu, Jun 2, 2016 at 10:41 AM, Thomas Weise > wrote: > > > Hi Amit, > > > > Thanks for the help. I implemented the same serialization workaround for > > the PipelineOptions. Since every distributed runner will have to solve > > this, would it make sense to provide the serialization support along with > > the interface proxy? > > > > Here is the exception I get with with WindowedValue: > > > > com.esotericsoftware.kryo.KryoException: Class cannot be created (missing > > no-arg constructor): > > org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow > > at > > > > > com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228) > > at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049) > > at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058) > > at > > > > > com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:547) > > at > > > > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:523) > > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761) > > > > Thanks, > > Thomas > > > > > > On Wed, Jun 1, 2016 at 12:45 AM, Amit Sela wrote: > > > > > Hi Thomas, > > > > > > Spark and the Spark runner are using kryo for serialization and it > seems > > to > > > work just fine. What is your exact problem ? stack trace/message ? > > > I've hit an issue with Guava's ImmutableList/Map etc. and used > > > https://github.com/magro/kryo-serializers for that. > > > > > > For PipelineOptions you can take a look at the Spark runner code here: > > > > > > > > > https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkRuntimeContext.java#L73 > > > > > > I'd be happy to assist with Kryo. > > > > > > Thanks, > > > Amit > > > > > > On Wed, Jun 1, 2016 at 7:10 AM Thomas Weise wrote: > > > > > > > Hi, > > > > > > > > I'm working on putting together a basic runner for Apache Apex. > > > > > > > > Hitting a couple of serialization related issues with running tests. > > Apex > > > > is using Kryo for serialization by default (and Kryo can delegate to > > > other > > > > serialization frameworks). > > > > > > > > The inner classes of WindowedValue are private and have no default > > > > constructor, which the Kryo field serializer does not like. Also > these > > > > classes are not Java serializable, so that's not a fallback option > (not > > > > that it would be efficient anyways). > > > > > > > > What's the recommended technique to move the WindowedValues over the > > > wire? > > > > > > > > Also, PipelineOptions aren't serializable, while most other classes > > are. > > > > They are needed for example with DoFnRunnerBase, so what's the > > > recommended > > > > way to distribute them? Disassemble/reassemble? :) > > > > > > > > Thanks, > > > > Thomas > > > > > > > > > >