[jira] [Commented] (BEAM-117) Implement the API for Static Display Metadata
[ https://issues.apache.org/jira/browse/BEAM-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297137#comment-15297137 ] ASF GitHub Bot commented on BEAM-117: - GitHub user swegner opened a pull request: https://github.com/apache/incubator-beam/pull/375 [BEAM-117] Evaluate display data from InProcessPipelineRunner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Display data can be added to any PTransform to be used for display from any runner. Runners are not required to consume display data, and currently many don't. This changes InProcessRunner to consumer display data (and then discard it) in order to validate that display data is properly implemented on transforms within a pipeline. Exceptions thrown within HasDisplayData implementations will cause Pipeline.run() to fail. You can merge this pull request into a Git repository by running: $ git pull https://github.com/swegner/incubator-beam displaydata-directrunner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/375.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #375 commit 698a0af1553279ba48ad2f4768556fbd4810208a Author: Scott WegnerDate: 2016-05-23T21:29:33Z Evaluate display data from InProcessPipelineRunner Display data can be added to any PTransform to be used for display from any runner. Runners are not required to consume display data, and currently many don't. This changes InProcessRunner to consumer display data (and then discard it) in order to validate that display data is properly implemented on transforms within a pipeline. Exceptions thrown within HasDisplayData implementations will cause Pipeline.run() to fail. > Implement the API for Static Display Metadata > - > > Key: BEAM-117 > URL: https://issues.apache.org/jira/browse/BEAM-117 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Ben Chambers >Assignee: Scott Wegner > > As described in the following doc, we would like the SDK to allow associating > display metadata with PTransforms. > https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-53) PubSubIO: reimplement in Java
[ https://issues.apache.org/jira/browse/BEAM-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297103#comment-15297103 ] ASF GitHub Bot commented on BEAM-53: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/371 > PubSubIO: reimplement in Java > - > > Key: BEAM-53 > URL: https://issues.apache.org/jira/browse/BEAM-53 > Project: Beam > Issue Type: New Feature > Components: runner-core >Reporter: Daniel Halperin >Assignee: Mark Shields > > PubSubIO is currently only partially implemented in Java: the > DirectPipelineRunner uses a non-scalable API in a single-threaded manner. > In contrast, the DataflowPipelineRunner uses an entirely different code path > implemented in the Google Cloud Dataflow service. > We need to reimplement PubSubIO in Java in order to support other runners in > a scalable way. > Additionally, we can take this opportunity to add new features: > * getting timestamp from an arbitrary lambda in arbitrary formats rather than > from a message attribute in only 2 formats. > * exposing metadata and attributes in the elements produced by PubSubIO.Read > * setting metadata and attributes in the messages written by PubSubIO.Write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-117) Implement the API for Static Display Metadata
[ https://issues.apache.org/jira/browse/BEAM-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291647#comment-15291647 ] ASF GitHub Bot commented on BEAM-117: - GitHub user swegner opened a pull request: https://github.com/apache/incubator-beam/pull/355 [BEAM-117] Fix bug in PipelineOptions DisplayData serialization Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- In the code for populating display data for `PipelineOptions` in the serialized JSON, we had a bug in `ProxyInvocationHandler.Serializer` where it was not checking for null values from previously-deserialized JSON values. As a result, a null value in the JSON would cause us to throw and block pipelines from running. This PR fixes a number of issues: * Properly handle null JSON values in `ProxyInvocatinHandler.Serializer` * Establish a common pattern for handling errors during display data collection, via `DisplayData.from(Throwable)` * Implement resiliency pattern in `ProxyInvocationHandler.Serializer` so bugs in `PipelineOptions` serialization do not prevent running pipelines. You can merge this pull request into a Git repository by running: $ git pull https://github.com/swegner/incubator-beam displaydata-resiliency Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/355.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #355 commit 8ffeb3da894ee5e25a54f7faa8faeff8fe1ae98f Author: Scott WegnerDate: 2016-05-19T16:17:37Z Fix bug in PipelineOptions DisplayData serialization > Implement the API for Static Display Metadata > - > > Key: BEAM-117 > URL: https://issues.apache.org/jira/browse/BEAM-117 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Ben Chambers >Assignee: Scott Wegner > > As described in the following doc, we would like the SDK to allow associating > display metadata with PTransforms. > https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-242) Enable Checkstyle check for the Flink Runner
[ https://issues.apache.org/jira/browse/BEAM-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295674#comment-15295674 ] ASF GitHub Bot commented on BEAM-242: - GitHub user jbonofre opened a pull request: https://github.com/apache/incubator-beam/pull/372 [BEAM-242] Enable and fix checkstyle on Flink runner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [X] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [X] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [X] Replace `` in the title with the actual Jira issue number, if there is one. - [X] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Enable checkstyle plugin in Flink runner and fix checkstyle issues. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jbonofre/incubator-beam BEAM-242-FLINK-CHECKSTYLE Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/372.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #372 commit 01039c296f90be1544385454e1efb0d2398aaa35 Author: Jean-Baptiste OnofréDate: 2016-05-22T18:43:05Z [BEAM-242] Enable and fix checkstyle on Flink runner > Enable Checkstyle check for the Flink Runner > - > > Key: BEAM-242 > URL: https://issues.apache.org/jira/browse/BEAM-242 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Maximilian Michels >Assignee: Jean-Baptiste Onofré >Priority: Minor > > We don't have a Checkstyle check in place for the Flink Runner. I would like > to use the SDK's checkstyle rules. > We could also think about a unified Checkstyle for all Runners. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-297) version typo at README.md of flink runner
[ https://issues.apache.org/jira/browse/BEAM-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294900#comment-15294900 ] ASF GitHub Bot commented on BEAM-297: - GitHub user JianfengQian opened a pull request: https://github.com/apache/incubator-beam/pull/370 [BEAM-297] update flink README.md at line 145 Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- [BEAM-297] update flink README.md at line 145, version of org.apache.beam should be 0.1.0-incubating-SNAPSHOT now. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JianfengQian/incubator-beam patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/370.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #370 commit 671ac2e321ec07e511281af28138861dc56b9b22 Author: JianfengQianDate: 2016-05-21T10:32:29Z [BEAM-297] update flink README.md at line 145,version of org.apache.beam should be 0.1.0-incubating-SNAPSHOT now. > version typo at README.md of flink runner > - > > Key: BEAM-297 > URL: https://issues.apache.org/jira/browse/BEAM-297 > Project: Beam > Issue Type: Bug > Components: runner-flink >Affects Versions: 0.1.0-incubating >Reporter: Jianfeng Qian >Priority: Trivial > Labels: easyfix > Fix For: 0.1.0-incubating > > > version typo at README.md of flink runner > at line 145, the version should be "0.1.0-incubating-SNAPSHOT" instead of > "0.4-SNAPSHOT" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-103) Make UnboundedSourceWrapper parallel
[ https://issues.apache.org/jira/browse/BEAM-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298188#comment-15298188 ] ASF GitHub Bot commented on BEAM-103: - GitHub user mxm opened a pull request: https://github.com/apache/incubator-beam-site/pull/19 [BEAM-103] update capability matrix This reflects the changes of BEAM-103 in the Capability Matrix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/incubator-beam-site asf-site Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam-site/pull/19.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19 commit 0ee1aa0f08a941abd21a27ed3ab87d0bde1f6f14 Author: Maximilian MichelsDate: 2016-05-24T13:41:21Z [BEAM-103] update capability matrix This reflects the changes of BEAM-103 in the Capability Matrix. commit 07f63a1c88b94e9395faa8744ba171b7f5bcfc38 Author: Maximilian Michels Date: 2016-05-24T13:43:26Z rebuild Beam web site > Make UnboundedSourceWrapper parallel > > > Key: BEAM-103 > URL: https://issues.apache.org/jira/browse/BEAM-103 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Maximilian Michels >Assignee: Aljoscha Krettek > Fix For: 0.1.0-incubating > > > As of now {{UnboundedSource}} s are executed with a parallelism of 1 > regardless of the splits which the source returns. The corresponding > {{UnboundedSourceWrapper}} should implement {{RichParallelSourceFunction}} > and deal with splits correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-242) Enable Checkstyle check and Javadoc build for the Flink Runner
[ https://issues.apache.org/jira/browse/BEAM-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298525#comment-15298525 ] ASF GitHub Bot commented on BEAM-242: - GitHub user jbonofre opened a pull request: https://github.com/apache/incubator-beam/pull/382 [BEAM-242] Fix javadoc on Flink runner. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [X] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [X] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [X] Replace `` in the title with the actual Jira issue number, if there is one. - [X] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Fixing Javadoc issue on Flink runner (core and examples) allowing to run a mvn clean install -Prelease. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jbonofre/incubator-beam BEAM-242-JAVADOC Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/382.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #382 commit ecc96837495bbdc17f6b77899ac4d16c2d7ba839 Author: Jean-Baptiste OnofréDate: 2016-05-23T06:48:34Z [BEAM-242] Fix javadoc on Flink runner. > Enable Checkstyle check and Javadoc build for the Flink Runner > --- > > Key: BEAM-242 > URL: https://issues.apache.org/jira/browse/BEAM-242 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Maximilian Michels >Assignee: Jean-Baptiste Onofré >Priority: Minor > > We don't have a Checkstyle check in place for the Flink Runner. I would like > to use the SDK's checkstyle rules. > We could also think about a unified Checkstyle for all Runners. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-297) version typo at README.md of flink runner
[ https://issues.apache.org/jira/browse/BEAM-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294886#comment-15294886 ] ASF GitHub Bot commented on BEAM-297: - Github user JianfengQian closed the pull request at: https://github.com/apache/incubator-beam/pull/354 > version typo at README.md of flink runner > - > > Key: BEAM-297 > URL: https://issues.apache.org/jira/browse/BEAM-297 > Project: Beam > Issue Type: Bug > Components: runner-flink >Affects Versions: 0.1.0-incubating >Reporter: Jianfeng Qian >Priority: Trivial > Labels: easyfix > Fix For: 0.1.0-incubating > > > version typo at README.md of flink runner > at line 145, the version should be "0.1.0-incubating-SNAPSHOT" instead of > "0.4-SNAPSHOT" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-53) PubSubIO: reimplement in Java
[ https://issues.apache.org/jira/browse/BEAM-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295352#comment-15295352 ] ASF GitHub Bot commented on BEAM-53: GitHub user mshields822 opened a pull request: https://github.com/apache/incubator-beam/pull/371 [BEAM-53] Mirror some DataflowJavaSDK pubsub fixes Forward port from https://github.com/GoogleCloudPlatform/DataflowJavaSDK/pull/281 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mshields822/incubator-beam pubsub-fiddles Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/371.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #371 commit c87a19b38e51b2e15689b154089ec93da1a8c7ae Author: Mark ShieldsDate: 2016-05-21T04:22:08Z s/Apiary/Json/g commit 0a48cfa5479310fbdb212dce8c84a140517a7f66 Author: Mark Shields Date: 2016-05-21T04:23:02Z wibble commit 178f5dcef4bb22def2b818a1dd21e5952700c0e4 Author: Mark Shields Date: 2016-05-21T05:14:54Z Mark as bounded commit da3bfe464f0b2b69c826d1a7d72868211b448371 Author: Mark Shields Date: 2016-05-22T01:44:47Z Fix warnings commit 710fd2db6100d071392dd5fba28114ebe8548065 Author: Mark Shields Date: 2016-05-22T01:47:17Z yawn > PubSubIO: reimplement in Java > - > > Key: BEAM-53 > URL: https://issues.apache.org/jira/browse/BEAM-53 > Project: Beam > Issue Type: New Feature > Components: runner-core >Reporter: Daniel Halperin >Assignee: Mark Shields > > PubSubIO is currently only partially implemented in Java: the > DirectPipelineRunner uses a non-scalable API in a single-threaded manner. > In contrast, the DataflowPipelineRunner uses an entirely different code path > implemented in the Google Cloud Dataflow service. > We need to reimplement PubSubIO in Java in order to support other runners in > a scalable way. > Additionally, we can take this opportunity to add new features: > * getting timestamp from an arbitrary lambda in arbitrary formats rather than > from a message attribute in only 2 formats. > * exposing metadata and attributes in the elements produced by PubSubIO.Read > * setting metadata and attributes in the messages written by PubSubIO.Write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-270) Support Timestamps/Windows in Flink Batch
[ https://issues.apache.org/jira/browse/BEAM-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292003#comment-15292003 ] ASF GitHub Bot commented on BEAM-270: - Github user aljoscha closed the pull request at: https://github.com/apache/incubator-beam/pull/328 > Support Timestamps/Windows in Flink Batch > - > > Key: BEAM-270 > URL: https://issues.apache.org/jira/browse/BEAM-270 > Project: Beam > Issue Type: Sub-task > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > Right now, Flink Batch execution does not use {{WindowedValue}} internally, > this means that all programs that interact with timestamps/windows will not > work. We should just internally wrap everything in {{WindowedValue}} as we do > in Flink Streaming. This also makes it very straightforward to add support > for windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-115) Beam Runner API
[ https://issues.apache.org/jira/browse/BEAM-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291952#comment-15291952 ] ASF GitHub Bot commented on BEAM-115: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/268 > Beam Runner API > --- > > Key: BEAM-115 > URL: https://issues.apache.org/jira/browse/BEAM-115 > Project: Beam > Issue Type: Improvement > Components: runner-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > The PipelineRunner API from the SDK is not ideal for the Beam technical > vision. > It has technical limitations: > - The user's DAG (even including library expansions) is never explicitly > represented, so it cannot be analyzed except incrementally, and cannot > necessarily be reconstructed (for example, to display it!). > - The flattened DAG of just primitive transforms isn't well-suited for > display or transform override. > - The TransformHierarchy isn't well-suited for optimizations. > - The user must realistically pre-commit to a runner, and its configuration > (batch vs streaming) prior to graph construction, since the runner will be > modifying the graph as it is built. > - It is fairly language- and SDK-specific. > It has usability issues (these are not from intuition, but derived from > actual cases of failure to use according to the design) > - The interleaving of apply() methods in PTransform/Pipeline/PipelineRunner > is confusing. > - The TransformHierarchy, accessible only via visitor traversals, is > cumbersome. > - The staging of construction-time vs run-time is not always obvious. > These are just examples. This ticket tracks designing, coming to consensus, > and building an API that more simply and directly supports the technical > vision. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-272) Flink Runner depends on Dataflow Runner
[ https://issues.apache.org/jira/browse/BEAM-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279893#comment-15279893 ] ASF GitHub Bot commented on BEAM-272: - GitHub user mxm opened a pull request: https://github.com/apache/incubator-beam/pull/324 [BEAM-272][flink] remove dependency on Dataflow Runner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [X] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [X] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [X] Replace `` in the title with the actual Jira issue number, if there is one. - [X] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/incubator-beam BEAM-272 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/324.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #324 commit abfe73a962624001022a83ce4dd5c1697037bdbf Author: Maximilian MichelsDate: 2016-05-11T09:57:44Z [BEAM-272][flink] remove dependency on Dataflow Runner > Flink Runner depends on Dataflow Runner > --- > > Key: BEAM-272 > URL: https://issues.apache.org/jira/browse/BEAM-272 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Maximilian Michels >Assignee: Maximilian Michels > > During restructuring of the modules, we have introduced a dependency of the > Flink Runner on the Dataflow Runner. The {{PipelineOptionsFactory}} used to > be part of the SDK core but moved to the Dataflow Runner. We should get rid > of this dependency to avoid classpath related problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278646#comment-15278646 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/312 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-117) Implement the API for Static Display Metadata
[ https://issues.apache.org/jira/browse/BEAM-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278633#comment-15278633 ] ASF GitHub Bot commented on BEAM-117: - GitHub user swegner opened a pull request: https://github.com/apache/incubator-beam/pull/315 [BEAM-117] Runners should be resilient to DisplayData failure Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Display data is collected from PTransforms at Pipeline construction time. Collecting display data runs user code from provided transforms and fn's. These components should be designed not to throw during pipeline construction, however we also shouldn't fail a pipeline if this code does fail. This PR adds resiliency to the DataflowPipelineTranslator, where we collect display data for the Dataflow runner, and also a RunnableOnService test to verify that all runners are resilient to display data failures. Other runners are not yet using display data, but will get this validation for free when they do. You can merge this pull request into a Git repository by running: $ git pull https://github.com/swegner/incubator-beam displaydata-safe Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/315.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #315 commit d6c5025937cbd0016d0a83d08e53902ae4d4519b Author: Scott WegnerDate: 2016-05-10T18:19:14Z Runners should be resilient to DisplayData failure Display data is collected from PTransforms at Pipeline construction time. Collecting display data runs user code from provided transforms and fn's. These components should be designed not to throw during pipeline construction, however we also shouldn't fail a pipeline if this code does fail. This PR adds resiliency to the DataflowPipelineTranslator, where we collect display data for the Dataflow runner, and also a RunnableOnService test to verify that all runners are resilient to display data failures. Other runners are not yet using display data, but will get this validation for free when they do. > Implement the API for Static Display Metadata > - > > Key: BEAM-117 > URL: https://issues.apache.org/jira/browse/BEAM-117 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Ben Chambers >Assignee: Scott Wegner > > As described in the following doc, we would like the SDK to allow associating > display metadata with PTransforms. > https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-103) Make UnboundedSourceWrapper parallel
[ https://issues.apache.org/jira/browse/BEAM-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278810#comment-15278810 ] ASF GitHub Bot commented on BEAM-103: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/274 > Make UnboundedSourceWrapper parallel > > > Key: BEAM-103 > URL: https://issues.apache.org/jira/browse/BEAM-103 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Maximilian Michels >Assignee: Aljoscha Krettek > > As of now {{UnboundedSource}} s are executed with a parallelism of 1 > regardless of the splits which the source returns. The corresponding > {{UnboundedSourceWrapper}} should implement {{RichParallelSourceFunction}} > and deal with splits correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278813#comment-15278813 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/318 [BEAM-22] Use an AtomicReference in InProcessSideInputContainer Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This fixes a TOCTOU race in the contents updating logic, where the determination that the current pane should replace the contents of the side input and the replacement is not a single atomic operation. Using AtomicReference allows the use of compareAndSet to ensure that the replacement can only occur on the pane that the decision to replace was made with. Fixes a race where a pane could be the latest, and replace a pane, but would be lost due to an earlier pane being written between the invalidation and loading of contents. Fixes a race where a reader can incorrectly read an empty iterable as the contents of a PCollectionView, due to occuring between the invalidate and reload steps. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam atomic_reference_side_input_container Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/318.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #318 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279008#comment-15279008 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/320 [BEAM-22] Reuse DoFns in ParDoEvaluators Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This allows the runner to avoid cloning DoFns for every input bundle. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam reuse_dofns_pardoevaluators Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/320.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #320 commit f018c75ef3ca1b60fc990bd88031d5419c571b87 Author: Thomas GrohDate: 2016-05-10T21:06:27Z Reuse DoFns in ParDoEvaluators This allows the runner to avoid cloning DoFns for every input bundle. > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-271) Option to configure remote Dataflow windmill service endpoint
[ https://issues.apache.org/jira/browse/BEAM-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278839#comment-15278839 ] ASF GitHub Bot commented on BEAM-271: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/314 > Option to configure remote Dataflow windmill service endpoint > - > > Key: BEAM-271 > URL: https://issues.apache.org/jira/browse/BEAM-271 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Raghu Angadi >Assignee: Davor Bonaci >Priority: Minor > > Add two options to DataflowPipelineDebugOptions to configure Dataflow remove > windmill service. This lets Dataflow users to configure the streaming > pipelines to point to remote windmill service. > https://github.com/apache/incubator-beam/pull/314 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278844#comment-15278844 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/319 [BEAM-22] Enable RunnableOnService Tests Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Not ready for review. Publishing PR to hook into Travis and Jenkins. Update runners/direct-java/pom.xml to enable the RunnableOnService tests phase. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam enable_ros_tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/319.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #319 commit d1796ba6fecb8423e563fcdf66946beda79e52c6 Author: Thomas GrohDate: 2016-05-09T22:47:27Z Minor checkArgument style fix commit f0e38fd170949f27d4794113e4bcb2077ffe88a6 Author: Thomas Groh Date: 2016-05-10T18:27:37Z Use an AtomicReference in InProcessSideInputContainer This fixes a TOCTOU race in the contents updating logic, where the determination that the current pane should replace the contents of the side input and the replacement is not a single atomic operation. Using AtomicReference allows the use of compareAndSet to ensure that the replacement can only occur on the pane that the decision to replace was made with. Fixes a race where a pane could be the latest, and replace a pane, but would be lost due to an earlier pane being written between the invalidation and loading of contents. Fixes a race where a reader can incorrectly read an empty iterable as the contents of a PCollectionView, due to occuring between the invalidate and reload steps. commit e06e449e3762a48404d0407babaff440ebfa416e Author: Thomas Groh Date: 2016-05-10T20:22:20Z Cache read SideInput Contents in the InProcessSideInputContainer This ensures that while processing a bundle all elements see the same contents for any SideInput Window. commit 8ff1d79474f3d114381b924fa61aa46bd7b935db Author: Thomas Groh Date: 2016-05-10T20:36:21Z Enable RunnableOnService tests for the Direct Runner > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279326#comment-15279326 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/322 [BEAM-22] More regularly schedule additional roots Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This ensures that even when elements are pushed back into the Pipeline Runner, roots are scheduled if necessary. As elements may be rescheduled indefinitely, this is required to ensure that unbounded roots are scheduled during pipeline execution when existing elements are blocked on side inputs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam more_aggressively_add_roots Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/322.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #322 commit 1f5e399630e3ba71e2c025e7ab3f32c6c3ba9518 Author: Thomas GrohDate: 2016-05-11T00:11:23Z More regularly schedule additional roots This ensures that even when elements are pushed back into the Pipeline Runner, roots are scheduled if necessary. As elements may be rescheduled indefinitely, this is required to ensure that unbounded roots are scheduled during pipeline execution when existing elements are blocked on side inputs. > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-225) Create Class for Common TypeDescriptors
[ https://issues.apache.org/jira/browse/BEAM-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280522#comment-15280522 ] ASF GitHub Bot commented on BEAM-225: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/275 > Create Class for Common TypeDescriptors > --- > > Key: BEAM-225 > URL: https://issues.apache.org/jira/browse/BEAM-225 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Jesse Anderson >Assignee: Jesse Anderson >Priority: Trivial > Labels: starter > > There should be a built-in class for common types like String, Float, etc. > Right now, all types have to create an inline TypeDescriptor: > {code:java} > PCollection words = suits.apply( > FlatMapElements.via( > (String line) -> Arrays.asList(line.split(" ")) > ).withOutputType(new TypeDescriptor() {})); > {code} > The should be a built-in class with common types like String so you don't > have to create a TypeDescriptor each time like: > {code:java} > PCollection words = suits.apply( > FlatMapElements.via( > (String line) -> Arrays.asList(line.split(" ")) > ).withOutputType(TypeDescriptors.STRINGS)); > {code} > Another possibility is to make it a static method: > {code:java} > PCollection words = suits.apply( > FlatMapElements.via( > (String line) -> Arrays.asList(line.split(" ")) > ).withOutputType(TypeDescriptors.strings())); > {code} > An example of this is Apache Crunch's Writables class > https://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/types/writable/Writables.html. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-270) Support Timestamps/Windows in Flink Batch
[ https://issues.apache.org/jira/browse/BEAM-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286181#comment-15286181 ] ASF GitHub Bot commented on BEAM-270: - GitHub user aljoscha opened a pull request: https://github.com/apache/incubator-beam/pull/343 [BEAM-270] Support Timestamps/Windows in Flink Batch This is a cleanup version of #328, this time for real. The interesting things are in `FlinkPartialReduceFunction`/`FlinkReduceFunction`, `FlinkMergingPartialReduceFunction`/`FlinkMergingReduceFunction` and `FlinkMergingNonShuffleReduceFunction`. All of these implement special cases of windowing. The first two are for general, non-merging windows, the second set is for doing a `GroupByKey`, the last one is for merging windows. In the last case we cannot do a pre-shuffle combine step. R: @kennknowles and @mxm for review You can merge this pull request into a Git repository by running: $ git pull https://github.com/aljoscha/incubator-beam flink-windowed-value-batch-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/343.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #343 commit 93c3f99a6be44b7aad7859927c69974d368f9903 Author: Kenneth KnowlesDate: 2016-05-02T20:11:12Z Add TestFlinkPipelineRunner to FlinkRunnerRegistrar This makes the runner available for selection by integration tests. commit c48e1eaea4805359fdfc326d70b5d3c9964fe37f Author: Kenneth Knowles Date: 2016-05-02T21:04:20Z Configure RunnableOnService tests for Flink in batch mode Today Flink batch supports only global windows. This is a situation we intend our build to allow, eventually via JUnit category filtering. For now all the test classes that use non-global windows are excluded entirely via maven configuration. In the future, it should be on a per-test-method basis. commit 4cc1acc8630a2e436acd75f5aeb4ee6b01a38dc5 Author: Aljoscha Krettek Date: 2016-05-06T06:26:50Z Fix Dangling Flink DataSets commit 508eebafee0a762a59d5a21a07c26f43981c304f Author: Aljoscha Krettek Date: 2016-05-06T07:38:55Z Add hamcrest dependency to Flink Runner Without it the RunnableOnService tests seem to not work commit 3b1f064ca1f2985b6898d527f6174cc9055a1e4a Author: Kenneth Knowles Date: 2016-05-06T17:54:41Z Remove unused threadCount from integration tests commit 99df86fc057d49fcf4e305d3523864d68cf5abd1 Author: Kenneth Knowles Date: 2016-05-06T17:55:16Z Disable Flink streaming integration tests for now commit 4b2eb1151e4cd7ef140a9e6e0eab251452ef7070 Author: Kenneth Knowles Date: 2016-05-06T19:49:55Z Special casing job exec AssertionError in TestFlinkPipelineRunner commit c45651fa434d91064a16f54d53b65f40eadad108 Author: Aljoscha Krettek Date: 2016-05-10T11:53:03Z [BEAM-270] Support Timestamps/Windows in Flink Batch With this change we always use WindowedValue for the underlying Flink DataSets instead of just T. This allows us to support windowing as well. This changes also a lot of other stuff enabled by the above: - Use WindowedValue throughout - Add proper translation for Window.into() - Make side inputs window aware - Make GroupByKey and Combine transformations window aware, this includes support for merging windows. GroupByKey is implemented as a Combine with a concatenating CombineFn, for simplicity This removes Flink specific transformations for things that are handled by builtin sources/sinks, among other things this: - Removes special translation for AvroIO.Read/Write and TextIO.Read/Write - Removes special support for Write.Bound, this was not working properly and is now handled by the Beam machinery that uses DoFns for this - Removes special translation for binary Co-Group, the code was still in there but was never used With this change all RunnableOnService tests run on Flink Batch. commit 863aa2cb2a207449e9a711c4a9e248ed134939d4 Author: Aljoscha Krettek Date: 2016-05-13T12:17:50Z Fix faulty Flink Flatten when PCollectionList is empty commit 5c58830c2da4c0b86d80f93251b001f96edeef35 Author: Aljoscha Krettek Date: 2016-05-13T12:41:20Z Remove superfluous Flink Tests, Fix those that stay in All of the stuff in the removed ITCases is covered (in more detail) by the RunnableOnService tests. commit 5e6be8c757f89d933a4e6818cf7ef6316b7195d6 Author: Aljoscha Krettek
[jira] [Commented] (BEAM-53) PubSubIO: reimplement in Java
[ https://issues.apache.org/jira/browse/BEAM-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287148#comment-15287148 ] ASF GitHub Bot commented on BEAM-53: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/332 > PubSubIO: reimplement in Java > - > > Key: BEAM-53 > URL: https://issues.apache.org/jira/browse/BEAM-53 > Project: Beam > Issue Type: New Feature > Components: runner-core >Reporter: Daniel Halperin >Assignee: Mark Shields > > PubSubIO is currently only partially implemented in Java: the > DirectPipelineRunner uses a non-scalable API in a single-threaded manner. > In contrast, the DataflowPipelineRunner uses an entirely different code path > implemented in the Google Cloud Dataflow service. > We need to reimplement PubSubIO in Java in order to support other runners in > a scalable way. > Additionally, we can take this opportunity to add new features: > * getting timestamp from an arbitrary lambda in arbitrary formats rather than > from a message attribute in only 2 formats. > * exposing metadata and attributes in the elements produced by PubSubIO.Read > * setting metadata and attributes in the messages written by PubSubIO.Write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-53) PubSubIO: reimplement in Java
[ https://issues.apache.org/jira/browse/BEAM-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287905#comment-15287905 ] ASF GitHub Bot commented on BEAM-53: GitHub user mshields822 opened a pull request: https://github.com/apache/incubator-beam/pull/346 [BEAM-53] Wire PubsubUnbounded{Source,Sink} into PubsubIO This also refines the handling of record ids in the sink to be random-but-reused-on-failure, using the same trick as we do for the BigQuery sink. Still need to re-do the load tests I did a few weeks back with the actual change. Note that last time I tested the DataflowPipelineTranslator does not kick in and replace the new transforms with the correct native transforms. Need to dig deeper. R: @dhalperi You can merge this pull request into a Git repository by running: $ git pull https://github.com/mshields822/incubator-beam pubsub-runner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/346.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #346 commit ce655a6a7480147f7527fa23818e1d546abaa599 Author: Mark ShieldsDate: 2016-04-12T00:36:27Z Make java unbounded pub/sub source the default. commit aafcf5f9d0286ec5a6ed0d634df8ff0902897cdc Author: Mark Shields Date: 2016-05-17T23:44:30Z Refine record id calculation. Prepare for supporting unit tests with re-using record ids. > PubSubIO: reimplement in Java > - > > Key: BEAM-53 > URL: https://issues.apache.org/jira/browse/BEAM-53 > Project: Beam > Issue Type: New Feature > Components: runner-core >Reporter: Daniel Halperin >Assignee: Mark Shields > > PubSubIO is currently only partially implemented in Java: the > DirectPipelineRunner uses a non-scalable API in a single-threaded manner. > In contrast, the DataflowPipelineRunner uses an entirely different code path > implemented in the Google Cloud Dataflow service. > We need to reimplement PubSubIO in Java in order to support other runners in > a scalable way. > Additionally, we can take this opportunity to add new features: > * getting timestamp from an arbitrary lambda in arbitrary formats rather than > from a message attribute in only 2 formats. > * exposing metadata and attributes in the elements produced by PubSubIO.Read > * setting metadata and attributes in the messages written by PubSubIO.Write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-117) Implement the API for Static Display Metadata
[ https://issues.apache.org/jira/browse/BEAM-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285185#comment-15285185 ] ASF GitHub Bot commented on BEAM-117: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/338 > Implement the API for Static Display Metadata > - > > Key: BEAM-117 > URL: https://issues.apache.org/jira/browse/BEAM-117 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Ben Chambers >Assignee: Scott Wegner > > As described in the following doc, we would like the SDK to allow associating > display metadata with PTransforms. > https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-289) Examples Use TypeDescriptors
[ https://issues.apache.org/jira/browse/BEAM-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285528#comment-15285528 ] ASF GitHub Bot commented on BEAM-289: - GitHub user eljefe6a opened a pull request: https://github.com/apache/incubator-beam/pull/342 [BEAM-289] Examples Use TypeDescriptors Jira issue BEAM-289 Examples Use TypeDescriptors. Changed example code to use `TypeDescriptors`. These could have used static imports to make the lines shorter, but we opted for the more understandable syntax. You can merge this pull request into a Git repository by running: $ git pull https://github.com/eljefe6a/incubator-beam TypeDescriptorsExamples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/342.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #342 commit 72ec9b82fb21ee9040b1bee6615fecfb9916e470 Author: Jesse AndersonDate: 2016-05-02T18:39:26Z Make Regex Transform commit e6f8c958a2bb88d8b2582cb9d4391922c15b7141 Author: Jesse Anderson Date: 2016-05-02T22:34:08Z Merge remote-tracking branch 'upstream/master' commit 587eaaec106829002df5df1b38753f811649aa51 Author: Jesse Anderson Date: 2016-05-03T01:08:13Z Fixing checkstyle issues. Added missing Apache license. commit df3045f62c939ef3a777ffbf658088f193144983 Author: Jesse Anderson Date: 2016-05-05T15:46:14Z Added distributed replacement functions. Add replaceAll and replaceFirst. Fixed some JavaDocs. commit 793d22667f485a5cdd49a7d36553c96e6898391c Author: Jesse Anderson Date: 2016-05-05T15:55:58Z Whitespace fixes for check style. commit 9e5a9971131721c988242400643712f5c9671b9e Author: Jesse Anderson Date: 2016-05-09T15:47:56Z Merge remote-tracking branch 'upstream/master' commit 425a4d89692f33f99a68aed511270de0ff9db4ac Author: Jesse Anderson Date: 2016-05-11T19:51:57Z Merge remote-tracking branch 'upstream/master' commit 225b2d0ac2d8808da7756f915cbeb7684e4951a4 Author: Jesse Anderson Date: 2016-05-14T01:20:12Z Merge remote-tracking branch 'upstream/master' commit c1a3bc55b47e5dbd11c2bcb360a2a7f2c4aacfb9 Author: Jesse Anderson Date: 2016-05-16T20:58:08Z Changed Word Counts to use TypeDescriptors. commit adfeb01bafb1c17c6c65b30c45ea3e42473a80e6 Author: Jesse Anderson Date: 2016-05-16T21:09:18Z Updated complete examples to use TypeDescriptors. commit 2a455f3ae029e30644cc7b6c96105f761f903e74 Author: Jesse Anderson Date: 2016-05-16T22:11:57Z Removing Regex transforms from this branch. > Examples Use TypeDescriptors > > > Key: BEAM-289 > URL: https://issues.apache.org/jira/browse/BEAM-289 > Project: Beam > Issue Type: Improvement > Components: examples-java >Reporter: Jesse Anderson >Assignee: Frances Perry > > Change the Java and Java 8 examples to use TypeDescriptors instead of inline > TypeDescriptor creation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-295) Flink Create Functions call Collector.close()
[ https://issues.apache.org/jira/browse/BEAM-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289185#comment-15289185 ] ASF GitHub Bot commented on BEAM-295: - GitHub user aljoscha opened a pull request: https://github.com/apache/incubator-beam/pull/347 [BEAM-295] Remove erroneous close() calls in Flink Create Sources Collector.close() should only be called by internal Flink components, not by user functions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aljoscha/incubator-beam remove-collector-close Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/347.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #347 commit bd658bfb3d36e047eacecc91146b051b91eebf1b Author: Aljoscha KrettekDate: 2016-05-18T15:46:34Z [BEAM-295] Remove erroneous close() calls in Flink Create Sources Collector.close() should only be called by internal Flink components, not by user functions. > Flink Create Functions call Collector.close() > - > > Key: BEAM-295 > URL: https://issues.apache.org/jira/browse/BEAM-295 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > {{Collector.close()}} should only be called internally, by Flink. Calling > close() in the user function, as we do in {{FlinkCreateFunction}} and > {{FlinkStreamingCreateFunction}} will lead to downstream operations being > closed twice, which can lead to faulty behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-286) Reorganize flink runner directories
[ https://issues.apache.org/jira/browse/BEAM-286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289295#comment-15289295 ] ASF GitHub Bot commented on BEAM-286: - GitHub user jbonofre opened a pull request: https://github.com/apache/incubator-beam/pull/348 [BEAM-286] Reorganize flink runner module to follow other runners str… Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [X] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [X] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [X] Replace `` in the title with the actual Jira issue number, if there is one. - [X] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Reorganize flink runner module to follow other runners structure. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jbonofre/incubator-beam BEAM-286-REORG-FLINK Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/348.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #348 commit 5200ebf3586a0424093f160c287394c85f466d73 Author: Jean-Baptiste OnofréDate: 2016-05-18T16:46:34Z [BEAM-286] Reorganize flink runner module to follow other runners structure > Reorganize flink runner directories > --- > > Key: BEAM-286 > URL: https://issues.apache.org/jira/browse/BEAM-286 > Project: Beam > Issue Type: Task > Components: runner-flink >Reporter: Jean-Baptiste Onofré >Assignee: Jean-Baptiste Onofré > Fix For: 0.1.0-incubating > > > The flink runner Maven module uses two sub-modules: runner and examples. It's > the only one which use such layout (compare to spark, dataflow or > inprocess/direct runners). > I will propose a PR to align flink runner module with the other, keeping the > examples in a sub-directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-228) Create a merge bot for Beam
[ https://issues.apache.org/jira/browse/BEAM-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289345#comment-15289345 ] ASF GitHub Bot commented on BEAM-228: - Github user jasonkuster closed the pull request at: https://github.com/apache/incubator-beam/pull/225 > Create a merge bot for Beam > --- > > Key: BEAM-228 > URL: https://issues.apache.org/jira/browse/BEAM-228 > Project: Beam > Issue Type: New Feature > Components: project-management >Reporter: Jason Kuster >Assignee: Jason Kuster > > This issue tracks the creation of a merge bot for Beam. This merge bot should > watch the Beam github repository and queue and merge pull requests which are > marked LGTM and good for merge by an approved Beam committer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-79) Gearpump runner
[ https://issues.apache.org/jira/browse/BEAM-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279595#comment-15279595 ] ASF GitHub Bot commented on BEAM-79: GitHub user manuzhang opened a pull request: https://github.com/apache/incubator-beam/pull/323 [BEAM-79] add Gearpump runner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This PR adds Gearpump runner to Beam meeting the goals of phase 1 in the [design document](https://docs.google.com/document/d/1nw64QUWVfT8L7FUprPGLEeNjSBpDMkn1otfLt2rHM5g/edit). The Gearpump runner supports the following functionalities, * Transformations: ParDo, GroupByKey, Flatten * Windows: using Beam's window logic and code * side outputs * serialization/deserialization: using Gearpump's Kryo serializer * sources: Beam's UnboundedSource * message delivery guarantee: at-most-once * tests: integration test for various translators Here's a snapshot of running the following Beam example on Gearpump cluster ```java PCollection> wordCounts = p.apply(Read.from(new UnboundedTextSource()).named("WordStream")) .apply(ParDo.of(new ExtractWordsFn())) .apply(Window.into(FixedWindows.of(Duration.standardSeconds(10 .apply(Count.perElement()); wordCounts.apply(ParDo.of(new FormatAsStringFn())); ``` ![snip20160511_4](https://cloud.githubusercontent.com/assets/1191767/15171197/fd6ffba0-177e-11e6-99a1-30c7c2597244.png) Note that the Gearpump runner is still in early stage and lacking capabilities like trigger, side inputs, aggregator. However, I'd like to have the community to get a feel of what Gearpump is like, whether Beam and Gearpump go well, and gather ideas for improvements. You can merge this pull request into a Git repository by running: $ git pull https://github.com/manuzhang/incubator-beam gearpump_runner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/323.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #323 commit 73e5978f599bcf32ed8c2f1d54b6bd3bd8350092 Author: manuzhang Date: 2016-03-15T08:15:16Z [BEAM-79] add Gearpump runner > Gearpump runner > --- > > Key: BEAM-79 > URL: https://issues.apache.org/jira/browse/BEAM-79 > Project: Beam > Issue Type: New Feature > Components: runner-ideas >Reporter: Tyler Akidau >Assignee: Manu Zhang > > Intel is submitting Gearpump (http://www.gearpump.io) to ASF > (https://wiki.apache.org/incubator/GearpumpProposal). Appears to be a mix of > low-level primitives a la MillWheel, with some higher level primitives like > non-merging windowing mixed in. Seems like it would make a nice Beam runner. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280718#comment-15280718 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/322 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-77) Reorganize Directory structure
[ https://issues.apache.org/jira/browse/BEAM-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264846#comment-15264846 ] ASF GitHub Bot commented on BEAM-77: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/256 > Reorganize Directory structure > -- > > Key: BEAM-77 > URL: https://issues.apache.org/jira/browse/BEAM-77 > Project: Beam > Issue Type: Task > Components: project-management >Reporter: Frances Perry >Assignee: Jean-Baptiste Onofré > > Now that we've done the initial Dataflow code drop, we will restructure > directories to provide space for additional SDKs and Runners. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-255) Understand and improve performance of Write/FileBasedSink
[ https://issues.apache.org/jira/browse/BEAM-255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269481#comment-15269481 ] ASF GitHub Bot commented on BEAM-255: - GitHub user dhalperi opened a pull request: https://github.com/apache/incubator-beam/pull/279 [BEAM-255] Write: add limited logging This will help, for all sinks, users and developers gain insight into where time is spent. (Enabling DEBUG level will provide more insight.) R: @lukecwik You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhalperi/incubator-beam write-logging Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/279.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #279 commit d892f259cb9dd33b762aca416d9b616423c5fbae Author: Dan HalperinDate: 2016-05-03T20:07:04Z [BEAM-255] Write: add limited logging This will help, for all sinks, users and developers gain insight into where time is spent. (Enabling DEBUG level will provide more insight.) > Understand and improve performance of Write/FileBasedSink > - > > Key: BEAM-255 > URL: https://issues.apache.org/jira/browse/BEAM-255 > Project: Beam > Issue Type: Improvement > Components: sdk-java-gcp >Reporter: Daniel Halperin >Assignee: Daniel Halperin > > The work by [~lcwik] to move TextIO from a Dataflow-specific implementation > to a FileBasedSink-based implementation may have caused performance > regressions -- which really means it has exposed opportunity for improvement > in the Beam combination of Write/FileBasedSink. > This is a general tracking bug for SDK-side improvements (likely GCP-related). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-168) IntervalBoundedExponentialBackOff change broke Beam-on-Dataflow
[ https://issues.apache.org/jira/browse/BEAM-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269369#comment-15269369 ] ASF GitHub Bot commented on BEAM-168: - GitHub user dhalperi opened a pull request: https://github.com/apache/incubator-beam/pull/278 [BEAM-168] IntervalBEB: remove deprecated function The pre-commit wordcount test will confirm that this does not break the Cloud Dataflow worker. This resolves BEAM-168. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhalperi/incubator-beam remove-deprecated Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/278.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #278 commit fa2b18a92f98950ec401863610c2e126e1c62e5c Author: Dan HalperinDate: 2016-05-03T19:04:13Z [BEAM-168] IntervalBEB: remove deprecated function The pre-commit wordcount test will confirm that this does not break the Cloud Dataflow worker. > IntervalBoundedExponentialBackOff change broke Beam-on-Dataflow > --- > > Key: BEAM-168 > URL: https://issues.apache.org/jira/browse/BEAM-168 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Daniel Halperin >Assignee: Daniel Halperin > > Changing the `int` to a `long` breaks ABI compatibility, which Dataflow > service uses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-225) Create Class for Common TypeDescriptors
[ https://issues.apache.org/jira/browse/BEAM-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269249#comment-15269249 ] ASF GitHub Bot commented on BEAM-225: - GitHub user eljefe6a opened a pull request: https://github.com/apache/incubator-beam/pull/275 [BEAM-225] Create Class for Common TypeDescriptors Jira issue BEAM-225 Create Class for Common TypeDescriptors. Adding KVTypeDescriptors and TypeDescriptors to make primitives easier. You can merge this pull request into a Git repository by running: $ git pull https://github.com/eljefe6a/incubator-beam TypeDescriptors Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #275 commit 8e89c7a9e390ec8026ee46d68d3a422f8917ef71 Author: Jesse AndersonDate: 2016-05-03T18:09:08Z Adding KVTypeDescriptors and TypeDescriptors to make primitives easier. > Create Class for Common TypeDescriptors > --- > > Key: BEAM-225 > URL: https://issues.apache.org/jira/browse/BEAM-225 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Jesse Anderson >Priority: Trivial > Labels: starter > > There should be a built-in class for common types like String, Float, etc. > Right now, all types have to create an inline TypeDescriptor: > {code:java} > PCollection words = suits.apply( > FlatMapElements.via( > (String line) -> Arrays.asList(line.split(" ")) > ).withOutputType(new TypeDescriptor() {})); > {code} > The should be a built-in class with common types like String so you don't > have to create a TypeDescriptor each time like: > {code:java} > PCollection words = suits.apply( > FlatMapElements.via( > (String line) -> Arrays.asList(line.split(" ")) > ).withOutputType(TypeDescriptors.STRINGS)); > {code} > Another possibility is to make it a static method: > {code:java} > PCollection words = suits.apply( > FlatMapElements.via( > (String line) -> Arrays.asList(line.split(" ")) > ).withOutputType(TypeDescriptors.strings())); > {code} > An example of this is Apache Crunch's Writables class > https://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/types/writable/Writables.html. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267997#comment-15267997 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/265 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-154) Provide Maven BOM
[ https://issues.apache.org/jira/browse/BEAM-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269550#comment-15269550 ] ASF GitHub Bot commented on BEAM-154: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/267 > Provide Maven BOM > - > > Key: BEAM-154 > URL: https://issues.apache.org/jira/browse/BEAM-154 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Jean-Baptiste Onofré >Assignee: Jean-Baptiste Onofré > > When using the Java SDK (for instance to develop IO), the developer has to > add dependencies in his pom.xml (like junit, hamcrest, slf4j, ...). > To simplify the way to define the dependencies, each Beam SDK could provide a > Maven BoM (Bill of Material) describing these dependencies. Then the > developer could simply define this BoM as pom.xml dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-168) IntervalBoundedExponentialBackOff change broke Beam-on-Dataflow
[ https://issues.apache.org/jira/browse/BEAM-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269558#comment-15269558 ] ASF GitHub Bot commented on BEAM-168: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/278 > IntervalBoundedExponentialBackOff change broke Beam-on-Dataflow > --- > > Key: BEAM-168 > URL: https://issues.apache.org/jira/browse/BEAM-168 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Daniel Halperin >Assignee: Daniel Halperin > > Changing the `int` to a `long` breaks ABI compatibility, which Dataflow > service uses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-248) Register DisplayData from anonymous implementation PTransforms
[ https://issues.apache.org/jira/browse/BEAM-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269544#comment-15269544 ] ASF GitHub Bot commented on BEAM-248: - GitHub user swegner opened a pull request: https://github.com/apache/incubator-beam/pull/280 [BEAM-248] Add display data to additional PTransforms Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/swegner/incubator-beam displaydata-leaves Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/280.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #280 commit 676f6fe8715444a8bf1c614de7322557a6f9de72 Author: Scott WegnerDate: 2016-05-03T16:28:21Z Test utility for display data in a pipeline runner DisplayDataEvaluator is useful for validating how PTransform display data is surfaced in the context of a Pipeline and runner. commit f5e167bbea1ffced370dffab8af2697fad157728 Author: Scott Wegner Date: 2016-05-03T16:33:55Z Fix Combine transform primitive display data commit 978068c2fb754986cb86f412c1e85eaf5ee30f7b Author: Scott Wegner Date: 2016-05-03T18:16:51Z Add display data for MapElements transform > Register DisplayData from anonymous implementation PTransforms > -- > > Key: BEAM-248 > URL: https://issues.apache.org/jira/browse/BEAM-248 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Scott Wegner >Assignee: Scott Wegner > > Most SDK PTransforms are implemented in terms of lower-level PTransforms, > often with anonymous user-fn implementations at the leaf-level. Currently > display data is only being registered on the composite node and not within > the anonymous implementation. As a result, the details are lost. > We should register display data both in the composite and internal leaf > nodes, particularly when the implementation is anonymous. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-255) Understand and improve performance of Write/FileBasedSink
[ https://issues.apache.org/jira/browse/BEAM-255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269549#comment-15269549 ] ASF GitHub Bot commented on BEAM-255: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/279 > Understand and improve performance of Write/FileBasedSink > - > > Key: BEAM-255 > URL: https://issues.apache.org/jira/browse/BEAM-255 > Project: Beam > Issue Type: Improvement > Components: sdk-java-gcp >Reporter: Daniel Halperin >Assignee: Daniel Halperin > > The work by [~lcwik] to move TextIO from a Dataflow-specific implementation > to a FileBasedSink-based implementation may have caused performance > regressions for Dataflow -- which really means it has exposed an opportunity > for improvement in the Beam combination of Write/FileBasedSink. > This is a general tracking bug for SDK-side improvements (likely GCP-related). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269636#comment-15269636 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/282 [BEAM-22] Refactor CompletionCallbacks Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- The default and timerful completion callbacks are identical, excepting their calls to evaluationContext.commitResult; factor that code into a common location. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam completion_callback_refactor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/282.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #282 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-256) Add lifecycle even verifiers for Beam pipelines.
[ https://issues.apache.org/jira/browse/BEAM-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270093#comment-15270093 ] ASF GitHub Bot commented on BEAM-256: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/273 > Add lifecycle even verifiers for Beam pipelines. > > > Key: BEAM-256 > URL: https://issues.apache.org/jira/browse/BEAM-256 > Project: Beam > Issue Type: New Feature > Components: testing >Reporter: Jason Kuster >Assignee: Davor Bonaci > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-259) Execute selected RunnableOnService tests with Spark runner
[ https://issues.apache.org/jira/browse/BEAM-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273223#comment-15273223 ] ASF GitHub Bot commented on BEAM-259: - GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/294 [BEAM-259] Configure RunnableOnService tests for Spark runner, batch mode Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This PR demonstrates how to configure the integration tests. It has two categories of issue: 1. Transforms that are not supported. For these we can add surefire exclusions for now. 2. Runtime errors having to do with configuring Spark. I'm hoping someone with expertise in the runner can take quickly recommend the course of action. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam spark-integration Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/294.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #294 commit d8a2a34f723ff4ca7fe841c8056706c19d37770d Author: Kenneth KnowlesDate: 2016-05-05T22:11:07Z Configure RunnableOnService tests for Spark runner, batch mode > Execute selected RunnableOnService tests with Spark runner > -- > > Key: BEAM-259 > URL: https://issues.apache.org/jira/browse/BEAM-259 > Project: Beam > Issue Type: Test >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-258) Execute selected RunnableOnService tests with Flink runner
[ https://issues.apache.org/jira/browse/BEAM-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272882#comment-15272882 ] ASF GitHub Bot commented on BEAM-258: - GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/291 [BEAM-258] Configure RunnableOnService tests for Flink runner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This is a sample configuration for now. There are these kind of failures in the tests right now: 1. Since the batch runner only supports global windows, I've filtered those tests out. I added the `UnsupportedOperationException` to the windowing translator so I could distinguish them. 2. Those tests that have simple & supported pipelines succeed at building the pipeline, but somehow the graph is empty - I have checked and it seemed like translators are not even being invoked. This is beyond my current scope of digging in. 3. Some other tests fail with other misc errors. Perhaps they use state, which results in `NullPointerException`. 4. Pretty much all of the tests require side inputs since that is how `PAssert` works, so they cannot work in streaming mode. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam flink-integration Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/291.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #291 commit 52198eb7e9c2df627871e7f96404d188dc4a4ee7 Author: Kenneth KnowlesDate: 2016-05-02T21:29:30Z Add Window.Bound translator to Flink batch This adds a Window.Bound translator that allows only GlobalWindows. It is a temporary measure, but one that brings the Flink batch translator in line with the Beam model - instead of "ignoring" windows, the GBK is a perfectly valid GBK for GlobalWindows. Previously, the SDK's runner test suite would fail due to the lack of a translator - now some of them will fail due to windowing support, but others have a chance. commit 095f9840dd0d8d78f041e494613375664f7d3eaa Author: Kenneth Knowles Date: 2016-05-02T20:11:12Z Add TestFlinkPipelineRunner to FlinkRunnerRegistrar This makes the runner available for selection by integration tests. commit 750a49d286f4d11d6ad63460d8b244a5ebde975e Author: Kenneth Knowles Date: 2016-05-02T21:04:20Z Configure RunnableOnService tests for Flink in batch mode Today Flink batch supports only global windows. This is a situation we intend our build to allow, eventually via JUnit category filtering. For now all the test classes that use non-global windows are excluded entirely via maven configuration. In the future, it should be on a per-test-method basis. > Execute selected RunnableOnService tests with Flink runner > -- > > Key: BEAM-258 > URL: https://issues.apache.org/jira/browse/BEAM-258 > Project: Beam > Issue Type: Test > Components: runner-flink >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-13) Create JMS IO
[ https://issues.apache.org/jira/browse/BEAM-13?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274324#comment-15274324 ] ASF GitHub Bot commented on BEAM-13: GitHub user jbonofre opened a pull request: https://github.com/apache/incubator-beam/pull/299 [BEAM-13] Add JmsIO Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [X] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [X] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [X] Replace `` in the title with the actual Jira issue number, if there is one. - [X] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- JmsIO (unbounded source and sink). You can merge this pull request into a Git repository by running: $ git pull https://github.com/jbonofre/incubator-beam BEAM-13-JMSIO Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/299.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #299 commit 02b5b8327ee6ebe963e6d613fb5cfec6156762d3 Author: Jean-Baptiste OnofréDate: 2016-05-05T17:14:37Z [BEAM-13] Add JmsIO > Create JMS IO > - > > Key: BEAM-13 > URL: https://issues.apache.org/jira/browse/BEAM-13 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Jean-Baptiste Onofré >Assignee: Jean-Baptiste Onofré > > Work in progress: https://github.com/jbonofre/DataflowJavaSDK/tree/IO-JMS -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-240) Add display data link URLs for sources / sinks
[ https://issues.apache.org/jira/browse/BEAM-240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274423#comment-15274423 ] ASF GitHub Bot commented on BEAM-240: - GitHub user swegner opened a pull request: https://github.com/apache/incubator-beam/pull/300 [BEAM-240] Display data links for input and output files Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- With display data, SDK authors have the ability to annotate display items with a link URLs. This adds browse URLs for GCS and local files, and attach them to well-known source/sink display data. You can merge this pull request into a Git repository by running: $ git pull https://github.com/swegner/incubator-beam displaydata-urls Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/300.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #300 commit c95a61a0df01f93ebeeb897249535f160afdf9eb Author: Scott WegnerDate: 2016-05-03T23:18:27Z Add browse URL to GcsPath commit 6ff5bae0e5bc1cdc2d10ef9e65b44039c8363407 Author: Scott Wegner Date: 2016-05-03T23:32:21Z Add browse URL to File IO commit ca40e25c87ddde4693b17169e40c00c0b62be170 Author: Scott Wegner Date: 2016-05-04T16:38:18Z Add DisplayDataMatcher for linkUrl commit 28739a73ec2510fc9b78fc878346f121e5bd7636 Author: Scott Wegner Date: 2016-05-04T16:39:04Z Add linkUrl to AvroIO DisplayData commit fefc3ef5b2a3b7c0178a216c7fdaaa4885893144 Author: Scott Wegner Date: 2016-05-04T16:51:20Z Add linkUrl to FileBasedSource DisplayData commit c28c4470c93b77096ea182ddc29813f7fe69db03 Author: Scott Wegner Date: 2016-05-04T17:01:39Z Add linkUrl to FileBasedSink DisplayData commit f369a3324a260346bd45c786bccfa8a2a1c15e64 Author: Scott Wegner Date: 2016-05-04T17:19:15Z Add linkUrl to TextIO DisplayData > Add display data link URLs for sources / sinks > -- > > Key: BEAM-240 > URL: https://issues.apache.org/jira/browse/BEAM-240 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Scott Wegner >Assignee: Scott Wegner > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-267) Enable Chekstyle check in Spark runner
[ https://issues.apache.org/jira/browse/BEAM-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274252#comment-15274252 ] ASF GitHub Bot commented on BEAM-267: - GitHub user jbonofre opened a pull request: https://github.com/apache/incubator-beam/pull/298 [BEAM-267] Enable checkstyle in Spark runner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [X] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [X] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [X] Replace `` in the title with the actual Jira issue number, if there is one. - [X] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Enable checkstyle in Spark runner and fix checkstyle errors. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jbonofre/incubator-beam BEAM-267-CHECKSTYLE-SPARK Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/298.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #298 commit 2fb4f96d5d24ac0aeb89e37e6bc55234ce735751 Author: Jean-Baptiste OnofréDate: 2016-05-06T15:47:46Z [BEAM-267] Enable checkstyle in Spark runner > Enable Chekstyle check in Spark runner > -- > > Key: BEAM-267 > URL: https://issues.apache.org/jira/browse/BEAM-267 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Reporter: Jean-Baptiste Onofré >Assignee: Jean-Baptiste Onofré >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-201) Material page
[ https://issues.apache.org/jira/browse/BEAM-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267027#comment-15267027 ] ASF GitHub Bot commented on BEAM-201: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam-site/pull/13 > Material page > - > > Key: BEAM-201 > URL: https://issues.apache.org/jira/browse/BEAM-201 > Project: Beam > Issue Type: Improvement > Components: website > Environment: Create a website page with logo and project material > content >Reporter: James Malone >Assignee: James Malone > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267190#comment-15267190 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/264 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-252) Make Regex Transform
[ https://issues.apache.org/jira/browse/BEAM-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267234#comment-15267234 ] ASF GitHub Bot commented on BEAM-252: - GitHub user eljefe6a opened a pull request: https://github.com/apache/incubator-beam/pull/269 [BEAM-252] Make Regex Transform Jira issue BEAM-252 Make Regex Transform. Adding a pre-built transform for Regex. You can merge this pull request into a Git repository by running: $ git pull https://github.com/eljefe6a/incubator-beam master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/269.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #269 commit 72ec9b82fb21ee9040b1bee6615fecfb9916e470 Author: Jesse AndersonDate: 2016-05-02T18:39:26Z Make Regex Transform > Make Regex Transform > > > Key: BEAM-252 > URL: https://issues.apache.org/jira/browse/BEAM-252 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Jesse Anderson >Assignee: Davor Bonaci > > There needs to be an easier way to run Regular Expressions as part of a > transform. This will make string-based ETL much easier. > The transform should support using the matches and find methods. The > transform should allow you to choose a group in the regex to output. The > transform should allow single strings to be output or KV's of strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-231) Remove ClassForDisplay infrastructure class.
[ https://issues.apache.org/jira/browse/BEAM-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267431#comment-15267431 ] ASF GitHub Bot commented on BEAM-231: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/259 > Remove ClassForDisplay infrastructure class. > > > Key: BEAM-231 > URL: https://issues.apache.org/jira/browse/BEAM-231 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Scott Wegner > > See discussion here: > https://github.com/apache/incubator-beam/pull/247#discussion-diff-61184975 > This class should no longer be needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-77) Reorganize Directory structure
[ https://issues.apache.org/jira/browse/BEAM-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271033#comment-15271033 ] ASF GitHub Bot commented on BEAM-77: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/281 > Reorganize Directory structure > -- > > Key: BEAM-77 > URL: https://issues.apache.org/jira/browse/BEAM-77 > Project: Beam > Issue Type: Task > Components: project-management >Reporter: Frances Perry >Assignee: Jean-Baptiste Onofré > > Now that we've done the initial Dataflow code drop, we will restructure > directories to provide space for additional SDKs and Runners. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271078#comment-15271078 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/258 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271716#comment-15271716 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/289 [BEAM-22] Mark CheckpointMark as volatile in UnboundedReadEvaluator Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- The evaluator may be reused in a different thread, and updates to the checkpoint must be visible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam unbounded_evaluator_close_before_requeue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #289 commit e6ba15d451c9e3574a29ac0d51e6275893c9ee60 Author: Thomas GrohDate: 2016-05-05T00:44:59Z Mark CheckpointMark as volatile in UnboundedReadEvaluator The evaluator may be reused in a different thread, and updates to the checkpoint must be visible. > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272951#comment-15272951 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/282 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-115) Beam Runner API
[ https://issues.apache.org/jira/browse/BEAM-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272941#comment-15272941 ] ASF GitHub Bot commented on BEAM-115: - Github user kennknowles closed the pull request at: https://github.com/apache/incubator-beam/pull/277 > Beam Runner API > --- > > Key: BEAM-115 > URL: https://issues.apache.org/jira/browse/BEAM-115 > Project: Beam > Issue Type: Improvement > Components: runner-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > The PipelineRunner API from the SDK is not ideal for the Beam technical > vision. > It has technical limitations: > - The user's DAG (even including library expansions) is never explicitly > represented, so it cannot be analyzed except incrementally, and cannot > necessarily be reconstructed (for example, to display it!). > - The flattened DAG of just primitive transforms isn't well-suited for > display or transform override. > - The TransformHierarchy isn't well-suited for optimizations. > - The user must realistically pre-commit to a runner, and its configuration > (batch vs streaming) prior to graph construction, since the runner will be > modifying the graph as it is built. > - It is fairly language- and SDK-specific. > It has usability issues (these are not from intuition, but derived from > actual cases of failure to use according to the design) > - The interleaving of apply() methods in PTransform/Pipeline/PipelineRunner > is confusing. > - The TransformHierarchy, accessible only via visitor traversals, is > cumbersome. > - The staging of construction-time vs run-time is not always obvious. > These are just examples. This ticket tracks designing, coming to consensus, > and building an API that more simply and directly supports the technical > vision. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278513#comment-15278513 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/309 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-103) Make UnboundedSourceWrapper parallel
[ https://issues.apache.org/jira/browse/BEAM-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268736#comment-15268736 ] ASF GitHub Bot commented on BEAM-103: - GitHub user aljoscha opened a pull request: https://github.com/apache/incubator-beam/pull/274 [BEAM-103][BEAM-130] Make Flink Source Parallel and Checkpointed Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ x] Replace `` in the title with the actual Jira issue number, if there is one --- CC: @mxm for the Flink parts CC: @tgroh for the tests, I thought you could have something to say about that since you started the doc about better testing I created a new `TestCountingSource` in the `runner-flink` package because I didn't want to depend on the data flow runner. Maybe we should move this to a common package. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aljoscha/incubator-beam flink-parallel-unbounded-source Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/274.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #274 commit 698f50bd7759e3a036e35ca214b1d727f9db4e51 Author: Aljoscha KrettekDate: 2016-05-03T11:35:35Z [BEAM-103][BEAM-130] Make Flink Source Parallel and Checkpointed > Make UnboundedSourceWrapper parallel > > > Key: BEAM-103 > URL: https://issues.apache.org/jira/browse/BEAM-103 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Maximilian Michels >Assignee: Aljoscha Krettek > > As of now {{UnboundedSource}} s are executed with a parallelism of 1 > regardless of the splits which the source returns. The corresponding > {{UnboundedSourceWrapper}} should implement {{RichParallelSourceFunction}} > and deal with splits correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-145) OutputTimeFn#assignOutputTime overrides WindowFn#getOutputTime in unfortunate ways
[ https://issues.apache.org/jira/browse/BEAM-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276642#comment-15276642 ] ASF GitHub Bot commented on BEAM-145: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/296 > OutputTimeFn#assignOutputTime overrides WindowFn#getOutputTime in unfortunate > ways > -- > > Key: BEAM-145 > URL: https://issues.apache.org/jira/browse/BEAM-145 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: windowing > > Today the {{OutputTimeFn}} includes {{#assignOutputTime}}, {{#combine}}, and > {{#merge}}. Together these express the grouping of timestamps, analogous to > the grouping of values in a GBK / Combine, in a canonical way. > The default {{OutputTimeFn}} is provided by the {{WindowFn}}. In particular, > {{SlidingWindows}} provides an {{OutputTimeFn}} that shifts input timestamps > later to avoid watermark stuckness and then takes the minimum to compute the > output timestamp. > The SDK additionally provides instance for "min", "max" and "end of window" > output timestamps. > Unfortunately, if one overrides the {{OutputTimeFn}} to one of these, the > shifting done by {{SlidingWindows}} is lost. > This is actually only a minor problem for now, since "min" is the default, > "end of window" is unaffected, and "max" has only esoteric uses.The fix is > easy: > This is interrelated with another suggested change: Since there are only > three common {{OutputTimeFn}} instances, and it is a high bandwidth API, it > does not seem worthwhile to leave it in userland. So it is proposed to reduce > it to an enum, which would leave only the {{WindowFn}} as a userland place > for timestamp adjustments. (requiring special casing for end-of-window, since > it cannot be implemented without owning {{#assignOutputTime}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-269) Create BigDecimal Coder
[ https://issues.apache.org/jira/browse/BEAM-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276678#comment-15276678 ] ASF GitHub Bot commented on BEAM-269: - GitHub user eljefe6a opened a pull request: https://github.com/apache/incubator-beam/pull/307 [BEAM-269] Create BigDecimal Coder Jira issue BEAM-269 Create BigDecimal Coder. Added coder for BigDecimal. Added unit tests for BigDecimal coder. You can merge this pull request into a Git repository by running: $ git pull https://github.com/eljefe6a/incubator-beam BigDecimalCoder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/307.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #307 commit 72ec9b82fb21ee9040b1bee6615fecfb9916e470 Author: Jesse AndersonDate: 2016-05-02T18:39:26Z Make Regex Transform commit e6f8c958a2bb88d8b2582cb9d4391922c15b7141 Author: Jesse Anderson Date: 2016-05-02T22:34:08Z Merge remote-tracking branch 'upstream/master' commit 587eaaec106829002df5df1b38753f811649aa51 Author: Jesse Anderson Date: 2016-05-03T01:08:13Z Fixing checkstyle issues. Added missing Apache license. commit df3045f62c939ef3a777ffbf658088f193144983 Author: Jesse Anderson Date: 2016-05-05T15:46:14Z Added distributed replacement functions. Add replaceAll and replaceFirst. Fixed some JavaDocs. commit 793d22667f485a5cdd49a7d36553c96e6898391c Author: Jesse Anderson Date: 2016-05-05T15:55:58Z Whitespace fixes for check style. commit 9e5a9971131721c988242400643712f5c9671b9e Author: Jesse Anderson Date: 2016-05-09T15:47:56Z Merge remote-tracking branch 'upstream/master' commit bdbed177c4fa8afed2a89f9c8c9be87ae93abeff Author: Jesse Anderson Date: 2016-05-09T17:05:15Z Added BigDecimal coder and tests. commit 9615b1bc1eb3095d6469640b9d8e072013cceee5 Author: Jesse Anderson Date: 2016-05-09T17:31:11Z Removing files from another branch. > Create BigDecimal Coder > --- > > Key: BEAM-269 > URL: https://issues.apache.org/jira/browse/BEAM-269 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Jesse Anderson >Assignee: Jesse Anderson > > There isn't a coder for BigDecimal. This class is especially important for > financial companies to represent money. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276802#comment-15276802 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/283 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-146) WindowFn.AssingContext leaks implementation details about compressed WindowedValue representation
[ https://issues.apache.org/jira/browse/BEAM-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276858#comment-15276858 ] ASF GitHub Bot commented on BEAM-146: - GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/308 [BEAM-146] Remove references to multi-window representation from model Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Some areas of the Beam model in the SDK allude to the use of a compressed representation of an element along with the set of windows it is assigned to. However, the model itself views elements in different windows as fully independent, so the SDK should not place any obligation on the part of the runner or user to use a particular representation. This change removes those places in the SDK where an element is treated in multiple windows at once. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam AssignContext Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/308.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #308 commit 27588f1cccfc4daffc5d594e69cd0f94a225d22a Author: Kenneth KnowlesDate: 2016-05-09T19:17:09Z Remove references to multi-window representation from model Some areas of the Beam model in the SDK allude to the use of a compressed representation of an element along with the set of windows it is assigned to. However, the model itself views elements in different windows as fully independent, so the SDK should not place any obligation on the part of the runner or user to use a particular representation. This change removes those places in the SDK where an element is treated in multiple windows at once. > WindowFn.AssingContext leaks implementation details about compressed > WindowedValue representation > - > > Key: BEAM-146 > URL: https://issues.apache.org/jira/browse/BEAM-146 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > > Today, {{WindowFn.AssignContext}} provides simultaneous access to all of the > windows that a value has been placed in. > Providing access to the current window for a value is convenient for, e.g. > converting day windows to hour windows for each hour of the assign day. But > providing access to all the assigned windows allows spooky action across > windows, and is generally not intended to be observable - elements are > semantically considered to be "duplicated" into each of the assigned windows. > This ticket proposes that the {{AssignContext}} should provide only a single > window, and that windows should be "exploded" prior to window re-assignment > so that elements are only observed within one window at a time. This can be > accomplished trivially today via surgical insertion of > {{RequiresWindowAccess}} but the {{AssignContext}} should have its API > adjusted to be explicit about it, too. > This will affect only pipelines for which _all_ of the following hold: > - assigns to sliding windows (or custom {{WindowFn}} that places each > element in multiple windows) > - re-assigns to different windows without a {{GroupByKey}} between. > - the new window assignment actually does depend on the full set of windows > assigned > I hypothesize the number of such pipelines is zero. > I expect to address this during the Beam Runner API design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276999#comment-15276999 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/309 [BEAM-22] Use PushbackSideInputDoFnRunner in the InProcessRunner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This removes blocking behavior while retrieving Side Inputs in the InProcessRunner. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam use_pushback_runner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/309.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #309 commit ffab4b6b22276bf41c9ce4b33d22c44e6d753268 Author: Thomas GrohDate: 2016-04-28T00:27:57Z Use PushbackDoFnRunner in the ParDoInProcessEvaluator This ensures that the evaluator does not block while processing an input bundle. commit b7d09e14ef749fcd8e773c82b3f9e73f1646eff8 Author: Thomas Groh Date: 2016-04-28T17:12:09Z Limit the number of work schedules per MonitorRunnable run This ensures that work readded to the queue will not cause the monitor runnable to run forever before delivering timers > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277124#comment-15277124 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/312 [BEAM-22] Update Watermarks Outside of handleResult Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This removes excess mutual exclusion, and reduces the eagerness of updating the value of a watermark. The first commit in this CR reviewed separately in #310 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam ippr_wm_asynchronous Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/312.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #312 commit 2d94540ccd02b19a73410647e84486c008d9cdc5 Author: Thomas GrohDate: 2016-05-09T21:03:16Z Return null evaluators from Unavailable Reads Null TransformEvaluators for sources represent a source where all splits are currently in use or completed. Update TransformExecutor to handle null evaluators properly. Change TransformExecutor to a Runnable. commit 9ee1502e159cce49a8a4490593d896b34154e006 Author: Thomas Groh Date: 2016-05-03T00:26:47Z Update Watermarks Outside of handleResult Remove excess mutual exclusion > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277134#comment-15277134 ] ASF GitHub Bot commented on BEAM-22: Github user tgroh closed the pull request at: https://github.com/apache/incubator-beam/pull/304 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277296#comment-15277296 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/289 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241712#comment-15241712 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/178 [BEAM-22] Switch the Default PipelineRunner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Use the InProcessPiplineRunner (pending rename) as the default runner. The InProcessPipelineRunner implements the beam model, including support for Unbounded PCollections. Tests will fail until #167 and #177 are merged, as this change modifies the graph constructed by the Pipeline if the runner is not specified. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam ippr_switch_default Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #178 commit 55ac23eb9bc184c5eda7c69ba5b70e436b665e09 Author: Thomas GrohDate: 2016-04-08T17:20:56Z Switch the Default PipelineRunner Use the InProcessPiplineRunner (pending rename) as the default runner. The InProcessPipelineRunner implements the beam model, including support for Unbounded PCollections. > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-189) The Spark runner uses valueInEmptyWindow which causes values to be dropped
[ https://issues.apache.org/jira/browse/BEAM-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241913#comment-15241913 ] ASF GitHub Bot commented on BEAM-189: - GitHub user amitsela opened a pull request: https://github.com/apache/incubator-beam/pull/179 [BEAM-189] The Spark runner uses valueInEmptyWindow which causes values to be dropped Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/amitsela/incubator-beam BEAM-189 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/179.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #179 commit d9ce6df3c41dcbc7d884d5b19841475fac330757 Author: SelaDate: 2016-04-14T19:01:15Z Replace valueInEmptyWindows with valueInGlobalWindow commit 2309ad6dbd74333808d6b993b43fe81791a19611 Author: Sela Date: 2016-04-14T19:02:20Z Replace valueInEmptyWindows with valueInGlobalWindow in Spark Function, and add per-value (non-RDD) windowing functions commit 47fdf133da440a9ee8a6a1bc5ea34e59138f7c53 Author: Sela Date: 2016-04-14T19:05:14Z Materialize PCollection/RDD as windowed values with the appropriate windows. commit 06afe790c3e933f7a1a5ed26c3586c22dbaf750f Author: Sela Date: 2016-04-14T20:18:24Z Add unit test for TextIO output to support the mvn exec:exec example we provide in README commit 51089e52474127e4dba21014ca79e1e8879b98cd Author: Sela Date: 2016-04-14T20:58:14Z Satisfy checkstyle > The Spark runner uses valueInEmptyWindow which causes values to be dropped > -- > > Key: BEAM-189 > URL: https://issues.apache.org/jira/browse/BEAM-189 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > Values in empty windowed may be dropped at anytime and so the default > windowing should be with GlobalWindow > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-155) Support asserting the contents of windows and panes in PAssert
[ https://issues.apache.org/jira/browse/BEAM-155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242046#comment-15242046 ] ASF GitHub Bot commented on BEAM-155: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/150 > Support asserting the contents of windows and panes in PAssert > -- > > Key: BEAM-155 > URL: https://issues.apache.org/jira/browse/BEAM-155 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Thomas Groh >Assignee: Thomas Groh > > This consists of reifying the output windows and panes, and running asserts > per-window about the contents of panes. > This includes aggregated matching and final pane matching, e.g. > PAssert.that(output).byOnTimePane().hasOutputElements(foo, bar); > // For discarding mode - could have emitted (say) [spam, eggs], [spam], [], > [sausage], [] > PAssert.that(output).byFinalPane().hasOutputElements(spam, eggs, sausage, > spam); > // For accumulating mode without late data > PAssert.that(output).finalPane().containsInAnyOrder(spam, eggs, sausage, > spam); > // For accumulating mode with late data > PAssert.that(output).finalPane().containsInAnyOrder(foo, > bar).mayAlsoContain(baz, rab); > See also: > https://docs.google.com/document/d/1fZUUbG2LxBtqCVabQshldXIhkMcXepsbv2vuuny8Ix4/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-121) Publish DisplayData from common PTransforms
[ https://issues.apache.org/jira/browse/BEAM-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243430#comment-15243430 ] ASF GitHub Bot commented on BEAM-121: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/124 > Publish DisplayData from common PTransforms > --- > > Key: BEAM-121 > URL: https://issues.apache.org/jira/browse/BEAM-121 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Scott Wegner >Assignee: Scott Wegner > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-77) Reorganize Directory structure
[ https://issues.apache.org/jira/browse/BEAM-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243263#comment-15243263 ] ASF GitHub Bot commented on BEAM-77: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/96 > Reorganize Directory structure > -- > > Key: BEAM-77 > URL: https://issues.apache.org/jira/browse/BEAM-77 > Project: Beam > Issue Type: Task > Components: project-management >Reporter: Frances Perry >Assignee: Jean-Baptiste Onofré > > Now that we've done the initial Dataflow code drop, we will restructure > directories to provide space for additional SDKs and Runners. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243267#comment-15243267 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/188 [BEAM-22] Improve ParDoEvaluator Factoring Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This moves shared code into a common location. Clone DoFn instances before constructing the DoFnRunner to avoid races. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam ippr_better_ParDoEvaluator_factoring Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/188.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #188 commit ecc26d51cee4ea1568948d48cd3441594f638e39 Author: Thomas GrohDate: 2016-03-30T00:38:22Z Move Shared construction code to ParDoInProcessEvaluator Remove duplicate code in ParDo(Single/Multi)EvaluatorFactory; instead only extract the appropriate elements and pass them to the ParDoInProcessEvaluator.t log commit e47aba0a7097cee8341369594e47e73b83029a50 Author: Thomas Groh Date: 2016-04-15T17:23:15Z Clone DoFns before constructing a DoFnRunner This ensures that each thread gets an individual copy of a DoFn, so multiple threads do not interact. > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-202) Remove YYYCoderBase
[ https://issues.apache.org/jira/browse/BEAM-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243875#comment-15243875 ] ASF GitHub Bot commented on BEAM-202: - GitHub user lukecwik opened a pull request: https://github.com/apache/incubator-beam/pull/194 [BEAM-202] Clean-up *CoderBase classes Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- We are on Jackson 2.7.0 now. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam remove_coder_base Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/194.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #194 commit 11e8bfed4c73ec21dcf5a671ff0381ec1f39d8d4 Author: Luke CwikDate: 2016-04-15T23:53:23Z [BEAM-202] Clean-up *CoderBase classes since we are on a newer version of Jackson > Remove YYYCoderBase > --- > > Key: BEAM-202 > URL: https://issues.apache.org/jira/browse/BEAM-202 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Luke Cwik >Assignee: Luke Cwik > > Jackson 2.4.5 had a bug where it couldn't deserialize types with multiple > generic parameters. Since we are on Jackson 2.7.0, we can remove KvCoderBase > and MapCoderBase -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-50) BigQueryIO.Write: reimplement in Java
[ https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243825#comment-15243825 ] ASF GitHub Bot commented on BEAM-50: GitHub user peihe opened a pull request: https://github.com/apache/incubator-beam/pull/192 [BEAM-50] Fix BigQuery.Write tempFilePrefix concatenation You can merge this pull request into a Git repository by running: $ git pull https://github.com/peihe/incubator-beam fix-path-resolve Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/192.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #192 commit 8024a85f05f568b83745d8f61c443bf8ca179421 Author: Pei HeDate: 2016-04-15T23:15:11Z Fix BigQuery.Write tempFilePrefix concatenation > BigQueryIO.Write: reimplement in Java > - > > Key: BEAM-50 > URL: https://issues.apache.org/jira/browse/BEAM-50 > Project: Beam > Issue Type: New Feature > Components: sdk-java-gcp >Reporter: Daniel Halperin >Priority: Minor > > BigQueryIO.Write is currently implemented in a somewhat hacky way. > Unbounded sink: > * The DirectPipelineRunner and the DataflowPipelineRunner use > StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's > streaming writes API. > Bounded sink: > * The DirectPipelineRunner still uses streaming writes. > * The DataflowPipelineRunner uses a different code path in the Google Cloud > Dataflow service that writes to GCS and the initiates a BigQuery load job. > * Per-window table destinations do not work scalably. (See Beam-XXX). > We need to reimplement BigQueryIO.Write fully in Java code in order to > support other runners in a scalable way. > I additionally suggest that we revisit the design of the BigQueryIO sink in > the process. A short list: > * Do not use TableRow as the default value for rows. It could be Map Object> with well-defined types, for example, or an Avro GenericRecord. > Dropping TableRow will get around a variety of issues with types, fields > named 'f', etc., and it will also reduce confusion as we use TableRow objects > differently than usual (for good reason). > * Possibly support not-knowing the schema until pipeline execution time. > * Our builders for BigQueryIO.Write are useful and we should keep them. Where > possible we should also allow users to provide the JSON objects that > configure the underlying table creation, write disposition, etc. This would > let users directly control things like table expiration time, table location, > etc., Would also optimistically let users take advantage of some new BigQuery > features without code changes. > * We could choose between streaming write API and load jobs based on user > preference or dynamic job properties . We could use streaming write in a > batch pipeline if the data is small. We could use load jobs in streaming > pipelines if the windows are large enough to make this practical. > * When issuing BigQuery load jobs, we could leave files in GCS if the import > fails, so that data errors can be debugged. > * We should make per-window table writes scalable in batch. > Caveat, possibly blocker: > * (Beam-XXX): cleanup and temp file management. One advantage of the Google > Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that > intermediate files are deleted when bundles or jobs fail, etc. Beam does not > currently support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243888#comment-15243888 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/178 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-50) BigQueryIO.Write: reimplement in Java
[ https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243900#comment-15243900 ] ASF GitHub Bot commented on BEAM-50: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/193 > BigQueryIO.Write: reimplement in Java > - > > Key: BEAM-50 > URL: https://issues.apache.org/jira/browse/BEAM-50 > Project: Beam > Issue Type: New Feature > Components: sdk-java-gcp >Reporter: Daniel Halperin >Priority: Minor > > BigQueryIO.Write is currently implemented in a somewhat hacky way. > Unbounded sink: > * The DirectPipelineRunner and the DataflowPipelineRunner use > StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's > streaming writes API. > Bounded sink: > * The DirectPipelineRunner still uses streaming writes. > * The DataflowPipelineRunner uses a different code path in the Google Cloud > Dataflow service that writes to GCS and the initiates a BigQuery load job. > * Per-window table destinations do not work scalably. (See Beam-XXX). > We need to reimplement BigQueryIO.Write fully in Java code in order to > support other runners in a scalable way. > I additionally suggest that we revisit the design of the BigQueryIO sink in > the process. A short list: > * Do not use TableRow as the default value for rows. It could be MapObject> with well-defined types, for example, or an Avro GenericRecord. > Dropping TableRow will get around a variety of issues with types, fields > named 'f', etc., and it will also reduce confusion as we use TableRow objects > differently than usual (for good reason). > * Possibly support not-knowing the schema until pipeline execution time. > * Our builders for BigQueryIO.Write are useful and we should keep them. Where > possible we should also allow users to provide the JSON objects that > configure the underlying table creation, write disposition, etc. This would > let users directly control things like table expiration time, table location, > etc., Would also optimistically let users take advantage of some new BigQuery > features without code changes. > * We could choose between streaming write API and load jobs based on user > preference or dynamic job properties . We could use streaming write in a > batch pipeline if the data is small. We could use load jobs in streaming > pipelines if the windows are large enough to make this practical. > * When issuing BigQuery load jobs, we could leave files in GCS if the import > fails, so that data errors can be debugged. > * We should make per-window table writes scalable in batch. > Caveat, possibly blocker: > * (Beam-XXX): cleanup and temp file management. One advantage of the Google > Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that > intermediate files are deleted when bundles or jobs fail, etc. Beam does not > currently support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-78) Rename Dataflow to Beam
[ https://issues.apache.org/jira/browse/BEAM-78?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243807#comment-15243807 ] ASF GitHub Bot commented on BEAM-78: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/191 > Rename Dataflow to Beam > > > Key: BEAM-78 > URL: https://issues.apache.org/jira/browse/BEAM-78 > Project: Beam > Issue Type: Task > Components: sdk-java-core >Reporter: Frances Perry >Assignee: Jean-Baptiste Onofré >Priority: Blocker > > The initial code drop contains code that uses "Dataflow" to refer to the > SDK/model and Cloud Dataflow service. The first usage needs to be swapped to > Beam. > This includes: > - mentions throughout the javadoc > - packages of classes that belong to the java sdk core > And does not include: > - the DataflowPipelineRunner > We plan to postpone this rename until other code drops have been integrated > into the repository, and we have completed the refactoring that will separate > these two uses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-50) BigQueryIO.Write: reimplement in Java
[ https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243836#comment-15243836 ] ASF GitHub Bot commented on BEAM-50: GitHub user peihe opened a pull request: https://github.com/apache/incubator-beam/pull/193 [BEAM-50] Remove BigQueryIO.Write.Bound translator You can merge this pull request into a Git repository by running: $ git pull https://github.com/peihe/incubator-beam remove-bq-write-translator Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/193.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #193 commit 5dd12fe94a5ea2249cf5156c16df1b22aa663baf Author: Pei HeDate: 2016-04-15T23:19:54Z [BEAM-50] Remove BigQueryIO.Write.Bound translator > BigQueryIO.Write: reimplement in Java > - > > Key: BEAM-50 > URL: https://issues.apache.org/jira/browse/BEAM-50 > Project: Beam > Issue Type: New Feature > Components: sdk-java-gcp >Reporter: Daniel Halperin >Priority: Minor > > BigQueryIO.Write is currently implemented in a somewhat hacky way. > Unbounded sink: > * The DirectPipelineRunner and the DataflowPipelineRunner use > StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's > streaming writes API. > Bounded sink: > * The DirectPipelineRunner still uses streaming writes. > * The DataflowPipelineRunner uses a different code path in the Google Cloud > Dataflow service that writes to GCS and the initiates a BigQuery load job. > * Per-window table destinations do not work scalably. (See Beam-XXX). > We need to reimplement BigQueryIO.Write fully in Java code in order to > support other runners in a scalable way. > I additionally suggest that we revisit the design of the BigQueryIO sink in > the process. A short list: > * Do not use TableRow as the default value for rows. It could be Map Object> with well-defined types, for example, or an Avro GenericRecord. > Dropping TableRow will get around a variety of issues with types, fields > named 'f', etc., and it will also reduce confusion as we use TableRow objects > differently than usual (for good reason). > * Possibly support not-knowing the schema until pipeline execution time. > * Our builders for BigQueryIO.Write are useful and we should keep them. Where > possible we should also allow users to provide the JSON objects that > configure the underlying table creation, write disposition, etc. This would > let users directly control things like table expiration time, table location, > etc., Would also optimistically let users take advantage of some new BigQuery > features without code changes. > * We could choose between streaming write API and load jobs based on user > preference or dynamic job properties . We could use streaming write in a > batch pipeline if the data is small. We could use load jobs in streaming > pipelines if the windows are large enough to make this practical. > * When issuing BigQuery load jobs, we could leave files in GCS if the import > fails, so that data errors can be debugged. > * We should make per-window table writes scalable in batch. > Caveat, possibly blocker: > * (Beam-XXX): cleanup and temp file management. One advantage of the Google > Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that > intermediate files are deleted when bundles or jobs fail, etc. Beam does not > currently support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-121) Publish DisplayData from common PTransforms
[ https://issues.apache.org/jira/browse/BEAM-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243554#comment-15243554 ] ASF GitHub Bot commented on BEAM-121: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/125 > Publish DisplayData from common PTransforms > --- > > Key: BEAM-121 > URL: https://issues.apache.org/jira/browse/BEAM-121 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Scott Wegner >Assignee: Scott Wegner > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-121) Publish DisplayData from common PTransforms
[ https://issues.apache.org/jira/browse/BEAM-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243723#comment-15243723 ] ASF GitHub Bot commented on BEAM-121: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/126 > Publish DisplayData from common PTransforms > --- > > Key: BEAM-121 > URL: https://issues.apache.org/jira/browse/BEAM-121 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Scott Wegner >Assignee: Scott Wegner > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-78) Rename Dataflow to Beam
[ https://issues.apache.org/jira/browse/BEAM-78?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243743#comment-15243743 ] ASF GitHub Bot commented on BEAM-78: GitHub user lukecwik opened a pull request: https://github.com/apache/incubator-beam/pull/191 [BEAM-78] Expose package private methods that Dataflow worker relies on Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- The Beam rename caused package structure to change. This broke some of the visiblity requirements inside Dataflow worker. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam beam_rename Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/191.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #191 commit 99020bc5a31669af191778c3a58d80a57263 Author: Luke CwikDate: 2016-04-15T22:04:46Z [BEAM-78] Expose package private methods that Dataflow worker relies on The Beam rename caused package structure to change. This broke some of the visiblity requirements inside Dataflow worker. commit 70d7a7cc56fc4c353be2a64c35a38080cf125b69 Author: Luke Cwik Date: 2016-04-15T22:09:46Z [BEAM-78] !fixup Fix package import order > Rename Dataflow to Beam > > > Key: BEAM-78 > URL: https://issues.apache.org/jira/browse/BEAM-78 > Project: Beam > Issue Type: Task > Components: sdk-java-core >Reporter: Frances Perry >Assignee: Jean-Baptiste Onofré >Priority: Blocker > > The initial code drop contains code that uses "Dataflow" to refer to the > SDK/model and Cloud Dataflow service. The first usage needs to be swapped to > Beam. > This includes: > - mentions throughout the javadoc > - packages of classes that belong to the java sdk core > And does not include: > - the DataflowPipelineRunner > We plan to postpone this rename until other code drops have been integrated > into the repository, and we have completed the refactoring that will separate > these two uses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-155) Support asserting the contents of windows and panes in PAssert
[ https://issues.apache.org/jira/browse/BEAM-155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243665#comment-15243665 ] ASF GitHub Bot commented on BEAM-155: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/180 > Support asserting the contents of windows and panes in PAssert > -- > > Key: BEAM-155 > URL: https://issues.apache.org/jira/browse/BEAM-155 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Thomas Groh >Assignee: Thomas Groh > > This consists of reifying the output windows and panes, and running asserts > per-window about the contents of panes. > This includes aggregated matching and final pane matching, e.g. > PAssert.that(output).byOnTimePane().hasOutputElements(foo, bar); > // For discarding mode - could have emitted (say) [spam, eggs], [spam], [], > [sausage], [] > PAssert.that(output).byFinalPane().hasOutputElements(spam, eggs, sausage, > spam); > // For accumulating mode without late data > PAssert.that(output).finalPane().containsInAnyOrder(spam, eggs, sausage, > spam); > // For accumulating mode with late data > PAssert.that(output).finalPane().containsInAnyOrder(foo, > bar).mayAlsoContain(baz, rab); > See also: > https://docs.google.com/document/d/1fZUUbG2LxBtqCVabQshldXIhkMcXepsbv2vuuny8Ix4/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-201) Material page
[ https://issues.apache.org/jira/browse/BEAM-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243674#comment-15243674 ] ASF GitHub Bot commented on BEAM-201: - GitHub user evilsoapbox opened a pull request: https://github.com/apache/incubator-beam-site/pull/13 Material page; logo fixes - Added material page with project logos/materials - Navigation fixes - Logo fix for the main logo on all pages In JIRA as [BEAM-201](https://issues.apache.org/jira/browse/BEAM-201) You can merge this pull request into a Git repository by running: $ git pull https://github.com/evilsoapbox/incubator-beam-site logo-files Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam-site/pull/13.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13 commit a7cdd50326c5f14e7b578328001a710ba4672620 Author: James MaloneDate: 2016-04-15T21:13:19Z Addition of material page; nav fixes > Material page > - > > Key: BEAM-201 > URL: https://issues.apache.org/jira/browse/BEAM-201 > Project: Beam > Issue Type: Improvement > Components: website > Environment: Create a website page with logo and project material > content >Reporter: James Malone >Assignee: James Malone > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-177) Integrate code coverage to build and review process
[ https://issues.apache.org/jira/browse/BEAM-177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246085#comment-15246085 ] ASF GitHub Bot commented on BEAM-177: - Github user kennknowles closed the pull request at: https://github.com/apache/incubator-beam/pull/138 > Integrate code coverage to build and review process > --- > > Key: BEAM-177 > URL: https://issues.apache.org/jira/browse/BEAM-177 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > We cannot use codecov, but we can use coveralls. We have the maven plugin > included in the pom and need to invoke it appropriately in our various > builds, and disseminate knowledge about browser extensions to get it into the > pull request UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-202) Remove YYYCoderBase
[ https://issues.apache.org/jira/browse/BEAM-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245961#comment-15245961 ] ASF GitHub Bot commented on BEAM-202: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/194 > Remove YYYCoderBase > --- > > Key: BEAM-202 > URL: https://issues.apache.org/jira/browse/BEAM-202 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Luke Cwik >Assignee: Luke Cwik > > Jackson 2.4.5 had a bug where it couldn't deserialize types with multiple > generic parameters. Since we are on Jackson 2.7.0, we can remove KvCoderBase > and MapCoderBase -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-196) Pipeline options must be available Context in DoFn.startBundle
[ https://issues.apache.org/jira/browse/BEAM-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245934#comment-15245934 ] ASF GitHub Bot commented on BEAM-196: - GitHub user mxm opened a pull request: https://github.com/apache/incubator-beam/pull/200 [BEAM-196] Pipeline options must be available Context in DoFn.startBundle This gets rid of the custom Java serialization code by defaulting to serialization of the `PipelineOptions` to a byte array. So far, this has been proven the most hassle-free method for the Flink Runner. For code reuse and avoiding multiple deserialization of the byte array, the `SerializedPipelineOptions` class has been introduced. The changes also make the options accessible in the context of the `DoFn` function. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/incubator-beam BEAM-196 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/200.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #200 commit 81577b31c2642522f7dd4ba8eba794df48a0ca56 Author: Maximilian MichelsDate: 2016-04-18T15:40:38Z [BEAM-196] abstraction for PipelineOptions serialization commit 43b5ec743718e63c2d9d9532e3ca55bc87370290 Author: Maximilian Michels Date: 2016-04-18T15:40:50Z [BEAM-196] make use of SerializedPipelineOptions > Pipeline options must be available Context in DoFn.startBundle > -- > > Key: BEAM-196 > URL: https://issues.apache.org/jira/browse/BEAM-196 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Mark Shields >Assignee: Maximilian Michels > > Our (not yet merged) Java Pubsub implementation has code like this in a DoFn: > @Override > public void startBundle(Context c) throws Exception { > Preconditions.checkState(pubsubClient == null); > pubsubClient = PubsubClient.newClient(transportType, > timestampLabel, idLabel, > c.getPipelineOptions().as(PubsubOptions.class)); > super.startBundle(c); > } > This fails with NPE since the pipeline options are not conveyed via the > context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246471#comment-15246471 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/201 [BEAM-22] Remove isKeyed property of InProcess Bundles Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- The property of keyedness belongs to a PCollection. A BundleFactory propogates the key as far as possible, but does not track if a bundle is keyed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam ippr_remove_bundle_iskeyed Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/201.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #201 commit 9644b7e20955603cb191f79f16b0f1f50b6497db Author: Thomas GrohDate: 2016-04-18T19:59:24Z Remove isKeyed property of InProcess Bundles The property of keyedness belongs to a PCollection. A BundleFactory propogates the key as far as possible, but does not track if a bundle is keyed. > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-208) Flaky test: org.apache.beam.runners.flink.streaming.GroupByNullKeyTest.testJob
[ https://issues.apache.org/jira/browse/BEAM-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248133#comment-15248133 ] ASF GitHub Bot commented on BEAM-208: - GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/210 [BEAM-208] Remove use of System.currentTimeMillis in Flink Test Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This will stop tests failing if the DoFn executes within ~25 seconds of an hour boundary. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam flink_deflake_groupByKeyNull Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/210.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #210 commit 416ca1f5e589b19e9e98e68fbb0e35349e19b82d Author: Thomas GrohDate: 2016-04-19T16:39:38Z Remove use of System.currentTimeMillis in Flink Test This will stop tests failing if the DoFn executes within ~25 seconds of an hour boundary. > Flaky test: org.apache.beam.runners.flink.streaming.GroupByNullKeyTest.testJob > -- > > Key: BEAM-208 > URL: https://issues.apache.org/jira/browse/BEAM-208 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Davor Bonaci >Assignee: Maximilian Michels >Priority: Minor > > org.apache.beam.runners.flink.streaming.GroupByNullKeyTest.testJob sometimes > flakes out. > Error: > Different number of lines in expected and obtained result. expected:<1> but > was:<2> > Here's an example on Jenkins: > https://builds.apache.org/job/beam_PostCommit_MavenVerify/org.apache.beam$flink-runner_2.10/199/testReport/junit/org.apache.beam.runners.flink.streaming/GroupByNullKeyTest/testJob/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246581#comment-15246581 ] ASF GitHub Bot commented on BEAM-22: Github user tgroh closed the pull request at: https://github.com/apache/incubator-beam/pull/202 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246791#comment-15246791 ] ASF GitHub Bot commented on BEAM-22: Github user tgroh closed the pull request at: https://github.com/apache/incubator-beam/pull/204 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-50) BigQueryIO.Write: reimplement in Java
[ https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246956#comment-15246956 ] ASF GitHub Bot commented on BEAM-50: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/205 > BigQueryIO.Write: reimplement in Java > - > > Key: BEAM-50 > URL: https://issues.apache.org/jira/browse/BEAM-50 > Project: Beam > Issue Type: New Feature > Components: sdk-java-gcp >Reporter: Daniel Halperin >Priority: Minor > > BigQueryIO.Write is currently implemented in a somewhat hacky way. > Unbounded sink: > * The DirectPipelineRunner and the DataflowPipelineRunner use > StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's > streaming writes API. > Bounded sink: > * The DirectPipelineRunner still uses streaming writes. > * The DataflowPipelineRunner uses a different code path in the Google Cloud > Dataflow service that writes to GCS and the initiates a BigQuery load job. > * Per-window table destinations do not work scalably. (See Beam-XXX). > We need to reimplement BigQueryIO.Write fully in Java code in order to > support other runners in a scalable way. > I additionally suggest that we revisit the design of the BigQueryIO sink in > the process. A short list: > * Do not use TableRow as the default value for rows. It could be MapObject> with well-defined types, for example, or an Avro GenericRecord. > Dropping TableRow will get around a variety of issues with types, fields > named 'f', etc., and it will also reduce confusion as we use TableRow objects > differently than usual (for good reason). > * Possibly support not-knowing the schema until pipeline execution time. > * Our builders for BigQueryIO.Write are useful and we should keep them. Where > possible we should also allow users to provide the JSON objects that > configure the underlying table creation, write disposition, etc. This would > let users directly control things like table expiration time, table location, > etc., Would also optimistically let users take advantage of some new BigQuery > features without code changes. > * We could choose between streaming write API and load jobs based on user > preference or dynamic job properties . We could use streaming write in a > batch pipeline if the data is small. We could use load jobs in streaming > pipelines if the windows are large enough to make this practical. > * When issuing BigQuery load jobs, we could leave files in GCS if the import > fails, so that data errors can be debugged. > * We should make per-window table writes scalable in batch. > Caveat, possibly blocker: > * (Beam-XXX): cleanup and temp file management. One advantage of the Google > Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that > intermediate files are deleted when bundles or jobs fail, etc. Beam does not > currently support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246889#comment-15246889 ] ASF GitHub Bot commented on BEAM-22: GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/207 [BEAM-22] Track Pending Elements via Exploded WindowedValues Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This allows the WindowedValues that are completed to be removed from the set of pending elements, even if the actual object is a different instance, by ensuring that all WindowedValues contain only a single (element, window) pair. Built on top of #206 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam ippr_exploded_wm_tracking Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/207.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #207 commit ac0696d8867d7a1706f1ff85c4b299c2b1779d02 Author: Thomas GrohDate: 2016-04-18T23:55:57Z Add WindowedValue#explodeWindows This takes an existing WindowedValue and returns a Collection of WindowedValues, each of which is in exactly one window. Use the explode implementation on DoFnRunnerBase commit 725b2ddea58add3f583ed6f7c74f6ab4343cf292 Author: Thomas Groh Date: 2016-04-19T00:28:47Z Track pending elements via exploded WindowedValues This allows the WindowedValues to be partially completed while the still holding the watermark. > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248040#comment-15248040 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/203 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-189) The Spark runner uses valueInEmptyWindow which causes values to be dropped
[ https://issues.apache.org/jira/browse/BEAM-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248476#comment-15248476 ] ASF GitHub Bot commented on BEAM-189: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/179 > The Spark runner uses valueInEmptyWindow which causes values to be dropped > -- > > Key: BEAM-189 > URL: https://issues.apache.org/jira/browse/BEAM-189 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > Values in empty windowed may be dropped at anytime and so the default > windowing should be with GlobalWindow > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-210) Be consistent with emitting final empty panes
[ https://issues.apache.org/jira/browse/BEAM-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248488#comment-15248488 ] ASF GitHub Bot commented on BEAM-210: - GitHub user bjchambers opened a pull request: https://github.com/apache/incubator-beam/pull/211 [BEAM-210] Test that empty final panes are not produced. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/bjchambers/incubator-beam empty-final-panes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/211.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #211 commit cae541795c632f5ba4799b24a730018ec75ffb1b Author: bchambersDate: 2016-04-19T18:34:25Z Remove unused generic arguments in ReduceFnRunnerTest. commit 4ab1c175f188e03f0ef2a5b6b2019c1e0ba27260 Author: bchambers Date: 2016-04-19T19:43:52Z Add test for empty ON_TIME and no empty final pane Add a test that we get an empty `ON_TIME` pane, and don't get the empty final pane when using accumulation mode with the only if non-empty `ClosingBehavior`. > Be consistent with emitting final empty panes > - > > Key: BEAM-210 > URL: https://issues.apache.org/jira/browse/BEAM-210 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Mark Shields >Assignee: Mark Shields > > Currently ReduceFnRunner.onTrigger uses shouldEmit to prevent empty final > panes unless the user has requested them. > The same check needs to be done in ReduceFnRunner.onTimer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250092#comment-15250092 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/207 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-207) Flink test flake in ReadSourceStreamingITCase
[ https://issues.apache.org/jira/browse/BEAM-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250310#comment-15250310 ] ASF GitHub Bot commented on BEAM-207: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/209 > Flink test flake in ReadSourceStreamingITCase > - > > Key: BEAM-207 > URL: https://issues.apache.org/jira/browse/BEAM-207 > Project: Beam > Issue Type: Bug > Components: runner-flink, testing >Reporter: Daniel Halperin >Assignee: Maximilian Michels > > Log from Travis: > https://s3.amazonaws.com/archive.travis-ci.org/jobs/124066205/log.txt > Snippet: > {noformat} > Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.792 sec - > in org.apache.beam.runners.flink.SideInputITCase > Running org.apache.beam.runners.flink.ReadSourceStreamingITCase > Pipeline execution failed > java.lang.RuntimeException: Pipeline execution failed > at > org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:119) > at > org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:51) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:182) > at > org.apache.beam.runners.flink.ReadSourceStreamingITCase.runProgram(ReadSourceStreamingITCase.java:70) > at > org.apache.beam.runners.flink.ReadSourceStreamingITCase.testProgram(ReadSourceStreamingITCase.java:53) > at > org.apache.flink.streaming.util.StreamingProgramTestBase.testJob(StreamingProgramTestBase.java:85) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:714) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:660) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:660) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) > at >
[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250442#comment-15250442 ] ASF GitHub Bot commented on BEAM-22: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/201 > DirectPipelineRunner: support for unbounded collections > --- > > Key: BEAM-22 > URL: https://issues.apache.org/jira/browse/BEAM-22 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Davor Bonaci >Assignee: Thomas Groh > > DirectPipelineRunner currently runs over bounded PCollections only, and > implements only a portion of the Beam Model. > We should improve it to faithfully implement the full Beam Model, such as add > ability to run over unbounded PCollections, and better resemble execution > model in a distributed system. > This further enables features such as a testing source which may simulate > late data and test triggers in the pipeline. Finally, we may want to expose > an option to select between "debug" (single threaded), "chaos monkey" (test > as many model requirements as possible), and "performance" (multi-threaded). > more testing (chaos monkey) > Once this is done, we should update this StackOverflow question: > http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-188) Merging WindowFn + GBK + Write => InvalidWindows throws UnsupportedOperationException
[ https://issues.apache.org/jira/browse/BEAM-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242215#comment-15242215 ] ASF GitHub Bot commented on BEAM-188: - GitHub user dhalperi opened a pull request: https://github.com/apache/incubator-beam/pull/181 [BEAM-188] Write: apply GlobalWindows first And do not supply a timestamp when outputting. Note that this is safe because the functions in the Writer cannot access the window or timestamp. When we add per-Window or similar functions to the sinks, we will likely do so at a higher level. Also testing and improving the existing tests. Note that the session test did fail without the accompanying changes to Write. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhalperi/incubator-beam write-global-windows Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/181.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #181 commit 303e14d508a45c848bafb227944e8b8a4078efb0 Author: Dan HalperinDate: 2016-04-15T00:23:42Z [BEAM-188] Write: apply GlobalWindows first And do not supply a timestamp when outputting. Note that this is safe because the functions in the Writer cannot access the window or timestamp. When we add per-Window or similar functions to the sinks, we will likely do so at a higher level. > Merging WindowFn + GBK + Write => InvalidWindows throws > UnsupportedOperationException > - > > Key: BEAM-188 > URL: https://issues.apache.org/jira/browse/BEAM-188 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Daniel Halperin > > The {{Write}} transform performs {{outputWithTimestamp(..., Instant.now())}} > in the {{finishBundle}} of one of the encapsulated {{ParDo}} transforms. This > action causes the {{WindowFn}} to be invoked to assign a window to the output > value. But a merging {{WindowFn}} such as {{Sessions}} will be replaced by > {{InvalidWindows}} at the GBK where merging is performed, so this is destined > to crash. > It is almost certain that the window is not relevant, so we can quickly fix > this by just windowing into the global window earlier and using vanilla > {{output(...)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-188) Merging WindowFn + GBK + Write => InvalidWindows throws UnsupportedOperationException
[ https://issues.apache.org/jira/browse/BEAM-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242246#comment-15242246 ] ASF GitHub Bot commented on BEAM-188: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/181 > Merging WindowFn + GBK + Write => InvalidWindows throws > UnsupportedOperationException > - > > Key: BEAM-188 > URL: https://issues.apache.org/jira/browse/BEAM-188 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Daniel Halperin > > The {{Write}} transform performs {{outputWithTimestamp(..., Instant.now())}} > in the {{finishBundle}} of one of the encapsulated {{ParDo}} transforms. This > action causes the {{WindowFn}} to be invoked to assign a window to the output > value. But a merging {{WindowFn}} such as {{Sessions}} will be replaced by > {{InvalidWindows}} at the GBK where merging is performed, so this is destined > to crash. > It is almost certain that the window is not relevant, so we can quickly fix > this by just windowing into the global window earlier and using vanilla > {{output(...)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)