[jira] [Updated] (BEAM-8568) Local file system does not match relative path with wildcards
[ https://issues.apache.org/jira/browse/BEAM-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Moravek updated BEAM-8568: Fix Version/s: 2.17.0 Affects Version/s: 2.16.0 Description: CWD structure: {code} src/test/resources/input/sometestfile.txt {code} Code: {code:java} input .apply(Create.of("src/test/resources/input/*)) .apply(FileIO.matchAll()) .apply(FileIO.readMatches()) {code} The code above doesn't match any file starting Beam 2.16.0. The regression has been introduced in BEAM-7854. was: Relative path in project: {code:java} src/test/resources/input/sometestfile.txt{code} Code: {code:java} input .apply(Create.of("src/test/resources/input/*)) .apply(FileIO.matchAll()) .apply(FileIO.readMatches()) {code} This code does not match any file. It does in beam version 2.16.0. Bug was created by BEAM-7854. > Local file system does not match relative path with wildcards > - > > Key: BEAM-8568 > URL: https://issues.apache.org/jira/browse/BEAM-8568 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.16.0 >Reporter: Ondrej Cerny >Priority: Major > Fix For: 2.17.0 > > > CWD structure: > {code} > src/test/resources/input/sometestfile.txt > {code} > > Code: > {code:java} > input > .apply(Create.of("src/test/resources/input/*)) > .apply(FileIO.matchAll()) > .apply(FileIO.readMatches()) > {code} > The code above doesn't match any file starting Beam 2.16.0. The regression > has been introduced in BEAM-7854. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-8568) Local file system does not match relative path with wildcards
[ https://issues.apache.org/jira/browse/BEAM-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969012#comment-16969012 ] David Moravek edited comment on BEAM-8568 at 11/7/19 7:52 AM: -- LocalFileSystem no longer supports wildcard relative paths. was (Author: davidmoravek): LocalFileSystem no longer support wildcard relative paths. > Local file system does not match relative path with wildcards > - > > Key: BEAM-8568 > URL: https://issues.apache.org/jira/browse/BEAM-8568 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Ondrej Cerny >Priority: Major > > Relative path in project: > {code:java} > src/test/resources/input/sometestfile.txt{code} > > Code: > {code:java} > input > .apply(Create.of("src/test/resources/input/*)) > .apply(FileIO.matchAll()) > .apply(FileIO.readMatches()) > {code} > This code does not match any file. It does in beam version 2.16.0. > > Bug was created by BEAM-7854. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8568) Local file system does not match relative path with wildcards
[ https://issues.apache.org/jira/browse/BEAM-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969012#comment-16969012 ] David Moravek commented on BEAM-8568: - LocalFileSystem no longer support wildcard relative paths. > Local file system does not match relative path with wildcards > - > > Key: BEAM-8568 > URL: https://issues.apache.org/jira/browse/BEAM-8568 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Ondrej Cerny >Priority: Major > > Relative path in project: > {code:java} > src/test/resources/input/sometestfile.txt{code} > > Code: > {code:java} > input > .apply(Create.of("src/test/resources/input/*)) > .apply(FileIO.matchAll()) > .apply(FileIO.readMatches()) > {code} > This code does not match any file. It does in beam version 2.16.0. > > Bug was created by BEAM-7854. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8564) Add LZO compression and decompression support
[ https://issues.apache.org/jira/browse/BEAM-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968989#comment-16968989 ] Amogh Tiwari commented on BEAM-8564: [~kenn], please assign this issue to me, I have already started working on it and would love to contribute. You can find a concise plan in the description. > Add LZO compression and decompression support > - > > Key: BEAM-8564 > URL: https://issues.apache.org/jira/browse/BEAM-8564 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Amogh Tiwari >Priority: Minor > > LZO is a lossless data compression algorithm which is focused on compression > and decompression speeds. > This will enable Apache Beam sdk to compress/decompress files using LZO > compression algorithm. > This will include the following functionalities: > # compress() : for compressing files into an LZO archive > # decompress() : for decompressing files archived using LZO compression > Appropriate Input and Output stream will also be added to enable working with > LZO files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8539) Clearly define the valid job state transitions
[ https://issues.apache.org/jira/browse/BEAM-8539?focusedWorklogId=339752&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339752 ] ASF GitHub Bot logged work on BEAM-8539: Author: ASF GitHub Bot Created on: 07/Nov/19 05:09 Start Date: 07/Nov/19 05:09 Worklog Time Spent: 10m Work Description: chadrik commented on issue #9965: [BEAM-8539] Make job state transitions in python-based runners consistent with java-based runners URL: https://github.com/apache/beam/pull/9965#issuecomment-550135736 R: @lukecwik R: @mxm R: @robertwb This is a followup to https://github.com/apache/beam/pull/9969 that corrects python-based runners to behave the same as Java-based runners wrt STOPPED: it should be the initial state and not a terminal state. Tests pass, but I'm not sure if this could cause problems with Dataflow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339752) Time Spent: 3h 20m (was: 3h 10m) > Clearly define the valid job state transitions > -- > > Key: BEAM-8539 > URL: https://issues.apache.org/jira/browse/BEAM-8539 > Project: Beam > Issue Type: Improvement > Components: beam-model, runner-core, sdk-java-core, sdk-py-core >Reporter: Chad Dombrova >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > The Beam job state transitions are ill-defined, which is big problem for > anything that relies on the values coming from JobAPI.GetStateStream. > I was hoping to find something like a state transition diagram in the docs so > that I could determine the start state, the terminal states, and the valid > transitions, but I could not find this. The code reveals that the SDKs differ > on the fundamentals: > Java InMemoryJobService: > * start state: *STOPPED* > * run - about to submit to executor: STARTING > * run - actually running on executor: RUNNING > * terminal states: DONE, FAILED, CANCELLED, DRAINED > Python AbstractJobServiceServicer / LocalJobServicer: > * start state: STARTING > * terminal states: DONE, FAILED, CANCELLED, *STOPPED* > I think it would be good to make python work like Java, so that there is a > difference in state between a job that has been prepared and one that has > additionally been run. > It's hard to tell how far this problem has spread within the various runners. > I think a simple thing that can be done to help standardize behavior is to > implement the terminal states as an enum in the beam_job_api.proto, or create > a utility function in each language for checking if a state is terminal, so > that it's not left up to each runner to reimplement this logic. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8512) Add integration tests for Python "flink_runner.py"
[ https://issues.apache.org/jira/browse/BEAM-8512?focusedWorklogId=339748&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339748 ] ASF GitHub Bot logged work on BEAM-8512: Author: ASF GitHub Bot Created on: 07/Nov/19 04:42 Start Date: 07/Nov/19 04:42 Worklog Time Spent: 10m Work Description: tweise commented on pull request #9998: [BEAM-8512] Add integration tests for flink_runner.py URL: https://github.com/apache/beam/pull/9998#discussion_r343474271 ## File path: buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy ## @@ -1953,8 +1953,10 @@ class BeamModulePlugin implements Plugin { } project.ext.addPortableWordCountTasks = { -> -addPortableWordCountTask(false) -addPortableWordCountTask(true) +addPortableWordCountTask(false, "PortableRunner") Review comment: Is there anything that we are testing with PortableRunner that isn't included in FlinkRunner? If not, perhaps just replace those? Side question for follow-up: Since these tasks are Flink specific, should they be renamed and possibly moved out of BeamModulePlugin? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339748) Time Spent: 1h (was: 50m) > Add integration tests for Python "flink_runner.py" > -- > > Key: BEAM-8512 > URL: https://issues.apache.org/jira/browse/BEAM-8512 > Project: Beam > Issue Type: Test > Components: runner-flink, sdk-py-core >Reporter: Maximilian Michels >Assignee: Kyle Weaver >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > There are currently no integration tests for the Python FlinkRunner. We need > a set of tests similar to {{flink_runner_test.py}} which currently use the > PortableRunner and not the FlinkRunner. > CC [~robertwb] [~ibzib] [~thw] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-3288) Guard against unsafe triggers at construction time
[ https://issues.apache.org/jira/browse/BEAM-3288?focusedWorklogId=339746&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339746 ] ASF GitHub Bot logged work on BEAM-3288: Author: ASF GitHub Bot Created on: 07/Nov/19 03:57 Start Date: 07/Nov/19 03:57 Worklog Time Spent: 10m Work Description: kennknowles commented on issue #9960: [BEAM-3288] Guard against unsafe triggers at construction time URL: https://github.com/apache/beam/pull/9960#issuecomment-550675399 Abandoned node enforcement was the culprit, since I was deliberately not running the pipeline. https://scans.gradle.com/s/ie7pt3dazwpbm/tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339746) Time Spent: 3h 20m (was: 3h 10m) > Guard against unsafe triggers at construction time > --- > > Key: BEAM-3288 > URL: https://issues.apache.org/jira/browse/BEAM-3288 > Project: Beam > Issue Type: Task > Components: sdk-java-core, sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > Current Beam trigger semantics are rather confusing and in some cases > extremely unsafe, especially if the pipeline includes multiple chained GBKs. > One example of that is https://issues.apache.org/jira/browse/BEAM-3169 . > There's multiple issues: > The API allows users to specify terminating top-level triggers (e.g. "trigger > a pane after receiving 1 elements in the window, and that's it"), but > experience from user support shows that this is nearly always a mistake and > the user did not intend to drop all further data. > In general, triggers are the only place in Beam where data is being dropped > without making a lot of very loud noise about it - a practice for which the > PTransform style guide uses the language: "never, ever, ever do this". > Continuation triggers are still worse. For context: continuation trigger is > the trigger that's set on the output of a GBK and controls further > aggregation of the results of this aggregation by downstream GBKs. The output > shouldn't just use the same trigger as the input, because e.g. if the input > trigger said "wait for an hour before emitting a pane", that doesn't mean > that we should wait for another hour before emitting a result of aggregating > the result of the input trigger. Continuation triggers try to simulate the > behavior "as if a pane of the input propagated through the entire pipeline", > but the implementation of individual continuation triggers doesn't do that. > E.g. the continuation of "first N elements in pane" trigger is "first 1 > element in pane", and if the results of a first GBK are further grouped by a > second GBK onto more coarse key (e.g. if everything is grouped onto the same > key), that effectively means that, of the keys of the first GBK, only one > survives and all others are dropped (what happened in the data loss bug). > The ultimate fix to all of these things is > https://s.apache.org/beam-sink-triggers . However, it is a huge model change, > and meanwhile we have to do something. The options are, in order of > increasing backward incompatibility (but incompatibility in a "rejecting > something that previously was accepted but extremely dangerous" kind of way): > - Make the continuation trigger of most triggers be the "always-fire" > trigger. Seems that this should be the case for all triggers except the > watermark trigger. This will definitely increase safety, but lead to more > eager firing of downstream aggregations. It also will violate a user's > expectation that a fire-once trigger fires everything downstream only once, > but that expectation appears impossible to satisfy safely. > - Make the continuation trigger of some triggers be the "invalid" trigger, > i.e. require the user to set it explicitly: there's in general no good and > safe way to infer what a trigger on a second GBK "truly" should be, based on > the trigger of the PCollection input into a first GBK. This is especially > true for terminating triggers. > - Prohibit top-level terminating triggers entirely. This will ensure that the > only data that ever gets dropped is "droppably late" data. > CC: [~bchambers] [~kenn] [~tgroh] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-3288) Guard against unsafe triggers at construction time
[ https://issues.apache.org/jira/browse/BEAM-3288?focusedWorklogId=339743&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339743 ] ASF GitHub Bot logged work on BEAM-3288: Author: ASF GitHub Bot Created on: 07/Nov/19 03:51 Start Date: 07/Nov/19 03:51 Worklog Time Spent: 10m Work Description: kennknowles commented on issue #9960: [BEAM-3288] Guard against unsafe triggers at construction time URL: https://github.com/apache/beam/pull/9960#issuecomment-550667460 Ah, this is of course part of `./gradlew :runners:direct-java:needsRunnerTests` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339743) Time Spent: 3h 10m (was: 3h) > Guard against unsafe triggers at construction time > --- > > Key: BEAM-3288 > URL: https://issues.apache.org/jira/browse/BEAM-3288 > Project: Beam > Issue Type: Task > Components: sdk-java-core, sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > Current Beam trigger semantics are rather confusing and in some cases > extremely unsafe, especially if the pipeline includes multiple chained GBKs. > One example of that is https://issues.apache.org/jira/browse/BEAM-3169 . > There's multiple issues: > The API allows users to specify terminating top-level triggers (e.g. "trigger > a pane after receiving 1 elements in the window, and that's it"), but > experience from user support shows that this is nearly always a mistake and > the user did not intend to drop all further data. > In general, triggers are the only place in Beam where data is being dropped > without making a lot of very loud noise about it - a practice for which the > PTransform style guide uses the language: "never, ever, ever do this". > Continuation triggers are still worse. For context: continuation trigger is > the trigger that's set on the output of a GBK and controls further > aggregation of the results of this aggregation by downstream GBKs. The output > shouldn't just use the same trigger as the input, because e.g. if the input > trigger said "wait for an hour before emitting a pane", that doesn't mean > that we should wait for another hour before emitting a result of aggregating > the result of the input trigger. Continuation triggers try to simulate the > behavior "as if a pane of the input propagated through the entire pipeline", > but the implementation of individual continuation triggers doesn't do that. > E.g. the continuation of "first N elements in pane" trigger is "first 1 > element in pane", and if the results of a first GBK are further grouped by a > second GBK onto more coarse key (e.g. if everything is grouped onto the same > key), that effectively means that, of the keys of the first GBK, only one > survives and all others are dropped (what happened in the data loss bug). > The ultimate fix to all of these things is > https://s.apache.org/beam-sink-triggers . However, it is a huge model change, > and meanwhile we have to do something. The options are, in order of > increasing backward incompatibility (but incompatibility in a "rejecting > something that previously was accepted but extremely dangerous" kind of way): > - Make the continuation trigger of most triggers be the "always-fire" > trigger. Seems that this should be the case for all triggers except the > watermark trigger. This will definitely increase safety, but lead to more > eager firing of downstream aggregations. It also will violate a user's > expectation that a fire-once trigger fires everything downstream only once, > but that expectation appears impossible to satisfy safely. > - Make the continuation trigger of some triggers be the "invalid" trigger, > i.e. require the user to set it explicitly: there's in general no good and > safe way to infer what a trigger on a second GBK "truly" should be, based on > the trigger of the PCollection input into a first GBK. This is especially > true for terminating triggers. > - Prohibit top-level terminating triggers entirely. This will ensure that the > only data that ever gets dropped is "droppably late" data. > CC: [~bchambers] [~kenn] [~tgroh] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-3288) Guard against unsafe triggers at construction time
[ https://issues.apache.org/jira/browse/BEAM-3288?focusedWorklogId=339740&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339740 ] ASF GitHub Bot logged work on BEAM-3288: Author: ASF GitHub Bot Created on: 07/Nov/19 03:44 Start Date: 07/Nov/19 03:44 Worklog Time Spent: 10m Work Description: kennknowles commented on pull request #9960: [BEAM-3288] Guard against unsafe triggers at construction time URL: https://github.com/apache/beam/pull/9960#discussion_r343465429 ## File path: sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query10.java ## @@ -314,7 +315,7 @@ public void processElement(ProcessContext c, BoundedWindow window) name + ".WindowLogFiles", Window.>into( FixedWindows.of(Duration.standardSeconds(configuration.windowSizeSec))) -.triggering(AfterWatermark.pastEndOfWindow()) +.triggering(DefaultTrigger.of()) Review comment: (did nothing - the formatting is correct) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339740) Time Spent: 3h (was: 2h 50m) > Guard against unsafe triggers at construction time > --- > > Key: BEAM-3288 > URL: https://issues.apache.org/jira/browse/BEAM-3288 > Project: Beam > Issue Type: Task > Components: sdk-java-core, sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > Current Beam trigger semantics are rather confusing and in some cases > extremely unsafe, especially if the pipeline includes multiple chained GBKs. > One example of that is https://issues.apache.org/jira/browse/BEAM-3169 . > There's multiple issues: > The API allows users to specify terminating top-level triggers (e.g. "trigger > a pane after receiving 1 elements in the window, and that's it"), but > experience from user support shows that this is nearly always a mistake and > the user did not intend to drop all further data. > In general, triggers are the only place in Beam where data is being dropped > without making a lot of very loud noise about it - a practice for which the > PTransform style guide uses the language: "never, ever, ever do this". > Continuation triggers are still worse. For context: continuation trigger is > the trigger that's set on the output of a GBK and controls further > aggregation of the results of this aggregation by downstream GBKs. The output > shouldn't just use the same trigger as the input, because e.g. if the input > trigger said "wait for an hour before emitting a pane", that doesn't mean > that we should wait for another hour before emitting a result of aggregating > the result of the input trigger. Continuation triggers try to simulate the > behavior "as if a pane of the input propagated through the entire pipeline", > but the implementation of individual continuation triggers doesn't do that. > E.g. the continuation of "first N elements in pane" trigger is "first 1 > element in pane", and if the results of a first GBK are further grouped by a > second GBK onto more coarse key (e.g. if everything is grouped onto the same > key), that effectively means that, of the keys of the first GBK, only one > survives and all others are dropped (what happened in the data loss bug). > The ultimate fix to all of these things is > https://s.apache.org/beam-sink-triggers . However, it is a huge model change, > and meanwhile we have to do something. The options are, in order of > increasing backward incompatibility (but incompatibility in a "rejecting > something that previously was accepted but extremely dangerous" kind of way): > - Make the continuation trigger of most triggers be the "always-fire" > trigger. Seems that this should be the case for all triggers except the > watermark trigger. This will definitely increase safety, but lead to more > eager firing of downstream aggregations. It also will violate a user's > expectation that a fire-once trigger fires everything downstream only once, > but that expectation appears impossible to satisfy safely. > - Make the continuation trigger of some triggers be the "invalid" trigger, > i.e. require the user to set it explicitly: there's in general no good and > safe way to infer what a trigger on a second GBK "truly" should be, based on > the trigger of the PCollecti
[jira] [Work logged] (BEAM-3288) Guard against unsafe triggers at construction time
[ https://issues.apache.org/jira/browse/BEAM-3288?focusedWorklogId=339738&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339738 ] ASF GitHub Bot logged work on BEAM-3288: Author: ASF GitHub Bot Created on: 07/Nov/19 03:41 Start Date: 07/Nov/19 03:41 Worklog Time Spent: 10m Work Description: kennknowles commented on issue #9960: [BEAM-3288] Guard against unsafe triggers at construction time URL: https://github.com/apache/beam/pull/9960#issuecomment-550653454 Curiously, `./gradlew -p sdks/java/core test` does not repro the failure locally. I have noticed some inconsistencies between `./gradlew -p ` and `./gradlew :`. I thought these were the same. So we probably have some issues in our gradle configs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339738) Time Spent: 2h 50m (was: 2h 40m) > Guard against unsafe triggers at construction time > --- > > Key: BEAM-3288 > URL: https://issues.apache.org/jira/browse/BEAM-3288 > Project: Beam > Issue Type: Task > Components: sdk-java-core, sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > Current Beam trigger semantics are rather confusing and in some cases > extremely unsafe, especially if the pipeline includes multiple chained GBKs. > One example of that is https://issues.apache.org/jira/browse/BEAM-3169 . > There's multiple issues: > The API allows users to specify terminating top-level triggers (e.g. "trigger > a pane after receiving 1 elements in the window, and that's it"), but > experience from user support shows that this is nearly always a mistake and > the user did not intend to drop all further data. > In general, triggers are the only place in Beam where data is being dropped > without making a lot of very loud noise about it - a practice for which the > PTransform style guide uses the language: "never, ever, ever do this". > Continuation triggers are still worse. For context: continuation trigger is > the trigger that's set on the output of a GBK and controls further > aggregation of the results of this aggregation by downstream GBKs. The output > shouldn't just use the same trigger as the input, because e.g. if the input > trigger said "wait for an hour before emitting a pane", that doesn't mean > that we should wait for another hour before emitting a result of aggregating > the result of the input trigger. Continuation triggers try to simulate the > behavior "as if a pane of the input propagated through the entire pipeline", > but the implementation of individual continuation triggers doesn't do that. > E.g. the continuation of "first N elements in pane" trigger is "first 1 > element in pane", and if the results of a first GBK are further grouped by a > second GBK onto more coarse key (e.g. if everything is grouped onto the same > key), that effectively means that, of the keys of the first GBK, only one > survives and all others are dropped (what happened in the data loss bug). > The ultimate fix to all of these things is > https://s.apache.org/beam-sink-triggers . However, it is a huge model change, > and meanwhile we have to do something. The options are, in order of > increasing backward incompatibility (but incompatibility in a "rejecting > something that previously was accepted but extremely dangerous" kind of way): > - Make the continuation trigger of most triggers be the "always-fire" > trigger. Seems that this should be the case for all triggers except the > watermark trigger. This will definitely increase safety, but lead to more > eager firing of downstream aggregations. It also will violate a user's > expectation that a fire-once trigger fires everything downstream only once, > but that expectation appears impossible to satisfy safely. > - Make the continuation trigger of some triggers be the "invalid" trigger, > i.e. require the user to set it explicitly: there's in general no good and > safe way to infer what a trigger on a second GBK "truly" should be, based on > the trigger of the PCollection input into a first GBK. This is especially > true for terminating triggers. > - Prohibit top-level terminating triggers entirely. This will ensure that the > only data that ever gets dropped is "droppably late" data. > CC: [~bchambers] [~kenn] [~tgroh] -- This message was sent by Atlassia
[jira] [Work logged] (BEAM-3288) Guard against unsafe triggers at construction time
[ https://issues.apache.org/jira/browse/BEAM-3288?focusedWorklogId=339733&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339733 ] ASF GitHub Bot logged work on BEAM-3288: Author: ASF GitHub Bot Created on: 07/Nov/19 03:27 Start Date: 07/Nov/19 03:27 Worklog Time Spent: 10m Work Description: kennknowles commented on pull request #9960: [BEAM-3288] Guard against unsafe triggers at construction time URL: https://github.com/apache/beam/pull/9960#discussion_r343462469 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupByKey.java ## @@ -162,6 +164,45 @@ public static void applicableTo(PCollection input) { throw new IllegalStateException( "GroupByKey must have a valid Window merge function. " + "Invalid because: " + cause); } + +// Validate that the trigger does not finish before garbage collection time +if (!triggerIsSafe(windowingStrategy)) { + throw new IllegalArgumentException( + String.format( + "Unsafe trigger may lose data, see" + + " https://s.apache.org/finishing-triggers-drop-data: %s", + windowingStrategy.getTrigger())); +} + } + + // Note that Never trigger finishes *at* GC time so it is OK, and + // AfterWatermark.fromEndOfWindow() finishes at end-of-window time so it is + // OK if there is no allowed lateness. + private static boolean triggerIsSafe(WindowingStrategy windowingStrategy) { +if (!windowingStrategy.getTrigger().mayFinish()) { Review comment: I agree exactly with all of your analysis. This pull request will forbid triggers of type a). The meaning of `Trigger.mayFinish()` is to answer this question "is this a trigger that might finish?" If the answer to that question is "yes" then it might be a type a) trigger. It depends on allowed lateness. There are a couple of very special triggers where the answer "yes" is OK as long as allowed lateness is zero. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339733) Time Spent: 2.5h (was: 2h 20m) > Guard against unsafe triggers at construction time > --- > > Key: BEAM-3288 > URL: https://issues.apache.org/jira/browse/BEAM-3288 > Project: Beam > Issue Type: Task > Components: sdk-java-core, sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > Current Beam trigger semantics are rather confusing and in some cases > extremely unsafe, especially if the pipeline includes multiple chained GBKs. > One example of that is https://issues.apache.org/jira/browse/BEAM-3169 . > There's multiple issues: > The API allows users to specify terminating top-level triggers (e.g. "trigger > a pane after receiving 1 elements in the window, and that's it"), but > experience from user support shows that this is nearly always a mistake and > the user did not intend to drop all further data. > In general, triggers are the only place in Beam where data is being dropped > without making a lot of very loud noise about it - a practice for which the > PTransform style guide uses the language: "never, ever, ever do this". > Continuation triggers are still worse. For context: continuation trigger is > the trigger that's set on the output of a GBK and controls further > aggregation of the results of this aggregation by downstream GBKs. The output > shouldn't just use the same trigger as the input, because e.g. if the input > trigger said "wait for an hour before emitting a pane", that doesn't mean > that we should wait for another hour before emitting a result of aggregating > the result of the input trigger. Continuation triggers try to simulate the > behavior "as if a pane of the input propagated through the entire pipeline", > but the implementation of individual continuation triggers doesn't do that. > E.g. the continuation of "first N elements in pane" trigger is "first 1 > element in pane", and if the results of a first GBK are further grouped by a > second GBK onto more coarse key (e.g. if everything is grouped onto the same > key), that effectively means that, of the keys of the first GBK, only one > survives and all others are dropped (what happened in the data loss bug). > The ultimate fix to all of these things is > https://s.apache.org/beam-sink-triggers . However, i
[jira] [Work logged] (BEAM-3288) Guard against unsafe triggers at construction time
[ https://issues.apache.org/jira/browse/BEAM-3288?focusedWorklogId=339734&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339734 ] ASF GitHub Bot logged work on BEAM-3288: Author: ASF GitHub Bot Created on: 07/Nov/19 03:27 Start Date: 07/Nov/19 03:27 Worklog Time Spent: 10m Work Description: kennknowles commented on pull request #9960: [BEAM-3288] Guard against unsafe triggers at construction time URL: https://github.com/apache/beam/pull/9960#discussion_r343462578 ## File path: sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query10.java ## @@ -314,7 +315,7 @@ public void processElement(ProcessContext c, BoundedWindow window) name + ".WindowLogFiles", Window.>into( FixedWindows.of(Duration.standardSeconds(configuration.windowSizeSec))) -.triggering(AfterWatermark.pastEndOfWindow()) +.triggering(DefaultTrigger.of()) Review comment: I will run spotlessApply. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339734) Time Spent: 2h 40m (was: 2.5h) > Guard against unsafe triggers at construction time > --- > > Key: BEAM-3288 > URL: https://issues.apache.org/jira/browse/BEAM-3288 > Project: Beam > Issue Type: Task > Components: sdk-java-core, sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > Current Beam trigger semantics are rather confusing and in some cases > extremely unsafe, especially if the pipeline includes multiple chained GBKs. > One example of that is https://issues.apache.org/jira/browse/BEAM-3169 . > There's multiple issues: > The API allows users to specify terminating top-level triggers (e.g. "trigger > a pane after receiving 1 elements in the window, and that's it"), but > experience from user support shows that this is nearly always a mistake and > the user did not intend to drop all further data. > In general, triggers are the only place in Beam where data is being dropped > without making a lot of very loud noise about it - a practice for which the > PTransform style guide uses the language: "never, ever, ever do this". > Continuation triggers are still worse. For context: continuation trigger is > the trigger that's set on the output of a GBK and controls further > aggregation of the results of this aggregation by downstream GBKs. The output > shouldn't just use the same trigger as the input, because e.g. if the input > trigger said "wait for an hour before emitting a pane", that doesn't mean > that we should wait for another hour before emitting a result of aggregating > the result of the input trigger. Continuation triggers try to simulate the > behavior "as if a pane of the input propagated through the entire pipeline", > but the implementation of individual continuation triggers doesn't do that. > E.g. the continuation of "first N elements in pane" trigger is "first 1 > element in pane", and if the results of a first GBK are further grouped by a > second GBK onto more coarse key (e.g. if everything is grouped onto the same > key), that effectively means that, of the keys of the first GBK, only one > survives and all others are dropped (what happened in the data loss bug). > The ultimate fix to all of these things is > https://s.apache.org/beam-sink-triggers . However, it is a huge model change, > and meanwhile we have to do something. The options are, in order of > increasing backward incompatibility (but incompatibility in a "rejecting > something that previously was accepted but extremely dangerous" kind of way): > - Make the continuation trigger of most triggers be the "always-fire" > trigger. Seems that this should be the case for all triggers except the > watermark trigger. This will definitely increase safety, but lead to more > eager firing of downstream aggregations. It also will violate a user's > expectation that a fire-once trigger fires everything downstream only once, > but that expectation appears impossible to satisfy safely. > - Make the continuation trigger of some triggers be the "invalid" trigger, > i.e. require the user to set it explicitly: there's in general no good and > safe way to infer what a trigger on a second GBK "truly" should be, based on > the trigger of the PCollection input i
[jira] [Work logged] (BEAM-2879) Implement and use an Avro coder rather than the JSON one for intermediary files to be loaded in BigQuery
[ https://issues.apache.org/jira/browse/BEAM-2879?focusedWorklogId=339712&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339712 ] ASF GitHub Bot logged work on BEAM-2879: Author: ASF GitHub Bot Created on: 07/Nov/19 02:45 Start Date: 07/Nov/19 02:45 Worklog Time Spent: 10m Work Description: steveniemitz commented on issue #9665: [BEAM-2879] Support writing data to BigQuery via avro URL: https://github.com/apache/beam/pull/9665#issuecomment-550595649 > @chamikaramj are there up-to-date docs anywhere on how to run the BigQueryIOIT tests? The example args in the javadoc are super out of date. nm I hacked it up enough to get it to run. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339712) Time Spent: 3h 10m (was: 3h) > Implement and use an Avro coder rather than the JSON one for intermediary > files to be loaded in BigQuery > > > Key: BEAM-2879 > URL: https://issues.apache.org/jira/browse/BEAM-2879 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Black Phoenix >Assignee: Steve Niemitz >Priority: Minor > Labels: starter > Time Spent: 3h 10m > Remaining Estimate: 0h > > Before being loaded in BigQuery, temporary files are created and encoded in > JSON. Which is a costly solution compared to an Avro alternative -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339698&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339698 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 07/Nov/19 02:14 Start Date: 07/Nov/19 02:14 Worklog Time Spent: 10m Work Description: jfarr commented on pull request #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#discussion_r343429384 ## File path: sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/ShardReadersPool.java ## @@ -149,10 +157,20 @@ private void readLoop(ShardRecordsIterator shardRecordsIterator) { recordsQueue.put(kinesisRecord); numberOfRecordsInAQueueByShard.get(kinesisRecord.getShardId()).incrementAndGet(); } +rateLimiter.onSuccess(kinesisRecords); + } catch (KinesisClientThrottledException e) { +try { + rateLimiter.onThrottle(e); +} catch (InterruptedException ex) { + LOG.warn("Thread was interrupted, finishing the read loop", ex); + Thread.currentThread().interrupt(); Review comment: Thread.interrupt() does not interrupt the current thread, it sets the thread's interrupt status. The short answer is that catching InterruptedException clears the interrupt status so in order to handle it properly you should always either rethrow the exception or reset the thread's interrupt status so any code higher up the call stack can still detect the interruption. There is a good article by Brian Goetz that goes into the details: https://www.ibm.com/developerworks/java/library/j-jtp05236/index.html. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339698) Time Spent: 9.5h (was: 9h 20m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 9.5h > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-2879) Implement and use an Avro coder rather than the JSON one for intermediary files to be loaded in BigQuery
[ https://issues.apache.org/jira/browse/BEAM-2879?focusedWorklogId=339693&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339693 ] ASF GitHub Bot logged work on BEAM-2879: Author: ASF GitHub Bot Created on: 07/Nov/19 02:05 Start Date: 07/Nov/19 02:05 Worklog Time Spent: 10m Work Description: steveniemitz commented on issue #9665: [BEAM-2879] Support writing data to BigQuery via avro URL: https://github.com/apache/beam/pull/9665#issuecomment-550586372 @chamikaramj are there up-to-date docs anywhere on how to run the BigQueryIOIT tests? The example args in the javadoc are super out of date. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339693) Time Spent: 3h (was: 2h 50m) > Implement and use an Avro coder rather than the JSON one for intermediary > files to be loaded in BigQuery > > > Key: BEAM-2879 > URL: https://issues.apache.org/jira/browse/BEAM-2879 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Black Phoenix >Assignee: Steve Niemitz >Priority: Minor > Labels: starter > Time Spent: 3h > Remaining Estimate: 0h > > Before being loaded in BigQuery, temporary files are created and encoded in > JSON. Which is a costly solution compared to an Avro alternative -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-3288) Guard against unsafe triggers at construction time
[ https://issues.apache.org/jira/browse/BEAM-3288?focusedWorklogId=339692&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339692 ] ASF GitHub Bot logged work on BEAM-3288: Author: ASF GitHub Bot Created on: 07/Nov/19 02:04 Start Date: 07/Nov/19 02:04 Worklog Time Spent: 10m Work Description: kennknowles commented on pull request #9960: [BEAM-3288] Guard against unsafe triggers at construction time URL: https://github.com/apache/beam/pull/9960#discussion_r343423684 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupByKey.java ## @@ -162,6 +164,45 @@ public static void applicableTo(PCollection input) { throw new IllegalStateException( "GroupByKey must have a valid Window merge function. " + "Invalid because: " + cause); } + +// Validate that the trigger does not finish before garbage collection time +if (!triggerIsSafe(windowingStrategy)) { Review comment: Combine is not a primitive operation. If you look at Combine.java you can see that it expands to include a GroupByKey. When a runner sees it in a pipeline, it will usually replace the operation. But the validation happens on the raw graph. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339692) Time Spent: 2h 20m (was: 2h 10m) > Guard against unsafe triggers at construction time > --- > > Key: BEAM-3288 > URL: https://issues.apache.org/jira/browse/BEAM-3288 > Project: Beam > Issue Type: Task > Components: sdk-java-core, sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > Current Beam trigger semantics are rather confusing and in some cases > extremely unsafe, especially if the pipeline includes multiple chained GBKs. > One example of that is https://issues.apache.org/jira/browse/BEAM-3169 . > There's multiple issues: > The API allows users to specify terminating top-level triggers (e.g. "trigger > a pane after receiving 1 elements in the window, and that's it"), but > experience from user support shows that this is nearly always a mistake and > the user did not intend to drop all further data. > In general, triggers are the only place in Beam where data is being dropped > without making a lot of very loud noise about it - a practice for which the > PTransform style guide uses the language: "never, ever, ever do this". > Continuation triggers are still worse. For context: continuation trigger is > the trigger that's set on the output of a GBK and controls further > aggregation of the results of this aggregation by downstream GBKs. The output > shouldn't just use the same trigger as the input, because e.g. if the input > trigger said "wait for an hour before emitting a pane", that doesn't mean > that we should wait for another hour before emitting a result of aggregating > the result of the input trigger. Continuation triggers try to simulate the > behavior "as if a pane of the input propagated through the entire pipeline", > but the implementation of individual continuation triggers doesn't do that. > E.g. the continuation of "first N elements in pane" trigger is "first 1 > element in pane", and if the results of a first GBK are further grouped by a > second GBK onto more coarse key (e.g. if everything is grouped onto the same > key), that effectively means that, of the keys of the first GBK, only one > survives and all others are dropped (what happened in the data loss bug). > The ultimate fix to all of these things is > https://s.apache.org/beam-sink-triggers . However, it is a huge model change, > and meanwhile we have to do something. The options are, in order of > increasing backward incompatibility (but incompatibility in a "rejecting > something that previously was accepted but extremely dangerous" kind of way): > - Make the continuation trigger of most triggers be the "always-fire" > trigger. Seems that this should be the case for all triggers except the > watermark trigger. This will definitely increase safety, but lead to more > eager firing of downstream aggregations. It also will violate a user's > expectation that a fire-once trigger fires everything downstream only once, > but that expectation appears impossible to satisfy safely. > - Make the continuation trigger of some triggers be the "invalid" trigger, > i
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339691&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339691 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 07/Nov/19 02:04 Start Date: 07/Nov/19 02:04 Worklog Time Spent: 10m Work Description: jfarr commented on pull request #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#discussion_r343423203 ## File path: sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/RateLimitPolicyFactory.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.io.kinesis; + +import java.io.IOException; +import java.io.Serializable; +import java.util.List; +import org.apache.beam.sdk.util.BackOff; +import org.apache.beam.sdk.util.FluentBackoff; +import org.joda.time.Duration; + +/** + * Implement this interface to create a {@code RateLimitPolicy}. Used to create a rate limiter for + * each shard. + */ +public interface RateLimitPolicyFactory extends Serializable { + + RateLimitPolicy getRateLimitPolicy(); + + static RateLimitPolicyFactory withoutLimiter() { +return () -> new RateLimitPolicy() {}; + } + + static RateLimitPolicyFactory withBackoff(FluentBackoff fluentBackoff) { +return () -> new BackoffRateLimiter(fluentBackoff); + } + + static RateLimitPolicyFactory withFixedDelay() { +return FixedDelayRateLimiter::new; + } + + static RateLimitPolicyFactory withFixedDelay(Duration delay) { +return () -> new FixedDelayRateLimiter(delay); + } + + class BackoffRateLimiter implements RateLimitPolicy { + +private final BackOff backoff; + +public BackoffRateLimiter(FluentBackoff fluentBackoff) { + this.backoff = + fluentBackoff + // never stop retrying + .withMaxRetries(Integer.MAX_VALUE) + .withMaxCumulativeBackoff(Duration.standardDays(1000)) + .backoff(); +} + +@Override +public void onThrottle(KinesisClientThrottledException t) throws InterruptedException { + try { +long backOffMillis = backoff.nextBackOffMillis(); +if (backOffMillis != BackOff.STOP) { + Thread.sleep(backOffMillis); +} + } catch (IOException e) { +// do nothing + } +} + +@Override +public void onSuccess(List records) { + try { +backoff.reset(); + } catch (IOException e) { +// do nothing + } +} + } + + class FixedDelayRateLimiter implements RateLimitPolicy { + +private static final Duration DEFAULT_DELAY = Duration.standardSeconds(1); + +private final Duration delay; + +public FixedDelayRateLimiter() { + this(DEFAULT_DELAY); +} + +public FixedDelayRateLimiter(Duration delay) { + this.delay = delay; +} + +@Override +public void onSuccess(List records) throws InterruptedException { Review comment: It's hard to anticipate what state a custom rate limit policy will want to keep but since we have the list of records and it might want them I just passed them in here. Same with passing the KinesisClientThrottledException in onThrottled(). Currently, no, there are no implementations using these. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339691) Time Spent: 9h 20m (was: 9h 10m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter:
[jira] [Work logged] (BEAM-8557) Clean up useless null check.
[ https://issues.apache.org/jira/browse/BEAM-8557?focusedWorklogId=339690&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339690 ] ASF GitHub Bot logged work on BEAM-8557: Author: ASF GitHub Bot Created on: 07/Nov/19 02:03 Start Date: 07/Nov/19 02:03 Worklog Time Spent: 10m Work Description: sunjincheng121 commented on pull request #9991: [BEAM-8557] Cleanup the useless null check for ResponseStreamObserver… URL: https://github.com/apache/beam/pull/9991#discussion_r343423145 ## File path: runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/FnApiControlClient.java ## @@ -148,16 +148,14 @@ public void onNext(BeamFnApi.InstructionResponse response) { LOG.debug("Received InstructionResponse {}", response); CompletableFuture responseFuture = outstandingRequests.remove(response.getInstructionId()); - if (responseFuture != null) { -if (response.getError().isEmpty()) { - responseFuture.complete(response); -} else { - responseFuture.completeExceptionally( - new RuntimeException( - String.format( - "Error received from SDK harness for instruction %s: %s", - response.getInstructionId(), response.getError(; -} + if (response.getError().isEmpty()) { Review comment: Totally agree that this may mean a bug in the SDK harness if it happens. I'm just thinking that a log message may be not enough as it's easy to be ignored. An exception makes sure that problem was discovered in time. But I'm fine if you think log is enough as users will investigate the log if there are serious problem for them after all. So, we need only add a WARNING log for this case, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339690) Time Spent: 1h 40m (was: 1.5h) > Clean up useless null check. > > > Key: BEAM-8557 > URL: https://issues.apache.org/jira/browse/BEAM-8557 > Project: Beam > Issue Type: Sub-task > Components: runner-core, sdk-java-harness >Reporter: sunjincheng >Assignee: sunjincheng >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > I think we do not need null check here: > [https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/FnApiControlClient.java#L151] > Because before the the `onNext` call, the `Future` already put into the queue > in `handle` method. > > I found the test as follows: > {code:java} > @Test > public void testUnknownResponseIgnored() throws Exception{code} > I do not know why we need test this case? I think it would be better if we > throw the Exception for an UnknownResponse, otherwise, this may hidden a > potential bug. > Please correct me if there anything I misunderstand @kennknowles > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339689&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339689 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 07/Nov/19 02:00 Start Date: 07/Nov/19 02:00 Worklog Time Spent: 10m Work Description: jfarr commented on issue #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#issuecomment-550585285 > 2\. From user perspective, what if users set all 3 or 4 withRateLimitXXX ? which one will override what? or we apply all of them? Would it be helpful if we provide an enum? If you call one of these methods again it will override the previous call. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339689) Time Spent: 9h 10m (was: 9h) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 9h 10m > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339688&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339688 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 07/Nov/19 01:56 Start Date: 07/Nov/19 01:56 Worklog Time Spent: 10m Work Description: jfarr commented on issue #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#issuecomment-550584252 > 1. Have we try to use Kinesis's RetryPolicy and BackoffStrategy? This look like we reengineer this feature, which already supported by Kinesis's API. Here is the example: https://www.programcreek.com/java-api-examples/?api=com.amazonaws.retry.RetryPolicy. If we already tried it, then what would be the reasons not to use the Kinesis's one ? I'm not sure if RetryPolicy and BackoffStrategy apply to LimitExceededException / ProvisionedThroughputExceededException but I can look into that. If so I think it makes sense to just configure this in the Kinesis client instead of having a BackoffRateLimitPolicy. What do you think @aromanenko-dev and @lukecwik ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339688) Time Spent: 9h (was: 8h 50m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 9h > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339687&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339687 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 07/Nov/19 01:56 Start Date: 07/Nov/19 01:56 Worklog Time Spent: 10m Work Description: jfarr commented on pull request #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#discussion_r343418039 ## File path: sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/KinesisIO.java ## @@ -420,6 +426,47 @@ public Read withCustomWatermarkPolicy(WatermarkPolicyFactory watermarkPolicyFact return toBuilder().setWatermarkPolicyFactory(watermarkPolicyFactory).build(); } +/** + * Specifies the rate limit policy as BackoffRateLimiter. + * + * @param fluentBackoff The {@code FluentBackoff} used to create the backoff policy. + */ +public Read withBackoffRateLimitPolicy(FluentBackoff fluentBackoff) { + checkArgument(fluentBackoff != null, "fluentBackoff cannot be null"); + return toBuilder() + .setRateLimitPolicyFactory(RateLimitPolicyFactory.withBackoff(fluentBackoff)) + .build(); +} + +/** + * Specifies the rate limit policy as FixedDelayRateLimiter with the default delay of 1 second. + */ +public Read withFixedDelayRateLimitPolicy() { + return toBuilder().setRateLimitPolicyFactory(RateLimitPolicyFactory.withFixedDelay()).build(); +} + +/** + * Specifies the rate limit policy as FixedDelayRateLimiter with the given delay. + * + * @param delay Denotes the fixed delay duration. + */ +public Read withFixedDelayRateLimitPolicy(Duration delay) { Review comment: > 1. Have we try to use Kinesis's RetryPolicy and BackoffStrategy? This look like we reengineer this feature, which already supported by Kinesis's API. Here is the example: https://www.programcreek.com/java-api-examples/?api=com.amazonaws.retry.RetryPolicy. If we already tried it, then what would be the reasons not to use the Kinesis's one ? I'm not sure if RetryPolicy and BackoffStrategy apply to LimitExceededException / ProvisionedThroughputExceededException but I can look into that. If so I think it makes sense to just configure this in the Kinesis client instead of having a BackoffRateLimitPolicy. What do you think @aromanenko-dev and @lukecwik ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339687) Time Spent: 8h 50m (was: 8h 40m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 8h 50m > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339686&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339686 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 07/Nov/19 01:55 Start Date: 07/Nov/19 01:55 Worklog Time Spent: 10m Work Description: jfarr commented on pull request #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#discussion_r343418039 ## File path: sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/KinesisIO.java ## @@ -420,6 +426,47 @@ public Read withCustomWatermarkPolicy(WatermarkPolicyFactory watermarkPolicyFact return toBuilder().setWatermarkPolicyFactory(watermarkPolicyFactory).build(); } +/** + * Specifies the rate limit policy as BackoffRateLimiter. + * + * @param fluentBackoff The {@code FluentBackoff} used to create the backoff policy. + */ +public Read withBackoffRateLimitPolicy(FluentBackoff fluentBackoff) { + checkArgument(fluentBackoff != null, "fluentBackoff cannot be null"); + return toBuilder() + .setRateLimitPolicyFactory(RateLimitPolicyFactory.withBackoff(fluentBackoff)) + .build(); +} + +/** + * Specifies the rate limit policy as FixedDelayRateLimiter with the default delay of 1 second. + */ +public Read withFixedDelayRateLimitPolicy() { + return toBuilder().setRateLimitPolicyFactory(RateLimitPolicyFactory.withFixedDelay()).build(); +} + +/** + * Specifies the rate limit policy as FixedDelayRateLimiter with the given delay. + * + * @param delay Denotes the fixed delay duration. + */ +public Read withFixedDelayRateLimitPolicy(Duration delay) { Review comment: > 1. Have we try to use Kinesis's RetryPolicy and BackoffStrategy? This look like we reengineer this feature, which already supported by Kinesis's API. Here is the example: https://www.programcreek.com/java-api-examples/?api=com.amazonaws.retry.RetryPolicy. If we already tried it, then what would be the reasons not to use the Kinesis's one ? I'm not sure if RetryPolicy and BackoffStrategy apply to LimitExceededException / ProvisionedThroughputExceededException but I can look into that. If so I think it makes sense to just configure this in the Kinesis client instead of having a BackoffRateLimitPolicy. What do you think @aromanenko-dev and @lukecwik ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339686) Time Spent: 8h 40m (was: 8.5h) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 8h 40m > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339679&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339679 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 07/Nov/19 01:30 Start Date: 07/Nov/19 01:30 Worklog Time Spent: 10m Work Description: jfarr commented on pull request #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#discussion_r343412074 ## File path: sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/KinesisIO.java ## @@ -420,6 +426,47 @@ public Read withCustomWatermarkPolicy(WatermarkPolicyFactory watermarkPolicyFact return toBuilder().setWatermarkPolicyFactory(watermarkPolicyFactory).build(); } +/** + * Specifies the rate limit policy as BackoffRateLimiter. + * + * @param fluentBackoff The {@code FluentBackoff} used to create the backoff policy. + */ +public Read withBackoffRateLimitPolicy(FluentBackoff fluentBackoff) { Review comment: I was considering exposing a wrapper object like SnsIO. I didn't realize FluentBackoff was internal. I think that makes a lot of sense here too then. What does it mean to give up retrying with an unbounded source though? What should we do in that case? Throw an unchecked exception and let the pipeline crash? That's why I explicitly overrode max retries and max cumulative backoff because I didn't think that made sense but I can change that if you want. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339679) Time Spent: 8.5h (was: 8h 20m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339675&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339675 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 07/Nov/19 01:23 Start Date: 07/Nov/19 01:23 Worklog Time Spent: 10m Work Description: jfarr commented on pull request #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#discussion_r343410781 ## File path: sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/ShardReadersPool.java ## @@ -51,15 +51,13 @@ /** * Executor service for running the threads that read records from shards handled by this pool. - * Each thread runs the {@link ShardReadersPool#readLoop(ShardRecordsIterator)} method and handles - * exactly one shard. + * Each thread runs the {@link ShardReadersPool#readLoop} method and handles exactly one shard. Review comment: Sure, I'll change that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339675) Time Spent: 8h 20m (was: 8h 10m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 8h 20m > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-2879) Implement and use an Avro coder rather than the JSON one for intermediary files to be loaded in BigQuery
[ https://issues.apache.org/jira/browse/BEAM-2879?focusedWorklogId=339673&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339673 ] ASF GitHub Bot logged work on BEAM-2879: Author: ASF GitHub Bot Created on: 07/Nov/19 01:22 Start Date: 07/Nov/19 01:22 Worklog Time Spent: 10m Work Description: steveniemitz commented on issue #9665: [BEAM-2879] Support writing data to BigQuery via avro URL: https://github.com/apache/beam/pull/9665#issuecomment-550576486 > Thanks. Looks great to me. > > To make sure I understood correctly, this will not be enabled for existing users by default and to enable this users have to specify withAvroFormatFunction(), correct ? Correct, with schemas I think we could make this enabled transparently, but for now its opt-in only. > Also, can we add a version of BigQueryIOIT so that we can continue to monitor both Avro and JSON based BQ write transforms ? Yeah I can add that in there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339673) Time Spent: 2h 50m (was: 2h 40m) > Implement and use an Avro coder rather than the JSON one for intermediary > files to be loaded in BigQuery > > > Key: BEAM-2879 > URL: https://issues.apache.org/jira/browse/BEAM-2879 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Black Phoenix >Assignee: Steve Niemitz >Priority: Minor > Labels: starter > Time Spent: 2h 50m > Remaining Estimate: 0h > > Before being loaded in BigQuery, temporary files are created and encoded in > JSON. Which is a costly solution compared to an Avro alternative -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=339672&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339672 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 07/Nov/19 01:22 Start Date: 07/Nov/19 01:22 Worklog Time Spent: 10m Work Description: liumomo315 commented on issue #9957: [BEAM-8575] Add validates runner tests for 1. Custom window fn: Test a customized window fn work as expected; 2. Windows idempotency: Applying the same window fn (or window fn + GBK) to the input multiple times will have the same effect as applying it once. URL: https://github.com/apache/beam/pull/9957#issuecomment-550576421 > Welcome to Beam, all non-trivial PRs are meant to be accompanied with a JIRA in the title as per the [contribution guide](https://beam.apache.org/contribute/#share-your-intent). > Thanks for the pointer. Added JIRA:) > Also, please address the lint issue: > > ``` > Task :sdks:python:test-suites:tox:py37:lintPy37 > 15:45:38 * Module apache_beam.transforms.window_test > 15:45:38 apache_beam/transforms/window_test.py:283:48: E0110: Abstract class 'TestCustomWindows' with abstract methods instantiated (abstract-class-instantiated) > ``` Done fully implementing it, but have a question on whether I could make it simpler without changing other files, in another thread. Also cc'ed you there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339672) Time Spent: 40m (was: 0.5h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=339668&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339668 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 07/Nov/19 01:17 Start Date: 07/Nov/19 01:17 Worklog Time Spent: 10m Work Description: liumomo315 commented on pull request #9957: [BEAM-8575] Add validates runner tests for 1. Custom window fn: Test a customized window fn work as expected; 2. Windows idempotency: Applying the same window fn (or window fn + GBK) to the input multiple times will have the same effect as applying it once. URL: https://github.com/apache/beam/pull/9957#discussion_r343409486 ## File path: sdks/python/apache_beam/transforms/window_test.py ## @@ -65,6 +76,44 @@ def process(self, element, window=core.DoFn.WindowParam): reify_windows = core.ParDo(ReifyWindowsFn()) +class TestCustomWindows(NonMergingWindowFn): + """A custom non merging window fn which assigns elements into interval windows + based on the element timestamps. + """ + + def __init__(self, first_window_end, second_window_end): +self.first_window_end = Timestamp.of(first_window_end) +self.second_window_end = Timestamp.of(second_window_end) + + def assign(self, context): +timestamp = context.timestamp +if timestamp < self.first_window_end: + return [IntervalWindow(0, self.first_window_end)] +elif timestamp < self.second_window_end: + return [IntervalWindow(self.first_window_end, self.second_window_end)] +else: + return [IntervalWindow(self.second_window_end, timestamp)] + + def get_window_coder(self): +return IntervalWindowCoder() + + def to_runner_api_parameter(self, context): Review comment: This custom window fn is added only for a test, to test a non-standard window fn. To fully implement it, I need to also implement to_runner_api_parameter and from_runner_api_parameter. Those require me to also change the standard_window_fns.proto and common_urns.py. Are these changes necessary? If not, is there a better way to do this? Thanks! @lukecwik @robertwb This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339668) Time Spent: 0.5h (was: 20m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8561) Add ThriftIO to Support IO for Thrift Files
[ https://issues.apache.org/jira/browse/BEAM-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968834#comment-16968834 ] Eugene Kirpichov commented on BEAM-8561: Please above all make sure that this connector provides APIs compatible with FileIO: i.e. ThriftIO.readFiles() and ThritfIO.sink(). Providing ThriftIO.read() and write() for basic use cases also makes sense, but there's no need to make them super customizable, or to provide readAll() - all advanced use cases can be handled by the two methods above + FileIO. > Add ThriftIO to Support IO for Thrift Files > --- > > Key: BEAM-8561 > URL: https://issues.apache.org/jira/browse/BEAM-8561 > Project: Beam > Issue Type: New Feature > Components: io-java-files >Reporter: Chris Larsen >Assignee: Chris Larsen >Priority: Minor > > Similar to AvroIO it would be very useful to support reading and writing > to/from Thrift files with a native connector. > Functionality would include: > # read() - Reading from one or more Thrift files. > # write() - Writing to one or more Thrift files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=339663&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339663 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 07/Nov/19 01:06 Start Date: 07/Nov/19 01:06 Worklog Time Spent: 10m Work Description: liumomo315 commented on pull request #9957: [BEAM-8575] Add validates runner tests for 1. Custom window fn: Test a customized window fn work as expected; 2. Windows idempotency: Applying the same window fn (or window fn + GBK) to the input multiple times will have the same effect as applying it once. URL: https://github.com/apache/beam/pull/9957#discussion_r343407270 ## File path: sdks/python/apache_beam/transforms/window_test.py ## @@ -252,6 +276,50 @@ def test_timestamped_with_combiners(self): assert_that(mean_per_window, equal_to([(0, 2.0), (1, 7.0)]), label='assert:mean') + @attr('ValidatesRunner') + def test_custom_windows(self): +with TestPipeline() as p: + result = (p | Create([TimestampedValue(x, x * 100) for x in range(7)]) +| 'custom window' >> WindowInto(TestCustomWindows(250, 200)) +| 'insert key' >> Map(lambda v: ('key', v.value)) +| GroupByKey()) + + assert_that(result, equal_to([('key', [0, 1, 2]), +('key', [3, 4]), +('key', [5]), +('key', [6])])) + + @attr('ValidatesRunner') + def test_windows_idempotency(self): +with TestPipeline() as p: + result = (p | Create([(x, x * 2) for x in range(5)]) +| Map(lambda item: TimestampedValue(item[0], item[1])) +| Map(lambda v: ('key', v)) +| 'window' >> WindowInto(FixedWindows(4)) +| 'same window' >> WindowInto(FixedWindows(4)) +| 'same window again' >> WindowInto(FixedWindows(4)) +| GroupByKey()) + + assert_that(result, equal_to([('key', [0, 1]), +('key', [2, 3]), +('key', [4])])) + + @attr('ValidatesRunner') + def test_windows_gbk_idempotency(self): +with TestPipeline() as p: + result = (p | Create([(x, x * 2) for x in range(5)]) Review comment: Same for this one:) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339663) Time Spent: 20m (was: 10m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=339662&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339662 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 07/Nov/19 01:06 Start Date: 07/Nov/19 01:06 Worklog Time Spent: 10m Work Description: liumomo315 commented on pull request #9957: [BEAM-8575] Add validates runner tests for 1. Custom window fn: Test a customized window fn work as expected; 2. Windows idempotency: Applying the same window fn (or window fn + GBK) to the input multiple times will have the same effect as applying it once. URL: https://github.com/apache/beam/pull/9957#discussion_r343407158 ## File path: sdks/python/apache_beam/transforms/window_test.py ## @@ -252,6 +276,50 @@ def test_timestamped_with_combiners(self): assert_that(mean_per_window, equal_to([(0, 2.0), (1, 7.0)]), label='assert:mean') + @attr('ValidatesRunner') + def test_custom_windows(self): +with TestPipeline() as p: + result = (p | Create([TimestampedValue(x, x * 100) for x in range(7)]) +| 'custom window' >> WindowInto(TestCustomWindows(250, 200)) +| 'insert key' >> Map(lambda v: ('key', v.value)) +| GroupByKey()) + + assert_that(result, equal_to([('key', [0, 1, 2]), +('key', [3, 4]), +('key', [5]), +('key', [6])])) + + @attr('ValidatesRunner') + def test_windows_idempotency(self): +with TestPipeline() as p: + result = (p | Create([(x, x * 2) for x in range(5)]) Review comment: I tried that but it failed. Dug this issue more and found this is a known issue with Python timestamped value, see https://github.com/apache/beam/pull/8206. In short, to have the downstream ptransforms treat an timestamped element as an element with a timestamp instead of a timestamped value object, there must be a DoFn (could be a simple Map(lamda x: x)) applied to the timestamped elements. (This should be fixed, but probably not in this PR) I found this test class already have a helper fun timestamped_key_values, so I reuse it in the new tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339662) Remaining Estimate: 0h Time Spent: 10m > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=339653&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339653 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 07/Nov/19 00:55 Start Date: 07/Nov/19 00:55 Worklog Time Spent: 10m Work Description: rohdesamuel commented on issue #9720: [BEAM-8335] Add initial modules for interactive streaming support URL: https://github.com/apache/beam/pull/9720#issuecomment-550570058 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339653) Time Spent: 19h 10m (was: 19h) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 19h 10m > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=339655&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339655 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 07/Nov/19 00:55 Start Date: 07/Nov/19 00:55 Worklog Time Spent: 10m Work Description: rohdesamuel commented on issue #9720: [BEAM-8335] Add initial modules for interactive streaming support URL: https://github.com/apache/beam/pull/9720#issuecomment-550570123 Run Java_Examples_Dataflow PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339655) Time Spent: 19.5h (was: 19h 20m) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 19.5h > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=339654&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339654 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 07/Nov/19 00:55 Start Date: 07/Nov/19 00:55 Worklog Time Spent: 10m Work Description: rohdesamuel commented on issue #9720: [BEAM-8335] Add initial modules for interactive streaming support URL: https://github.com/apache/beam/pull/9720#issuecomment-550570078 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339654) Time Spent: 19h 20m (was: 19h 10m) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 19h 20m > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8457) Instrument Dataflow jobs that are launched from Notebooks
[ https://issues.apache.org/jira/browse/BEAM-8457?focusedWorklogId=339650&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339650 ] ASF GitHub Bot logged work on BEAM-8457: Author: ASF GitHub Bot Created on: 07/Nov/19 00:44 Start Date: 07/Nov/19 00:44 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9885: [BEAM-8457] Label Dataflow jobs from Notebook URL: https://github.com/apache/beam/pull/9885#discussion_r343401907 ## File path: sdks/python/apache_beam/pipeline.py ## @@ -396,28 +400,57 @@ def replace_all(self, replacements): for override in replacements: self._check_replacement(override) - def run(self, test_runner_api=True): -"""Runs the pipeline. Returns whatever our runner returns after running.""" - + def run(self, test_runner_api=True, runner=None, options=None, Review comment: > Do you think we should put it into a separate PR, Yes. At least let's have a seperate discussion for API changes like this. > or simply not supporting it at all? Maybe not. This could be just a override for the interactive runners run() (e.g. run_with(NewRunner, NewOptions). At least, let's discuss with all stakeholders. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339650) Time Spent: 8h (was: 7h 50m) > Instrument Dataflow jobs that are launched from Notebooks > - > > Key: BEAM-8457 > URL: https://issues.apache.org/jira/browse/BEAM-8457 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 8h > Remaining Estimate: 0h > > Dataflow needs the capability to tell how many Dataflow jobs are launched > from the Notebook environment, i.e., the Interactive Runner. > # Change the pipeline.run() API to allow supply a runner and an option > parameter so that a pipeline initially bundled w/ an interactive runner can > be directly run by other runners from notebook. > # Implicitly add the necessary source information through user labels when > the user does p.run(runner=DataflowRunner()). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8457) Instrument Dataflow jobs that are launched from Notebooks
[ https://issues.apache.org/jira/browse/BEAM-8457?focusedWorklogId=339649&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339649 ] ASF GitHub Bot logged work on BEAM-8457: Author: ASF GitHub Bot Created on: 07/Nov/19 00:41 Start Date: 07/Nov/19 00:41 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9885: [BEAM-8457] Label Dataflow jobs from Notebook URL: https://github.com/apache/beam/pull/9885#discussion_r343401313 ## File path: sdks/python/apache_beam/runners/dataflow/dataflow_runner.py ## @@ -360,6 +360,16 @@ def visit_transform(self, transform_node): def run_pipeline(self, pipeline, options): """Remotely executes entire pipeline or parts reachable from node.""" +# Label goog-dataflow-notebook if pipeline is initiated from interactive +# runner. +if pipeline.interactive: Review comment: Could we move https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/interactive_environment.py#L100:L102 to a common utility function, and each runner if they want could call this without worrying about require additional imports? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339649) Time Spent: 7h 50m (was: 7h 40m) > Instrument Dataflow jobs that are launched from Notebooks > - > > Key: BEAM-8457 > URL: https://issues.apache.org/jira/browse/BEAM-8457 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 7h 50m > Remaining Estimate: 0h > > Dataflow needs the capability to tell how many Dataflow jobs are launched > from the Notebook environment, i.e., the Interactive Runner. > # Change the pipeline.run() API to allow supply a runner and an option > parameter so that a pipeline initially bundled w/ an interactive runner can > be directly run by other runners from notebook. > # Implicitly add the necessary source information through user labels when > the user does p.run(runner=DataflowRunner()). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8575) Add more Python validates runner tests
wendy liu created BEAM-8575: --- Summary: Add more Python validates runner tests Key: BEAM-8575 URL: https://issues.apache.org/jira/browse/BEAM-8575 Project: Beam Issue Type: Test Components: sdk-py-core, testing Reporter: wendy liu This is the umbrella issue to track the work of adding more Python tests to improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=339648&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339648 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 07/Nov/19 00:29 Start Date: 07/Nov/19 00:29 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#issuecomment-550564200 Run Java PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339648) Time Spent: 37h 10m (was: 37h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 37h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=339641&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339641 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 06/Nov/19 23:39 Start Date: 06/Nov/19 23:39 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339641) Time Spent: 15h 50m (was: 15h 40m) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 15h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8508) [SQL] Support predicate push-down without project push-down
[ https://issues.apache.org/jira/browse/BEAM-8508?focusedWorklogId=339640&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339640 ] ASF GitHub Bot logged work on BEAM-8508: Author: ASF GitHub Bot Created on: 06/Nov/19 23:35 Start Date: 06/Nov/19 23:35 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #9943: [BEAM-8508] [SQL] Standalone filter push down URL: https://github.com/apache/beam/pull/9943#issuecomment-550550057 `PushDownRule` should not be applied to the same BeamIOSourceRel more than once. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339640) Time Spent: 3h 50m (was: 3h 40m) > [SQL] Support predicate push-down without project push-down > --- > > Key: BEAM-8508 > URL: https://issues.apache.org/jira/browse/BEAM-8508 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 3h 50m > Remaining Estimate: 0h > > In this PR: [https://github.com/apache/beam/pull/9863] > Support for Predicate push-down is added, but only for IOs that support > project push-down. > In order to accomplish that some checks need to be added to not perform > certain Calc and IO manipulations when only filter push-down is needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?focusedWorklogId=339639&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339639 ] ASF GitHub Bot logged work on BEAM-8574: Author: ASF GitHub Bot Created on: 06/Nov/19 23:33 Start Date: 06/Nov/19 23:33 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #10018: [BEAM-8574] Revert "[BEAM-8427] Create a table and a table provider for MongoDB" URL: https://github.com/apache/beam/pull/10018 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339639) Time Spent: 50m (was: 40m) > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > Time Spent: 50m > Remaining Estimate: 0h > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > Cause: [https://github.com/apache/beam/pull/9806] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8570) Use SDK version in default Java container tag
[ https://issues.apache.org/jira/browse/BEAM-8570?focusedWorklogId=339636&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339636 ] ASF GitHub Bot logged work on BEAM-8570: Author: ASF GitHub Bot Created on: 06/Nov/19 23:22 Start Date: 06/Nov/19 23:22 Worklog Time Spent: 10m Work Description: ibzib commented on issue #10017: [BEAM-8570] Use SDK version in default Java container tag URL: https://github.com/apache/beam/pull/10017#issuecomment-550546532 Run XVR_Flink PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339636) Time Spent: 0.5h (was: 20m) > Use SDK version in default Java container tag > - > > Key: BEAM-8570 > URL: https://issues.apache.org/jira/browse/BEAM-8570 > Project: Beam > Issue Type: Improvement > Components: sdk-java-harness >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently, the Java SDK uses container `apachebeam/java_sdk:latest` by > default [1]. This causes confusion when using locally built containers [2], > especially since images are automatically pulled, meaning the release image > is used instead of the developer's own image (BEAM-8545). > [[1] > https://github.com/apache/beam/blob/473377ef8f51949983508f70663e75ef0ee24a7f/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java#L84-L91|https://github.com/apache/beam/blob/473377ef8f51949983508f70663e75ef0ee24a7f/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java#L84-L91] > [[2] > https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E|https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?focusedWorklogId=339634&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339634 ] ASF GitHub Bot logged work on BEAM-8574: Author: ASF GitHub Bot Created on: 06/Nov/19 23:14 Start Date: 06/Nov/19 23:14 Worklog Time Spent: 10m Work Description: apilloud commented on issue #10018: [BEAM-8574] Revert "[BEAM-8427] Create a table and a table provider for MongoDB" URL: https://github.com/apache/beam/pull/10018#issuecomment-550544180 Run sql postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339634) Time Spent: 40m (was: 0.5h) > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > Time Spent: 40m > Remaining Estimate: 0h > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > Cause: [https://github.com/apache/beam/pull/9806] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-1438) The default behavior for the Write transform doesn't work well with the Dataflow streaming runner
[ https://issues.apache.org/jira/browse/BEAM-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968788#comment-16968788 ] Amit Kumar commented on BEAM-1438: -- I have also recently seen failure withNumShards(0) for an unbounded source. > The default behavior for the Write transform doesn't work well with the > Dataflow streaming runner > - > > Key: BEAM-1438 > URL: https://issues.apache.org/jira/browse/BEAM-1438 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Fix For: 2.5.0 > > > If a Write specifies 0 output shards, that implies the runner should pick an > appropriate sharding. The default behavior is to write one shard per input > bundle. This works well with the Dataflow batch runner, but not with the > streaming runner which produces large numbers of small bundles. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8512) Add integration tests for Python "flink_runner.py"
[ https://issues.apache.org/jira/browse/BEAM-8512?focusedWorklogId=339633&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339633 ] ASF GitHub Bot logged work on BEAM-8512: Author: ASF GitHub Bot Created on: 06/Nov/19 23:12 Start Date: 06/Nov/19 23:12 Worklog Time Spent: 10m Work Description: ibzib commented on issue #9998: [BEAM-8512] Add integration tests for flink_runner.py URL: https://github.com/apache/beam/pull/9998#issuecomment-550543641 Run Python 3.7 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339633) Time Spent: 50m (was: 40m) > Add integration tests for Python "flink_runner.py" > -- > > Key: BEAM-8512 > URL: https://issues.apache.org/jira/browse/BEAM-8512 > Project: Beam > Issue Type: Test > Components: runner-flink, sdk-py-core >Reporter: Maximilian Michels >Assignee: Kyle Weaver >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > There are currently no integration tests for the Python FlinkRunner. We need > a set of tests similar to {{flink_runner_test.py}} which currently use the > PortableRunner and not the FlinkRunner. > CC [~robertwb] [~ibzib] [~thw] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8512) Add integration tests for Python "flink_runner.py"
[ https://issues.apache.org/jira/browse/BEAM-8512?focusedWorklogId=339632&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339632 ] ASF GitHub Bot logged work on BEAM-8512: Author: ASF GitHub Bot Created on: 06/Nov/19 23:11 Start Date: 06/Nov/19 23:11 Worklog Time Spent: 10m Work Description: ibzib commented on issue #9998: [BEAM-8512] Add integration tests for flink_runner.py URL: https://github.com/apache/beam/pull/9998#issuecomment-550543546 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339632) Time Spent: 40m (was: 0.5h) > Add integration tests for Python "flink_runner.py" > -- > > Key: BEAM-8512 > URL: https://issues.apache.org/jira/browse/BEAM-8512 > Project: Beam > Issue Type: Test > Components: runner-flink, sdk-py-core >Reporter: Maximilian Michels >Assignee: Kyle Weaver >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > There are currently no integration tests for the Python FlinkRunner. We need > a set of tests similar to {{flink_runner_test.py}} which currently use the > PortableRunner and not the FlinkRunner. > CC [~robertwb] [~ibzib] [~thw] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8570) Use SDK version in default Java container tag
[ https://issues.apache.org/jira/browse/BEAM-8570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-8570: -- Status: Open (was: Triage Needed) > Use SDK version in default Java container tag > - > > Key: BEAM-8570 > URL: https://issues.apache.org/jira/browse/BEAM-8570 > Project: Beam > Issue Type: Improvement > Components: sdk-java-harness >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Currently, the Java SDK uses container `apachebeam/java_sdk:latest` by > default [1]. This causes confusion when using locally built containers [2], > especially since images are automatically pulled, meaning the release image > is used instead of the developer's own image (BEAM-8545). > [[1] > https://github.com/apache/beam/blob/473377ef8f51949983508f70663e75ef0ee24a7f/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java#L84-L91|https://github.com/apache/beam/blob/473377ef8f51949983508f70663e75ef0ee24a7f/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java#L84-L91] > [[2] > https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E|https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8571) Use SDK version in default Go container tag
[ https://issues.apache.org/jira/browse/BEAM-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-8571: -- Status: Open (was: Triage Needed) > Use SDK version in default Go container tag > --- > > Key: BEAM-8571 > URL: https://issues.apache.org/jira/browse/BEAM-8571 > Project: Beam > Issue Type: Improvement > Components: sdk-go >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > > Currently, the Go SDK uses container `apachebeam/go_sdk:latest` by default > [1]. This causes confusion when using locally built containers [2], > especially since images are automatically pulled, meaning the release image > is used instead of the developer's own image (BEAM-8545). > [1] > [https://github.com/apache/beam/blob/473377ef8f51949983508f70663e75ef0ee24a7f/sdks/go/pkg/beam/options/jobopts/options.go#L111] > [[2] > https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E|https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?focusedWorklogId=339630&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339630 ] ASF GitHub Bot logged work on BEAM-8574: Author: ASF GitHub Bot Created on: 06/Nov/19 23:07 Start Date: 06/Nov/19 23:07 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #10018: [BEAM-8574] Revert "[BEAM-8427] Create a table and a table provider for MongoDB" URL: https://github.com/apache/beam/pull/10018#issuecomment-550542272 R: @apilloud This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339630) Time Spent: 0.5h (was: 20m) > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > Time Spent: 0.5h > Remaining Estimate: 0h > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > Cause: [https://github.com/apache/beam/pull/9806] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Kozlov updated BEAM-8574: Description: Integration test for Sql MongoDb table read and write fails. Jenkins: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] Cause: [https://github.com/apache/beam/pull/9806] Revert PR: [https://github.com/apache/beam/pull/10018] was: Integration test for Sql MongoDb table read and write fails. Jenkins: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] Cause: [https://github.com/apache/beam/pull/9806] > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > Time Spent: 20m > Remaining Estimate: 0h > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > Cause: [https://github.com/apache/beam/pull/9806] > Revert PR: [https://github.com/apache/beam/pull/10018] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Kozlov updated BEAM-8574: Description: Integration test for Sql MongoDb table read and write fails. Jenkins: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] Cause: [https://github.com/apache/beam/pull/9806] was: Integration test for Sql MongoDb table read and write fails. Jenkins: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] Cause: [https://github.com/apache/beam/pull/9806] Revert PR: [https://github.com/apache/beam/pull/10018] > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > Time Spent: 20m > Remaining Estimate: 0h > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > Cause: [https://github.com/apache/beam/pull/9806] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?focusedWorklogId=339627&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339627 ] ASF GitHub Bot logged work on BEAM-8574: Author: ASF GitHub Bot Created on: 06/Nov/19 22:52 Start Date: 06/Nov/19 22:52 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #10018: [BEAM-8574] Revert "[BEAM-8427] Create a table and a table provider for MongoDB" URL: https://github.com/apache/beam/pull/10018#issuecomment-550537682 Run JavaPortabilityApi PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339627) Time Spent: 20m (was: 10m) > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > Time Spent: 20m > Remaining Estimate: 0h > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > Cause: [https://github.com/apache/beam/pull/9806] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?focusedWorklogId=339626&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339626 ] ASF GitHub Bot logged work on BEAM-8574: Author: ASF GitHub Bot Created on: 06/Nov/19 22:52 Start Date: 06/Nov/19 22:52 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #10018: [BEAM-8574] Revert "[BEAM-8427] Create a table and a table provider for MongoDB" URL: https://github.com/apache/beam/pull/10018#issuecomment-550537682 Run JavaPortabilityApi PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339626) Remaining Estimate: 0h Time Spent: 10m > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > Time Spent: 10m > Remaining Estimate: 0h > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > Cause: [https://github.com/apache/beam/pull/9806] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8512) Add integration tests for Python "flink_runner.py"
[ https://issues.apache.org/jira/browse/BEAM-8512?focusedWorklogId=339624&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339624 ] ASF GitHub Bot logged work on BEAM-8512: Author: ASF GitHub Bot Created on: 06/Nov/19 22:50 Start Date: 06/Nov/19 22:50 Worklog Time Spent: 10m Work Description: ibzib commented on issue #9998: [BEAM-8512] Add integration tests for flink_runner.py URL: https://github.com/apache/beam/pull/9998#issuecomment-550536929 Run Seed Job This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339624) Time Spent: 0.5h (was: 20m) > Add integration tests for Python "flink_runner.py" > -- > > Key: BEAM-8512 > URL: https://issues.apache.org/jira/browse/BEAM-8512 > Project: Beam > Issue Type: Test > Components: runner-flink, sdk-py-core >Reporter: Maximilian Michels >Assignee: Kyle Weaver >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > There are currently no integration tests for the Python FlinkRunner. We need > a set of tests similar to {{flink_runner_test.py}} which currently use the > PortableRunner and not the FlinkRunner. > CC [~robertwb] [~ibzib] [~thw] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Kozlov updated BEAM-8574: Description: Integration test for Sql MongoDb table read and write fails. Jenkins: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] Cause: [https://github.com/apache/beam/pull/9806] was: Integration test for Sql MongoDb table read and write fails. Jenkins: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > Cause: [https://github.com/apache/beam/pull/9806] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Kozlov updated BEAM-8574: Description: Integration test for Sql MongoDb table read and write fails. Jenkins: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] was: Integration test for Sql MongoDb table read and write fails. [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > > Integration test for Sql MongoDb table read and write fails. > Jenkins: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
[ https://issues.apache.org/jira/browse/BEAM-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Kozlov updated BEAM-8574: Description: Integration test for Sql MongoDb table read and write fails. [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] was:Integration test for Sql MongoDb table read and write fails. > [SQL] MongoDb PostCommit_SQL fails > -- > > Key: BEAM-8574 > URL: https://issues.apache.org/jira/browse/BEAM-8574 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Critical > > Integration test for Sql MongoDb table read and write fails. > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_SQL/3126/] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8402) Create a class hierarchy to represent environments
[ https://issues.apache.org/jira/browse/BEAM-8402?focusedWorklogId=339622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339622 ] ASF GitHub Bot logged work on BEAM-8402: Author: ASF GitHub Bot Created on: 06/Nov/19 22:43 Start Date: 06/Nov/19 22:43 Worklog Time Spent: 10m Work Description: violalyu commented on issue #9811: [BEAM-8402] Create a class hierarchy to represent environments URL: https://github.com/apache/beam/pull/9811#issuecomment-550534787 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339622) Time Spent: 2h 50m (was: 2h 40m) > Create a class hierarchy to represent environments > -- > > Key: BEAM-8402 > URL: https://issues.apache.org/jira/browse/BEAM-8402 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > As a first step towards making it possible to assign different environments > to sections of a pipeline, we first need to expose environment classes to the > pipeline API. Unlike PTransforms, PCollections, Coders, and Windowings, > environments exists solely in the portability framework as protobuf objects. > By creating a hierarchy of "native" classes that represent the various > environment types -- external, docker, process, etc -- users will be able to > instantiate these and assign them to parts of the pipeline. The assignment > portion will be covered in a follow-up issue/PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8570) Use SDK version in default Java container tag
[ https://issues.apache.org/jira/browse/BEAM-8570?focusedWorklogId=339623&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339623 ] ASF GitHub Bot logged work on BEAM-8570: Author: ASF GitHub Bot Created on: 06/Nov/19 22:43 Start Date: 06/Nov/19 22:43 Worklog Time Spent: 10m Work Description: ibzib commented on issue #10017: [BEAM-8570] Use SDK version in default Java container tag URL: https://github.com/apache/beam/pull/10017#issuecomment-550534969 Run XVR_Flink PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339623) Time Spent: 20m (was: 10m) > Use SDK version in default Java container tag > - > > Key: BEAM-8570 > URL: https://issues.apache.org/jira/browse/BEAM-8570 > Project: Beam > Issue Type: Improvement > Components: sdk-java-harness >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Currently, the Java SDK uses container `apachebeam/java_sdk:latest` by > default [1]. This causes confusion when using locally built containers [2], > especially since images are automatically pulled, meaning the release image > is used instead of the developer's own image (BEAM-8545). > [[1] > https://github.com/apache/beam/blob/473377ef8f51949983508f70663e75ef0ee24a7f/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java#L84-L91|https://github.com/apache/beam/blob/473377ef8f51949983508f70663e75ef0ee24a7f/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java#L84-L91] > [[2] > https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E|https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8574) [SQL] MongoDb PostCommit_SQL fails
Kirill Kozlov created BEAM-8574: --- Summary: [SQL] MongoDb PostCommit_SQL fails Key: BEAM-8574 URL: https://issues.apache.org/jira/browse/BEAM-8574 Project: Beam Issue Type: Bug Components: dsl-sql Reporter: Kirill Kozlov Assignee: Kirill Kozlov Integration test for Sql MongoDb table read and write fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8427) [SQL] Add support for MongoDB source
[ https://issues.apache.org/jira/browse/BEAM-8427?focusedWorklogId=339621&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339621 ] ASF GitHub Bot logged work on BEAM-8427: Author: ASF GitHub Bot Created on: 06/Nov/19 22:37 Start Date: 06/Nov/19 22:37 Worklog Time Spent: 10m Work Description: 11moon11 commented on pull request #10018: Revert "[BEAM-8427] Create a table and a table provider for MongoDB" URL: https://github.com/apache/beam/pull/10018 Reverts apache/beam#9806 Post Commit tests break. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339621) Time Spent: 4.5h (was: 4h 20m) > [SQL] Add support for MongoDB source > > > Key: BEAM-8427 > URL: https://issues.apache.org/jira/browse/BEAM-8427 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 4.5h > Remaining Estimate: 0h > > In progress: > * Create a MongoDB table and table provider. > * Implement buildIOReader > * Support primitive types > Still needs to be done: > * Implement buildIOWrite > * improve getTableStatistics -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8570) Use SDK version in default Java container tag
[ https://issues.apache.org/jira/browse/BEAM-8570?focusedWorklogId=339620&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339620 ] ASF GitHub Bot logged work on BEAM-8570: Author: ASF GitHub Bot Created on: 06/Nov/19 22:36 Start Date: 06/Nov/19 22:36 Worklog Time Spent: 10m Work Description: ibzib commented on pull request #10017: [BEAM-8570] Use SDK version in default Java container tag URL: https://github.com/apache/beam/pull/10017 One small concern I have is the provenance of the release version. It ultimately comes from `sdk.properties`, which is generated by [`./gradlew :sdks:java:core:processResources`](https://github.com/ibzib/beam/blob/java-container-version/sdks/java/core/build.gradle#L46-L52), which appears to do nothing unless it's a clean build. Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Py_ValCon
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339618&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339618 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 22:32 Start Date: 06/Nov/19 22:32 Worklog Time Spent: 10m Work Description: stevekoonce commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343363529 ## File path: runners/google-cloud-dataflow-java/worker/src/test/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorkerTest.java ## @@ -949,63 +949,14 @@ public void testKeyCommitTooLargeException() throws Exception { assertEquals(2, result.size()); assertEquals(makeExpectedOutput(2, 0, "key", "key").build(), result.get(2L)); assertTrue(result.containsKey(1L)); -assertEquals("large_key", result.get(1L).getKey().toStringUtf8()); -assertTrue(result.get(1L).getSerializedSize() > 1000); -// Spam worker updates a few times. -int maxTries = 10; -while (--maxTries > 0) { - worker.reportPeriodicWorkerUpdates(); - Uninterruptibles.sleepUninterruptibly(1000, TimeUnit.MILLISECONDS); -} +WorkItemCommitRequest largeCommit = result.get(1L); +assertEquals("large_key", largeCommit.getKey().toStringUtf8()); Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339618) Time Spent: 1h 40m (was: 1.5h) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339617&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339617 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 22:32 Start Date: 06/Nov/19 22:32 Worklog Time Spent: 10m Work Description: stevekoonce commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343363496 ## File path: runners/google-cloud-dataflow-java/worker/src/test/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorkerTest.java ## @@ -949,63 +949,14 @@ public void testKeyCommitTooLargeException() throws Exception { assertEquals(2, result.size()); assertEquals(makeExpectedOutput(2, 0, "key", "key").build(), result.get(2L)); assertTrue(result.containsKey(1L)); -assertEquals("large_key", result.get(1L).getKey().toStringUtf8()); -assertTrue(result.get(1L).getSerializedSize() > 1000); -// Spam worker updates a few times. -int maxTries = 10; -while (--maxTries > 0) { - worker.reportPeriodicWorkerUpdates(); - Uninterruptibles.sleepUninterruptibly(1000, TimeUnit.MILLISECONDS); -} +WorkItemCommitRequest largeCommit = result.get(1L); +assertEquals("large_key", largeCommit.getKey().toStringUtf8()); -// We should see an exception reported for the large commit but not the small one. -ArgumentCaptor workItemStatusCaptor = -ArgumentCaptor.forClass(WorkItemStatus.class); -verify(mockWorkUnitClient, atLeast(2)).reportWorkItemStatus(workItemStatusCaptor.capture()); -List capturedStatuses = workItemStatusCaptor.getAllValues(); -boolean foundErrors = false; -for (WorkItemStatus status : capturedStatuses) { - if (!status.getErrors().isEmpty()) { -assertFalse(foundErrors); -foundErrors = true; -String errorMessage = status.getErrors().get(0).getMessage(); -assertThat(errorMessage, Matchers.containsString("KeyCommitTooLargeException")); - } -} -assertTrue(foundErrors); - } - - @Test - public void testKeyCommitTooLargeException_StreamingEngine() throws Exception { -KvCoder kvCoder = KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()); - -List instructions = -Arrays.asList( -makeSourceInstruction(kvCoder), -makeDoFnInstruction(new LargeCommitFn(), 0, kvCoder), -makeSinkInstruction(kvCoder, 1)); - -FakeWindmillServer server = new FakeWindmillServer(errorCollector); -server.setExpectedExceptionCount(1); - -StreamingDataflowWorkerOptions options = -createTestingPipelineOptions(server, "--experiments=enable_streaming_engine"); -StreamingDataflowWorker worker = makeWorker(instructions, options, true /* publishCounters */); -worker.setMaxWorkItemCommitBytes(1000); -worker.start(); - -server.addWorkToOffer(makeInput(1, 0, "large_key")); -server.addWorkToOffer(makeInput(2, 0, "key")); -server.waitForEmptyWorkQueue(); - -Map result = server.waitForAndGetCommits(1); - -assertEquals(2, result.size()); -assertEquals(makeExpectedOutput(2, 0, "key", "key").build(), result.get(2L)); -assertTrue(result.containsKey(1L)); -assertEquals("large_key", result.get(1L).getKey().toStringUtf8()); -assertTrue(result.get(1L).getSerializedSize() > 1000); +// The large commit should have its flags set marking it for truncation +assertTrue(largeCommit.getExceedsMaxWorkItemCommitBytes()); +assertTrue(largeCommit.getSerializedSize() < 100); Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339617) Time Spent: 1.5h (was: 1h 20m) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339614&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339614 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 22:23 Start Date: 06/Nov/19 22:23 Worklog Time Spent: 10m Work Description: stevekoonce commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343360175 ## File path: runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/windmill.proto ## @@ -290,12 +293,19 @@ message WorkItemCommitRequest { optional SourceState source_state_updates = 12; optional int64 source_watermark = 13 [default=-0x8000]; optional int64 source_backlog_bytes = 17 [default=-1]; + optional int64 source_bytes_processed = 22 [default = 0]; + repeated WatermarkHold watermark_holds = 14; + repeated int64 finalize_ids = 19 [packed = true]; + + optional int64 testonly_fake_clock_time_usec = 23; + // DEPRECATED repeated GlobalDataId global_data_id_requests = 9; reserved 6; + Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339614) Time Spent: 1h 20m (was: 1h 10m) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339612&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339612 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 22:23 Start Date: 06/Nov/19 22:23 Worklog Time Spent: 10m Work Description: stevekoonce commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343360096 ## File path: runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/windmill.proto ## @@ -290,12 +293,19 @@ message WorkItemCommitRequest { optional SourceState source_state_updates = 12; optional int64 source_watermark = 13 [default=-0x8000]; optional int64 source_backlog_bytes = 17 [default=-1]; + optional int64 source_bytes_processed = 22 [default = 0]; Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339612) Time Spent: 1h (was: 50m) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 1h > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339613&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339613 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 22:23 Start Date: 06/Nov/19 22:23 Worklog Time Spent: 10m Work Description: stevekoonce commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343360142 ## File path: runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/windmill.proto ## @@ -290,12 +293,19 @@ message WorkItemCommitRequest { optional SourceState source_state_updates = 12; optional int64 source_watermark = 13 [default=-0x8000]; optional int64 source_backlog_bytes = 17 [default=-1]; + optional int64 source_bytes_processed = 22 [default = 0]; + repeated WatermarkHold watermark_holds = 14; + repeated int64 finalize_ids = 19 [packed = true]; + + optional int64 testonly_fake_clock_time_usec = 23; Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339613) Time Spent: 1h 10m (was: 1h) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8573) @SplitRestriction's documented signature is incorrect
[ https://issues.apache.org/jira/browse/BEAM-8573?focusedWorklogId=339603&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339603 ] ASF GitHub Bot logged work on BEAM-8573: Author: ASF GitHub Bot Created on: 06/Nov/19 21:55 Start Date: 06/Nov/19 21:55 Worklog Time Spent: 10m Work Description: jagthebeetle commented on pull request #10016: [BEAM-8573] Updates documented signature for @SplitRestriction URL: https://github.com/apache/beam/pull/10016 Updating the documentation for DoFn.SplitRestriction The signature does not match the compiler-enforced signature actually required when writing a Splittable DoFn. R: @kennknowles Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [X] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [X] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [X] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Python3
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339601&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339601 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 06/Nov/19 21:44 Start Date: 06/Nov/19 21:44 Worklog Time Spent: 10m Work Description: cmachgodaddy commented on issue #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#issuecomment-550514005 > @cmachgodaddy > > 1. Good point, but we use `AmazonKinesis` as a client for Kinesis. Can we leverage `RetryPolicy` in this case? > 2. I believe the last one should win but it would make sense to add more checks to avoid an ambiguity. --> #1 Yes, every AWS client (Dynamo, Sns, Sqs, Kinesis, ...) can take in ClientConfiguration as a argument. And we can set any configurations with this object. --> #2 Of course, we can add a ton of checkers, but users will be confuse of which one of withRateLimitXXXs to use. I would rather have just one withRateLimitXXX and let user pass in an Enum. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339601) Time Spent: 8h (was: 7h 50m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 8h > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=339602&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339602 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 06/Nov/19 21:44 Start Date: 06/Nov/19 21:44 Worklog Time Spent: 10m Work Description: cmachgodaddy commented on issue #9765: [WIP][BEAM-8382] Add rate limit policy to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#issuecomment-550514005 > 1. Good point, but we use `AmazonKinesis` as a client for Kinesis. Can we leverage `RetryPolicy` in this case? > 2. I believe the last one should win but it would make sense to add more checks to avoid an ambiguity. --> #1 Yes, every AWS client (Dynamo, Sns, Sqs, Kinesis, ...) can take in ClientConfiguration as a argument. And we can set any configurations with this object. --> #2 Of course, we can add a ton of checkers, but users will be confuse of which one of withRateLimitXXXs to use. I would rather have just one withRateLimitXXX and let user pass in an Enum. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339602) Time Spent: 8h 10m (was: 8h) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 8h 10m > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8338) Support ES 7.x for ElasticsearchIO
[ https://issues.apache.org/jira/browse/BEAM-8338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Chen reassigned BEAM-8338: --- Assignee: Jing Chen > Support ES 7.x for ElasticsearchIO > -- > > Key: BEAM-8338 > URL: https://issues.apache.org/jira/browse/BEAM-8338 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Reporter: Michal Brunát >Assignee: Jing Chen >Priority: Major > > Elasticsearch has released 7.4 but ElasticsearchIO only supports 2x,5.x,6.x. > We should support ES 7.x for ElasticsearchIO. > [https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html] > > [https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8452) TriggerLoadJobs.process in bigquery_file_loads schema is type str
[ https://issues.apache.org/jira/browse/BEAM-8452?focusedWorklogId=339594&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339594 ] ASF GitHub Bot logged work on BEAM-8452: Author: ASF GitHub Bot Created on: 06/Nov/19 21:15 Start Date: 06/Nov/19 21:15 Worklog Time Spent: 10m Work Description: noah-goodrich commented on pull request #1: BEAM-8452 - TriggerLoadJobs.process in bigquery_file_loads schema is type str URL: https://github.com/apache/beam/pull/1#discussion_r343330994 ## File path: sdks/python/apache_beam/io/gcp/bigquery_file_loads.py ## @@ -387,6 +387,14 @@ def process(self, element, load_job_name_prefix, *schema_side_inputs): else: schema = self.schema +import unicode +import json + +if isinstance(schema, (str, unicode)): + schema = bigquery_tools.parse_table_schema_from_json(schema) +elif isinstance(schema, dict): Review comment: Apologies if the issue as described was not clear. Basically - I was trying to creating a dataflow template, which requires using ValueProvider argument. I do not believe it is possible to pass a serialized object as the argument in this case. I'm actually borrowing the logic from here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L757 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339594) Time Spent: 1h 20m (was: 1h 10m) > TriggerLoadJobs.process in bigquery_file_loads schema is type str > - > > Key: BEAM-8452 > URL: https://issues.apache.org/jira/browse/BEAM-8452 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.15.0, 2.16.0 >Reporter: Noah Goodrich >Assignee: Noah Goodrich >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > I've found a first issue with the BigQueryFileLoads Transform and the type > of the schema parameter. > {code:java} > Triggering job > beam_load_2019_10_11_140829_19_157670e4d458f0ff578fbe971a91b30a_1570802915 to > load data to BigQuery table datasetId: 'pyr_monat_dev' > projectId: 'icentris-ml-dev' > tableId: 'tree_user_types'>.Schema: {"fields": [{"name": "id", "type": > "INTEGER", "mode": "required"}, {"name": "description", "type": "STRING", > "mode": "nullable"}]}. Additional parameters: {} > Retry with exponential backoff: waiting for 4.875033410381894 seconds before > retrying _insert_load_job because we caught exception: > apitools.base.protorpclite.messages.ValidationError: Expected type s > 'apache_beam.io.gcp.internal.clients.bigquery.bigquery_v2_messages.TableSchema'> > for field schema, found {"fields": [{"name": "id", "type": "INTEGER", > "mode": "required"}, {"name": "description", "type" > : "STRING", "mode": "nullable"}]} (type ) > Traceback for above exception (most recent call last): > File "/opt/conda/lib/python3.7/site-packages/apache_beam/utils/retry.py", > line 206, in wrapper > return fun(*args, **kwargs) > File > "/opt/conda/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", > line 344, in _insert_load_job > **additional_load_parameters > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 791, in __init__ > setattr(self, name, value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 973, in __setattr__ > object.__setattr__(self, name, value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1652, in __set__ > super(MessageField, self).__set__(message_instance, value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1293, in __set__ > value = self.validate(value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1400, in validate > return self.__validate(value, self.validate_element) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1358, in __validate > return validate_element(value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1340, in validate_element > (self.type, name, value, type(value))) > > {code} > > The triggering code looks like this: > > options.view_as(D
[jira] [Assigned] (BEAM-8573) @SplitRestriction's documented signature is incorrect
[ https://issues.apache.org/jira/browse/BEAM-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía reassigned BEAM-8573: -- Assignee: Jonathan Alvarez-Gutierrez > @SplitRestriction's documented signature is incorrect > - > > Key: BEAM-8573 > URL: https://issues.apache.org/jira/browse/BEAM-8573 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Jonathan Alvarez-Gutierrez >Assignee: Jonathan Alvarez-Gutierrez >Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > The documented signature for DoFn.SplitRestriction is > {code:java} > List splitRestriction( InputT element, RestrictionT > restriction);{code} > This fails to compile with, for example: > {{@SplitRestriction split(long, OffsetRange): Must return void}} > It looks like the correct signature is: > {code:java} > void splitRestriction(InputT element, RestrictionT restriction, > OutputReceiver restriction);{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8573) @SplitRestriction's documented signature is incorrect
[ https://issues.apache.org/jira/browse/BEAM-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-8573: --- Status: Open (was: Triage Needed) > @SplitRestriction's documented signature is incorrect > - > > Key: BEAM-8573 > URL: https://issues.apache.org/jira/browse/BEAM-8573 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Jonathan Alvarez-Gutierrez >Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > The documented signature for DoFn.SplitRestriction is > {code:java} > List splitRestriction( InputT element, RestrictionT > restriction);{code} > This fails to compile with, for example: > {{@SplitRestriction split(long, OffsetRange): Must return void}} > It looks like the correct signature is: > {code:java} > void splitRestriction(InputT element, RestrictionT restriction, > OutputReceiver restriction);{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8572) tox environment: assert on Cython source file presence
[ https://issues.apache.org/jira/browse/BEAM-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968716#comment-16968716 ] Udi Meiri commented on BEAM-8572: - Ideas: 1a. Migrate precommit unit testing to pytest, mark Cython requiring tests as such (pytest.mark.cython), and only run them when 1b. Same migration, but add a flag to conftest.py that makes it verify that cythonize has been run. (utils.check_compiled('apache_beam.coders')?) > tox environment: assert on Cython source file presence > -- > > Key: BEAM-8572 > URL: https://issues.apache.org/jira/browse/BEAM-8572 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, testing >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > > Add an assertion somewhere that checks if Cythonized files are present in the > sdist tarball in use. That is for "tox -e py27" assert that these files are > not present, for "tox -e py27-cython" assert that they are present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339593&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339593 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 21:12 Start Date: 06/Nov/19 21:12 Worklog Time Spent: 10m Work Description: scwhittle commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343328612 ## File path: runners/google-cloud-dataflow-java/worker/src/test/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorkerTest.java ## @@ -949,63 +949,14 @@ public void testKeyCommitTooLargeException() throws Exception { assertEquals(2, result.size()); assertEquals(makeExpectedOutput(2, 0, "key", "key").build(), result.get(2L)); assertTrue(result.containsKey(1L)); -assertEquals("large_key", result.get(1L).getKey().toStringUtf8()); -assertTrue(result.get(1L).getSerializedSize() > 1000); -// Spam worker updates a few times. -int maxTries = 10; -while (--maxTries > 0) { - worker.reportPeriodicWorkerUpdates(); - Uninterruptibles.sleepUninterruptibly(1000, TimeUnit.MILLISECONDS); -} +WorkItemCommitRequest largeCommit = result.get(1L); +assertEquals("large_key", largeCommit.getKey().toStringUtf8()); -// We should see an exception reported for the large commit but not the small one. -ArgumentCaptor workItemStatusCaptor = -ArgumentCaptor.forClass(WorkItemStatus.class); -verify(mockWorkUnitClient, atLeast(2)).reportWorkItemStatus(workItemStatusCaptor.capture()); -List capturedStatuses = workItemStatusCaptor.getAllValues(); -boolean foundErrors = false; -for (WorkItemStatus status : capturedStatuses) { - if (!status.getErrors().isEmpty()) { -assertFalse(foundErrors); -foundErrors = true; -String errorMessage = status.getErrors().get(0).getMessage(); -assertThat(errorMessage, Matchers.containsString("KeyCommitTooLargeException")); - } -} -assertTrue(foundErrors); - } - - @Test - public void testKeyCommitTooLargeException_StreamingEngine() throws Exception { -KvCoder kvCoder = KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()); - -List instructions = -Arrays.asList( -makeSourceInstruction(kvCoder), -makeDoFnInstruction(new LargeCommitFn(), 0, kvCoder), -makeSinkInstruction(kvCoder, 1)); - -FakeWindmillServer server = new FakeWindmillServer(errorCollector); -server.setExpectedExceptionCount(1); - -StreamingDataflowWorkerOptions options = -createTestingPipelineOptions(server, "--experiments=enable_streaming_engine"); -StreamingDataflowWorker worker = makeWorker(instructions, options, true /* publishCounters */); -worker.setMaxWorkItemCommitBytes(1000); -worker.start(); - -server.addWorkToOffer(makeInput(1, 0, "large_key")); -server.addWorkToOffer(makeInput(2, 0, "key")); -server.waitForEmptyWorkQueue(); - -Map result = server.waitForAndGetCommits(1); - -assertEquals(2, result.size()); -assertEquals(makeExpectedOutput(2, 0, "key", "key").build(), result.get(2L)); -assertTrue(result.containsKey(1L)); -assertEquals("large_key", result.get(1L).getKey().toStringUtf8()); -assertTrue(result.get(1L).getSerializedSize() > 1000); +// The large commit should have its flags set marking it for truncation +assertTrue(largeCommit.getExceedsMaxWorkItemCommitBytes()); +assertTrue(largeCommit.getSerializedSize() < 100); Review comment: verify the timers and output messages repeated fields are empty This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339593) Time Spent: 50m (was: 40m) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339590&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339590 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 21:12 Start Date: 06/Nov/19 21:12 Worklog Time Spent: 10m Work Description: scwhittle commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343329744 ## File path: runners/google-cloud-dataflow-java/worker/src/test/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorkerTest.java ## @@ -949,63 +949,14 @@ public void testKeyCommitTooLargeException() throws Exception { assertEquals(2, result.size()); assertEquals(makeExpectedOutput(2, 0, "key", "key").build(), result.get(2L)); assertTrue(result.containsKey(1L)); -assertEquals("large_key", result.get(1L).getKey().toStringUtf8()); -assertTrue(result.get(1L).getSerializedSize() > 1000); -// Spam worker updates a few times. -int maxTries = 10; -while (--maxTries > 0) { - worker.reportPeriodicWorkerUpdates(); - Uninterruptibles.sleepUninterruptibly(1000, TimeUnit.MILLISECONDS); -} +WorkItemCommitRequest largeCommit = result.get(1L); +assertEquals("large_key", largeCommit.getKey().toStringUtf8()); Review comment: verify sharding_key, work_token, cache_token also (verified in other cases with makeExpectedOutput) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339590) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339592&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339592 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 21:12 Start Date: 06/Nov/19 21:12 Worklog Time Spent: 10m Work Description: scwhittle commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343327998 ## File path: runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/windmill.proto ## @@ -290,12 +293,19 @@ message WorkItemCommitRequest { optional SourceState source_state_updates = 12; optional int64 source_watermark = 13 [default=-0x8000]; optional int64 source_backlog_bytes = 17 [default=-1]; + optional int64 source_bytes_processed = 22 [default = 0]; + repeated WatermarkHold watermark_holds = 14; + repeated int64 finalize_ids = 19 [packed = true]; + + optional int64 testonly_fake_clock_time_usec = 23; Review comment: rm and instead have in reserved field list below This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339592) Time Spent: 50m (was: 40m) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339591&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339591 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 21:12 Start Date: 06/Nov/19 21:12 Worklog Time Spent: 10m Work Description: scwhittle commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343328116 ## File path: runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/windmill.proto ## @@ -290,12 +293,19 @@ message WorkItemCommitRequest { optional SourceState source_state_updates = 12; optional int64 source_watermark = 13 [default=-0x8000]; optional int64 source_backlog_bytes = 17 [default=-1]; + optional int64 source_bytes_processed = 22 [default = 0]; + repeated WatermarkHold watermark_holds = 14; + repeated int64 finalize_ids = 19 [packed = true]; + + optional int64 testonly_fake_clock_time_usec = 23; + // DEPRECATED repeated GlobalDataId global_data_id_requests = 9; reserved 6; + Review comment: rm blank line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339591) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339589&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339589 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 21:12 Start Date: 06/Nov/19 21:12 Worklog Time Spent: 10m Work Description: scwhittle commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#discussion_r343327746 ## File path: runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/windmill.proto ## @@ -290,12 +293,19 @@ message WorkItemCommitRequest { optional SourceState source_state_updates = 12; optional int64 source_watermark = 13 [default=-0x8000]; optional int64 source_backlog_bytes = 17 [default=-1]; + optional int64 source_bytes_processed = 22 [default = 0]; Review comment: rm default=0, it's the default :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339589) Time Spent: 40m (was: 0.5h) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8561) Add ThriftIO to Support IO for Thrift Files
[ https://issues.apache.org/jira/browse/BEAM-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía reassigned BEAM-8561: -- Assignee: Chris Larsen > Add ThriftIO to Support IO for Thrift Files > --- > > Key: BEAM-8561 > URL: https://issues.apache.org/jira/browse/BEAM-8561 > Project: Beam > Issue Type: New Feature > Components: io-java-files >Reporter: Chris Larsen >Assignee: Chris Larsen >Priority: Minor > > Similar to AvroIO it would be very useful to support reading and writing > to/from Thrift files with a native connector. > Functionality would include: > # read() - Reading from one or more Thrift files. > # write() - Writing to one or more Thrift files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8561) Add ThriftIO to Support IO for Thrift Files
[ https://issues.apache.org/jira/browse/BEAM-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-8561: --- Status: Open (was: Triage Needed) > Add ThriftIO to Support IO for Thrift Files > --- > > Key: BEAM-8561 > URL: https://issues.apache.org/jira/browse/BEAM-8561 > Project: Beam > Issue Type: New Feature > Components: io-java-files >Reporter: Chris Larsen >Priority: Minor > > Similar to AvroIO it would be very useful to support reading and writing > to/from Thrift files with a native connector. > Functionality would include: > # read() - Reading from one or more Thrift files. > # write() - Writing to one or more Thrift files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8572) tox environment: assert on Cython source file presence
[ https://issues.apache.org/jira/browse/BEAM-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968713#comment-16968713 ] Udi Meiri commented on BEAM-8572: - Looks like some Cython-requiring tests are always skipped. {code} $ curl -s https://builds.apache.org/job/beam_PreCommit_Python_Phrase/900/consoleText | grep fast_coders_test.FastCoders consoleText test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders) ... SKIP: Cython is not installed {code} > tox environment: assert on Cython source file presence > -- > > Key: BEAM-8572 > URL: https://issues.apache.org/jira/browse/BEAM-8572 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, testing >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > > Add an assertion somewhere that checks if Cythonized files are present in the > sdist tarball in use. That is for "tox -e py27" assert that these files are > not present, for "tox -e py27-cython" assert that they are present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8572) tox environment: assert on Cython source file presence
[ https://issues.apache.org/jira/browse/BEAM-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udi Meiri updated BEAM-8572: Status: Open (was: Triage Needed) > tox environment: assert on Cython source file presence > -- > > Key: BEAM-8572 > URL: https://issues.apache.org/jira/browse/BEAM-8572 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, testing >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > > Add an assertion somewhere that checks if Cythonized files are present in the > sdist tarball in use. That is for "tox -e py27" assert that these files are > not present, for "tox -e py27-cython" assert that they are present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8452) TriggerLoadJobs.process in bigquery_file_loads schema is type str
[ https://issues.apache.org/jira/browse/BEAM-8452?focusedWorklogId=339588&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339588 ] ASF GitHub Bot logged work on BEAM-8452: Author: ASF GitHub Bot Created on: 06/Nov/19 21:06 Start Date: 06/Nov/19 21:06 Worklog Time Spent: 10m Work Description: noah-goodrich commented on pull request #1: BEAM-8452 - TriggerLoadJobs.process in bigquery_file_loads schema is type str URL: https://github.com/apache/beam/pull/1#discussion_r343327032 ## File path: sdks/python/tox.ini ## @@ -20,6 +20,9 @@ envlist = py27,py35,py36,py37,py27-{gcp,cython,lint,lint3},py35-{gcp,cython},py36-{gcp,cython},py37-{gcp,cython,lint},docs toxworkdir = {toxinidir}/target/{env:ENV_NAME:.tox} +[flake8] +ignore = E111,E114,E121,E125,E127,E129,E226,E302,E41,E502,W503,W504 Review comment: Locally flake8 identified all of these as errors in the file I am trying to edit. Rather than refactor the file to match all of these rules, I thought it was simpler for myself and anyone else coming behind if flake8 knows that those rules aren't enforced. For example, a lot of the code was indented by multiples of 2 instead of 4, etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339588) Time Spent: 1h 10m (was: 1h) > TriggerLoadJobs.process in bigquery_file_loads schema is type str > - > > Key: BEAM-8452 > URL: https://issues.apache.org/jira/browse/BEAM-8452 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.15.0, 2.16.0 >Reporter: Noah Goodrich >Assignee: Noah Goodrich >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > I've found a first issue with the BigQueryFileLoads Transform and the type > of the schema parameter. > {code:java} > Triggering job > beam_load_2019_10_11_140829_19_157670e4d458f0ff578fbe971a91b30a_1570802915 to > load data to BigQuery table datasetId: 'pyr_monat_dev' > projectId: 'icentris-ml-dev' > tableId: 'tree_user_types'>.Schema: {"fields": [{"name": "id", "type": > "INTEGER", "mode": "required"}, {"name": "description", "type": "STRING", > "mode": "nullable"}]}. Additional parameters: {} > Retry with exponential backoff: waiting for 4.875033410381894 seconds before > retrying _insert_load_job because we caught exception: > apitools.base.protorpclite.messages.ValidationError: Expected type s > 'apache_beam.io.gcp.internal.clients.bigquery.bigquery_v2_messages.TableSchema'> > for field schema, found {"fields": [{"name": "id", "type": "INTEGER", > "mode": "required"}, {"name": "description", "type" > : "STRING", "mode": "nullable"}]} (type ) > Traceback for above exception (most recent call last): > File "/opt/conda/lib/python3.7/site-packages/apache_beam/utils/retry.py", > line 206, in wrapper > return fun(*args, **kwargs) > File > "/opt/conda/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", > line 344, in _insert_load_job > **additional_load_parameters > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 791, in __init__ > setattr(self, name, value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 973, in __setattr__ > object.__setattr__(self, name, value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1652, in __set__ > super(MessageField, self).__set__(message_instance, value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1293, in __set__ > value = self.validate(value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1400, in validate > return self.__validate(value, self.validate_element) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1358, in __validate > return validate_element(value) > File > "/opt/conda/lib/python3.7/site-packages/apitools/base/protorpclite/messages.py", > line 1340, in validate_element > (self.type, name, value, type(value))) > > {code} > > The triggering code looks like this: > > options.view_as(DebugOptions).experiments = ['use_beam_bq_sink'] > # Save main session state so pickled functions and classes > # defined
[jira] [Updated] (BEAM-8573) @SplitRestriction's documented signature is incorrect
[ https://issues.apache.org/jira/browse/BEAM-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Alvarez-Gutierrez updated BEAM-8573: - Description: The documented signature for DoFn.SplitRestriction is {code:java} List splitRestriction( InputT element, RestrictionT restriction);{code} This fails to compile with, for example: {{@SplitRestriction split(long, OffsetRange): Must return void}} It looks like the correct signature is: {code:java} void splitRestriction(InputT element, RestrictionT restriction, OutputReceiver restriction);{code} was: The documented signature for DoFn.SplitRestriction is {code:java} List splitRestriction( InputT element, RestrictionT restriction);{code} This fails to compile with: {{@SplitRestriction split(long, OffsetRange): Must return void}} It looks like the correct signature is: {code:java} void splitRestriction(InputT element, RestrictionT restriction, OutputReceiver restriction);{code} > @SplitRestriction's documented signature is incorrect > - > > Key: BEAM-8573 > URL: https://issues.apache.org/jira/browse/BEAM-8573 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Jonathan Alvarez-Gutierrez >Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > The documented signature for DoFn.SplitRestriction is > {code:java} > List splitRestriction( InputT element, RestrictionT > restriction);{code} > This fails to compile with, for example: > {{@SplitRestriction split(long, OffsetRange): Must return void}} > It looks like the correct signature is: > {code:java} > void splitRestriction(InputT element, RestrictionT restriction, > OutputReceiver restriction);{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339584&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339584 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 20:58 Start Date: 06/Nov/19 20:58 Worklog Time Spent: 10m Work Description: stevekoonce commented on issue #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#issuecomment-550497575 R: @scwhittle This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339584) Time Spent: 0.5h (was: 20m) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339583&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339583 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 20:58 Start Date: 06/Nov/19 20:58 Worklog Time Spent: 10m Work Description: stevekoonce commented on issue #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013#issuecomment-550497470 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339583) Time Spent: 20m (was: 10m) > Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to > be broken up > - > > Key: BEAM-8554 > URL: https://issues.apache.org/jira/browse/BEAM-8554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Koonce >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > +Background:+ > When a WorkItemCommitRequest is generated that's bigger than the permitted > size (> ~180 MB), a KeyCommitTooLargeException is logged (_not thrown_) and > the request is still sent to the service. The service rejects the commit, > but breaks up input messages that were bundled together and adds them to new, > smaller work items that will later be pulled and re-tried - likely without > generating another commit that is too large. > When a WorkItemCommitRequest is generated that's too large to be sent back to > the service (> 2 GB), a KeyCommitTooLargeException is thrown and nothing is > sent back to the service. > > +Proposed Improvement+ > In both cases, prevent the doomed, large commit item from being sent back to > the service. Instead send flags in the commit request signaling that the > current work item led to a commit that is too large and the work item should > be broken up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8573) @SplitRestriction's documented signature is incorrect
Jonathan Alvarez-Gutierrez created BEAM-8573: Summary: @SplitRestriction's documented signature is incorrect Key: BEAM-8573 URL: https://issues.apache.org/jira/browse/BEAM-8573 Project: Beam Issue Type: Improvement Components: sdk-java-core Affects Versions: 2.15.0 Reporter: Jonathan Alvarez-Gutierrez The documented signature for DoFn.SplitRestriction is {code:java} List splitRestriction( InputT element, RestrictionT restriction);{code} This fails to compile with: {{@SplitRestriction split(long, OffsetRange): Must return void}} It looks like the correct signature is: {code:java} void splitRestriction(InputT element, RestrictionT restriction, OutputReceiver restriction);{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8368) [Python] libprotobuf-generated exception when importing apache_beam
[ https://issues.apache.org/jira/browse/BEAM-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968671#comment-16968671 ] Brian Hulette commented on BEAM-8368: - Awesome, thank you! > [Python] libprotobuf-generated exception when importing apache_beam > --- > > Key: BEAM-8368 > URL: https://issues.apache.org/jira/browse/BEAM-8368 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.15.0, 2.17.0 >Reporter: Ubaier Bhat >Assignee: Brian Hulette >Priority: Blocker > Fix For: 2.17.0 > > Attachments: error_log.txt > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Unable to import apache_beam after upgrading to macos 10.15 (Catalina). > Cleared all the pipenvs and but can't get it working again. > {code} > import apache_beam as beam > /Users/***/.local/share/virtualenvs/beam-etl-ims6DitU/lib/python3.7/site-packages/apache_beam/__init__.py:84: > UserWarning: Some syntactic constructs of Python 3 are not yet fully > supported by Apache Beam. > 'Some syntactic constructs of Python 3 are not yet fully supported by ' > [libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already > exists in database: > [libprotobuf FATAL google/protobuf/descriptor.cc:1370] CHECK failed: > GeneratedDatabase()->Add(encoded_file_descriptor, size): > libc++abi.dylib: terminating with uncaught exception of type > google::protobuf::FatalException: CHECK failed: > GeneratedDatabase()->Add(encoded_file_descriptor, size): > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8554) Use WorkItemCommitRequest protobuf fields to signal that a WorkItem needs to be broken up
[ https://issues.apache.org/jira/browse/BEAM-8554?focusedWorklogId=339570&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339570 ] ASF GitHub Bot logged work on BEAM-8554: Author: ASF GitHub Bot Created on: 06/Nov/19 20:07 Start Date: 06/Nov/19 20:07 Worklog Time Spent: 10m Work Description: stevekoonce commented on pull request #10013: [BEAM-8554] Use WorkItemCommitRequest protobuf fields to signal that … URL: https://github.com/apache/beam/pull/10013 …a WorkItem needs to be broken up This implements the improvement described in [BEAM-8554](https://issues.apache.org/jira/browse/BEAM-8554): when the serialized size of a WorkItemCommitRequest proto is larger than the maximum size, the commit request will be replaced by a request for a server-side 'truncation' which will cause the WorkItem itself to be broken up and, after reprocessing, result in multiple, smaller WorkItemCommitRequests that are each smaller and can be successfully submitted. I updated an existing unit test and removed a redundant one - the StreamingDataflowWorkerTest is already configured to run all tests with and without StreamingEngine and Windmill, so separate, otherwise-identical tests are not necessary. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCom
[jira] [Work logged] (BEAM-8402) Create a class hierarchy to represent environments
[ https://issues.apache.org/jira/browse/BEAM-8402?focusedWorklogId=339569&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339569 ] ASF GitHub Bot logged work on BEAM-8402: Author: ASF GitHub Bot Created on: 06/Nov/19 20:06 Start Date: 06/Nov/19 20:06 Worklog Time Spent: 10m Work Description: violalyu commented on pull request #9811: [BEAM-8402] Create a class hierarchy to represent environments URL: https://github.com/apache/beam/pull/9811#discussion_r343302106 ## File path: sdks/python/apache_beam/transforms/environments.py ## @@ -0,0 +1,394 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Environments concepts. + +For internal use only. No backwards compatibility guarantees.""" + +from __future__ import absolute_import + +import json + +from google.protobuf import message + +from apache_beam.portability import common_urns +from apache_beam.portability import python_urns +from apache_beam.portability.api import beam_runner_api_pb2 +from apache_beam.portability.api import endpoints_pb2 +from apache_beam.utils import proto_utils + +__all__ = ['Environment', + 'DockerEnvironment', 'ProcessEnvironment', 'ExternalEnvironment', + 'EmbeddedPythonEnvironment', 'EmbeddedPythonGrpcEnvironment', + 'SubprocessSDKEnvironment', 'RunnerAPIEnvironmentHolder'] + + +class Environment(object): + """Abstract base class for environments. + + Represents a type and configuration of environment. + Each type of Environment should have a unique urn. + """ Review comment: Updated! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339569) Time Spent: 2h 40m (was: 2.5h) > Create a class hierarchy to represent environments > -- > > Key: BEAM-8402 > URL: https://issues.apache.org/jira/browse/BEAM-8402 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > As a first step towards making it possible to assign different environments > to sections of a pipeline, we first need to expose environment classes to the > pipeline API. Unlike PTransforms, PCollections, Coders, and Windowings, > environments exists solely in the portability framework as protobuf objects. > By creating a hierarchy of "native" classes that represent the various > environment types -- external, docker, process, etc -- users will be able to > instantiate these and assign them to parts of the pipeline. The assignment > portion will be covered in a follow-up issue/PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8402) Create a class hierarchy to represent environments
[ https://issues.apache.org/jira/browse/BEAM-8402?focusedWorklogId=339568&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339568 ] ASF GitHub Bot logged work on BEAM-8402: Author: ASF GitHub Bot Created on: 06/Nov/19 19:56 Start Date: 06/Nov/19 19:56 Worklog Time Spent: 10m Work Description: chadrik commented on pull request #9811: [BEAM-8402] Create a class hierarchy to represent environments URL: https://github.com/apache/beam/pull/9811#discussion_r343297664 ## File path: sdks/python/apache_beam/transforms/environments.py ## @@ -0,0 +1,394 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Environments concepts. + +For internal use only. No backwards compatibility guarantees.""" + +from __future__ import absolute_import + +import json + +from google.protobuf import message + +from apache_beam.portability import common_urns +from apache_beam.portability import python_urns +from apache_beam.portability.api import beam_runner_api_pb2 +from apache_beam.portability.api import endpoints_pb2 +from apache_beam.utils import proto_utils + +__all__ = ['Environment', + 'DockerEnvironment', 'ProcessEnvironment', 'ExternalEnvironment', + 'EmbeddedPythonEnvironment', 'EmbeddedPythonGrpcEnvironment', + 'SubprocessSDKEnvironment', 'RunnerAPIEnvironmentHolder'] + + +class Environment(object): + """Abstract base class for environments. + + Represents a type and configuration of environment. + Each type of Environment should have a unique urn. + """ Review comment: let's add the internal warning here too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339568) Time Spent: 2.5h (was: 2h 20m) > Create a class hierarchy to represent environments > -- > > Key: BEAM-8402 > URL: https://issues.apache.org/jira/browse/BEAM-8402 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > As a first step towards making it possible to assign different environments > to sections of a pipeline, we first need to expose environment classes to the > pipeline API. Unlike PTransforms, PCollections, Coders, and Windowings, > environments exists solely in the portability framework as protobuf objects. > By creating a hierarchy of "native" classes that represent the various > environment types -- external, docker, process, etc -- users will be able to > instantiate these and assign them to parts of the pipeline. The assignment > portion will be covered in a follow-up issue/PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8402) Create a class hierarchy to represent environments
[ https://issues.apache.org/jira/browse/BEAM-8402?focusedWorklogId=339566&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339566 ] ASF GitHub Bot logged work on BEAM-8402: Author: ASF GitHub Bot Created on: 06/Nov/19 19:54 Start Date: 06/Nov/19 19:54 Worklog Time Spent: 10m Work Description: violalyu commented on issue #9811: [BEAM-8402] Create a class hierarchy to represent environments URL: https://github.com/apache/beam/pull/9811#issuecomment-550474432 @mxm updated! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339566) Time Spent: 2h 20m (was: 2h 10m) > Create a class hierarchy to represent environments > -- > > Key: BEAM-8402 > URL: https://issues.apache.org/jira/browse/BEAM-8402 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > As a first step towards making it possible to assign different environments > to sections of a pipeline, we first need to expose environment classes to the > pipeline API. Unlike PTransforms, PCollections, Coders, and Windowings, > environments exists solely in the portability framework as protobuf objects. > By creating a hierarchy of "native" classes that represent the various > environment types -- external, docker, process, etc -- users will be able to > instantiate these and assign them to parts of the pipeline. The assignment > portion will be covered in a follow-up issue/PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6756) Support lazy iterables in schemas
[ https://issues.apache.org/jira/browse/BEAM-6756?focusedWorklogId=339565&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339565 ] ASF GitHub Bot logged work on BEAM-6756: Author: ASF GitHub Bot Created on: 06/Nov/19 19:54 Start Date: 06/Nov/19 19:54 Worklog Time Spent: 10m Work Description: reuvenlax commented on issue #10003: [BEAM-6756] Create Iterable type for Schema URL: https://github.com/apache/beam/pull/10003#issuecomment-550474375 run sql postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339565) Time Spent: 1h (was: 50m) > Support lazy iterables in schemas > - > > Key: BEAM-6756 > URL: https://issues.apache.org/jira/browse/BEAM-6756 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > The iterables returned by GroupByKey and CoGroupByKey are lazy; this allows a > runner to page data into memory if the full iterable is too large. We > currently don't support this in Schemas, so the Schema Group and CoGroup > transforms materialize all data into memory. We should add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8427) [SQL] Add support for MongoDB source
[ https://issues.apache.org/jira/browse/BEAM-8427?focusedWorklogId=339559&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339559 ] ASF GitHub Bot logged work on BEAM-8427: Author: ASF GitHub Bot Created on: 06/Nov/19 19:25 Start Date: 06/Nov/19 19:25 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #9892: [BEAM-8427] [SQL] buildIOWrite from MongoDb Table URL: https://github.com/apache/beam/pull/9892#issuecomment-550462776 R: @TheNeuralBit cc: @apilloud cc: @amaliujia This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339559) Time Spent: 4h 20m (was: 4h 10m) > [SQL] Add support for MongoDB source > > > Key: BEAM-8427 > URL: https://issues.apache.org/jira/browse/BEAM-8427 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > In progress: > * Create a MongoDB table and table provider. > * Implement buildIOReader > * Support primitive types > Still needs to be done: > * Implement buildIOWrite > * improve getTableStatistics -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8376) Add FirestoreIO connector to Java SDK
[ https://issues.apache.org/jira/browse/BEAM-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968627#comment-16968627 ] Jing Chen commented on BEAM-8376: - chat with [~chamikara] offline, there are ongoing efforts on the issue. unassign myself. > Add FirestoreIO connector to Java SDK > - > > Key: BEAM-8376 > URL: https://issues.apache.org/jira/browse/BEAM-8376 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Stefan Djelekar >Priority: Major > > Motivation: > There is no Firestore connector for Java SDK at the moment. > Having it will enhance the integrations with database options on the Google > Cloud Platform. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8343) Add means for IO APIs to support predicate and/or project push-down when running SQL pipelines
[ https://issues.apache.org/jira/browse/BEAM-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Kozlov updated BEAM-8343: Description: The objective is to create a universal way for Beam SQL IO APIs to support predicate/project push-down. A proposed way to achieve that is by introducing an interface responsible for identifying what portion(s) of a Calc can be moved down to IO layer. Also, adding following methods to a BeamSqlTable interface to pass necessary parameters to IO APIs: - BeamSqlTableFilter constructFilter(List filter) - ProjectSupport supportsProjects() - PCollection buildIOReader(PBegin begin, BeamSqlTableFilter filters, List fieldNames) ProjectSupport is an enum with the following options: * NONE * WITHOUT_FIELD_REORDERING * WITH_FIELD_REORDERING Design doc [link|https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit?usp=sharing]. was: The objective is to create a universal way for Beam SQL IO APIs to support predicate/project push-down. A proposed way to achieve that is by introducing an interface responsible for identifying what portion(s) of a Calc can be moved down to IO layer. Also, adding following methods to a BeamSqlTable interface to pass necessary parameters to IO APIs: - BeamSqlTableFilter supportsFilter(RexProgram program, RexNode filter) - ProjectSupport supportsProjects() - PCollection buildIOReader(PBegin begin, BeamSqlTableFilter filters, List fieldNames) * ProjectSupport is an enum with the following options: * NONE * WITHOUT_FIELD_REORDERING * WITH_FIELD_REORDERING Design doc [link|https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit?usp=sharing]. > Add means for IO APIs to support predicate and/or project push-down when > running SQL pipelines > -- > > Key: BEAM-8343 > URL: https://issues.apache.org/jira/browse/BEAM-8343 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 5h > Remaining Estimate: 0h > > The objective is to create a universal way for Beam SQL IO APIs to support > predicate/project push-down. > A proposed way to achieve that is by introducing an interface responsible > for identifying what portion(s) of a Calc can be moved down to IO layer. > Also, adding following methods to a BeamSqlTable interface to pass necessary > parameters to IO APIs: > - BeamSqlTableFilter constructFilter(List filter) > - ProjectSupport supportsProjects() > - PCollection buildIOReader(PBegin begin, BeamSqlTableFilter filters, > List fieldNames) > > ProjectSupport is an enum with the following options: > * NONE > * WITHOUT_FIELD_REORDERING > * WITH_FIELD_REORDERING > > Design doc > [link|https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8376) Add FirestoreIO connector to Java SDK
[ https://issues.apache.org/jira/browse/BEAM-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Chen reassigned BEAM-8376: --- Assignee: (was: Jing Chen) > Add FirestoreIO connector to Java SDK > - > > Key: BEAM-8376 > URL: https://issues.apache.org/jira/browse/BEAM-8376 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Stefan Djelekar >Priority: Major > > Motivation: > There is no Firestore connector for Java SDK at the moment. > Having it will enhance the integrations with database options on the Google > Cloud Platform. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8343) Add means for IO APIs to support predicate and/or project push-down when running SQL pipelines
[ https://issues.apache.org/jira/browse/BEAM-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Kozlov updated BEAM-8343: Description: The objective is to create a universal way for Beam SQL IO APIs to support predicate/project push-down. A proposed way to achieve that is by introducing an interface responsible for identifying what portion(s) of a Calc can be moved down to IO layer. Also, adding following methods to a BeamSqlTable interface to pass necessary parameters to IO APIs: - BeamSqlTableFilter supportsFilter(RexProgram program, RexNode filter) - ProjectSupport supportsProjects() - PCollection buildIOReader(PBegin begin, BeamSqlTableFilter filters, List fieldNames) * ProjectSupport is an enum with the following options: * NONE * WITHOUT_FIELD_REORDERING * WITH_FIELD_REORDERING Design doc [link|https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit?usp=sharing]. was: The objective is to create a universal way for Beam SQL IO APIs to support predicate/project push-down. A proposed way to achieve that is by introducing an interface responsible for identifying what portion(s) of a Calc can be moved down to IO layer. Also, adding following methods to a BeamSqlTable interface to pass necessary parameters to IO APIs: - BeamSqlTableFilter supportsFilter(RexProgram program, RexNode filter) - Boolean supportsProjects() - PCollection buildIOReader(PBegin begin, BeamSqlTableFilter filters, List fieldNames) Design doc [link|https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit?usp=sharing]. > Add means for IO APIs to support predicate and/or project push-down when > running SQL pipelines > -- > > Key: BEAM-8343 > URL: https://issues.apache.org/jira/browse/BEAM-8343 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 5h > Remaining Estimate: 0h > > The objective is to create a universal way for Beam SQL IO APIs to support > predicate/project push-down. > A proposed way to achieve that is by introducing an interface responsible > for identifying what portion(s) of a Calc can be moved down to IO layer. > Also, adding following methods to a BeamSqlTable interface to pass necessary > parameters to IO APIs: > - BeamSqlTableFilter supportsFilter(RexProgram program, RexNode filter) > - ProjectSupport supportsProjects() > - PCollection buildIOReader(PBegin begin, BeamSqlTableFilter filters, > List fieldNames) > > * ProjectSupport is an enum with the following options: > * NONE > * WITHOUT_FIELD_REORDERING > * WITH_FIELD_REORDERING > > Design doc > [link|https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6756) Support lazy iterables in schemas
[ https://issues.apache.org/jira/browse/BEAM-6756?focusedWorklogId=339556&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339556 ] ASF GitHub Bot logged work on BEAM-6756: Author: ASF GitHub Bot Created on: 06/Nov/19 19:17 Start Date: 06/Nov/19 19:17 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #10003: [BEAM-6756] Create Iterable type for Schema URL: https://github.com/apache/beam/pull/10003#discussion_r343279150 ## File path: learning/katas/java/.idea/study_project.xml ## @@ -0,0 +1,3151 @@ + Review comment: No. Ignore that file - I will remove it from the PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339556) Time Spent: 50m (was: 40m) > Support lazy iterables in schemas > - > > Key: BEAM-6756 > URL: https://issues.apache.org/jira/browse/BEAM-6756 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The iterables returned by GroupByKey and CoGroupByKey are lazy; this allows a > runner to page data into memory if the full iterable is too large. We > currently don't support this in Schemas, so the Schema Group and CoGroup > transforms materialize all data into memory. We should add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)