[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=277230&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277230 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 16/Jul/19 05:45 Start Date: 16/Jul/19 05:45 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303734439 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: Because this PR is implementing reordering joins, a useful test would be a test, in which a three-way join is reordered. As join reordering is already measured in https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#, wouldn't it be straightforward to have a similar test? Without such a test, how do we even know if join reordering is working? In terms of checking output plans, Flink has been doing many tests on Calcite optimization rules (see [here](https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/plan/rules/logical/RewriteMultiJoinConditionRuleTest.scala) and [here](https://github.com/apache/flink/tree/master/flink-table/flink-table-planner-blink/src/test/resources/org/apache/flink/table/plan)). Flink's practice has shown that verifying output plan is deterministic and stable. The basic idea is if you want to test an optimization, only enable relevant rules in test case(so rules are hit will be known) I can see by Flink's way, you can test rules even if rules can be disabled and enabled independently: In [BeamRuleSets.java](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java) ``` static BeamJoinReorderingRelSet = {JoinCommuteRule.INSTANCE, JoinAssociateRule.INSTANCE} ``` In JoinReorderingTest.java ``` FrameworkConfig testConfig = createTestConfig(BeamJoinReorderingRelSet) Planner testPlanner = Frameworks.getPlanner(testConfig); // setup input tables expected_plan = PlanLoader.load(testcase.class) verifyPlan(testPlanner.getPlan(sql), expected_plan) ``` By doing so, if a relevant rule is disabled(e.g. `JoinCommuteRule.INSTANCE`), it will break existing join reordering tests, which guards join ordering for us. It also justifies this PR is doing join reordering. Because we are starting the effort to have more optimization rules in BeamSQL, Flink's practice on testing is a great example that we can learn and apply to BeamSQL to maintain our codebase's health. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277230) Time Spent: 7h 50m (was: 7h 40m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 7h 50m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=277231&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277231 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 16/Jul/19 05:45 Start Date: 16/Jul/19 05:45 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303734439 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: Because this PR is implementing reordering joins, a useful test would be a test, in which a three-way join is reordered. As join reordering is already measured in https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#, wouldn't it be straightforward to have a similar test? Without such a test, how do we even know if join reordering is working? In terms of checking output plans, Flink has been doing many tests on Calcite optimization rules (see [here](https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/plan/rules/logical/RewriteMultiJoinConditionRuleTest.scala) and [here](https://github.com/apache/flink/tree/master/flink-table/flink-table-planner-blink/src/test/resources/org/apache/flink/table/plan)). Flink's practice has shown that verifying output plan is deterministic and stable. The basic idea is if you want to test an optimization, only enable relevant rules in test case(so rules are hit will be known) I can see by Flink's way, you can test rules even if rules can be disabled and enabled independently: In [BeamRuleSets.java](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java) ``` static BeamJoinReorderingRelSet = {JoinCommuteRule.INSTANCE, JoinAssociateRule.INSTANCE} ``` In JoinReorderingTest.java ``` FrameworkConfig testConfig = createTestConfig(BeamJoinReorderingRelSet) Planner testPlanner = Frameworks.getPlanner(testConfig); // setup input tables expected_plan = PlanLoader.load(testcase.class) verifyPlan(testPlanner.getPlan(sql), expected_plan) ``` By doing so, if a relevant rule is disabled accidentally(e.g. `JoinCommuteRule.INSTANCE`), it will break existing join reordering test, which guards join ordering for us. It also justifies this PR is doing join reordering. Because we are starting the effort to have more optimization rules in BeamSQL, Flink's practice on testing is a great example that we can learn and apply to BeamSQL to maintain our codebase's health. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277231) Time Spent: 8h (was: 7h 50m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 8h > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=277228&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277228 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 16/Jul/19 05:40 Start Date: 16/Jul/19 05:40 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303734439 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: Because this PR is implementing reordering joins, a useful test would be a test, in which a three-way join is reordered. As join reordering is already measured in https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#, wouldn't it be straightforward to have a similar test? Without such a test, how do we even know if join reordering is working? In terms of checking output plans, Flink has been doing many tests on Calcite optimization rules (see [here](https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/plan/rules/logical/RewriteMultiJoinConditionRuleTest.scala) and [here](https://github.com/apache/flink/tree/master/flink-table/flink-table-planner-blink/src/test/resources/org/apache/flink/table/plan)). Flink's practice has shown that verifying output plan is deterministic and stable. The basic idea is if you want to test an optimization, only enable relevant rules in test case(so rules are hit will be known) I can see by Flink's way, you can test rules even if rules can be disabled and enabled independently: In [BeamRuleSets.java](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java) ``` static BeamJoinReorderingRelSet = {JoinCommuteRule.INSTANCE, JoinAssociateRule.INSTANCE} ``` In JoinReorderingTest.java ``` FrameworkConfig testConfig = createTestConfig(BeamJoinReorderingRelSet) Planner testPlanner = Frameworks.getPlanner(testConfig); // setup input tables expected_plan = PlanLoader.load(testcase.class) verifyPlan(testPlanner.getPlan(sql), expected_plan) ``` By doing so, if a relevant rule is disable, it will break existing join reordering tests, which has good value for us. It also justifies this PR is doing join reordering. Because we are starting the effort to have more optimization rules in BeamSQL, Flink's practice on testing is a great example that we can learn and apply to BeamSQL to maintain our codebase's health. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277228) Time Spent: 7h 40m (was: 7.5h) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 7h 40m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-6855) Side inputs are not supported when using the state API
[ https://issues.apache.org/jira/browse/BEAM-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885853#comment-16885853 ] Reuven Lax commented on BEAM-6855: -- [~kenn] it was never addressed. Can you confirm that the described solution is the correct one? > Side inputs are not supported when using the state API > -- > > Key: BEAM-6855 > URL: https://issues.apache.org/jira/browse/BEAM-6855 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-6675) The JdbcIO sink should accept schemas
[ https://issues.apache.org/jira/browse/BEAM-6675?focusedWorklogId=277187&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277187 ] ASF GitHub Bot logged work on BEAM-6675: Author: ASF GitHub Bot Created on: 16/Jul/19 04:27 Start Date: 16/Jul/19 04:27 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #8962: [BEAM-6675] Generate JDBC statement and preparedStatementSetter automatically when schema is available URL: https://github.com/apache/beam/pull/8962#issuecomment-511659585 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277187) Time Spent: 7.5h (was: 7h 20m) > The JdbcIO sink should accept schemas > - > > Key: BEAM-6675 > URL: https://issues.apache.org/jira/browse/BEAM-6675 > Project: Beam > Issue Type: Sub-task > Components: io-java-jdbc >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > Time Spent: 7.5h > Remaining Estimate: 0h > > If the input has a schema, there should be a default mapping to a > PreparedStatement for writing based on that schema. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-6855) Side inputs are not supported when using the state API
[ https://issues.apache.org/jira/browse/BEAM-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885830#comment-16885830 ] Kenneth Knowles commented on BEAM-6855: --- This was definitely an issue at one time. I thought someone had addressed this, but perhaps not. It would be great to confirm and to add ValidatesRunner tests. > Side inputs are not supported when using the state API > -- > > Key: BEAM-6855 > URL: https://issues.apache.org/jira/browse/BEAM-6855 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7739) Add ValueState in Python sdk
[ https://issues.apache.org/jira/browse/BEAM-7739?focusedWorklogId=277173&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277173 ] ASF GitHub Bot logged work on BEAM-7739: Author: ASF GitHub Bot Created on: 16/Jul/19 03:27 Start Date: 16/Jul/19 03:27 Worklog Time Spent: 10m Work Description: rakeshcusat commented on issue #9067: [BEAM-7739] Implement ValueState in Python SDK URL: https://github.com/apache/beam/pull/9067#issuecomment-511649980 @robertwb never mind, I found the discussion [here](https://lists.apache.org/thread.html/ccc0d548e440b63897b6784cd7896c266498df64c9c63ce6c52ae098@%3Cdev.beam.apache.org%3E) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277173) Time Spent: 50m (was: 40m) > Add ValueState in Python sdk > > > Key: BEAM-7739 > URL: https://issues.apache.org/jira/browse/BEAM-7739 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Rakesh Kumar >Assignee: Rakesh Kumar >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > Currently ValueState is missing from Python Sdks but it is existing in Java > sdks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7530) Reading None value type BYTES from bigquery: AttributeError
[ https://issues.apache.org/jira/browse/BEAM-7530?focusedWorklogId=277100&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277100 ] ASF GitHub Bot logged work on BEAM-7530: Author: ASF GitHub Bot Created on: 16/Jul/19 01:10 Start Date: 16/Jul/19 01:10 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #8875: [BEAM-7530] Add it test to read None values from BigQuery URL: https://github.com/apache/beam/pull/8875#issuecomment-511624977 Remaining postcommit failures (test_metrics_it, test_datastore_write_limit) are unrelated to the test that is modified in this PR, and failed due to lack of quota (known issue). @pabloem would you be comfortable merging this PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277100) Time Spent: 7h 10m (was: 7h) > Reading None value type BYTES from bigquery: AttributeError > --- > > Key: BEAM-7530 > URL: https://issues.apache.org/jira/browse/BEAM-7530 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Juta Staes >Assignee: Juta Staes >Priority: Major > Fix For: 2.14.0 > > Time Spent: 7h 10m > Remaining Estimate: 0h > > When reading bigquery data where a field of type BYTES contains a None value > I get the following error: > {code:java} > Traceback (most recent call last): File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/executor.py", > line 343, in call finish_state) File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/executor.py", > line 383, in attempt_call result = evaluator.finish_bundle() File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/transform_evaluator.py", > line 319, in finish_bundle bundles = _read_values_to_bundles(reader) File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/transform_evaluator.py", > line 306, in _read_values_to_bundles read_result = > [GlobalWindows.windowed_value(e) for e in reader] File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/transform_evaluator.py", > line 306, in read_result = [GlobalWindows.windowed_value(e) for e > in reader] File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/io/gcp/bigquery_tools.py", > line 932, in __iter__ row.f[i].v.string_value = > row.f[i].v.string_value.encode('utf-8') AttributeError: 'NoneType' object has > no attribute 'string_value'{code} > This bug was introduced in https://github.com/apache/beam/pull/8621 and is > present in the 2.13 release. > I submitted a pr to fix the issue: https://github.com/apache/beam/pull/8817 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7018) Regex transform for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-7018?focusedWorklogId=277094&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277094 ] ASF GitHub Bot logged work on BEAM-7018: Author: ASF GitHub Bot Created on: 16/Jul/19 00:58 Start Date: 16/Jul/19 00:58 Worklog Time Spent: 10m Work Description: aaltay commented on issue #8859: [BEAM-7018] Added Regex transform for PythonSDK URL: https://github.com/apache/beam/pull/8859#issuecomment-511622769 @mszb Both Robert and I were out of office for a bit. The change looks good to me. I will give @robertwb another day if he has any comments. After that we can merge it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277094) Time Spent: 3h 10m (was: 3h) > Regex transform for Python SDK > -- > > Key: BEAM-7018 > URL: https://issues.apache.org/jira/browse/BEAM-7018 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Rose Nguyen >Assignee: Shehzaad Nakhoda >Priority: Minor > Time Spent: 3h 10m > Remaining Estimate: 0h > > PTransorms to use Regular Expressions to process elements in a PCollection > It should offer the same API as its Java counterpart: > [https://github.com/apache/beam/blob/11a977b8b26eff2274d706541127c19dc93131a2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Regex.java] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7018) Regex transform for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-7018?focusedWorklogId=277093&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277093 ] ASF GitHub Bot logged work on BEAM-7018: Author: ASF GitHub Bot Created on: 16/Jul/19 00:56 Start Date: 16/Jul/19 00:56 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #8859: [BEAM-7018] Added Regex transform for PythonSDK URL: https://github.com/apache/beam/pull/8859#discussion_r303692315 ## File path: sdks/python/apache_beam/transforms/util.py ## @@ -864,3 +867,225 @@ def add_window_info(element, timestamp=DoFn.TimestampParam, def expand(self, pcoll): return pcoll | ParDo(self.add_window_info) + + +@ptransform_fn +def matches_all_object(pcoll, regex, group=None): Review comment: Do we expect these to be public interfaces? If not, can we prefix them with `_` and move them under Regex. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277093) Time Spent: 3h (was: 2h 50m) > Regex transform for Python SDK > -- > > Key: BEAM-7018 > URL: https://issues.apache.org/jira/browse/BEAM-7018 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Rose Nguyen >Assignee: Shehzaad Nakhoda >Priority: Minor > Time Spent: 3h > Remaining Estimate: 0h > > PTransorms to use Regular Expressions to process elements in a PCollection > It should offer the same API as its Java counterpart: > [https://github.com/apache/beam/blob/11a977b8b26eff2274d706541127c19dc93131a2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Regex.java] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik reassigned BEAM-2264: --- Assignee: Udi Meiri > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Luke Cwik >Assignee: Udi Meiri >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > We should cache the credential used within a Pipeline and re-use it instead > of generating a new one on each GCS call. When executing (against 2.0.0 RC2): > {code} > python -m apache_beam.examples.wordcount --input > "gs://dataflow-samples/shakespeare/*" --output local_counts > {code} > Note that we seemingly generate a new access token each time instead of when > a refresh is required. > {code} > super(GcsIO, cls).__new__(cls, storage_client)) > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 1 files. Estimation > took 0.286200046539 seconds > INFO:root:Running pipeline with DirectRunner. > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 43 files. Estimation > took 0.205624818802 seconds > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > ... many more times ... > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885720#comment-16885720 ] Luke Cwik commented on BEAM-2264: - Thanks for doing the investigation. > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Luke Cwik >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > We should cache the credential used within a Pipeline and re-use it instead > of generating a new one on each GCS call. When executing (against 2.0.0 RC2): > {code} > python -m apache_beam.examples.wordcount --input > "gs://dataflow-samples/shakespeare/*" --output local_counts > {code} > Note that we seemingly generate a new access token each time instead of when > a refresh is required. > {code} > super(GcsIO, cls).__new__(cls, storage_client)) > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 1 files. Estimation > took 0.286200046539 seconds > INFO:root:Running pipeline with DirectRunner. > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 43 files. Estimation > took 0.205624818802 seconds > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > ... many more times ... > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7696) Detect classpath resources contains directory cause exception
[ https://issues.apache.org/jira/browse/BEAM-7696?focusedWorklogId=277073&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277073 ] ASF GitHub Bot logged work on BEAM-7696: Author: ASF GitHub Bot Created on: 15/Jul/19 23:55 Start Date: 15/Jul/19 23:55 Worklog Time Spent: 10m Work Description: ibzib commented on pull request #9019: [BEAM-7696] Prepare files to stage also in local master of spark runner. URL: https://github.com/apache/beam/pull/9019#discussion_r303678227 ## File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PipelineResources.java ## @@ -61,7 +61,10 @@ List files = new ArrayList<>(); for (URL url : ((URLClassLoader) classLoader).getURLs()) { try { -files.add(new File(url.toURI()).getAbsolutePath()); +File file = new File(url.toURI()); +if (file.exists()) { Review comment: Since SparkContext will reject all directories [1], it is probably best to exclude directories too. ```suggestion if (file.exists() && !file.isDirectory()) { ``` [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1796-L1802 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277073) Time Spent: 1h 40m (was: 1.5h) > Detect classpath resources contains directory cause exception > - > > Key: BEAM-7696 > URL: https://issues.apache.org/jira/browse/BEAM-7696 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Wang Yanlin >Assignee: Wang Yanlin >Priority: Minor > Fix For: 2.15.0 > > Attachments: addJar_exception.jpg, files_contains_dir.jpg > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Run the unit test SparkPipelineStateTest.testBatchPipelineRunningState in > IntelliJ IDEA on my mac, get the IllegalArgumentException in the console > output. I check the source code, and find the result of > _PipelineResources.detectClassPathResourcesToStage_ contains directory, which > is the cause of the exception. > See the attached file 'addJar_exception.jpg' for detail, and the result of > _PipelineResources.detectClassPathResourcesToStage_ > is showed in attached file 'files_contains_dir.jpg' during debug. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7750) Pipeline instances are not garbage collected
[ https://issues.apache.org/jira/browse/BEAM-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Strokach updated BEAM-7750: -- Description: It seems that Apache Beam's Pipeline instances are not garbage collected, even if the pipelines are finished or cancelled and there are no references to those pipelines in the Python interpreter. For pipelines executed in a script, this is not a problem. However, for interactive pipelines executed inside a Jupyter notebook, this limits how well we can track and remove the dependencies of those pipelines. For example, if a pipeline reads from some cache, it would be nice to be able to delete that cache once there are no references to it from a pipeline or the global namespace. The issue can be reproduced using the following script: [https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py] was: It seems that Apache Beam's Pipeline instances are not garbage collected, even if the pipelines are finished or cancelled, and there are no references to those pipelines in the Python interpreter. For pipelines executed in a script, this is not a problem. However, for interactive pipelines executed inside a Jupyter notebook, this limits how well we can track and remove the dependencies of those pipelines. For example, if a pipeline reads from some cache, it would be nice to be able to delete that cache once there are no references to it from a pipeline or the global namespace. The issue can be reproduced using the following script: https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py > Pipeline instances are not garbage collected > > > Key: BEAM-7750 > URL: https://issues.apache.org/jira/browse/BEAM-7750 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.14.0 > Environment: OS: Debian rodete. > Tested using: > Beam versions: 2.13.0, 2.15.0.dev > Python versions: Python 2.7, Python 3.7. > Runners: DirectRunner, DataflowRunner. >Reporter: Alexey Strokach >Priority: Minor > > It seems that Apache Beam's Pipeline instances are not garbage collected, > even if the pipelines are finished or cancelled and there are no references > to those pipelines in the Python interpreter. > For pipelines executed in a script, this is not a problem. However, for > interactive pipelines executed inside a Jupyter notebook, this limits how > well we can track and remove the dependencies of those pipelines. For > example, if a pipeline reads from some cache, it would be nice to be able to > delete that cache once there are no references to it from a pipeline or the > global namespace. > The issue can be reproduced using the following script: > [https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7750) Pipeline instances are not garbage collected
[ https://issues.apache.org/jira/browse/BEAM-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Strokach updated BEAM-7750: -- Description: It seems that Apache Beam's Pipeline instances are not garbage collected, even if the pipelines are finished or cancelled and there are no references to those pipelines in the Python interpreter. For pipelines executed in a script, this is not a problem. However, for interactive pipelines executed inside a Jupyter notebook, this limits how well we can track and remove the dependencies of those pipelines. For example, if a pipeline reads from some cache, it would be nice to be able to delete that cache once there are no references to it from pipelines or the global namespace. The issue can be reproduced using the following script: [https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py] was: It seems that Apache Beam's Pipeline instances are not garbage collected, even if the pipelines are finished or cancelled and there are no references to those pipelines in the Python interpreter. For pipelines executed in a script, this is not a problem. However, for interactive pipelines executed inside a Jupyter notebook, this limits how well we can track and remove the dependencies of those pipelines. For example, if a pipeline reads from some cache, it would be nice to be able to delete that cache once there are no references to it from a pipeline or the global namespace. The issue can be reproduced using the following script: [https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py] > Pipeline instances are not garbage collected > > > Key: BEAM-7750 > URL: https://issues.apache.org/jira/browse/BEAM-7750 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.14.0 > Environment: OS: Debian rodete. > Tested using: > Beam versions: 2.13.0, 2.15.0.dev > Python versions: Python 2.7, Python 3.7. > Runners: DirectRunner, DataflowRunner. >Reporter: Alexey Strokach >Priority: Minor > > It seems that Apache Beam's Pipeline instances are not garbage collected, > even if the pipelines are finished or cancelled and there are no references > to those pipelines in the Python interpreter. > For pipelines executed in a script, this is not a problem. However, for > interactive pipelines executed inside a Jupyter notebook, this limits how > well we can track and remove the dependencies of those pipelines. For > example, if a pipeline reads from some cache, it would be nice to be able to > delete that cache once there are no references to it from pipelines or the > global namespace. > The issue can be reproduced using the following script: > [https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7750) Pipeline instances are not garbage collected
Alexey Strokach created BEAM-7750: - Summary: Pipeline instances are not garbage collected Key: BEAM-7750 URL: https://issues.apache.org/jira/browse/BEAM-7750 Project: Beam Issue Type: Bug Components: sdk-py-core Affects Versions: 2.14.0 Environment: OS: Debian rodete. Tested using: Beam versions: 2.13.0, 2.15.0.dev Python versions: Python 2.7, Python 3.7. Runners: DirectRunner, DataflowRunner. Reporter: Alexey Strokach It seems that Apache Beam's Pipeline instances are not garbage collected, even if the pipelines are finished or cancelled, and there are no references to those pipelines in the Python interpreter. For pipelines executed in a script, this is not a problem. However, for interactive pipelines executed inside a Jupyter notebook, this limits how well we can track and remove the dependencies of those pipelines. For example, if a pipeline reads from some cache, it would be nice to be able to delete that cache once there are no references to it from a pipeline or the global namespace. The issue can be reproduced using the following script: https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-7674) Define streaming ITs tests for direct runner in consistent way in Python 2 and Python 3 suites.
[ https://issues.apache.org/jira/browse/BEAM-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885697#comment-16885697 ] Pablo Estrada edited comment on BEAM-7674 at 7/15/19 11:30 PM: --- Sorry for the delay to answer. That test should be a pure streaming test (see how it uses TestStream). Edit: ugh, well, that's embarrassing: I see you've added it. Thanks guys : ) was (Author: pabloem): Sorry for the delay to answer. That test should be a pure streaming test (see how it uses TestStream). > Define streaming ITs tests for direct runner in consistent way in Python 2 > and Python 3 suites. > > > Key: BEAM-7674 > URL: https://issues.apache.org/jira/browse/BEAM-7674 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Valentyn Tymofieiev >Assignee: Juta Staes >Priority: Major > > Currently in Python 2 direct runner test suite some tests run in streaming > mode: > https://github.com/apache/beam/blob/fbd1f4cf7118c7b2fb4e3a4cf46646e98f3e3b8d/sdks/python/build.gradle#L130 > However in Python 3, we run both Batch and Streaming direct runner tests in > Batch mode: > https://github.com/apache/beam/blob/fbd1f4cf7118c7b2fb4e3a4cf46646e98f3e3b8d/sdks/python/test-suites/direct/py35/build.gradle#L32 > We should check whether we need to explicitly separate the tests into batch > and streaming and define all directrunner suites consistently. > cc: [~Juta] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7674) Define streaming ITs tests for direct runner in consistent way in Python 2 and Python 3 suites.
[ https://issues.apache.org/jira/browse/BEAM-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885697#comment-16885697 ] Pablo Estrada commented on BEAM-7674: - Sorry for the delay to answer. That test should be a pure streaming test (see how it uses TestStream). > Define streaming ITs tests for direct runner in consistent way in Python 2 > and Python 3 suites. > > > Key: BEAM-7674 > URL: https://issues.apache.org/jira/browse/BEAM-7674 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Valentyn Tymofieiev >Assignee: Juta Staes >Priority: Major > > Currently in Python 2 direct runner test suite some tests run in streaming > mode: > https://github.com/apache/beam/blob/fbd1f4cf7118c7b2fb4e3a4cf46646e98f3e3b8d/sdks/python/build.gradle#L130 > However in Python 3, we run both Batch and Streaming direct runner tests in > Batch mode: > https://github.com/apache/beam/blob/fbd1f4cf7118c7b2fb4e3a4cf46646e98f3e3b8d/sdks/python/test-suites/direct/py35/build.gradle#L32 > We should check whether we need to explicitly separate the tests into batch > and streaming and define all directrunner suites consistently. > cc: [~Juta] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-6611) A Python Sink for BigQuery with File Loads in Streaming
[ https://issues.apache.org/jira/browse/BEAM-6611?focusedWorklogId=277063&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277063 ] ASF GitHub Bot logged work on BEAM-6611: Author: ASF GitHub Bot Created on: 15/Jul/19 23:27 Start Date: 15/Jul/19 23:27 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #8871: [BEAM-6611] BigQuery file loads in Streaming for Python SDK URL: https://github.com/apache/beam/pull/8871#issuecomment-511606144 > @tvalentyn are you aware of why these tests could be getting triggered? See: [#8871 (comment)](https://github.com/apache/beam/pull/8871#issuecomment-511364580) @pabloem, PTAL at https://issues.apache.org/jira/browse/BEAM-7674, it may be relevant to this question. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277063) Time Spent: 3h 20m (was: 3h 10m) > A Python Sink for BigQuery with File Loads in Streaming > --- > > Key: BEAM-6611 > URL: https://issues.apache.org/jira/browse/BEAM-6611 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Tanay Tummalapalli >Priority: Major > Labels: gsoc, gsoc2019, mentor > Time Spent: 3h 20m > Remaining Estimate: 0h > > The Java SDK supports a bunch of methods for writing data into BigQuery, > while the Python SDK supports the following: > - Streaming inserts for streaming pipelines [As seen in [bigquery.py and > BigQueryWriteFn|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L649-L813]] > - File loads for batch pipelines [As implemented in [PR > 7655|https://github.com/apache/beam/pull/7655]] > Qucik and dirty early design doc: https://s.apache.org/beam-bqfl-py-streaming > The Java SDK also supports File Loads for Streaming pipelines [see BatchLoads > application|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1709-L1776]. > File loads have the advantage of being much cheaper than streaming inserts > (although they also are slower for the records to show up in the table). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885693#comment-16885693 ] Udi Meiri commented on BEAM-2264: - Migrating off of apitools-based clients (such as the one used for GCS) is a much larger project. Regarding threading issues, I believe that we came across them when trying to reuse GCS clients in multiple threads (https://issues.apache.org/jira/browse/BEAM-3990). The credentials, however, should be thread-safe according to this note: https://github.com/googleapis/google-api-python-client/blob/master/docs/thread_safety.md In any case, the newer credentials object don't try to be thread safe, but that is apparently not an issue: https://github.com/googleapis/google-auth-library-python/issues/246#issuecomment-371878855 (just potentially causes more refreshes) > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Luke Cwik >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > We should cache the credential used within a Pipeline and re-use it instead > of generating a new one on each GCS call. When executing (against 2.0.0 RC2): > {code} > python -m apache_beam.examples.wordcount --input > "gs://dataflow-samples/shakespeare/*" --output local_counts > {code} > Note that we seemingly generate a new access token each time instead of when > a refresh is required. > {code} > super(GcsIO, cls).__new__(cls, storage_client)) > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 1 files. Estimation > took 0.286200046539 seconds > INFO:root:Running pipeline with DirectRunner. > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 43 files. Estimation > took 0.205624818802 seconds > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > ... many more times ... > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=277060&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277060 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 15/Jul/19 23:16 Start Date: 15/Jul/19 23:16 Worklog Time Spent: 10m Work Description: ibzib commented on pull request #9020: [BEAM-4775] Support returning metrics from job service URL: https://github.com/apache/beam/pull/9020#discussion_r303674183 ## File path: runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/jobsubmission/JobInvocation.java ## @@ -93,6 +95,7 @@ public void onSuccess(@Nullable PipelineResult pipelineResult) { switch (pipelineResult.getState()) { case DONE: setState(Enum.DONE); + metrics = pipelineResult.portableMetrics(); Review comment: Sure, looks good :+1: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277060) Time Spent: 42h 40m (was: 42.5h) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Lukasz Gajowy >Priority: Major > Time Spent: 42h 40m > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7463) BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: incorrect checksum
[ https://issues.apache.org/jira/browse/BEAM-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885686#comment-16885686 ] Pablo Estrada commented on BEAM-7463: - FWIW I'm running with Py3.5 > BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: > incorrect checksum > -- > > Key: BEAM-7463 > URL: https://issues.apache.org/jira/browse/BEAM-7463 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Reporter: Valentyn Tymofieiev >Assignee: Pablo Estrada >Priority: Major > Labels: currently-failing > Time Spent: 5h 20m > Remaining Estimate: 0h > > {noformat} > 15:03:38 FAIL: test_big_query_new_types > (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) > 15:03:38 > -- > 15:03:38 Traceback (most recent call last): > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py", > line 211, in test_big_query_new_types > 15:03:38 big_query_query_to_table_pipeline.run_bq_pipeline(options) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py", > line 82, in run_bq_pipeline > 15:03:38 result = p.run() > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py", > line 107, in run > 15:03:38 else test_runner_api)) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py", > line 406, in run > 15:03:38 self._options).run(False) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py", > line 419, in run > 15:03:38 return self.runner.run_pipeline(self, self._options) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py", > line 51, in run_pipeline > 15:03:38 hc_assert_that(self.result, pickler.loads(on_success_matcher)) > 15:03:38 AssertionError: > 15:03:38 Expected: (Test pipeline expected terminated in state: DONE and > Expected checksum is 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214) > 15:03:38 but: Expected checksum is > 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214 Actual checksum is > da39a3ee5e6b4b0d3255bfef95601890afd80709 > {noformat} > [~Juta] could this be caused by changes to Bigquery matcher? > https://github.com/apache/beam/pull/8621/files#diff-f1ec7e3a3e7e2e5082ddb7043954c108R134 > > cc: [~pabloem] [~chamikara] [~apilloud] > A recent postcommit run has BQ failures in other tests as well: > https://builds.apache.org/job/beam_PostCommit_Python3_Verify/1000/consoleFull -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-5874) Python BQ directRunnerIT test fails with half-empty assertion
[ https://issues.apache.org/jira/browse/BEAM-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885685#comment-16885685 ] Pablo Estrada commented on BEAM-5874: - I am also unable to repro this on my machine with Py3.5 > Python BQ directRunnerIT test fails with half-empty assertion > - > > Key: BEAM-5874 > URL: https://issues.apache.org/jira/browse/BEAM-5874 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Henning Rohde >Assignee: Pablo Estrada >Priority: Major > Labels: currently-failing > > https://scans.gradle.com/s/2fd5wbm7vkkna/console-log?task=:beam-sdks-python:directRunnerIT#L147 > == > FAIL: test_big_query_new_types > (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) > -- > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py", > line 169, in test_big_query_new_types > big_query_query_to_table_pipeline.run_bq_pipeline(options) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py", > line 67, in run_bq_pipeline > result = p.run() > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py", > line 107, in run > else test_runner_api)) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/pipeline.py", > line 416, in run > return self.runner.run_pipeline(self) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py", > line 51, in run_pipeline > hc_assert_that(self.result, pickler.loads(on_success_matcher)) > AssertionError: > Expected: (Test pipeline expected terminated in state: DONE and Expected > checksum is e1fbcb5ca479a5ca5f9ecf444d6998beee4d44c6) > but: > -- > XML: > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/nosetests.xml > -- > Ran 7 tests in 24.434s > FAILED (failures=1) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?focusedWorklogId=277040&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277040 ] ASF GitHub Bot logged work on BEAM-2264: Author: ASF GitHub Bot Created on: 15/Jul/19 22:53 Start Date: 15/Jul/19 22:53 Worklog Time Spent: 10m Work Description: udim commented on pull request #9060: [BEAM-2264] Reuse GCP credentials in GCS calls. URL: https://github.com/apache/beam/pull/9060#discussion_r303669216 ## File path: sdks/python/apache_beam/internal/gcp/auth.py ## @@ -57,77 +71,58 @@ def set_running_in_gce(worker_executing_project): executing_project = worker_executing_project -class AuthenticationException(retry.PermanentException): - pass - - -class _GCEMetadataCredentials(OAuth2Credentials): - """For internal use only; no backwards-compatibility guarantees. - - Credential object initialized using access token from GCE VM metadata.""" - - def __init__(self, user_agent=None): -"""Create an instance of GCEMetadataCredentials. - -These credentials are generated by contacting the metadata server on a GCE -VM instance. - -Args: - user_agent: string, The HTTP User-Agent to provide for this application. -""" -super(_GCEMetadataCredentials, self).__init__( -None, # access_token -None, # client_id -None, # client_secret -None, # refresh_token -datetime.datetime(2010, 1, 1), # token_expiry, set to time in past. -None, # token_uri -user_agent) - - @retry.with_exponential_backoff( - retry_filter=retry.retry_on_server_errors_and_timeout_filter) - def _refresh(self, http_request): -refresh_time = datetime.datetime.utcnow() -metadata_root = os.environ.get( -'GCE_METADATA_ROOT', 'metadata.google.internal') -token_url = ('http://{}/computeMetadata/v1/instance/service-accounts/' - 'default/token').format(metadata_root) -req = Request(token_url, headers={'Metadata-Flavor': 'Google'}) -token_data = json.loads(urlopen(req, timeout=60).read().decode('utf-8')) -self.access_token = token_data['access_token'] -self.token_expiry = (refresh_time + - datetime.timedelta(seconds=token_data['expires_in'])) - - def get_service_credentials(): """For internal use only; no backwards-compatibility guarantees. - Get credentials to access Google services.""" - user_agent = 'beam-python-sdk/1.0' - if is_running_in_gce: -# We are currently running as a GCE taskrunner worker. -# -# TODO(ccy): It's not entirely clear if these credentials are thread-safe. -# If so, we can cache these credentials to save the overhead of creating -# them again. -return _GCEMetadataCredentials(user_agent=user_agent) - else: -client_scopes = [ -'https://www.googleapis.com/auth/bigquery', -'https://www.googleapis.com/auth/cloud-platform', -'https://www.googleapis.com/auth/devstorage.full_control', -'https://www.googleapis.com/auth/userinfo.email', -'https://www.googleapis.com/auth/datastore' -] - -try: - credentials = GoogleCredentials.get_application_default() - credentials = credentials.create_scoped(client_scopes) - logging.debug('Connecting using Google Application Default ' -'Credentials.') - return credentials -except Exception as e: - logging.warning( - 'Unable to find default credentials to use: %s\n' - 'Connecting anonymously.', e) - return None + Get credentials to access Google services. + + Returns: +A ``oauth2client.client.OAuth2Credentials`` object or None if credentials +not found. Returned object is thread-safe. + """ + return _Credentials.get_service_credentials() + + +class _Credentials(object): + _credentials_lock = threading.Lock() + _credentials_init = False + _credentials = None + + @classmethod + def get_service_credentials(cls): +if cls._credentials_init: + return cls._credentials Review comment: yes; they auto-refresh This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277040) Time Spent: 1.5h (was: 1h 20m) > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core
[jira] [Commented] (BEAM-6855) Side inputs are not supported when using the state API
[ https://issues.apache.org/jira/browse/BEAM-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885670#comment-16885670 ] Reuven Lax commented on BEAM-6855: -- I believe that the side-input code in SimplePushbackSideInputDoFnRunner needs to be implemented in StatefulDoFnRunner. [~kenn] can you confirm? > Side inputs are not supported when using the state API > -- > > Key: BEAM-6855 > URL: https://issues.apache.org/jira/browse/BEAM-6855 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7463) BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: incorrect checksum
[ https://issues.apache.org/jira/browse/BEAM-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885665#comment-16885665 ] Pablo Estrada commented on BEAM-7463: - This tests is running in 4s, but they do seem to perform all of the actions that they're supposed to be performing. I've ran it a bunch of times >20, and I'm not seeing any problems. It may really be about a specific machine? I'll investigate further. > BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: > incorrect checksum > -- > > Key: BEAM-7463 > URL: https://issues.apache.org/jira/browse/BEAM-7463 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Reporter: Valentyn Tymofieiev >Assignee: Pablo Estrada >Priority: Major > Labels: currently-failing > Time Spent: 5h 20m > Remaining Estimate: 0h > > {noformat} > 15:03:38 FAIL: test_big_query_new_types > (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) > 15:03:38 > -- > 15:03:38 Traceback (most recent call last): > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py", > line 211, in test_big_query_new_types > 15:03:38 big_query_query_to_table_pipeline.run_bq_pipeline(options) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py", > line 82, in run_bq_pipeline > 15:03:38 result = p.run() > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py", > line 107, in run > 15:03:38 else test_runner_api)) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py", > line 406, in run > 15:03:38 self._options).run(False) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py", > line 419, in run > 15:03:38 return self.runner.run_pipeline(self, self._options) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py", > line 51, in run_pipeline > 15:03:38 hc_assert_that(self.result, pickler.loads(on_success_matcher)) > 15:03:38 AssertionError: > 15:03:38 Expected: (Test pipeline expected terminated in state: DONE and > Expected checksum is 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214) > 15:03:38 but: Expected checksum is > 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214 Actual checksum is > da39a3ee5e6b4b0d3255bfef95601890afd80709 > {noformat} > [~Juta] could this be caused by changes to Bigquery matcher? > https://github.com/apache/beam/pull/8621/files#diff-f1ec7e3a3e7e2e5082ddb7043954c108R134 > > cc: [~pabloem] [~chamikara] [~apilloud] > A recent postcommit run has BQ failures in other tests as well: > https://builds.apache.org/job/beam_PostCommit_Python3_Verify/1000/consoleFull -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?focusedWorklogId=277009&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277009 ] ASF GitHub Bot logged work on BEAM-2264: Author: ASF GitHub Bot Created on: 15/Jul/19 21:53 Start Date: 15/Jul/19 21:53 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9060: [BEAM-2264] Reuse GCP credentials in GCS calls. URL: https://github.com/apache/beam/pull/9060#discussion_r303653342 ## File path: sdks/python/apache_beam/internal/gcp/auth.py ## @@ -57,77 +71,58 @@ def set_running_in_gce(worker_executing_project): executing_project = worker_executing_project -class AuthenticationException(retry.PermanentException): - pass - - -class _GCEMetadataCredentials(OAuth2Credentials): - """For internal use only; no backwards-compatibility guarantees. - - Credential object initialized using access token from GCE VM metadata.""" - - def __init__(self, user_agent=None): -"""Create an instance of GCEMetadataCredentials. - -These credentials are generated by contacting the metadata server on a GCE -VM instance. - -Args: - user_agent: string, The HTTP User-Agent to provide for this application. -""" -super(_GCEMetadataCredentials, self).__init__( -None, # access_token -None, # client_id -None, # client_secret -None, # refresh_token -datetime.datetime(2010, 1, 1), # token_expiry, set to time in past. -None, # token_uri -user_agent) - - @retry.with_exponential_backoff( - retry_filter=retry.retry_on_server_errors_and_timeout_filter) - def _refresh(self, http_request): -refresh_time = datetime.datetime.utcnow() -metadata_root = os.environ.get( -'GCE_METADATA_ROOT', 'metadata.google.internal') -token_url = ('http://{}/computeMetadata/v1/instance/service-accounts/' - 'default/token').format(metadata_root) -req = Request(token_url, headers={'Metadata-Flavor': 'Google'}) -token_data = json.loads(urlopen(req, timeout=60).read().decode('utf-8')) -self.access_token = token_data['access_token'] -self.token_expiry = (refresh_time + - datetime.timedelta(seconds=token_data['expires_in'])) - - def get_service_credentials(): """For internal use only; no backwards-compatibility guarantees. - Get credentials to access Google services.""" - user_agent = 'beam-python-sdk/1.0' - if is_running_in_gce: -# We are currently running as a GCE taskrunner worker. -# -# TODO(ccy): It's not entirely clear if these credentials are thread-safe. -# If so, we can cache these credentials to save the overhead of creating -# them again. -return _GCEMetadataCredentials(user_agent=user_agent) - else: -client_scopes = [ -'https://www.googleapis.com/auth/bigquery', -'https://www.googleapis.com/auth/cloud-platform', -'https://www.googleapis.com/auth/devstorage.full_control', -'https://www.googleapis.com/auth/userinfo.email', -'https://www.googleapis.com/auth/datastore' -] - -try: - credentials = GoogleCredentials.get_application_default() - credentials = credentials.create_scoped(client_scopes) - logging.debug('Connecting using Google Application Default ' -'Credentials.') - return credentials -except Exception as e: - logging.warning( - 'Unable to find default credentials to use: %s\n' - 'Connecting anonymously.', e) - return None + Get credentials to access Google services. + + Returns: +A ``oauth2client.client.OAuth2Credentials`` object or None if credentials +not found. Returned object is thread-safe. + """ + return _Credentials.get_service_credentials() + + +class _Credentials(object): + _credentials_lock = threading.Lock() + _credentials_init = False + _credentials = None + + @classmethod + def get_service_credentials(cls): +if cls._credentials_init: + return cls._credentials Review comment: Is it possible for credentials to expire? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 277009) Time Spent: 1h 20m (was: 1h 10m) > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement >
[jira] [Commented] (BEAM-5874) Python BQ directRunnerIT test fails with half-empty assertion
[ https://issues.apache.org/jira/browse/BEAM-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885650#comment-16885650 ] Valentyn Tymofieiev commented on BEAM-5874: --- Assigning to Pablo who is investigating a similar issue (https://issues.apache.org/jira/browse/BEAM-7463). > Python BQ directRunnerIT test fails with half-empty assertion > - > > Key: BEAM-5874 > URL: https://issues.apache.org/jira/browse/BEAM-5874 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Henning Rohde >Assignee: Pablo Estrada >Priority: Major > Labels: currently-failing > > https://scans.gradle.com/s/2fd5wbm7vkkna/console-log?task=:beam-sdks-python:directRunnerIT#L147 > == > FAIL: test_big_query_new_types > (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) > -- > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py", > line 169, in test_big_query_new_types > big_query_query_to_table_pipeline.run_bq_pipeline(options) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py", > line 67, in run_bq_pipeline > result = p.run() > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py", > line 107, in run > else test_runner_api)) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/pipeline.py", > line 416, in run > return self.runner.run_pipeline(self) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py", > line 51, in run_pipeline > hc_assert_that(self.result, pickler.loads(on_success_matcher)) > AssertionError: > Expected: (Test pipeline expected terminated in state: DONE and Expected > checksum is e1fbcb5ca479a5ca5f9ecf444d6998beee4d44c6) > but: > -- > XML: > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/nosetests.xml > -- > Ran 7 tests in 24.434s > FAILED (failures=1) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (BEAM-5874) Python BQ directRunnerIT test fails with half-empty assertion
[ https://issues.apache.org/jira/browse/BEAM-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valentyn Tymofieiev reassigned BEAM-5874: - Assignee: Pablo Estrada > Python BQ directRunnerIT test fails with half-empty assertion > - > > Key: BEAM-5874 > URL: https://issues.apache.org/jira/browse/BEAM-5874 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Henning Rohde >Assignee: Pablo Estrada >Priority: Major > Labels: currently-failing > > https://scans.gradle.com/s/2fd5wbm7vkkna/console-log?task=:beam-sdks-python:directRunnerIT#L147 > == > FAIL: test_big_query_new_types > (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) > -- > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py", > line 169, in test_big_query_new_types > big_query_query_to_table_pipeline.run_bq_pipeline(options) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py", > line 67, in run_bq_pipeline > result = p.run() > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py", > line 107, in run > else test_runner_api)) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/pipeline.py", > line 416, in run > return self.runner.run_pipeline(self) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py", > line 51, in run_pipeline > hc_assert_that(self.result, pickler.loads(on_success_matcher)) > AssertionError: > Expected: (Test pipeline expected terminated in state: DONE and Expected > checksum is e1fbcb5ca479a5ca5f9ecf444d6998beee4d44c6) > but: > -- > XML: > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/nosetests.xml > -- > Ran 7 tests in 24.434s > FAILED (failures=1) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (BEAM-7656) Add sdk-worker-parallelism arg to flink job server shadow jar
[ https://issues.apache.org/jira/browse/BEAM-7656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver resolved BEAM-7656. --- Resolution: Fixed Fix Version/s: 2.15.0 > Add sdk-worker-parallelism arg to flink job server shadow jar > - > > Key: BEAM-7656 > URL: https://issues.apache.org/jira/browse/BEAM-7656 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Minor > Fix For: 2.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > It's unfortunate we have to manually add these args, but /shrug -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7058) Python SDK: Collect metrics in non-cython environment
[ https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-7058: -- Description: With the portable Flink runner, the metric is reported as 0, while the count metric works as expected. [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] Edit: metrics are collected properly when using cython, but not without cython. This is because state sampling has yet to be implemented in a way that does not depend on cython [1]. [1] [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/statesampler_slow.py#L62] was: With the portable Flink runner, the metric is reported as 0, while the count metric works as expected. [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] Edit: metrics are collected properly when using cython, but not without cython. > Python SDK: Collect metrics in non-cython environment > - > > Key: BEAM-7058 > URL: https://issues.apache.org/jira/browse/BEAM-7058 > Project: Beam > Issue Type: New Feature > Components: runner-flink, sdk-py-harness >Reporter: Thomas Weise >Assignee: Alex Amato >Priority: Major > Labels: metrics, portability-flink, portable-metrics-bugs > Attachments: test-metrics.txt > > > With the portable Flink runner, the metric is reported as 0, while the count > metric works as expected. > [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] > Edit: metrics are collected properly when using cython, but not without > cython. This is because state sampling has yet to be implemented in a way > that does not depend on cython [1]. > [1] > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/statesampler_slow.py#L62] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7058) Python SDK: Collect metrics in non-cython environment
[ https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-7058: -- Description: With the portable Flink runner, the metric is reported as 0, while the count metric works as expected. [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] Edit: metrics are collected properly when using cython, but not without cython. was: With the portable Flink runner, the metric is reported as 0, while the count metric works as expected. [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] > Python SDK: Collect metrics in non-cython environment > - > > Key: BEAM-7058 > URL: https://issues.apache.org/jira/browse/BEAM-7058 > Project: Beam > Issue Type: New Feature > Components: runner-flink, sdk-py-harness >Reporter: Thomas Weise >Assignee: Alex Amato >Priority: Major > Labels: metrics, portability-flink, portable-metrics-bugs > Attachments: test-metrics.txt > > > With the portable Flink runner, the metric is reported as 0, while the count > metric works as expected. > [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] > Edit: metrics are collected properly when using cython, but not without > cython. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7058) Python SDK: Collect metrics in non-cython environment
[ https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885630#comment-16885630 ] Kyle Weaver commented on BEAM-7058: --- We will leave this open. I have updated the issue type and description accordingly. > Python SDK: Collect metrics in non-cython environment > - > > Key: BEAM-7058 > URL: https://issues.apache.org/jira/browse/BEAM-7058 > Project: Beam > Issue Type: New Feature > Components: runner-flink, sdk-py-harness >Reporter: Thomas Weise >Assignee: Alex Amato >Priority: Major > Labels: metrics, portability-flink, portable-metrics-bugs > Attachments: test-metrics.txt > > > With the portable Flink runner, the metric is reported as 0, while the count > metric works as expected. > [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7058) Python SDK: Collect metrics in non-cython environment
[ https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-7058: -- Summary: Python SDK: Collect metrics in non-cython environment (was: Python SDK: Implement state sampling for non-cython environment) > Python SDK: Collect metrics in non-cython environment > - > > Key: BEAM-7058 > URL: https://issues.apache.org/jira/browse/BEAM-7058 > Project: Beam > Issue Type: New Feature > Components: runner-flink, sdk-py-harness >Reporter: Thomas Weise >Assignee: Alex Amato >Priority: Major > Labels: metrics, portability-flink, portable-metrics-bugs > Attachments: test-metrics.txt > > > With the portable Flink runner, the metric is reported as 0, while the count > metric works as expected. > [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7058) Python SDK: Implement state sampling for non-cython environment
[ https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-7058: -- Summary: Python SDK: Implement state sampling for non-cython environment (was: Python SDK metric process_bundle_msecs reported as zero) > Python SDK: Implement state sampling for non-cython environment > --- > > Key: BEAM-7058 > URL: https://issues.apache.org/jira/browse/BEAM-7058 > Project: Beam > Issue Type: New Feature > Components: runner-flink, sdk-py-harness >Reporter: Thomas Weise >Assignee: Alex Amato >Priority: Major > Labels: metrics, portability-flink, portable-metrics-bugs > Attachments: test-metrics.txt > > > With the portable Flink runner, the metric is reported as 0, while the count > metric works as expected. > [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7058) Python SDK metric process_bundle_msecs reported as zero
[ https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-7058: -- Issue Type: New Feature (was: Bug) > Python SDK metric process_bundle_msecs reported as zero > --- > > Key: BEAM-7058 > URL: https://issues.apache.org/jira/browse/BEAM-7058 > Project: Beam > Issue Type: New Feature > Components: runner-flink, sdk-py-harness >Reporter: Thomas Weise >Assignee: Alex Amato >Priority: Major > Labels: metrics, portability-flink, portable-metrics-bugs > Attachments: test-metrics.txt > > > With the portable Flink runner, the metric is reported as 0, while the count > metric works as expected. > [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276996&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276996 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 21:15 Start Date: 15/Jul/19 21:15 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303639762 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: Give me some time that let me check the details of PR and how a useful test could be. Generally if we are making changes that we cannot test, we should postponing such changes until we figure out how to test it. I think planner rules can be tested because I see Calcite and Flink have tests around optimization rules. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276996) Time Spent: 7.5h (was: 7h 20m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 7.5h > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7749) Establish a strategy to prevent or mitigate quota-related flakes in Beam integration tests.
[ https://issues.apache.org/jira/browse/BEAM-7749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885617#comment-16885617 ] Valentyn Tymofieiev commented on BEAM-7749: --- In the mean time as a mitigation, maintainers of apache-beam-testing project could request quota increases in http://console.cloud.google.com/iam-admin/quotas?project=apache-beam-testing. > Establish a strategy to prevent or mitigate quota-related flakes in Beam > integration tests. > --- > > Key: BEAM-7749 > URL: https://issues.apache.org/jira/browse/BEAM-7749 > Project: Beam > Issue Type: Bug > Components: testing >Reporter: Valentyn Tymofieiev >Assignee: Alan Myrvold >Priority: Major > > Every now and then our postcommit tests fail due to lack of quota in > apache-beam-testing, see [1]. > Filing this as an umbrella issue to establish a strategy for preventing > quota-related failures in Apache Beam tests. > Potential ideas: > - improve resilience of test infrastructure to retry tests that fail due to > lack of quota, or delay the test execution when quota limits fall below a > threshold > - monitor & procure the quota usage in apache-beam-testing > - tune the parameters affecting the number of concurrent test jobs according > to available quota, and document best practices. > - stop previous jenkins job that were triggered on a PR when a new job is > started. > [1] > https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20component%20%3D%20test-failures%20AND%20text%20~%20quota -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885611#comment-16885611 ] Luke Cwik edited comment on BEAM-2264 at 7/15/19 8:58 PM: -- It looks like Python has the same issue where google.auth replaces oauth2client since oauth2client was [deprecated|https://google-auth.readthedocs.io/en/latest/oauth2client-deprecation.html] for very similar reasons to the Java one. But that is separate from this task it seems with BEAM-7352 but it may solve some of the threading issues that are being referred to. WDYT? was (Author: lcwik): It looks like Python has the same issue where google.auth replaces oauth2client since oauth2client was [deprecated|https://google-auth.readthedocs.io/en/latest/oauth2client-deprecation.html] for very similar reasons to the Java one. But that is separate from this task it seems with BEAM-7352 > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Luke Cwik >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We should cache the credential used within a Pipeline and re-use it instead > of generating a new one on each GCS call. When executing (against 2.0.0 RC2): > {code} > python -m apache_beam.examples.wordcount --input > "gs://dataflow-samples/shakespeare/*" --output local_counts > {code} > Note that we seemingly generate a new access token each time instead of when > a refresh is required. > {code} > super(GcsIO, cls).__new__(cls, storage_client)) > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 1 files. Estimation > took 0.286200046539 seconds > INFO:root:Running pipeline with DirectRunner. > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 43 files. Estimation > took 0.205624818802 seconds > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > ... many more times ... > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7749) Establish a strategy to prevent or mitigate quota-related flakes in Beam integration tests.
Valentyn Tymofieiev created BEAM-7749: - Summary: Establish a strategy to prevent or mitigate quota-related flakes in Beam integration tests. Key: BEAM-7749 URL: https://issues.apache.org/jira/browse/BEAM-7749 Project: Beam Issue Type: Bug Components: testing Reporter: Valentyn Tymofieiev Assignee: Alan Myrvold Every now and then our postcommit tests fail due to lack of quota in apache-beam-testing, see [1]. Filing this as an umbrella issue to establish a strategy for preventing quota-related failures in Apache Beam tests. Potential ideas: - improve resilience of test infrastructure to retry tests that fail due to lack of quota, or delay the test execution when quota limits fall below a threshold - monitor & procure the quota usage in apache-beam-testing - tune the parameters affecting the number of concurrent test jobs according to available quota, and document best practices. - stop previous jenkins job that were triggered on a PR when a new job is started. [1] https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20component%20%3D%20test-failures%20AND%20text%20~%20quota -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885611#comment-16885611 ] Luke Cwik commented on BEAM-2264: - It looks like Python has the same issue where google.auth replaces oauth2client since oauth2client was [deprecated|https://google-auth.readthedocs.io/en/latest/oauth2client-deprecation.html] for very similar reasons to the Java one. But that is separate from this task it seems with BEAM-7352 > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Luke Cwik >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We should cache the credential used within a Pipeline and re-use it instead > of generating a new one on each GCS call. When executing (against 2.0.0 RC2): > {code} > python -m apache_beam.examples.wordcount --input > "gs://dataflow-samples/shakespeare/*" --output local_counts > {code} > Note that we seemingly generate a new access token each time instead of when > a refresh is required. > {code} > super(GcsIO, cls).__new__(cls, storage_client)) > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 1 files. Estimation > took 0.286200046539 seconds > INFO:root:Running pipeline with DirectRunner. > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 43 files. Estimation > took 0.205624818802 seconds > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > ... many more times ... > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7058) Python SDK metric process_bundle_msecs reported as zero
[ https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885610#comment-16885610 ] Rakesh Kumar commented on BEAM-7058: After installing cython in the application environment, I can see that {{\[organization_specific_prefix\].operator.beam-metric-pardo_execution_time-process_bundle_msecs-v1.gauge.mean}} is being populated correctly. Thanks [~angoenka] for looking into this. Should we keep this jira ticket open to provide the implementation for non-cython version? > Python SDK metric process_bundle_msecs reported as zero > --- > > Key: BEAM-7058 > URL: https://issues.apache.org/jira/browse/BEAM-7058 > Project: Beam > Issue Type: Bug > Components: runner-flink, sdk-py-harness >Reporter: Thomas Weise >Assignee: Alex Amato >Priority: Major > Labels: metrics, portability-flink, portable-metrics-bugs > Attachments: test-metrics.txt > > > With the portable Flink runner, the metric is reported as 0, while the count > metric works as expected. > [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E] > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7666) Pipe memory thrashing signal to Dataflow
[ https://issues.apache.org/jira/browse/BEAM-7666?focusedWorklogId=276976&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276976 ] ASF GitHub Bot logged work on BEAM-7666: Author: ASF GitHub Bot Created on: 15/Jul/19 20:45 Start Date: 15/Jul/19 20:45 Worklog Time Spent: 10m Work Description: angoenka commented on pull request #8971: [BEAM-7666] Throttle piping URL: https://github.com/apache/beam/pull/8971 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276976) Time Spent: 20m (was: 10m) > Pipe memory thrashing signal to Dataflow > > > Key: BEAM-7666 > URL: https://issues.apache.org/jira/browse/BEAM-7666 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Dustin Rhodes >Assignee: Dustin Rhodes >Priority: Critical > Time Spent: 20m > Remaining Estimate: 0h > > For autoscaling we would like to know if the user worker is spending too much > time garbage collecting. Pipe this signal through counters to DF. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7527) Python 3 test parallelization causes test flakines due to ModuleNotFoundError.
[ https://issues.apache.org/jira/browse/BEAM-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885597#comment-16885597 ] Valentyn Tymofieiev commented on BEAM-7527: --- I'll take a look at this while Mark is OOO. > Python 3 test parallelization causes test flakines due to > ModuleNotFoundError. > --- > > Key: BEAM-7527 > URL: https://issues.apache.org/jira/browse/BEAM-7527 > Project: Beam > Issue Type: Sub-task > Components: test-failures >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Blocker > > I am seeing several errors in Python SDK Integration test suites, such as > Dataflow ValidatesRunner and Python PostCommit that fail due to one of the > autogenerated files not being found. > For example: > {noformat} > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py:84: > UserWarning: Running the Apache Beam SDK on Python 3 is not yet fully > supported. You may encounter buggy behavior or missing features. > 'Running the Apache Beam SDK on Python 3 is not yet fully supported. ' > Failure: ModuleNotFoundError (No module named 'beam_runner_api_pb2') ... > ERROR > == > ERROR: Failure: ModuleNotFoundError (No module named 'beam_runner_api_pb2') > -- > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/failure.py", > line 39, in runTest > raise self.exc_val.with_traceback(self.tb) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/loader.py", > line 418, in loadTestsFromName > addr.filename, addr.module) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py", > line 47, in importFromPath > return self.importFromDir(dir_path, fqname) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py", > line 94, in importFromDir > mod = load_module(part_fqname, fh, filename, desc) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/imp.py", > line 245, in load_module > return load_package(name, filename) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/imp.py", > line 217, in load_package > return _load(spec) > File "", line 684, in _load > File "", line 665, in _load_unlocked > File "", line 678, in exec_module > File "", line 219, in _call_with_frames_removed > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py", > line 97, in > from apache_beam import coders > File "/home/jenkins/jenkins-slave/workspace/beam_Pos > tCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/__init__.py", line > 19, in > from apache_beam.coders.coders import * > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/coders.py", > line 32, in > from apache_beam.coders import coder_impl > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/coder_impl.py", > line 44, in > from apache_beam.utils import windowed_value > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/windowed_value.py", > line 34, in > from apache_beam.utils.timestamp import MAX_TIMESTAMP > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/timestamp.py", > line 34, in > from apache_beam.portability import common_urns > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/portability/common_urns.py", > line 25, in > from apache_beam.portability.api import metrics_pb2 > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/portability/api/metrics_pb2.py", > line 16, in > import beam_runner_api_pb2 as beam__runner__api__pb2 > ModuleNotFoundError: No module named 'beam_runner_api_pb2' > {noformat} > {noformat} > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__
[jira] [Assigned] (BEAM-7527) Python 3 test parallelization causes test flakines due to ModuleNotFoundError.
[ https://issues.apache.org/jira/browse/BEAM-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valentyn Tymofieiev reassigned BEAM-7527: - Assignee: Valentyn Tymofieiev (was: Mark Liu) > Python 3 test parallelization causes test flakines due to > ModuleNotFoundError. > --- > > Key: BEAM-7527 > URL: https://issues.apache.org/jira/browse/BEAM-7527 > Project: Beam > Issue Type: Sub-task > Components: test-failures >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Blocker > > I am seeing several errors in Python SDK Integration test suites, such as > Dataflow ValidatesRunner and Python PostCommit that fail due to one of the > autogenerated files not being found. > For example: > {noformat} > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py:84: > UserWarning: Running the Apache Beam SDK on Python 3 is not yet fully > supported. You may encounter buggy behavior or missing features. > 'Running the Apache Beam SDK on Python 3 is not yet fully supported. ' > Failure: ModuleNotFoundError (No module named 'beam_runner_api_pb2') ... > ERROR > == > ERROR: Failure: ModuleNotFoundError (No module named 'beam_runner_api_pb2') > -- > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/failure.py", > line 39, in runTest > raise self.exc_val.with_traceback(self.tb) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/loader.py", > line 418, in loadTestsFromName > addr.filename, addr.module) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py", > line 47, in importFromPath > return self.importFromDir(dir_path, fqname) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py", > line 94, in importFromDir > mod = load_module(part_fqname, fh, filename, desc) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/imp.py", > line 245, in load_module > return load_package(name, filename) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/imp.py", > line 217, in load_package > return _load(spec) > File "", line 684, in _load > File "", line 665, in _load_unlocked > File "", line 678, in exec_module > File "", line 219, in _call_with_frames_removed > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py", > line 97, in > from apache_beam import coders > File "/home/jenkins/jenkins-slave/workspace/beam_Pos > tCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/__init__.py", line > 19, in > from apache_beam.coders.coders import * > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/coders.py", > line 32, in > from apache_beam.coders import coder_impl > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/coder_impl.py", > line 44, in > from apache_beam.utils import windowed_value > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/windowed_value.py", > line 34, in > from apache_beam.utils.timestamp import MAX_TIMESTAMP > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/timestamp.py", > line 34, in > from apache_beam.portability import common_urns > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/portability/common_urns.py", > line 25, in > from apache_beam.portability.api import metrics_pb2 > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/portability/api/metrics_pb2.py", > line 16, in > import beam_runner_api_pb2 as beam__runner__api__pb2 > ModuleNotFoundError: No module named 'beam_runner_api_pb2' > {noformat} > {noformat} > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py:84: > UserWarning: Running the Ap
[jira] [Work logged] (BEAM-7656) Add sdk-worker-parallelism arg to flink job server shadow jar
[ https://issues.apache.org/jira/browse/BEAM-7656?focusedWorklogId=276974&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276974 ] ASF GitHub Bot logged work on BEAM-7656: Author: ASF GitHub Bot Created on: 15/Jul/19 20:41 Start Date: 15/Jul/19 20:41 Worklog Time Spent: 10m Work Description: angoenka commented on issue #8967: [BEAM-7656] Add sdk-worker-parallelism arg to flink job server shadow… URL: https://github.com/apache/beam/pull/8967#issuecomment-511561625 LGTM Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276974) Time Spent: 0.5h (was: 20m) > Add sdk-worker-parallelism arg to flink job server shadow jar > - > > Key: BEAM-7656 > URL: https://issues.apache.org/jira/browse/BEAM-7656 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > It's unfortunate we have to manually add these args, but /shrug -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7656) Add sdk-worker-parallelism arg to flink job server shadow jar
[ https://issues.apache.org/jira/browse/BEAM-7656?focusedWorklogId=276975&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276975 ] ASF GitHub Bot logged work on BEAM-7656: Author: ASF GitHub Bot Created on: 15/Jul/19 20:41 Start Date: 15/Jul/19 20:41 Worklog Time Spent: 10m Work Description: angoenka commented on pull request #8967: [BEAM-7656] Add sdk-worker-parallelism arg to flink job server shadow… URL: https://github.com/apache/beam/pull/8967 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276975) Time Spent: 40m (was: 0.5h) > Add sdk-worker-parallelism arg to flink job server shadow jar > - > > Key: BEAM-7656 > URL: https://issues.apache.org/jira/browse/BEAM-7656 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > It's unfortunate we have to manually add these args, but /shrug -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7733) Flaky tests on Dataflow because of not found information about job state
[ https://issues.apache.org/jira/browse/BEAM-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885596#comment-16885596 ] Valentyn Tymofieiev commented on BEAM-7733: --- This a Dataflow issue, and is being addressed outside of Apache Beam. BEAM-6202 may be able to help if we decide to retry 404s. > Flaky tests on Dataflow because of not found information about job state > > > Key: BEAM-7733 > URL: https://issues.apache.org/jira/browse/BEAM-7733 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, testing >Reporter: Kasia Kucharczyk >Priority: Major > Fix For: Not applicable > > > Some of Python performance cron tests on Dataflow experience some problems > because of: > {code:java} > AssertionError: Job did not reach to a terminal state after waiting > indefinitely. > {code} > > Dataflow returns following errors: > {code:java} > "error": { > "code": 404, > "message": "(bf55d279f24d1255): Information about job > 2019-07-07_08_42_42-8241186059035490445 could not be found in our system. > Please double check the id is correct. If it is please contact customer > support.", > "status": "NOT_FOUND" >} > } > {code} > Those tests are success on Dataflow, but Jenkins returns failure. > It already happened in Combine and CoGroupByKey tests: > * [example of Combine > test|https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_LoadTests_Python_Combine_Dataflow_Batch/6/console] > * [example of > CoGBK|https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_LoadTests_Python_CoGBK_Dataflow_Batch/3/console] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?focusedWorklogId=276972&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276972 ] ASF GitHub Bot logged work on BEAM-2264: Author: ASF GitHub Bot Created on: 15/Jul/19 20:40 Start Date: 15/Jul/19 20:40 Worklog Time Spent: 10m Work Description: udim commented on issue #9060: [BEAM-2264] Reuse GCP credentials in GCS calls. URL: https://github.com/apache/beam/pull/9060#issuecomment-511560682 R: @lukecwik (since you have some state already, lmk if you don't have bandwidth) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276972) Time Spent: 1h 10m (was: 1h) > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Luke Cwik >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We should cache the credential used within a Pipeline and re-use it instead > of generating a new one on each GCS call. When executing (against 2.0.0 RC2): > {code} > python -m apache_beam.examples.wordcount --input > "gs://dataflow-samples/shakespeare/*" --output local_counts > {code} > Note that we seemingly generate a new access token each time instead of when > a refresh is required. > {code} > super(GcsIO, cls).__new__(cls, storage_client)) > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 1 files. Estimation > took 0.286200046539 seconds > INFO:root:Running pipeline with DirectRunner. > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 43 files. Estimation > took 0.205624818802 seconds > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > ... many more times ... > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7748) beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2'
[ https://issues.apache.org/jira/browse/BEAM-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885590#comment-16885590 ] Valentyn Tymofieiev commented on BEAM-7748: --- This is a duplicate of https://issues.apache.org/jira/browse/BEAM-7527 > beam_PostCommit_Py_VR_Dataflow -> > :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module > named 'endpoints_pb2' > - > > Key: BEAM-7748 > URL: https://issues.apache.org/jira/browse/BEAM-7748 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures, testing >Reporter: Ankur Goenka >Priority: Major > > beam_PostCommit_Py_VR_Dataflow is failing on Py3 related test suites becase > of missing module > > [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Py_VR_Dataflow/4025/] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (BEAM-7748) beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2'
[ https://issues.apache.org/jira/browse/BEAM-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valentyn Tymofieiev closed BEAM-7748. - Resolution: Duplicate Fix Version/s: Not applicable > beam_PostCommit_Py_VR_Dataflow -> > :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module > named 'endpoints_pb2' > - > > Key: BEAM-7748 > URL: https://issues.apache.org/jira/browse/BEAM-7748 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures, testing >Reporter: Ankur Goenka >Priority: Major > Fix For: Not applicable > > > beam_PostCommit_Py_VR_Dataflow is failing on Py3 related test suites becase > of missing module > > [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Py_VR_Dataflow/4025/] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7748) beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2'
Ankur Goenka created BEAM-7748: -- Summary: beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2' Key: BEAM-7748 URL: https://issues.apache.org/jira/browse/BEAM-7748 Project: Beam Issue Type: Bug Components: sdk-py-core, test-failures, testing Reporter: Ankur Goenka beam_PostCommit_Py_VR_Dataflow is failing on Py3 related test suites becase of missing module [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Py_VR_Dataflow/4025/] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call
[ https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885587#comment-16885587 ] Udi Meiri commented on BEAM-2264: - I'm not sure. I added a test logging.error('wrapper!') entry to _do_refresh_request() and it was only logged once, and only after the second "Attempting refresh..." entry. That may point to another user of oauth2client, or concurrent refreshes (i.e. "wrapper!" message is from the first thread doing "Attempting refresh..."). The jsonPayload.thread field in Stackdriver logs is identical for all 3 entries (136:140478770464512), but I have no idea if that corresponds to a Python thread. {code} 2019-07-15T17:32:45.078399658Z Loading main session from the staging area... I 2019-07-15T17:32:45.100733518Z Attempting refresh to obtain initial access_token I 2019-07-15T17:32:45.102308750Z Status HTTP server running at localhost:40229 I 2019-07-15T17:32:46.167937994Z Executing workitem I 2019-07-15T17:32:46.168387413Z Memory usage of worker beamapp-ehudm-0715172721--07151027-edcl-harness-xxm0 is 114 MB I 2019-07-15T17:32:46.195270538Z Attempting refresh to obtain initial access_token I 2019-07-15T17:32:46.195525169Z wrapper! E {code} > Re-use credential instead of generating a new one one each GCS call > --- > > Key: BEAM-2264 > URL: https://issues.apache.org/jira/browse/BEAM-2264 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Luke Cwik >Priority: Minor > Time Spent: 1h > Remaining Estimate: 0h > > We should cache the credential used within a Pipeline and re-use it instead > of generating a new one on each GCS call. When executing (against 2.0.0 RC2): > {code} > python -m apache_beam.examples.wordcount --input > "gs://dataflow-samples/shakespeare/*" --output local_counts > {code} > Note that we seemingly generate a new access token each time instead of when > a refresh is required. > {code} > super(GcsIO, cls).__new__(cls, storage_client)) > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 1 files. Estimation > took 0.286200046539 seconds > INFO:root:Running pipeline with DirectRunner. > INFO:root:Starting the size estimation of the input > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:root:Finished the size estimation of the input at 43 files. Estimation > took 0.205624818802 seconds > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > INFO:oauth2client.client:Refreshing access_token > INFO:oauth2client.transport:Attempting refresh to obtain initial access_token > ... many more times ... > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7748) beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2'
[ https://issues.apache.org/jira/browse/BEAM-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885588#comment-16885588 ] Ankur Goenka commented on BEAM-7748: cc [~tvalentyn] > beam_PostCommit_Py_VR_Dataflow -> > :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module > named 'endpoints_pb2' > - > > Key: BEAM-7748 > URL: https://issues.apache.org/jira/browse/BEAM-7748 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures, testing >Reporter: Ankur Goenka >Priority: Major > > beam_PostCommit_Py_VR_Dataflow is failing on Py3 related test suites becase > of missing module > > [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Py_VR_Dataflow/4025/] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7546) Portable WordCount-on-Flink Precommit is flaky - temporary folder not found.
[ https://issues.apache.org/jira/browse/BEAM-7546?focusedWorklogId=276966&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276966 ] ASF GitHub Bot logged work on BEAM-7546: Author: ASF GitHub Bot Created on: 15/Jul/19 20:23 Start Date: 15/Jul/19 20:23 Worklog Time Spent: 10m Work Description: angoenka commented on pull request #9046: [BEAM-7546] Increasing environment cache to avoid chances of recreati… URL: https://github.com/apache/beam/pull/9046 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276966) Time Spent: 50m (was: 40m) > Portable WordCount-on-Flink Precommit is flaky - temporary folder not found. > > > Key: BEAM-7546 > URL: https://issues.apache.org/jira/browse/BEAM-7546 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Valentyn Tymofieiev >Assignee: Ankur Goenka >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > On a few occasions I see this test fail due to a temp directory being missing. > Sample scan from > https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/745/: > https://scans.gradle.com/s/3ra4xw4hqvlyw/console-log?task=:sdks:python:portableWordCountBatch > {noformat} > [grpc-default-executor-0] ERROR sdk_worker._execute - Error processing > instruction 8. Original traceback is > Traceback (most recent call last): > File > "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 157, in _execute > response = task() > File > "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 190, in > self._execute(lambda: worker.do_instruction(work), work) > File > "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 342, in do_instruction > request.instruction_id) > File > "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 368, in process_bundle > bundle_processor.process_bundle(instruction_id)) > File > "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 589, in process_bundle > ].process_encoded(data.data) > File > "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 143, in process_encoded > self.output(decoded_value) > File "apache_beam/runners/worker/operations.py", line 255, in > apache_beam.runners.worker.operations.Operation.output > def output(self, windowed_value, output_index=0): > File "apache_beam/runners/worker/operations.py", line 256, in > apache_beam.runners.worker.operations.Operation.output > cython.cast(Receiver, > self.receivers[output_index]).receive(windowed_value) > File "apache_beam/runners/worker/operations.py", line 143, in > apache_beam.runners.worker.operations.SingletonConsumerSet.receive > self.consumer.process(windowed_value) > File "apache_beam/runners/worker/operations.py", line 593, in > apache_beam.runners.worker.operations.DoOperation.process > with self.scoped_process_state: > File "apache_beam/runners/worker/operations.py", line 594, in > apache_beam.runners.worker.operations.DoOperation.process > delayed_application = self.dofn_receiver.receive(o) > File "apache_beam/runners/common.py", line 778, in > apache_beam.runners.common.DoFnRun > ner.receive > self.process(windowed_value) > File "apache_beam/runners/common.py", line 784, in > apache_beam.runners.common.DoFnRunner.process > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 851, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > raise_with_traceback(new_exn) > File "apache_beam/runners/common.py", line 782, in > apache_beam.runners.common.DoFnRunner.process > return self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 594, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > self._invoke_process_per_window( > File "apache_beam/runners/common.py", line 666, in > apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window > windowed_value, self.process_method(*args_for_process)) > File "/usr/local/lib/python2.7/site-packages/apache_beam/io/iobase.py", > line 1041, in process > self.writer = self.sink.open_writer(init_result, str(uuid.uui
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276964&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276964 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 20:18 Start Date: 15/Jul/19 20:18 Worklog Time Spent: 10m Work Description: akedin commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303617190 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: In general we should have tests. However in this case the inputs/outputs are not changing depending on the inputs. We should make sure that we have a couple of tests that we know trigger these rules but we cannot enforce it, i.e. I don't know how we can ensure that we even hit the rule by looking at the results. If we don't know what a rule breaks, can we come up with a meaningful test for it? I also don't think we should check the plans. It will be too brittle. Rules can be enabled and disabled independently and probably interact in hard-to-predict ways. If VolcanoPlanner (or in general Calcite stuff) is tested, then we should not re-test it and accept that it applies the rules correctly. We add some rules that are applied the same way as all other rules and inherit from them. In this case the rules behave the same way as built-in Calcite rules, and reject the plans that we already don't support in the join rel. It feels like the only issue can be is the applicability of the reordering operation itself. Looking at what join operations we support and how it works (e.g. window is inherited for non-merging windows, or the input has to be explicitly re-windowed for merging windows, and that we only allow the default trigger that has itself as a continuation trigger), it looks safe and from the top of the head I cannot come up with a test case that we accept but that would break with the rule / without the rule. Probably a unit test of the join rejection would make sense but I don't think it's really valuable as we literally call the same code that we already do in the existing join rel and that probably already have tests (this should be confirmed) that trigger that logic. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276964) Time Spent: 7h 20m (was: 7h 10m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 7h 20m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276959&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276959 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 20:16 Start Date: 15/Jul/19 20:16 Worklog Time Spent: 10m Work Description: riazela commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303616264 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamJoinAssociateRule.java ## @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.sql.impl.rule; + +import org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.RelFactories; +import org.apache.calcite.rel.rules.JoinAssociateRule; +import org.apache.calcite.tools.RelBuilderFactory; + +/** + * This is very similar to {@link org.apache.calcite.rel.rules.JoinAssociateRule}. It only checks if + * the resulting condition is supported before transforming. + */ +public class BeamJoinAssociateRule extends JoinAssociateRule { + // ~ Static fields/initializers - + + /** The singleton. */ + public static final org.apache.beam.sdk.extensions.sql.impl.rule.BeamJoinAssociateRule INSTANCE = + new org.apache.beam.sdk.extensions.sql.impl.rule.BeamJoinAssociateRule( + RelFactories.LOGICAL_BUILDER); + + // ~ Constructors --- Review comment: Removed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276959) Time Spent: 7h 10m (was: 7h) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 7h 10m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276957&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276957 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 20:02 Start Date: 15/Jul/19 20:02 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303607515 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: But it's less common to have a change without any test checked in to guard it? Also lacking of tests were reasons that we enabled rules but then broke users. I am thinking a test that you can enable rules and then check output plan? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276957) Time Spent: 7h (was: 6h 50m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 7h > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276953&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276953 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 19:59 Start Date: 15/Jul/19 19:59 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303607515 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: But it's less common to have a change without any test? Also lacking of tests were reasons that we enabled rules but then broke users. I am thinking a test that you can enable rules and then check output plan? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276953) Time Spent: 6h 50m (was: 6h 40m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276946&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276946 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 19:54 Start Date: 15/Jul/19 19:54 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303607515 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: But it's less common to have a change without any test? I am thinking a test that you can enable rules and then check output plan. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276946) Time Spent: 6h 40m (was: 6.5h) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 6h 40m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7747) ERROR: test_sink_transform (apache_beam.io.avroio_test.TestFastAvro) Fails on Windows
Valentyn Tymofieiev created BEAM-7747: - Summary: ERROR: test_sink_transform (apache_beam.io.avroio_test.TestFastAvro) Fails on Windows Key: BEAM-7747 URL: https://issues.apache.org/jira/browse/BEAM-7747 Project: Beam Issue Type: Bug Components: io-python-avro, test-failures Reporter: Valentyn Tymofieiev == ERROR: test_sink_transform (apache_beam.io.avroio_test.TestFastAvro) -- Traceback (most recent call last): File "C:\projects\beam\sdks\python\apache_beam\io\avroio_test.py", line 436, in test_sink_transform | avroio.WriteToAvro(path, self.SCHEMA, use_fastavro=self.use_fastavro) File "C:\projects\beam\sdks\python\apache_beam\pipeline.py", line 426, in __exit__ self.run().wait_until_finish() File "C:\projects\beam\sdks\python\apache_beam\testing\test_pipeline.py", line 107, in run else test_runner_api)) File "C:\projects\beam\sdks\python\apache_beam\pipeline.py", line 406, in run self._options).run(False) File "C:\projects\beam\sdks\python\apache_beam\pipeline.py", line 419, in run return self.runner.run_pipeline(self, self._options) File "C:\projects\beam\sdks\python\apache_beam\runners\direct\direct_runner.py", line 128, in run_pipeline return runner.run_pipeline(pipeline, options) File "C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py", line 319, in run_pipeline default_environment=self._default_environment)) File "C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py", line 326, in run_via_runner_api return self.run_stages(stage_context, stages) File "C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py", line 408, in run_stages stage_context.safe_coders) File "C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py", line 681, in _run_stage result, splits = bundle_manager.process_bundle(data_input, data_output) File "C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py", line 1562, in process_bundle part_inputs): File "C:\venv\newenv1\lib\site-packages\concurrent\futures\_base.py", line 641, in result_iterator yield fs.pop().result() File "C:\venv\newenv1\lib\site-packages\concurrent\futures\_base.py", line 462, in result return self.__get_result() File "C:\venv\newenv1\lib\site-packages\concurrent\futures\thread.py", line 63, in run result = self.fn(*self.args, **self.kwargs) File "C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py", line 1561, in self._registered).process_bundle(part, expected_outputs), File "C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py", line 1500, in process_bundle result_future = self._controller.control_handler.push(process_bundle_req) File "C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py", line 1017, in push response = self.worker.do_instruction(request) File "C:\projects\beam\sdks\python\apache_beam\runners\worker\sdk_worker.py", line 342, in do_instruction request.instruction_id) File "C:\projects\beam\sdks\python\apache_beam\runners\worker\sdk_worker.py", line 368, in process_bundle bundle_processor.process_bundle(instruction_id)) File "C:\projects\beam\sdks\python\apache_beam\runners\worker\bundle_processor.py", line 593, in process_bundle data.ptransform_id].process_encoded(data.data) File "C:\projects\beam\sdks\python\apache_beam\runners\worker\bundle_processor.py", line 143, in process_encoded self.output(decoded_value) File "C:\projects\beam\sdks\python\apache_beam\runners\worker\operations.py", line 256, in output cython.cast(Receiver, self.receivers[output_index]).receive(windowed_value) File "C:\projects\beam\sdks\python\apache_beam\runners\worker\operations.py", line 143, in receive self.consumer.process(windowed_value) File "C:\projects\beam\sdks\python\apache_beam\runners\worker\operations.py", line 594, in process delayed_application = self.dofn_receiver.receive(o) File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 795, in receive self.process(windowed_value) File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 801, in process self._reraise_augmented(exn) File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 868, in _reraise_augmented raise_with_traceback(new_exn) File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 799, in process return self.do_fn_invoker.invoke_process(windowed_value) File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 611, in invoke_process windowed_value, ad
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276935&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276935 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 19:01 Start Date: 15/Jul/19 19:01 Worklog Time Spent: 10m Work Description: akedin commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303540287 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamJoinAssociateRule.java ## @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.sql.impl.rule; + +import org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.RelFactories; +import org.apache.calcite.rel.rules.JoinAssociateRule; +import org.apache.calcite.tools.RelBuilderFactory; + +/** + * This is very similar to {@link org.apache.calcite.rel.rules.JoinAssociateRule}. It only checks if + * the resulting condition is supported before transforming. + */ +public class BeamJoinAssociateRule extends JoinAssociateRule { + // ~ Static fields/initializers - + + /** The singleton. */ + public static final org.apache.beam.sdk.extensions.sql.impl.rule.BeamJoinAssociateRule INSTANCE = + new org.apache.beam.sdk.extensions.sql.impl.rule.BeamJoinAssociateRule( + RelFactories.LOGICAL_BUILDER); + + // ~ Constructors --- Review comment: nit: we don't usually have these separators in Beam codebase This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276935) Time Spent: 6.5h (was: 6h 20m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 6.5h > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276932&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276932 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 18:58 Start Date: 15/Jul/19 18:58 Worklog Time Spent: 10m Work Description: akedin commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303579948 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/JoinRelOptRuleCall.java ## @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.sql.impl.rule; + +import java.util.List; +import java.util.Map; +import org.apache.calcite.plan.RelOptPlanner; +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptRuleOperand; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.metadata.RelMetadataQuery; +import org.apache.calcite.tools.RelBuilder; + +/** + * This is a class to catch the built join and check if it is a legal join before passing it to the + * actual RelOptRuleCall. + */ +public class JoinRelOptRuleCall extends RelOptRuleCall { + private final RelOptRuleCall originalCall; + private final JoinChecker checker; + + JoinRelOptRuleCall(RelOptRuleCall originalCall, JoinChecker checker) { +super(originalCall.getPlanner(), originalCall.getOperand0(), originalCall.rels, null, null); +this.originalCall = originalCall; +this.checker = checker; + } + + @Override + public RelOptRuleOperand getOperand0() { +return originalCall.getOperand0(); + } + + @Override + public RelOptRule getRule() { +return originalCall.getRule(); + } + + @Override + public List getRelList() { +return originalCall.getRelList(); + } + + @Override + @SuppressWarnings("TypeParameterUnusedInFormals") + public T rel(int ordinal) { +return originalCall.rel(ordinal); + } + + @Override + public List getChildRels(RelNode rel) { +return originalCall.getChildRels(rel); + } + + @Override + public RelOptPlanner getPlanner() { +return originalCall.getPlanner(); + } + + @Override + public RelMetadataQuery getMetadataQuery() { +return originalCall.getMetadataQuery(); + } + + @Override + public List getParents() { +return originalCall.getParents(); + } + + @Override + public RelBuilder builder() { +return originalCall.builder(); + } + + @Override + public void transformTo(RelNode rel, Map equiv) { Review comment: nit: can you move this and `JoinChecker` to the top and add a comment that these are the only two custom things, and everything else is delegated to the `originalCall`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276932) Time Spent: 6h 10m (was: 6h) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 6h 10m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276933&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276933 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 18:58 Start Date: 15/Jul/19 18:58 Worklog Time Spent: 10m Work Description: akedin commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303586595 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamJoinPushThroughJoinRule.java ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.sql.impl.rule; + +import org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel; +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.rel.core.Join; +import org.apache.calcite.rel.core.RelFactories; +import org.apache.calcite.rel.logical.LogicalJoin; +import org.apache.calcite.rel.rules.JoinPushThroughJoinRule; +import org.apache.calcite.tools.RelBuilderFactory; + +/** + * This is exactly similar to {@link org.apache.calcite.rel.rules.JoinPushThroughJoinRule}. It only + * checks if the condition of the new bottom join is supported. + */ +public class BeamJoinPushThroughJoinRule extends JoinPushThroughJoinRule { + /** Instance of the rule that works on logical joins only, and pushes to the right. */ + public static final RelOptRule RIGHT = + new BeamJoinPushThroughJoinRule( + "BeamJoinPushThroughJoinRule:right", + true, + LogicalJoin.class, + RelFactories.LOGICAL_BUILDER); + + /** Instance of the rule that works on logical joins only, and pushes to the left. */ + public static final RelOptRule LEFT = + new BeamJoinPushThroughJoinRule( + "BeamJoinPushThroughJoinRule:left", + false, + LogicalJoin.class, + RelFactories.LOGICAL_BUILDER); + + /** Creates a JoinPushThroughJoinRule. */ + private BeamJoinPushThroughJoinRule( + String description, + boolean right, + Class clazz, + RelBuilderFactory relBuilderFactory) { +super(description, right, clazz, relBuilderFactory); + } + + @Override + public void onMatch(RelOptRuleCall call) { +super.onMatch( +new JoinRelOptRuleCall( Review comment: just an idea to think about, don't know if it makes sense: * implement the check `BeamJoinRel.isJoinLegal(topJoin) && BeamJoinRel.isJoinLegal(bottomJoin)` in `JoinRelOptRuleCall`; * create an interface like `JoinInputNodesMatcher` or something, which returns the top and the bottom join from the node, e.g.: ```java interface JoinInputNodesMatcher implements ... { Pair getTopAndBottomJoin(RelOptRuleCall call); default void onMatch(RelOptRuleCall call) { super.onMatch(new JoinRelOptRuleCall(this)); } } ``` * make the `BeamJoinPushThroughJoinRule` implement that interface; This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276933) Time Spent: 6h 20m (was: 6h 10m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and est
[jira] [Work logged] (BEAM-7711) Support DATETIME as a logical type in BeamSQL
[ https://issues.apache.org/jira/browse/BEAM-7711?focusedWorklogId=276931&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276931 ] ASF GitHub Bot logged work on BEAM-7711: Author: ASF GitHub Bot Created on: 15/Jul/19 18:56 Start Date: 15/Jul/19 18:56 Worklog Time Spent: 10m Work Description: kennknowles commented on pull request #8994: [BEAM-7711] Add DATETIME as a logical type in BeamSQL URL: https://github.com/apache/beam/pull/8994#discussion_r303585601 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java ## @@ -154,6 +163,7 @@ public static boolean isStringType(FieldType fieldType) { .put(TIME_WITH_LOCAL_TZ, SqlTypeName.TIME_WITH_LOCAL_TIME_ZONE) .put(TIMESTAMP, SqlTypeName.TIMESTAMP) .put(TIMESTAMP_WITH_LOCAL_TZ, SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE) + .put(DATETIME, SqlTypeName.TIMESTAMP) Review comment: What are the differences between DATETIME and TIMESTAMP WITHOUT TIME ZONE, so that they need to be distinguished? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276931) Time Spent: 1.5h (was: 1h 20m) > Support DATETIME as a logical type in BeamSQL > - > > Key: BEAM-7711 > URL: https://issues.apache.org/jira/browse/BEAM-7711 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Rui Wang >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > DATETIME as a type represents a year, month, day, hour, minute, second, and > subsecond(millis) > it ranges from 0001-01-01 00:00:00 to -12-31 23:59:59.999. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7746) Add type hints to python code
[ https://issues.apache.org/jira/browse/BEAM-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chad Dombrova updated BEAM-7746: Description: As a developer of the beam source code, I would like the code to use pep484 type hints so that I can clearly see what types are required, get completion in my IDE, and enforce code correctness via a static analyzer like mypy. This may be considered a precursor to BEAM-7060 Work has been started here: [https://github.com/apache/beam/pull/9056] was: As a developer of the beam source code, I would like the code to use pep484 type hints so that I can clearly see what types are required, get completion in my IDE, and enforce code correctness via a static analyzer like mypy. This is a precursor to BEAM-7060 Work has been started here: [https://github.com/apache/beam/pull/9056] > Add type hints to python code > - > > Key: BEAM-7746 > URL: https://issues.apache.org/jira/browse/BEAM-7746 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chad Dombrova >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > As a developer of the beam source code, I would like the code to use pep484 > type hints so that I can clearly see what types are required, get completion > in my IDE, and enforce code correctness via a static analyzer like mypy. > This may be considered a precursor to BEAM-7060 > Work has been started here: [https://github.com/apache/beam/pull/9056] > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7746) Add type hints to python code
[ https://issues.apache.org/jira/browse/BEAM-7746?focusedWorklogId=276925&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276925 ] ASF GitHub Bot logged work on BEAM-7746: Author: ASF GitHub Bot Created on: 15/Jul/19 18:44 Start Date: 15/Jul/19 18:44 Worklog Time Spent: 10m Work Description: chadrik commented on issue #9056: [BEAM-7746] Add python type hints URL: https://github.com/apache/beam/pull/9056#issuecomment-511522447 > type checking pipeline construction I think the goal of JIRA-7060 is specifically to make type annotations useful to pipeline developers. Making tools like mypy work on the beam codebase, and user pipelines in general, is another (worthy) goal. I made a new ticket for the latter, and changed the subject of this PR: https://issues.apache.org/jira/browse/BEAM-7746 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276925) Time Spent: 10m Remaining Estimate: 0h > Add type hints to python code > - > > Key: BEAM-7746 > URL: https://issues.apache.org/jira/browse/BEAM-7746 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chad Dombrova >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > As a developer of the beam source code, I would like the code to use pep484 > type hints so that I can clearly see what types are required, get completion > in my IDE, and enforce code correctness via a static analyzer like mypy. > This is a precursor to BEAM-7060 > Work has been started here: [https://github.com/apache/beam/pull/9056] > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7746) Add type hints to python code
Chad Dombrova created BEAM-7746: --- Summary: Add type hints to python code Key: BEAM-7746 URL: https://issues.apache.org/jira/browse/BEAM-7746 Project: Beam Issue Type: New Feature Components: sdk-py-core Reporter: Chad Dombrova As a developer of the beam source code, I would like the code to use pep484 type hints so that I can clearly see what types are required, get completion in my IDE, and enforce code correctness via a static analyzer like mypy. This is a precursor to BEAM-7060 Work has been started here: [https://github.com/apache/beam/pull/9056] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7463) BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: incorrect checksum
[ https://issues.apache.org/jira/browse/BEAM-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885473#comment-16885473 ] Valentyn Tymofieiev commented on BEAM-7463: --- We don't need to pass SDK location. This is where the command that runs on Jenkins is evaluated, overall it looks similar to yours: https://github.com/apache/beam/blob/5fb21e38d9d0e73db514e13a93c15578302c11fa/sdks/python/test-suites/direct/py35/build.gradle#L50 > BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: > incorrect checksum > -- > > Key: BEAM-7463 > URL: https://issues.apache.org/jira/browse/BEAM-7463 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Reporter: Valentyn Tymofieiev >Assignee: Pablo Estrada >Priority: Major > Labels: currently-failing > Time Spent: 5h 20m > Remaining Estimate: 0h > > {noformat} > 15:03:38 FAIL: test_big_query_new_types > (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) > 15:03:38 > -- > 15:03:38 Traceback (most recent call last): > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py", > line 211, in test_big_query_new_types > 15:03:38 big_query_query_to_table_pipeline.run_bq_pipeline(options) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py", > line 82, in run_bq_pipeline > 15:03:38 result = p.run() > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py", > line 107, in run > 15:03:38 else test_runner_api)) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py", > line 406, in run > 15:03:38 self._options).run(False) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py", > line 419, in run > 15:03:38 return self.runner.run_pipeline(self, self._options) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py", > line 51, in run_pipeline > 15:03:38 hc_assert_that(self.result, pickler.loads(on_success_matcher)) > 15:03:38 AssertionError: > 15:03:38 Expected: (Test pipeline expected terminated in state: DONE and > Expected checksum is 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214) > 15:03:38 but: Expected checksum is > 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214 Actual checksum is > da39a3ee5e6b4b0d3255bfef95601890afd80709 > {noformat} > [~Juta] could this be caused by changes to Bigquery matcher? > https://github.com/apache/beam/pull/8621/files#diff-f1ec7e3a3e7e2e5082ddb7043954c108R134 > > cc: [~pabloem] [~chamikara] [~apilloud] > A recent postcommit run has BQ failures in other tests as well: > https://builds.apache.org/job/beam_PostCommit_Python3_Verify/1000/consoleFull -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7745) StreamingSideInputDoFnRunner/StreamingSideInputFetcher have suboptimal state access pattern during normal operation
Steve Niemitz created BEAM-7745: --- Summary: StreamingSideInputDoFnRunner/StreamingSideInputFetcher have suboptimal state access pattern during normal operation Key: BEAM-7745 URL: https://issues.apache.org/jira/browse/BEAM-7745 Project: Beam Issue Type: Improvement Components: runner-dataflow Reporter: Steve Niemitz I spent some time tracking down sources of uncached state fetches in my job, and one large category was the interaction of StreamingSideInputDoFnRunner + StreamingSideInputFetcher. Basically, during standard operations, when the main input is NOT blocked by the side input, the side input fetcher will perform an uncached state read for every input element. Changing it to cache the blockedMap state gave me a ~30-40% increase in throughput in my job. The interaction is a little complicated, and there's a couple optimizations here I can see. Primarily, the blockedMap is only persisted if it is non-empty. Because the WindmillStateCache won't cache a null value, this means that the "nothing is blocked" signal is never actually cached, and will issue a state read to windmill for each input element. The solution here seems like it is to persist an empty map rather than a null when there are no blocked elements. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=276905&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276905 ] ASF GitHub Bot logged work on BEAM-7744: Author: ASF GitHub Bot Created on: 15/Jul/19 18:11 Start Date: 15/Jul/19 18:11 Worklog Time Spent: 10m Work Description: ihji commented on issue #9071: [BEAM-7744] cherry-pick BEAM-7689 for 2.7.1 URL: https://github.com/apache/beam/pull/9071#issuecomment-511510734 R: @chamikaramj @aaltay CC: @kennknowles This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276905) Time Spent: 10m Remaining Estimate: 0h > LTS backport: Temporary directory for WriteOperation may not be unique in > FileBaseSink > -- > > Key: BEAM-7744 > URL: https://issues.apache.org/jira/browse/BEAM-7744 > Project: Beam > Issue Type: Bug > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Fix For: 2.7.1 > > Time Spent: 10m > Remaining Estimate: 0h > > Tracking BEAM-7689 LTS backport for 2.7.1 release -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7463) BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: incorrect checksum
[ https://issues.apache.org/jira/browse/BEAM-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885465#comment-16885465 ] Pablo Estrada commented on BEAM-7463: - gah - I was running on dataflow. What's the best command to run these tests on direct runner? I am trying `./scripts/run_integration_test.sh --runner TestDirectRunner --sdk_location dist/apache-beam-2.15.0.dev0.tar.gz --test_opts --tests=apache_beam.io.gcp.bigquery_write_it_test:BigQueryWriteIntegrationTests.test_big_query_write_new_types` - but it doesn't look like it's working properly. > BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: > incorrect checksum > -- > > Key: BEAM-7463 > URL: https://issues.apache.org/jira/browse/BEAM-7463 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Reporter: Valentyn Tymofieiev >Assignee: Pablo Estrada >Priority: Major > Labels: currently-failing > Time Spent: 5h 20m > Remaining Estimate: 0h > > {noformat} > 15:03:38 FAIL: test_big_query_new_types > (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) > 15:03:38 > -- > 15:03:38 Traceback (most recent call last): > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py", > line 211, in test_big_query_new_types > 15:03:38 big_query_query_to_table_pipeline.run_bq_pipeline(options) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py", > line 82, in run_bq_pipeline > 15:03:38 result = p.run() > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py", > line 107, in run > 15:03:38 else test_runner_api)) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py", > line 406, in run > 15:03:38 self._options).run(False) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py", > line 419, in run > 15:03:38 return self.runner.run_pipeline(self, self._options) > 15:03:38 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py", > line 51, in run_pipeline > 15:03:38 hc_assert_that(self.result, pickler.loads(on_success_matcher)) > 15:03:38 AssertionError: > 15:03:38 Expected: (Test pipeline expected terminated in state: DONE and > Expected checksum is 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214) > 15:03:38 but: Expected checksum is > 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214 Actual checksum is > da39a3ee5e6b4b0d3255bfef95601890afd80709 > {noformat} > [~Juta] could this be caused by changes to Bigquery matcher? > https://github.com/apache/beam/pull/8621/files#diff-f1ec7e3a3e7e2e5082ddb7043954c108R134 > > cc: [~pabloem] [~chamikara] [~apilloud] > A recent postcommit run has BQ failures in other tests as well: > https://builds.apache.org/job/beam_PostCommit_Python3_Verify/1000/consoleFull -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink
Heejong Lee created BEAM-7744: - Summary: LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink Key: BEAM-7744 URL: https://issues.apache.org/jira/browse/BEAM-7744 Project: Beam Issue Type: Bug Components: io-java-files Reporter: Heejong Lee Assignee: Heejong Lee Fix For: 2.7.1 Tracking BEAM-7689 LTS backport for 2.7.1 release -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7689) Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Heejong Lee updated BEAM-7689: -- Fix Version/s: (was: 2.7.1) > Temporary directory for WriteOperation may not be unique in FileBaseSink > > > Key: BEAM-7689 > URL: https://issues.apache.org/jira/browse/BEAM-7689 > Project: Beam > Issue Type: Bug > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Blocker > Fix For: 2.14.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Temporary directory for WriteOperation in FileBasedSink is generated from a > second-granularity timestamp (-MM-dd_HH-mm-ss) and unique increasing > index. Such granularity is not enough to make temporary directories unique > between different jobs. When two jobs share the same temporary directory, > output file may not be produced in one job since the required temporary file > can be deleted from another job. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7689) Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7689?focusedWorklogId=276900&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276900 ] ASF GitHub Bot logged work on BEAM-7689: Author: ASF GitHub Bot Created on: 15/Jul/19 18:03 Start Date: 15/Jul/19 18:03 Worklog Time Spent: 10m Work Description: ihji commented on pull request #9071: [BEAM-7689] cherry-pick for 2.7.1 URL: https://github.com/apache/beam/pull/9071 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/) [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastComplet
[jira] [Commented] (BEAM-7743) Stream position bug in BigQuery storage stream source can lead to data loss
[ https://issues.apache.org/jira/browse/BEAM-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885459#comment-16885459 ] Kenneth Jung commented on BEAM-7743: Update: this issue does not appear to be present in 2.14: Change: [https://github.com/apache/beam/commit/aa8b3d054085548d79c988774923a1f68d223a26#diff-d5be8ca6fea88f6525fc7c49c3d43cd4] Release branch: [https://github.com/apache/beam/commits/release-2.14.0/sdks/java/io/google-cloud-platform] > Stream position bug in BigQuery storage stream source can lead to data loss > --- > > Key: BEAM-7743 > URL: https://issues.apache.org/jira/browse/BEAM-7743 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.14.0 >Reporter: Kenneth Jung >Priority: Blocker > > A bug in the splitAtFraction implementation for the BigQuery Storage API > stream source can lead to data loss in certain conditions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7743) Stream position bug in BigQuery storage stream source can lead to data loss
[ https://issues.apache.org/jira/browse/BEAM-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Jung updated BEAM-7743: --- Affects Version/s: (was: 2.14.0) 2.15.0 > Stream position bug in BigQuery storage stream source can lead to data loss > --- > > Key: BEAM-7743 > URL: https://issues.apache.org/jira/browse/BEAM-7743 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.15.0 >Reporter: Kenneth Jung >Priority: Blocker > > A bug in the splitAtFraction implementation for the BigQuery Storage API > stream source can lead to data loss in certain conditions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276895&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276895 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 17:58 Start Date: 15/Jul/19 17:58 Worklog Time Spent: 10m Work Description: riazela commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303537888 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: I have tested the effect of adding these rules in terms of performance gain in: https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit# However, in terms of testing against bugs, all I can say is it is passing our current tests; therefore, if it can create illegal plans for some query, that means we are not covering it in our tests and it should be added. I am not sure how we can test this against all possible scenarios and queries. I think one way to improve our test cases is checking other systems' tests (Such as Calcite's test cases) and add them to ours. I will appreciate if you have any suggestion for that This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276895) Time Spent: 6h (was: 5h 50m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 6h > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7743) Stream position bug in BigQuery storage stream source can lead to data loss
Kenneth Jung created BEAM-7743: -- Summary: Stream position bug in BigQuery storage stream source can lead to data loss Key: BEAM-7743 URL: https://issues.apache.org/jira/browse/BEAM-7743 Project: Beam Issue Type: Bug Components: io-java-gcp Affects Versions: 2.14.0 Reporter: Kenneth Jung A bug in the splitAtFraction implementation for the BigQuery Storage API stream source can lead to data loss in certain conditions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-1746) Include LICENSE and NOTICE in python dist files
[ https://issues.apache.org/jira/browse/BEAM-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885448#comment-16885448 ] Ahmet Altay commented on BEAM-1746: --- Not including these files is a bug. 2.4.0 distribution has these files, and starting from 2.5.0 these files are missing. > Include LICENSE and NOTICE in python dist files > --- > > Key: BEAM-1746 > URL: https://issues.apache.org/jira/browse/BEAM-1746 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Ahmet Altay >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Reopened] (BEAM-1746) Include LICENSE and NOTICE in python dist files
[ https://issues.apache.org/jira/browse/BEAM-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmet Altay reopened BEAM-1746: --- > Include LICENSE and NOTICE in python dist files > --- > > Key: BEAM-1746 > URL: https://issues.apache.org/jira/browse/BEAM-1746 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Ahmet Altay >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276852&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276852 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 15/Jul/19 17:08 Start Date: 15/Jul/19 17:08 Worklog Time Spent: 10m Work Description: chadrik commented on issue #9056: [BEAM-7060] Add python type hints URL: https://github.com/apache/beam/pull/9056#issuecomment-511487829 > typing module import: Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly. Sorry, I don't completely follow this. IIUC, we can't replace `apache_beam.typehints` with `typing` until beam knows how to deal with `typing` internally, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276852) Time Spent: 2h 10m (was: 2h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276847&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276847 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 15/Jul/19 17:06 Start Date: 15/Jul/19 17:06 Worklog Time Spent: 10m Work Description: chadrik commented on pull request #9056: [BEAM-7060] Add python type hints URL: https://github.com/apache/beam/pull/9056#discussion_r303540136 ## File path: sdks/python/apache_beam/io/iobase.py ## @@ -972,14 +979,15 @@ def expand(self, pcoll): 'or be a PTransform. Received : %r' % self.sink) -class WriteImpl(ptransform.PTransform): +class WriteImpl(ptransform.PTransform[InT, None]): Review comment: oh, also, it will require an import of the `apache_beam.pvalue` module into all of those modules as well. you won't be able to protect it inside a `typing.TYPE_CHECKING` block since it's used in the class definition. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276847) Time Spent: 2h (was: 1h 50m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276846&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276846 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 15/Jul/19 17:04 Start Date: 15/Jul/19 17:04 Worklog Time Spent: 10m Work Description: chadrik commented on pull request #9056: [BEAM-7060] Add python type hints URL: https://github.com/apache/beam/pull/9056#discussion_r303539509 ## File path: sdks/python/apache_beam/io/gcp/pubsub.py ## @@ -134,8 +136,13 @@ class ReadFromPubSub(PTransform): """A ``PTransform`` for reading from Cloud Pub/Sub.""" # Implementation note: This ``PTransform`` is overridden by Directrunner. - def __init__(self, topic=None, subscription=None, id_label=None, - with_attributes=False, timestamp_attribute=None): + def __init__(self, + topic=None, # type: typing.Optional[str] Review comment: I'll leave that up to the Beam Team, but we're still going to have the line length limit to contend with. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276846) Time Spent: 1h 50m (was: 1h 40m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276844&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276844 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 15/Jul/19 17:03 Start Date: 15/Jul/19 17:03 Worklog Time Spent: 10m Work Description: chadrik commented on pull request #9056: [BEAM-7060] Add python type hints URL: https://github.com/apache/beam/pull/9056#discussion_r303539155 ## File path: sdks/python/apache_beam/examples/wordcount.py ## @@ -85,31 +98,66 @@ def run(argv=None): pipeline_options.view_as(SetupOptions).save_main_session = True p = beam.Pipeline(options=pipeline_options) + reveal_type(p) + + _read = ReadFromText(known_args.input) + reveal_type(_read) + + read = 'read' >> _read + reveal_type(read) + # Read the text file[pattern] into a PCollection. - lines = p | 'read' >> ReadFromText(known_args.input) + lines = p | read + reveal_type(lines) # PCollection[unicode*] + + def make_ones(x): +# type: (T) -> Tuple[T, int] Review comment: Can you explain what you mean? Let me try to answer what I think you're asking about. First, even for a simple function like `make_ones`, mypy does not guess at the arguments and return type. Next, as a user annotating pipelines I have a few of choices (all of them are valid): 1. annotate functions, and let mypy infer the types of the `PTransforms` they are passed to 2. annotate `PTransforms` 3. annotate both, as a way to guard against unintentional incompatibilities Option 2 would look like this: ```python def make_ones(x): return (x, 1) makemap = beam.Map(make_ones) # type: beam.Map[T, Tuple[T, int]] ``` To be clear, in option 2, mypy is not using the type of `beam.Map` to infer the type of the `make_ones`, it's simply assuming it's correct, since in our thought experiment it's unannotated. In other words, the following would not generate an error because `make_ones` is untyped: ```python def make_ones(x): return (x + 1, 1) # x can only be a numeric type! makemap = beam.Map(make_ones) # type: beam.Map[str, Tuple[str, int]] ``` This module is very messy right now, so I apologize for any confusion that it's generated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276844) Time Spent: 1h 40m (was: 1.5h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276843&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276843 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 16:59 Start Date: 15/Jul/19 16:59 Worklog Time Spent: 10m Work Description: riazela commented on pull request #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#discussion_r303537888 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java ## @@ -103,6 +106,10 @@ // join rules JoinPushExpressionsRule.INSTANCE, + JoinCommuteRule.INSTANCE, Review comment: I have tested the effect of adding these rules in terms of performance gain for some of the queries in https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit# However, in terms of testing against bugs, all I can say is it is passing our current tests; therefore, if it can create illegal plans for some query, that means we are not covering it in our tests and it should be added. I am not sure how we can test this against all possible scenarios and queries. I think one way to improve our test cases is checking other systems' tests (Such as Calcite's test cases) and add them to ours. I will appreciate if you have any suggestion for that This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276843) Time Spent: 5h 50m (was: 5h 40m) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-3645) Support multi-process execution on the FnApiRunner
[ https://issues.apache.org/jira/browse/BEAM-3645?focusedWorklogId=276842&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276842 ] ASF GitHub Bot logged work on BEAM-3645: Author: ASF GitHub Bot Created on: 15/Jul/19 16:58 Start Date: 15/Jul/19 16:58 Worklog Time Spent: 10m Work Description: Hannah-Jiang commented on issue #8979: [BEAM-3645] add multiplexing for python FnApiRunner URL: https://github.com/apache/beam/pull/8979#issuecomment-511483661 kindly ping. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276842) Time Spent: 22h 50m (was: 22h 40m) > Support multi-process execution on the FnApiRunner > -- > > Key: BEAM-3645 > URL: https://issues.apache.org/jira/browse/BEAM-3645 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Charles Chen >Assignee: Hannah Jiang >Priority: Major > Time Spent: 22h 50m > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance > gain over the previous DirectRunner. We can do even better in multi-core > environments by supporting multi-process execution in the FnApiRunner, to > scale past Python GIL limitations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7742) BigQuery File Loads to work well with load job size limits
Pablo Estrada created BEAM-7742: --- Summary: BigQuery File Loads to work well with load job size limits Key: BEAM-7742 URL: https://issues.apache.org/jira/browse/BEAM-7742 Project: Beam Issue Type: Improvement Components: io-python-gcp Reporter: Pablo Estrada Assignee: Tanay Tummalapalli Load jobs into BigQuery have a number of limitations: [https://cloud.google.com/bigquery/quotas#load_jobs] Currently, the python BQ sink implemented in `bigquery_file_loads.py` does not handle these limitations well. Improvements need to be made to the miplementation, to: * Decide to use temp_tables dynamically at pipeline execution * Add code to determine when a load job to a single destination needs to be partitioned into multiple jobs. * When this happens, then we definitely need to use temp_tables, in case one of the two load jobs fails, and the pipeline is rerun. Tanay, would you be able to look at this? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276836&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276836 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 15/Jul/19 16:47 Start Date: 15/Jul/19 16:47 Worklog Time Spent: 10m Work Description: chadrik commented on pull request #9056: [BEAM-7060] Add python type hints URL: https://github.com/apache/beam/pull/9056#discussion_r303532868 ## File path: sdks/python/apache_beam/io/gcp/datastore/v1/query_splitter.py ## @@ -37,7 +37,7 @@ PropertyFilter.GREATER_THAN, PropertyFilter.GREATER_THAN_OR_EQUAL] except ImportError: - UNSUPPORTED_OPERATORS = None + UNSUPPORTED_OPERATORS = None # type: ignore Review comment: Correct. I can add a comment to the code clarify. The problem is that if we mark `UNSUPPORTED_OPERATORS` as `Optional` then mypy will generate numerous errors in this module, because the code assumes the import succeeded and that `UNSUPPORTED_OPERATORS` is _not_ `None`. An alternative would be to set it to `[]`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276836) Time Spent: 1.5h (was: 1h 20m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276835&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276835 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 15/Jul/19 16:42 Start Date: 15/Jul/19 16:42 Worklog Time Spent: 10m Work Description: chadrik commented on pull request #9056: [BEAM-7060] Add python type hints URL: https://github.com/apache/beam/pull/9056#discussion_r303530972 ## File path: sdks/python/apache_beam/io/iobase.py ## @@ -972,14 +979,15 @@ def expand(self, pcoll): 'or be a PTransform. Received : %r' % self.sink) -class WriteImpl(ptransform.PTransform): +class WriteImpl(ptransform.PTransform[InT, None]): Review comment: IMHO, that would be redundant. It won't change the behavior of the type checks, and it would add verbosity to _a lot_ of class definitions, since there are numerous `PTransforms` throughout the code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276835) Time Spent: 1h 20m (was: 1h 10m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276826&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276826 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 15/Jul/19 16:36 Start Date: 15/Jul/19 16:36 Worklog Time Spent: 10m Work Description: chadrik commented on pull request #9056: [BEAM-7060] Add python type hints URL: https://github.com/apache/beam/pull/9056#discussion_r303528558 ## File path: sdks/python/apache_beam/examples/wordcount.py ## @@ -65,8 +74,12 @@ def process(self, element): self.word_lengths_dist.update(len(w)) return words + def to_runner_api_parameter(self, unused_context): +pass + def run(argv=None): + # type: (Optional[Sequence[str]]) -> None Review comment: Sorry, I should have clarified that the changes to wordcount really amount to tests, primarily of the mypy plugin. Ultimately, we'll want a separate example to demonstrate typing. I'll get this reverted soon. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276826) Time Spent: 1h 10m (was: 1h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable
[ https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276825&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276825 ] ASF GitHub Bot logged work on BEAM-7545: Author: ASF GitHub Bot Created on: 15/Jul/19 16:35 Start Date: 15/Jul/19 16:35 Worklog Time Spent: 10m Work Description: riazela commented on issue #9040: [BEAM-7545] Reordering Beam Joins URL: https://github.com/apache/beam/pull/9040#issuecomment-511475589 run java postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276825) Time Spent: 5h 40m (was: 5.5h) > Row Count Estimation for CSV TextTable > -- > > Key: BEAM-7545 > URL: https://issues.apache.org/jira/browse/BEAM-7545 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Fix For: Not applicable > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Implementing Row Count Estimation for CSV Tables by reading the first few > lines of the file and estimating the number of records based on the length of > these lines and the total length of the file. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-6202) Gracefully handle exceptions when waiting for Dataflow job completion.
[ https://issues.apache.org/jira/browse/BEAM-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885369#comment-16885369 ] Kasia Kucharczyk commented on BEAM-6202: Here is an issue with same problems but in Load Tests (with job run examples). > Gracefully handle exceptions when waiting for Dataflow job completion. > -- > > Key: BEAM-6202 > URL: https://issues.apache.org/jira/browse/BEAM-6202 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, test-failures >Reporter: Robert Bradshaw >Priority: Major > > If there is an error when trying to contact the dataflow service in Python's > Dataflow.poll_for_job_completion, we may exit the thread prematurely. > A typical manifestation is: Dataflow Runner fails with: > {noformat} > AssertionError: Job did not reach to a terminal state after waiting > indefinitely. > {noformat} > however job execution continues, and succeeds. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (BEAM-7733) Flaky tests on Dataflow because of not found information about job state
[ https://issues.apache.org/jira/browse/BEAM-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kasia Kucharczyk resolved BEAM-7733. Resolution: Duplicate Fix Version/s: Not applicable > Flaky tests on Dataflow because of not found information about job state > > > Key: BEAM-7733 > URL: https://issues.apache.org/jira/browse/BEAM-7733 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, testing >Reporter: Kasia Kucharczyk >Priority: Major > Fix For: Not applicable > > > Some of Python performance cron tests on Dataflow experience some problems > because of: > {code:java} > AssertionError: Job did not reach to a terminal state after waiting > indefinitely. > {code} > > Dataflow returns following errors: > {code:java} > "error": { > "code": 404, > "message": "(bf55d279f24d1255): Information about job > 2019-07-07_08_42_42-8241186059035490445 could not be found in our system. > Please double check the id is correct. If it is please contact customer > support.", > "status": "NOT_FOUND" >} > } > {code} > Those tests are success on Dataflow, but Jenkins returns failure. > It already happened in Combine and CoGroupByKey tests: > * [example of Combine > test|https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_LoadTests_Python_Combine_Dataflow_Batch/6/console] > * [example of > CoGBK|https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_LoadTests_Python_CoGBK_Dataflow_Batch/3/console] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7630) Add ITs to check IO behavior with bytes and unicode strings
[ https://issues.apache.org/jira/browse/BEAM-7630?focusedWorklogId=276791&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276791 ] ASF GitHub Bot logged work on BEAM-7630: Author: ASF GitHub Bot Created on: 15/Jul/19 15:39 Start Date: 15/Jul/19 15:39 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #8985: [BEAM-7630] add ITs for writing and reading bytes from pubsub URL: https://github.com/apache/beam/pull/8985#issuecomment-511454565 The only test failure on this PR is `(apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) ... FAIL`, which is a known flake unrelated to this change. @pabloem Could you help merging this PR please? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276791) Time Spent: 2h 10m (was: 2h) > Add ITs to check IO behavior with bytes and unicode strings > --- > > Key: BEAM-7630 > URL: https://issues.apache.org/jira/browse/BEAM-7630 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Juta Staes >Assignee: Juta Staes >Priority: Major > Fix For: 2.15.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > When porting Bigquery IO to Python 3 we discovered some issues regarding > bytes support with BigQuery. We want to make sure that other IOs don't have > the same issues -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7715) External / Cross language (via expansion) transforms / public API should be marked @Experimental
[ https://issues.apache.org/jira/browse/BEAM-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885331#comment-16885331 ] Ismaël Mejía commented on BEAM-7715: Yes definitely not a blocker but a really nice to have, however I am stuck because of some random python error, trying to get it done in case we have a second RC, but will change the fix version only if needed. Thanks Anton. > External / Cross language (via expansion) transforms / public API should be > marked @Experimental > > > Key: BEAM-7715 > URL: https://issues.apache.org/jira/browse/BEAM-7715 > Project: Beam > Issue Type: Improvement > Components: io-java-kafka, sdk-java-core, sdk-py-core >Affects Versions: 2.13.0 >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Fix For: 2.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Beam introduced recently APIs and transforms to invoke transforms from > different languages. These APIs even if tested are not yet mature and it is > probably a good idea to mark them as > [`@Experimental|mailto:%60@Experimental]` at least for the user facing parts > (sdks/java/core and concrete transforms). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (BEAM-3061) BigtableIO writes should support emitting "done" notifications
[ https://issues.apache.org/jira/browse/BEAM-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía resolved BEAM-3061. Resolution: Fixed Fix Version/s: 2.15.0 > BigtableIO writes should support emitting "done" notifications > -- > > Key: BEAM-3061 > URL: https://issues.apache.org/jira/browse/BEAM-3061 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Steve Niemitz >Assignee: Steve Niemitz >Priority: Major > Fix For: 2.15.0 > > Time Spent: 16h 20m > Remaining Estimate: 0h > > There was some discussion of this on the dev@ mailing list [1]. This > approach was taken based on discussion there. > [1] > https://lists.apache.org/thread.html/949b33782f722a9000c9bf9e37042739c6fd0927589b99752b78d7bd@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-3061) BigtableIO should support emitting a sentinel "done" value when a bundle completes
[ https://issues.apache.org/jira/browse/BEAM-3061?focusedWorklogId=276780&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276780 ] ASF GitHub Bot logged work on BEAM-3061: Author: ASF GitHub Bot Created on: 15/Jul/19 15:27 Start Date: 15/Jul/19 15:27 Worklog Time Spent: 10m Work Description: iemejia commented on pull request #7805: [BEAM-3061] Done notifications for BigtableIO.Write URL: https://github.com/apache/beam/pull/7805 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 276780) Time Spent: 16h 20m (was: 16h 10m) > BigtableIO should support emitting a sentinel "done" value when a bundle > completes > -- > > Key: BEAM-3061 > URL: https://issues.apache.org/jira/browse/BEAM-3061 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Steve Niemitz >Assignee: Steve Niemitz >Priority: Major > Time Spent: 16h 20m > Remaining Estimate: 0h > > There was some discussion of this on the dev@ mailing list [1]. This > approach was taken based on discussion there. > [1] > https://lists.apache.org/thread.html/949b33782f722a9000c9bf9e37042739c6fd0927589b99752b78d7bd@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-3061) BigtableIO writes should support emitting "done" notifications
[ https://issues.apache.org/jira/browse/BEAM-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-3061: --- Summary: BigtableIO writes should support emitting "done" notifications (was: BigtableIO should support emitting a sentinel "done" value when a bundle completes) > BigtableIO writes should support emitting "done" notifications > -- > > Key: BEAM-3061 > URL: https://issues.apache.org/jira/browse/BEAM-3061 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Steve Niemitz >Assignee: Steve Niemitz >Priority: Major > Time Spent: 16h 20m > Remaining Estimate: 0h > > There was some discussion of this on the dev@ mailing list [1]. This > approach was taken based on discussion there. > [1] > https://lists.apache.org/thread.html/949b33782f722a9000c9bf9e37042739c6fd0927589b99752b78d7bd@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-6903) Go IT fails on quota issues frequently
[ https://issues.apache.org/jira/browse/BEAM-6903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885324#comment-16885324 ] Valentyn Tymofieiev commented on BEAM-6903: --- cc: [~alanmyrvold] for awareness as this is related to test flakes due to insufficient quota. > Go IT fails on quota issues frequently > -- > > Key: BEAM-6903 > URL: https://issues.apache.org/jira/browse/BEAM-6903 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Jason Kuster >Priority: Major > > https://builds.apache.org/job/beam_PostCommit_Go/3002/ > https://builds.apache.org/job/beam_PostCommit_Go/3000/ > https://builds.apache.org/job/beam_PostCommit_Go/2997/ > https://builds.apache.org/job/beam_PostCommit_Go/2993/ -- This message was sent by Atlassian JIRA (v7.6.14#76016)