date:20190715

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=277230&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277230
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 16/Jul/19 05:45
Start Date: 16/Jul/19 05:45
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303734439
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   Because this PR is implementing reordering joins, a useful test would be a 
test, in which a three-way join is reordered. As join reordering is already 
measured in 
https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#,
 wouldn't it be straightforward to have a similar test? Without such a test, 
how do we even know if join reordering is working?
   
   In terms of checking output plans, Flink has been doing many tests on 
Calcite optimization rules (see 
[here](https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/plan/rules/logical/RewriteMultiJoinConditionRuleTest.scala)
 and 
[here](https://github.com/apache/flink/tree/master/flink-table/flink-table-planner-blink/src/test/resources/org/apache/flink/table/plan)).
 Flink's practice has shown that verifying output plan is deterministic and 
stable.
   
   The basic idea is if you want to test an optimization, only enable relevant 
rules in test case(so rules are hit will be known) I can see by Flink's way, 
you can test rules even if rules can be disabled and enabled independently:
   
   In 
[BeamRuleSets.java](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java)
   ```
   static BeamJoinReorderingRelSet = 
   {JoinCommuteRule.INSTANCE, 
   JoinAssociateRule.INSTANCE}
   ```
   
   In JoinReorderingTest.java
   ```
   FrameworkConfig testConfig = createTestConfig(BeamJoinReorderingRelSet)
   Planner testPlanner = Frameworks.getPlanner(testConfig);
   
   // setup input tables
   
   expected_plan = PlanLoader.load(testcase.class)
   verifyPlan(testPlanner.getPlan(sql), expected_plan)
   ```
   
   By doing so, if a relevant rule is disabled(e.g. 
`JoinCommuteRule.INSTANCE`), it will break existing join reordering tests, 
which guards join ordering for us. It also justifies this PR is doing join 
reordering.   
   
   Because we are starting the effort to have more optimization rules in 
BeamSQL, Flink's practice on testing is a great example that we can learn and 
apply to BeamSQL to maintain our codebase's health.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277230)
Time Spent: 7h 50m  (was: 7h 40m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=277231&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277231
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 16/Jul/19 05:45
Start Date: 16/Jul/19 05:45
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303734439
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   Because this PR is implementing reordering joins, a useful test would be a 
test, in which a three-way join is reordered. As join reordering is already 
measured in 
https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#,
 wouldn't it be straightforward to have a similar test? Without such a test, 
how do we even know if join reordering is working?
   
   In terms of checking output plans, Flink has been doing many tests on 
Calcite optimization rules (see 
[here](https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/plan/rules/logical/RewriteMultiJoinConditionRuleTest.scala)
 and 
[here](https://github.com/apache/flink/tree/master/flink-table/flink-table-planner-blink/src/test/resources/org/apache/flink/table/plan)).
 Flink's practice has shown that verifying output plan is deterministic and 
stable.
   
   The basic idea is if you want to test an optimization, only enable relevant 
rules in test case(so rules are hit will be known) I can see by Flink's way, 
you can test rules even if rules can be disabled and enabled independently:
   
   In 
[BeamRuleSets.java](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java)
   ```
   static BeamJoinReorderingRelSet = 
   {JoinCommuteRule.INSTANCE, 
   JoinAssociateRule.INSTANCE}
   ```
   
   In JoinReorderingTest.java
   ```
   FrameworkConfig testConfig = createTestConfig(BeamJoinReorderingRelSet)
   Planner testPlanner = Frameworks.getPlanner(testConfig);
   
   // setup input tables
   
   expected_plan = PlanLoader.load(testcase.class)
   verifyPlan(testPlanner.getPlan(sql), expected_plan)
   ```
   
   By doing so, if a relevant rule is disabled accidentally(e.g. 
`JoinCommuteRule.INSTANCE`), it will break existing join reordering test, which 
guards join ordering for us. It also justifies this PR is doing join 
reordering.   
   
   Because we are starting the effort to have more optimization rules in 
BeamSQL, Flink's practice on testing is a great example that we can learn and 
apply to BeamSQL to maintain our codebase's health.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277231)
Time Spent: 8h  (was: 7h 50m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=277228&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277228
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 16/Jul/19 05:40
Start Date: 16/Jul/19 05:40
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303734439
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   Because this PR is implementing reordering joins, a useful test would be a 
test, in which a three-way join is reordered. As join reordering is already 
measured in 
https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#,
 wouldn't it be straightforward to have a similar test? Without such a test, 
how do we even know if join reordering is working?
   
   In terms of checking output plans, Flink has been doing many tests on 
Calcite optimization rules (see 
[here](https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/plan/rules/logical/RewriteMultiJoinConditionRuleTest.scala)
 and 
[here](https://github.com/apache/flink/tree/master/flink-table/flink-table-planner-blink/src/test/resources/org/apache/flink/table/plan)).
 Flink's practice has shown that verifying output plan is deterministic and 
stable.
   
   The basic idea is if you want to test an optimization, only enable relevant 
rules in test case(so rules are hit will be known) I can see by Flink's way, 
you can test rules even if rules can be disabled and enabled independently:
   
   In 
[BeamRuleSets.java](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java)
   ```
   static BeamJoinReorderingRelSet = 
   {JoinCommuteRule.INSTANCE, 
   JoinAssociateRule.INSTANCE}
   ```
   
   In JoinReorderingTest.java
   ```
   FrameworkConfig testConfig = createTestConfig(BeamJoinReorderingRelSet)
   Planner testPlanner = Frameworks.getPlanner(testConfig);
   
   // setup input tables
   
   expected_plan = PlanLoader.load(testcase.class)
   verifyPlan(testPlanner.getPlan(sql), expected_plan)
   ```
   
   By doing so, if a relevant rule is disable, it will break existing join 
reordering tests, which has good value for us. It also justifies this PR is 
doing join reordering.
   
   Because we are starting the effort to have more optimization rules in 
BeamSQL, Flink's practice on testing is a great example that we can learn and 
apply to BeamSQL to maintain our codebase's health.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277228)
Time Spent: 7h 40m  (was: 7.5h)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-6855) Side inputs are not supported when using the state API

2019-07-15 Thread Reuven Lax (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885853#comment-16885853
 ] 

Reuven Lax commented on BEAM-6855:
--

[~kenn] it was never addressed. Can you confirm that the described solution is 
the correct one?

> Side inputs are not supported when using the state API
> --
>
> Key: BEAM-6855
> URL: https://issues.apache.org/jira/browse/BEAM-6855
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6675) The JdbcIO sink should accept schemas

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6675?focusedWorklogId=277187&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277187
 ]

ASF GitHub Bot logged work on BEAM-6675:


Author: ASF GitHub Bot
Created on: 16/Jul/19 04:27
Start Date: 16/Jul/19 04:27
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on issue #8962: [BEAM-6675] 
Generate JDBC statement and preparedStatementSetter automatically when schema 
is available
URL: https://github.com/apache/beam/pull/8962#issuecomment-511659585
 
 
   LGTM
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277187)
Time Spent: 7.5h  (was: 7h 20m)

> The JdbcIO sink should accept schemas
> -
>
> Key: BEAM-6675
> URL: https://issues.apache.org/jira/browse/BEAM-6675
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-jdbc
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> If the input has a schema, there should be a default mapping to a 
> PreparedStatement for writing based on that schema.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-6855) Side inputs are not supported when using the state API

2019-07-15 Thread Kenneth Knowles (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885830#comment-16885830
 ] 

Kenneth Knowles commented on BEAM-6855:
---

This was definitely an issue at one time. I thought someone had addressed this, 
but perhaps not. It would be great to confirm and to add ValidatesRunner tests.

> Side inputs are not supported when using the state API
> --
>
> Key: BEAM-6855
> URL: https://issues.apache.org/jira/browse/BEAM-6855
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7739) Add ValueState in Python sdk

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7739?focusedWorklogId=277173&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277173
 ]

ASF GitHub Bot logged work on BEAM-7739:


Author: ASF GitHub Bot
Created on: 16/Jul/19 03:27
Start Date: 16/Jul/19 03:27
Worklog Time Spent: 10m 
  Work Description: rakeshcusat commented on issue #9067: [BEAM-7739] 
Implement ValueState in Python SDK
URL: https://github.com/apache/beam/pull/9067#issuecomment-511649980
 
 
   @robertwb never mind, I found the discussion 
[here](https://lists.apache.org/thread.html/ccc0d548e440b63897b6784cd7896c266498df64c9c63ce6c52ae098@%3Cdev.beam.apache.org%3E)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277173)
Time Spent: 50m  (was: 40m)

> Add ValueState in Python sdk
> 
>
> Key: BEAM-7739
> URL: https://issues.apache.org/jira/browse/BEAM-7739
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Rakesh Kumar
>Assignee: Rakesh Kumar
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently ValueState is missing from Python Sdks but it is existing in Java 
> sdks. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7530) Reading None value type BYTES from bigquery: AttributeError

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7530?focusedWorklogId=277100&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277100
 ]

ASF GitHub Bot logged work on BEAM-7530:


Author: ASF GitHub Bot
Created on: 16/Jul/19 01:10
Start Date: 16/Jul/19 01:10
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #8875: [BEAM-7530] Add it 
test to read None values from BigQuery
URL: https://github.com/apache/beam/pull/8875#issuecomment-511624977
 
 
   Remaining postcommit failures (test_metrics_it, test_datastore_write_limit)  
   are unrelated to the test that is modified in this PR, and failed due to 
lack of quota (known issue).
   @pabloem would you be comfortable merging this PR?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277100)
Time Spent: 7h 10m  (was: 7h)

> Reading None value type BYTES from bigquery: AttributeError
> ---
>
> Key: BEAM-7530
> URL: https://issues.apache.org/jira/browse/BEAM-7530
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Juta Staes
>Assignee: Juta Staes
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> When reading bigquery data where a field of type BYTES contains a None value 
> I get the following error:
> {code:java}
> Traceback (most recent call last): File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/executor.py",
>  line 343, in call finish_state) File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/executor.py",
>  line 383, in attempt_call result = evaluator.finish_bundle() File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 319, in finish_bundle bundles = _read_values_to_bundles(reader) File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 306, in _read_values_to_bundles read_result = 
> [GlobalWindows.windowed_value(e) for e in reader] File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 306, in  read_result = [GlobalWindows.windowed_value(e) for e 
> in reader] File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv/lib/python3.5/site-packages/apache_beam/io/gcp/bigquery_tools.py",
>  line 932, in __iter__ row.f[i].v.string_value = 
> row.f[i].v.string_value.encode('utf-8') AttributeError: 'NoneType' object has 
> no attribute 'string_value'{code}
> This bug was introduced in https://github.com/apache/beam/pull/8621 and is 
> present in the 2.13 release.
> I submitted a pr to fix the issue: https://github.com/apache/beam/pull/8817



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7018) Regex transform for Python SDK

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7018?focusedWorklogId=277094&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277094
 ]

ASF GitHub Bot logged work on BEAM-7018:


Author: ASF GitHub Bot
Created on: 16/Jul/19 00:58
Start Date: 16/Jul/19 00:58
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #8859: [BEAM-7018] Added 
Regex transform for PythonSDK
URL: https://github.com/apache/beam/pull/8859#issuecomment-511622769
 
 
   @mszb Both Robert and I were out of office for a bit. The change looks good 
to me. I will give @robertwb another day if he has any comments. After that we 
can merge it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277094)
Time Spent: 3h 10m  (was: 3h)

> Regex transform for Python SDK
> --
>
> Key: BEAM-7018
> URL: https://issues.apache.org/jira/browse/BEAM-7018
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Rose Nguyen
>Assignee: Shehzaad Nakhoda
>Priority: Minor
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> PTransorms to use Regular Expressions to process elements in a PCollection
> It should offer the same API as its Java counterpart: 
> [https://github.com/apache/beam/blob/11a977b8b26eff2274d706541127c19dc93131a2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Regex.java]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7018) Regex transform for Python SDK

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7018?focusedWorklogId=277093&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277093
 ]

ASF GitHub Bot logged work on BEAM-7018:


Author: ASF GitHub Bot
Created on: 16/Jul/19 00:56
Start Date: 16/Jul/19 00:56
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #8859: [BEAM-7018] 
Added Regex transform for PythonSDK
URL: https://github.com/apache/beam/pull/8859#discussion_r303692315
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -864,3 +867,225 @@ def add_window_info(element, 
timestamp=DoFn.TimestampParam,
 
 def expand(self, pcoll):
   return pcoll | ParDo(self.add_window_info)
+
+
+@ptransform_fn
+def matches_all_object(pcoll, regex, group=None):
 
 Review comment:
   Do we expect these to be public interfaces?
   
   If not, can we prefix them with `_` and move them under Regex.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277093)
Time Spent: 3h  (was: 2h 50m)

> Regex transform for Python SDK
> --
>
> Key: BEAM-7018
> URL: https://issues.apache.org/jira/browse/BEAM-7018
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Rose Nguyen
>Assignee: Shehzaad Nakhoda
>Priority: Minor
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> PTransorms to use Regular Expressions to process elements in a PCollection
> It should offer the same API as its Java counterpart: 
> [https://github.com/apache/beam/blob/11a977b8b26eff2274d706541127c19dc93131a2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Regex.java]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread Luke Cwik (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik reassigned BEAM-2264:
---

Assignee: Udi Meiri

> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Luke Cwik
>Assignee: Udi Meiri
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We should cache the credential used within a Pipeline and re-use it instead 
> of generating a new one on each GCS call. When executing (against 2.0.0 RC2):
> {code}
> python -m apache_beam.examples.wordcount --input 
> "gs://dataflow-samples/shakespeare/*" --output local_counts
> {code}
> Note that we seemingly generate a new access token each time instead of when 
> a refresh is required.
> {code}
>   super(GcsIO, cls).__new__(cls, storage_client))
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 1 files. Estimation 
> took 0.286200046539 seconds
> INFO:root:Running pipeline with DirectRunner.
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 43 files. Estimation 
> took 0.205624818802 seconds
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> ... many more times ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread Luke Cwik (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885720#comment-16885720
 ] 

Luke Cwik commented on BEAM-2264:
-

Thanks for doing the investigation.




> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Luke Cwik
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We should cache the credential used within a Pipeline and re-use it instead 
> of generating a new one on each GCS call. When executing (against 2.0.0 RC2):
> {code}
> python -m apache_beam.examples.wordcount --input 
> "gs://dataflow-samples/shakespeare/*" --output local_counts
> {code}
> Note that we seemingly generate a new access token each time instead of when 
> a refresh is required.
> {code}
>   super(GcsIO, cls).__new__(cls, storage_client))
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 1 files. Estimation 
> took 0.286200046539 seconds
> INFO:root:Running pipeline with DirectRunner.
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 43 files. Estimation 
> took 0.205624818802 seconds
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> ... many more times ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7696) Detect classpath resources contains directory cause exception

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7696?focusedWorklogId=277073&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277073
 ]

ASF GitHub Bot logged work on BEAM-7696:


Author: ASF GitHub Bot
Created on: 15/Jul/19 23:55
Start Date: 15/Jul/19 23:55
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #9019: [BEAM-7696] 
Prepare files to stage also in local master of spark runner.
URL: https://github.com/apache/beam/pull/9019#discussion_r303678227
 
 

 ##
 File path: 
runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PipelineResources.java
 ##
 @@ -61,7 +61,10 @@
 List files = new ArrayList<>();
 for (URL url : ((URLClassLoader) classLoader).getURLs()) {
   try {
-files.add(new File(url.toURI()).getAbsolutePath());
+File file = new File(url.toURI());
+if (file.exists()) {
 
 Review comment:
   Since SparkContext will reject all directories [1], it is probably best to 
exclude directories too.
   
   ```suggestion
   if (file.exists() && !file.isDirectory()) {
   ```
   
   [1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1796-L1802
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277073)
Time Spent: 1h 40m  (was: 1.5h)

> Detect classpath resources contains directory cause exception
> -
>
> Key: BEAM-7696
> URL: https://issues.apache.org/jira/browse/BEAM-7696
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Wang Yanlin
>Assignee: Wang Yanlin
>Priority: Minor
> Fix For: 2.15.0
>
> Attachments: addJar_exception.jpg, files_contains_dir.jpg
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Run the unit test  SparkPipelineStateTest.testBatchPipelineRunningState in 
> IntelliJ IDEA on my mac, get the IllegalArgumentException in the console 
> output. I check the source code, and find the result of 
> _PipelineResources.detectClassPathResourcesToStage_ contains directory, which 
> is the cause of the exception.
> See the attached file 'addJar_exception.jpg' for detail, and the result of 
> _PipelineResources.detectClassPathResourcesToStage_
> is showed in attached file 'files_contains_dir.jpg' during debug.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7750) Pipeline instances are not garbage collected

2019-07-15 Thread Alexey Strokach (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Strokach updated BEAM-7750:
--
Description: 
It seems that Apache Beam's Pipeline instances are not garbage collected, even 
if the pipelines are finished or cancelled and there are no references to those 
pipelines in the Python interpreter.

For pipelines executed in a script, this is not a problem. However, for 
interactive pipelines executed inside a Jupyter notebook, this limits how well 
we can track and remove the dependencies of those pipelines. For example, if a 
pipeline reads from some cache, it would be nice to be able to delete that 
cache once there are no references to it from a pipeline or the global 
namespace.

The issue can be reproduced using the following script: 
[https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py]

  was:
It seems that Apache Beam's Pipeline instances are not garbage collected, even 
if the pipelines are finished or cancelled, and there are no references to 
those pipelines in the Python interpreter.

For pipelines executed in a script, this is not a problem. However, for 
interactive pipelines executed inside a Jupyter notebook, this limits how well 
we can track and remove the dependencies of those pipelines. For example, if a 
pipeline reads from some cache, it would be nice to be able to delete that 
cache once there are no references to it from a pipeline or the global 
namespace.

The issue can be reproduced using the following script: 
https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py


> Pipeline instances are not garbage collected
> 
>
> Key: BEAM-7750
> URL: https://issues.apache.org/jira/browse/BEAM-7750
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.14.0
> Environment: OS: Debian rodete.
> Tested using: 
> Beam versions: 2.13.0, 2.15.0.dev
> Python versions: Python 2.7, Python 3.7.
> Runners:  DirectRunner, DataflowRunner.
>Reporter: Alexey Strokach
>Priority: Minor
>
> It seems that Apache Beam's Pipeline instances are not garbage collected, 
> even if the pipelines are finished or cancelled and there are no references 
> to those pipelines in the Python interpreter.
> For pipelines executed in a script, this is not a problem. However, for 
> interactive pipelines executed inside a Jupyter notebook, this limits how 
> well we can track and remove the dependencies of those pipelines. For 
> example, if a pipeline reads from some cache, it would be nice to be able to 
> delete that cache once there are no references to it from a pipeline or the 
> global namespace.
> The issue can be reproduced using the following script: 
> [https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7750) Pipeline instances are not garbage collected

2019-07-15 Thread Alexey Strokach (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Strokach updated BEAM-7750:
--
Description: 
It seems that Apache Beam's Pipeline instances are not garbage collected, even 
if the pipelines are finished or cancelled and there are no references to those 
pipelines in the Python interpreter.

For pipelines executed in a script, this is not a problem. However, for 
interactive pipelines executed inside a Jupyter notebook, this limits how well 
we can track and remove the dependencies of those pipelines. For example, if a 
pipeline reads from some cache, it would be nice to be able to delete that 
cache once there are no references to it from pipelines or the global namespace.

The issue can be reproduced using the following script: 
[https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py]

  was:
It seems that Apache Beam's Pipeline instances are not garbage collected, even 
if the pipelines are finished or cancelled and there are no references to those 
pipelines in the Python interpreter.

For pipelines executed in a script, this is not a problem. However, for 
interactive pipelines executed inside a Jupyter notebook, this limits how well 
we can track and remove the dependencies of those pipelines. For example, if a 
pipeline reads from some cache, it would be nice to be able to delete that 
cache once there are no references to it from a pipeline or the global 
namespace.

The issue can be reproduced using the following script: 
[https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py]


> Pipeline instances are not garbage collected
> 
>
> Key: BEAM-7750
> URL: https://issues.apache.org/jira/browse/BEAM-7750
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.14.0
> Environment: OS: Debian rodete.
> Tested using: 
> Beam versions: 2.13.0, 2.15.0.dev
> Python versions: Python 2.7, Python 3.7.
> Runners:  DirectRunner, DataflowRunner.
>Reporter: Alexey Strokach
>Priority: Minor
>
> It seems that Apache Beam's Pipeline instances are not garbage collected, 
> even if the pipelines are finished or cancelled and there are no references 
> to those pipelines in the Python interpreter.
> For pipelines executed in a script, this is not a problem. However, for 
> interactive pipelines executed inside a Jupyter notebook, this limits how 
> well we can track and remove the dependencies of those pipelines. For 
> example, if a pipeline reads from some cache, it would be nice to be able to 
> delete that cache once there are no references to it from pipelines or the 
> global namespace.
> The issue can be reproduced using the following script: 
> [https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7750) Pipeline instances are not garbage collected

2019-07-15 Thread Alexey Strokach (JIRA)

Alexey Strokach created BEAM-7750:
-

 Summary: Pipeline instances are not garbage collected
 Key: BEAM-7750
 URL: https://issues.apache.org/jira/browse/BEAM-7750
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Affects Versions: 2.14.0
 Environment: OS: Debian rodete.

Tested using: 
Beam versions: 2.13.0, 2.15.0.dev
Python versions: Python 2.7, Python 3.7.
Runners:  DirectRunner, DataflowRunner.
Reporter: Alexey Strokach


It seems that Apache Beam's Pipeline instances are not garbage collected, even 
if the pipelines are finished or cancelled, and there are no references to 
those pipelines in the Python interpreter.

For pipelines executed in a script, this is not a problem. However, for 
interactive pipelines executed inside a Jupyter notebook, this limits how well 
we can track and remove the dependencies of those pipelines. For example, if a 
pipeline reads from some cache, it would be nice to be able to delete that 
cache once there are no references to it from a pipeline or the global 
namespace.

The issue can be reproduced using the following script: 
https://github.com/ostrokach/beam-notebooks/blob/48718038e63342a5f3acc31352a6326fffd34888/scripts/error_pipeline_gc.py



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (BEAM-7674) Define streaming ITs tests for direct runner in consistent way in Python 2 and Python 3 suites.

2019-07-15 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885697#comment-16885697
 ] 

Pablo Estrada edited comment on BEAM-7674 at 7/15/19 11:30 PM:
---

Sorry for the delay to answer. That test should be a pure streaming test (see 
how it uses TestStream).
Edit: ugh, well, that's embarrassing: I see you've added it. Thanks guys : )


was (Author: pabloem):
Sorry for the delay to answer. That test should be a pure streaming test (see 
how it uses TestStream).

> Define streaming ITs tests for direct runner in consistent way in Python 2 
> and  Python 3 suites.
> 
>
> Key: BEAM-7674
> URL: https://issues.apache.org/jira/browse/BEAM-7674
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Valentyn Tymofieiev
>Assignee: Juta Staes
>Priority: Major
>
> Currently in Python 2 direct runner test suite  some tests run in streaming 
> mode:
> https://github.com/apache/beam/blob/fbd1f4cf7118c7b2fb4e3a4cf46646e98f3e3b8d/sdks/python/build.gradle#L130
> However in Python 3, we run both Batch and Streaming direct runner tests in 
> Batch mode: 
> https://github.com/apache/beam/blob/fbd1f4cf7118c7b2fb4e3a4cf46646e98f3e3b8d/sdks/python/test-suites/direct/py35/build.gradle#L32
> We should check whether we need to explicitly separate the tests into batch 
> and streaming and define all directrunner suites consistently.
> cc: [~Juta]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7674) Define streaming ITs tests for direct runner in consistent way in Python 2 and Python 3 suites.

2019-07-15 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885697#comment-16885697
 ] 

Pablo Estrada commented on BEAM-7674:
-

Sorry for the delay to answer. That test should be a pure streaming test (see 
how it uses TestStream).

> Define streaming ITs tests for direct runner in consistent way in Python 2 
> and  Python 3 suites.
> 
>
> Key: BEAM-7674
> URL: https://issues.apache.org/jira/browse/BEAM-7674
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Valentyn Tymofieiev
>Assignee: Juta Staes
>Priority: Major
>
> Currently in Python 2 direct runner test suite  some tests run in streaming 
> mode:
> https://github.com/apache/beam/blob/fbd1f4cf7118c7b2fb4e3a4cf46646e98f3e3b8d/sdks/python/build.gradle#L130
> However in Python 3, we run both Batch and Streaming direct runner tests in 
> Batch mode: 
> https://github.com/apache/beam/blob/fbd1f4cf7118c7b2fb4e3a4cf46646e98f3e3b8d/sdks/python/test-suites/direct/py35/build.gradle#L32
> We should check whether we need to explicitly separate the tests into batch 
> and streaming and define all directrunner suites consistently.
> cc: [~Juta]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6611) A Python Sink for BigQuery with File Loads in Streaming

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6611?focusedWorklogId=277063&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277063
 ]

ASF GitHub Bot logged work on BEAM-6611:


Author: ASF GitHub Bot
Created on: 15/Jul/19 23:27
Start Date: 15/Jul/19 23:27
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #8871: [BEAM-6611] 
BigQuery file loads in Streaming for Python SDK
URL: https://github.com/apache/beam/pull/8871#issuecomment-511606144
 
 
   > @tvalentyn are you aware of why these tests could be getting triggered? 
See: [#8871 
(comment)](https://github.com/apache/beam/pull/8871#issuecomment-511364580)
   
   @pabloem, PTAL at https://issues.apache.org/jira/browse/BEAM-7674, it may be 
relevant to this question.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277063)
Time Spent: 3h 20m  (was: 3h 10m)

> A Python Sink for BigQuery with File Loads in Streaming
> ---
>
> Key: BEAM-6611
> URL: https://issues.apache.org/jira/browse/BEAM-6611
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Labels: gsoc, gsoc2019, mentor
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The Java SDK supports a bunch of methods for writing data into BigQuery, 
> while the Python SDK supports the following:
> - Streaming inserts for streaming pipelines [As seen in [bigquery.py and 
> BigQueryWriteFn|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L649-L813]]
> - File loads for batch pipelines [As implemented in [PR 
> 7655|https://github.com/apache/beam/pull/7655]]
> Qucik and dirty early design doc: https://s.apache.org/beam-bqfl-py-streaming
> The Java SDK also supports File Loads for Streaming pipelines [see BatchLoads 
> application|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1709-L1776].
> File loads have the advantage of being much cheaper than streaming inserts 
> (although they also are slower for the records to show up in the table).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread Udi Meiri (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885693#comment-16885693
 ] 

Udi Meiri commented on BEAM-2264:
-

Migrating off of apitools-based clients (such as the one used for GCS) is a 
much larger project.
Regarding threading issues, I believe that we came across them when trying to 
reuse GCS clients in multiple threads 
(https://issues.apache.org/jira/browse/BEAM-3990).
The credentials, however, should be thread-safe according to this note: 
https://github.com/googleapis/google-api-python-client/blob/master/docs/thread_safety.md

In any case, the newer credentials object don't try to be thread safe, but that 
is apparently not an issue: 
https://github.com/googleapis/google-auth-library-python/issues/246#issuecomment-371878855
(just potentially causes more refreshes)

> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Luke Cwik
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We should cache the credential used within a Pipeline and re-use it instead 
> of generating a new one on each GCS call. When executing (against 2.0.0 RC2):
> {code}
> python -m apache_beam.examples.wordcount --input 
> "gs://dataflow-samples/shakespeare/*" --output local_counts
> {code}
> Note that we seemingly generate a new access token each time instead of when 
> a refresh is required.
> {code}
>   super(GcsIO, cls).__new__(cls, storage_client))
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 1 files. Estimation 
> took 0.286200046539 seconds
> INFO:root:Running pipeline with DirectRunner.
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 43 files. Estimation 
> took 0.205624818802 seconds
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> ... many more times ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=277060&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277060
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 15/Jul/19 23:16
Start Date: 15/Jul/19 23:16
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #9020: [BEAM-4775] 
Support returning metrics from job service
URL: https://github.com/apache/beam/pull/9020#discussion_r303674183
 
 

 ##
 File path: 
runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/jobsubmission/JobInvocation.java
 ##
 @@ -93,6 +95,7 @@ public void onSuccess(@Nullable PipelineResult 
pipelineResult) {
   switch (pipelineResult.getState()) {
 case DONE:
   setState(Enum.DONE);
+  metrics = pipelineResult.portableMetrics();
 
 Review comment:
   Sure, looks good :+1: 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277060)
Time Spent: 42h 40m  (was: 42.5h)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 42h 40m
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, gauges



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7463) BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: incorrect checksum

2019-07-15 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885686#comment-16885686
 ] 

Pablo Estrada commented on BEAM-7463:
-

FWIW I'm running with Py3.5

> BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: 
> incorrect checksum 
> --
>
> Key: BEAM-7463
> URL: https://issues.apache.org/jira/browse/BEAM-7463
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Valentyn Tymofieiev
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: currently-failing
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> {noformat}
> 15:03:38 FAIL: test_big_query_new_types 
> (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT)
> 15:03:38 
> --
> 15:03:38 Traceback (most recent call last):
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py",
>  line 211, in test_big_query_new_types
> 15:03:38 big_query_query_to_table_pipeline.run_bq_pipeline(options)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py",
>  line 82, in run_bq_pipeline
> 15:03:38 result = p.run()
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py",
>  line 107, in run
> 15:03:38 else test_runner_api))
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 406, in run
> 15:03:38 self._options).run(False)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 419, in run
> 15:03:38 return self.runner.run_pipeline(self, self._options)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py",
>  line 51, in run_pipeline
> 15:03:38 hc_assert_that(self.result, pickler.loads(on_success_matcher))
> 15:03:38 AssertionError: 
> 15:03:38 Expected: (Test pipeline expected terminated in state: DONE and 
> Expected checksum is 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214)
> 15:03:38  but: Expected checksum is 
> 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214 Actual checksum is 
> da39a3ee5e6b4b0d3255bfef95601890afd80709
> {noformat}
> [~Juta] could this be caused by changes to Bigquery matcher? 
> https://github.com/apache/beam/pull/8621/files#diff-f1ec7e3a3e7e2e5082ddb7043954c108R134
>  
> cc: [~pabloem] [~chamikara] [~apilloud]
> A recent postcommit run has BQ failures in other tests as well: 
> https://builds.apache.org/job/beam_PostCommit_Python3_Verify/1000/consoleFull



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-5874) Python BQ directRunnerIT test fails with half-empty assertion

2019-07-15 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885685#comment-16885685
 ] 

Pablo Estrada commented on BEAM-5874:
-

I am also unable to repro this on my machine with Py3.5

> Python BQ directRunnerIT test fails with half-empty assertion
> -
>
> Key: BEAM-5874
> URL: https://issues.apache.org/jira/browse/BEAM-5874
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Henning Rohde
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: currently-failing
>
> https://scans.gradle.com/s/2fd5wbm7vkkna/console-log?task=:beam-sdks-python:directRunnerIT#L147
> ==
> FAIL: test_big_query_new_types 
> (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py",
>  line 169, in test_big_query_new_types
> big_query_query_to_table_pipeline.run_bq_pipeline(options)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py",
>  line 67, in run_bq_pipeline
> result = p.run()
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py",
>  line 107, in run
> else test_runner_api))
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 416, in run
> return self.runner.run_pipeline(self)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py",
>  line 51, in run_pipeline
> hc_assert_that(self.result, pickler.loads(on_success_matcher))
> AssertionError: 
> Expected: (Test pipeline expected terminated in state: DONE and Expected 
> checksum is e1fbcb5ca479a5ca5f9ecf444d6998beee4d44c6)
>  but: 
> --
> XML: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/nosetests.xml
> --
> Ran 7 tests in 24.434s
> FAILED (failures=1)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-2264?focusedWorklogId=277040&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277040
 ]

ASF GitHub Bot logged work on BEAM-2264:


Author: ASF GitHub Bot
Created on: 15/Jul/19 22:53
Start Date: 15/Jul/19 22:53
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #9060: [BEAM-2264] Reuse 
GCP credentials in GCS calls.
URL: https://github.com/apache/beam/pull/9060#discussion_r303669216
 
 

 ##
 File path: sdks/python/apache_beam/internal/gcp/auth.py
 ##
 @@ -57,77 +71,58 @@ def set_running_in_gce(worker_executing_project):
   executing_project = worker_executing_project
 
 
-class AuthenticationException(retry.PermanentException):
-  pass
-
-
-class _GCEMetadataCredentials(OAuth2Credentials):
-  """For internal use only; no backwards-compatibility guarantees.
-
-  Credential object initialized using access token from GCE VM metadata."""
-
-  def __init__(self, user_agent=None):
-"""Create an instance of GCEMetadataCredentials.
-
-These credentials are generated by contacting the metadata server on a GCE
-VM instance.
-
-Args:
-  user_agent: string, The HTTP User-Agent to provide for this application.
-"""
-super(_GCEMetadataCredentials, self).__init__(
-None,  # access_token
-None,  # client_id
-None,  # client_secret
-None,  # refresh_token
-datetime.datetime(2010, 1, 1),  # token_expiry, set to time in past.
-None,  # token_uri
-user_agent)
-
-  @retry.with_exponential_backoff(
-  retry_filter=retry.retry_on_server_errors_and_timeout_filter)
-  def _refresh(self, http_request):
-refresh_time = datetime.datetime.utcnow()
-metadata_root = os.environ.get(
-'GCE_METADATA_ROOT', 'metadata.google.internal')
-token_url = ('http://{}/computeMetadata/v1/instance/service-accounts/'
- 'default/token').format(metadata_root)
-req = Request(token_url, headers={'Metadata-Flavor': 'Google'})
-token_data = json.loads(urlopen(req, timeout=60).read().decode('utf-8'))
-self.access_token = token_data['access_token']
-self.token_expiry = (refresh_time +
- datetime.timedelta(seconds=token_data['expires_in']))
-
-
 def get_service_credentials():
   """For internal use only; no backwards-compatibility guarantees.
 
-  Get credentials to access Google services."""
-  user_agent = 'beam-python-sdk/1.0'
-  if is_running_in_gce:
-# We are currently running as a GCE taskrunner worker.
-#
-# TODO(ccy): It's not entirely clear if these credentials are thread-safe.
-# If so, we can cache these credentials to save the overhead of creating
-# them again.
-return _GCEMetadataCredentials(user_agent=user_agent)
-  else:
-client_scopes = [
-'https://www.googleapis.com/auth/bigquery',
-'https://www.googleapis.com/auth/cloud-platform',
-'https://www.googleapis.com/auth/devstorage.full_control',
-'https://www.googleapis.com/auth/userinfo.email',
-'https://www.googleapis.com/auth/datastore'
-]
-
-try:
-  credentials = GoogleCredentials.get_application_default()
-  credentials = credentials.create_scoped(client_scopes)
-  logging.debug('Connecting using Google Application Default '
-'Credentials.')
-  return credentials
-except Exception as e:
-  logging.warning(
-  'Unable to find default credentials to use: %s\n'
-  'Connecting anonymously.', e)
-  return None
+  Get credentials to access Google services.
+
+  Returns:
+A ``oauth2client.client.OAuth2Credentials`` object or None if credentials
+not found. Returned object is thread-safe.
+  """
+  return _Credentials.get_service_credentials()
+
+
+class _Credentials(object):
+  _credentials_lock = threading.Lock()
+  _credentials_init = False
+  _credentials = None
+
+  @classmethod
+  def get_service_credentials(cls):
+if cls._credentials_init:
+  return cls._credentials
 
 Review comment:
   yes; they auto-refresh
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277040)
Time Spent: 1.5h  (was: 1h 20m)

> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core

[jira] [Commented] (BEAM-6855) Side inputs are not supported when using the state API

2019-07-15 Thread Reuven Lax (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885670#comment-16885670
 ] 

Reuven Lax commented on BEAM-6855:
--

I believe that the side-input code in SimplePushbackSideInputDoFnRunner needs 
to be implemented in StatefulDoFnRunner. [~kenn] can you confirm?

> Side inputs are not supported when using the state API
> --
>
> Key: BEAM-6855
> URL: https://issues.apache.org/jira/browse/BEAM-6855
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7463) BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: incorrect checksum

2019-07-15 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885665#comment-16885665
 ] 

Pablo Estrada commented on BEAM-7463:
-

This tests is running in 4s, but they do seem to perform all of the actions 
that they're supposed to be performing. I've ran it a bunch of times >20, and 
I'm not seeing any problems.
It may really be about a specific machine? I'll investigate further.

> BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: 
> incorrect checksum 
> --
>
> Key: BEAM-7463
> URL: https://issues.apache.org/jira/browse/BEAM-7463
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Valentyn Tymofieiev
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: currently-failing
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> {noformat}
> 15:03:38 FAIL: test_big_query_new_types 
> (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT)
> 15:03:38 
> --
> 15:03:38 Traceback (most recent call last):
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py",
>  line 211, in test_big_query_new_types
> 15:03:38 big_query_query_to_table_pipeline.run_bq_pipeline(options)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py",
>  line 82, in run_bq_pipeline
> 15:03:38 result = p.run()
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py",
>  line 107, in run
> 15:03:38 else test_runner_api))
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 406, in run
> 15:03:38 self._options).run(False)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 419, in run
> 15:03:38 return self.runner.run_pipeline(self, self._options)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py",
>  line 51, in run_pipeline
> 15:03:38 hc_assert_that(self.result, pickler.loads(on_success_matcher))
> 15:03:38 AssertionError: 
> 15:03:38 Expected: (Test pipeline expected terminated in state: DONE and 
> Expected checksum is 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214)
> 15:03:38  but: Expected checksum is 
> 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214 Actual checksum is 
> da39a3ee5e6b4b0d3255bfef95601890afd80709
> {noformat}
> [~Juta] could this be caused by changes to Bigquery matcher? 
> https://github.com/apache/beam/pull/8621/files#diff-f1ec7e3a3e7e2e5082ddb7043954c108R134
>  
> cc: [~pabloem] [~chamikara] [~apilloud]
> A recent postcommit run has BQ failures in other tests as well: 
> https://builds.apache.org/job/beam_PostCommit_Python3_Verify/1000/consoleFull



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-2264?focusedWorklogId=277009&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277009
 ]

ASF GitHub Bot logged work on BEAM-2264:


Author: ASF GitHub Bot
Created on: 15/Jul/19 21:53
Start Date: 15/Jul/19 21:53
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9060: [BEAM-2264] 
Reuse GCP credentials in GCS calls.
URL: https://github.com/apache/beam/pull/9060#discussion_r303653342
 
 

 ##
 File path: sdks/python/apache_beam/internal/gcp/auth.py
 ##
 @@ -57,77 +71,58 @@ def set_running_in_gce(worker_executing_project):
   executing_project = worker_executing_project
 
 
-class AuthenticationException(retry.PermanentException):
-  pass
-
-
-class _GCEMetadataCredentials(OAuth2Credentials):
-  """For internal use only; no backwards-compatibility guarantees.
-
-  Credential object initialized using access token from GCE VM metadata."""
-
-  def __init__(self, user_agent=None):
-"""Create an instance of GCEMetadataCredentials.
-
-These credentials are generated by contacting the metadata server on a GCE
-VM instance.
-
-Args:
-  user_agent: string, The HTTP User-Agent to provide for this application.
-"""
-super(_GCEMetadataCredentials, self).__init__(
-None,  # access_token
-None,  # client_id
-None,  # client_secret
-None,  # refresh_token
-datetime.datetime(2010, 1, 1),  # token_expiry, set to time in past.
-None,  # token_uri
-user_agent)
-
-  @retry.with_exponential_backoff(
-  retry_filter=retry.retry_on_server_errors_and_timeout_filter)
-  def _refresh(self, http_request):
-refresh_time = datetime.datetime.utcnow()
-metadata_root = os.environ.get(
-'GCE_METADATA_ROOT', 'metadata.google.internal')
-token_url = ('http://{}/computeMetadata/v1/instance/service-accounts/'
- 'default/token').format(metadata_root)
-req = Request(token_url, headers={'Metadata-Flavor': 'Google'})
-token_data = json.loads(urlopen(req, timeout=60).read().decode('utf-8'))
-self.access_token = token_data['access_token']
-self.token_expiry = (refresh_time +
- datetime.timedelta(seconds=token_data['expires_in']))
-
-
 def get_service_credentials():
   """For internal use only; no backwards-compatibility guarantees.
 
-  Get credentials to access Google services."""
-  user_agent = 'beam-python-sdk/1.0'
-  if is_running_in_gce:
-# We are currently running as a GCE taskrunner worker.
-#
-# TODO(ccy): It's not entirely clear if these credentials are thread-safe.
-# If so, we can cache these credentials to save the overhead of creating
-# them again.
-return _GCEMetadataCredentials(user_agent=user_agent)
-  else:
-client_scopes = [
-'https://www.googleapis.com/auth/bigquery',
-'https://www.googleapis.com/auth/cloud-platform',
-'https://www.googleapis.com/auth/devstorage.full_control',
-'https://www.googleapis.com/auth/userinfo.email',
-'https://www.googleapis.com/auth/datastore'
-]
-
-try:
-  credentials = GoogleCredentials.get_application_default()
-  credentials = credentials.create_scoped(client_scopes)
-  logging.debug('Connecting using Google Application Default '
-'Credentials.')
-  return credentials
-except Exception as e:
-  logging.warning(
-  'Unable to find default credentials to use: %s\n'
-  'Connecting anonymously.', e)
-  return None
+  Get credentials to access Google services.
+
+  Returns:
+A ``oauth2client.client.OAuth2Credentials`` object or None if credentials
+not found. Returned object is thread-safe.
+  """
+  return _Credentials.get_service_credentials()
+
+
+class _Credentials(object):
+  _credentials_lock = threading.Lock()
+  _credentials_init = False
+  _credentials = None
+
+  @classmethod
+  def get_service_credentials(cls):
+if cls._credentials_init:
+  return cls._credentials
 
 Review comment:
   Is it possible for credentials to expire?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 277009)
Time Spent: 1h 20m  (was: 1h 10m)

> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>

[jira] [Commented] (BEAM-5874) Python BQ directRunnerIT test fails with half-empty assertion

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885650#comment-16885650
 ] 

Valentyn Tymofieiev commented on BEAM-5874:
---

Assigning to Pablo who is investigating a similar issue 
(https://issues.apache.org/jira/browse/BEAM-7463).

> Python BQ directRunnerIT test fails with half-empty assertion
> -
>
> Key: BEAM-5874
> URL: https://issues.apache.org/jira/browse/BEAM-5874
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Henning Rohde
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: currently-failing
>
> https://scans.gradle.com/s/2fd5wbm7vkkna/console-log?task=:beam-sdks-python:directRunnerIT#L147
> ==
> FAIL: test_big_query_new_types 
> (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py",
>  line 169, in test_big_query_new_types
> big_query_query_to_table_pipeline.run_bq_pipeline(options)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py",
>  line 67, in run_bq_pipeline
> result = p.run()
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py",
>  line 107, in run
> else test_runner_api))
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 416, in run
> return self.runner.run_pipeline(self)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py",
>  line 51, in run_pipeline
> hc_assert_that(self.result, pickler.loads(on_success_matcher))
> AssertionError: 
> Expected: (Test pipeline expected terminated in state: DONE and Expected 
> checksum is e1fbcb5ca479a5ca5f9ecf444d6998beee4d44c6)
>  but: 
> --
> XML: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/nosetests.xml
> --
> Ran 7 tests in 24.434s
> FAILED (failures=1)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (BEAM-5874) Python BQ directRunnerIT test fails with half-empty assertion

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev reassigned BEAM-5874:
-

Assignee: Pablo Estrada

> Python BQ directRunnerIT test fails with half-empty assertion
> -
>
> Key: BEAM-5874
> URL: https://issues.apache.org/jira/browse/BEAM-5874
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Henning Rohde
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: currently-failing
>
> https://scans.gradle.com/s/2fd5wbm7vkkna/console-log?task=:beam-sdks-python:directRunnerIT#L147
> ==
> FAIL: test_big_query_new_types 
> (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py",
>  line 169, in test_big_query_new_types
> big_query_query_to_table_pipeline.run_bq_pipeline(options)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py",
>  line 67, in run_bq_pipeline
> result = p.run()
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py",
>  line 107, in run
> else test_runner_api))
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 416, in run
> return self.runner.run_pipeline(self)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py",
>  line 51, in run_pipeline
> hc_assert_that(self.result, pickler.loads(on_success_matcher))
> AssertionError: 
> Expected: (Test pipeline expected terminated in state: DONE and Expected 
> checksum is e1fbcb5ca479a5ca5f9ecf444d6998beee4d44c6)
>  but: 
> --
> XML: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/nosetests.xml
> --
> Ran 7 tests in 24.434s
> FAILED (failures=1)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (BEAM-7656) Add sdk-worker-parallelism arg to flink job server shadow jar

2019-07-15 Thread Kyle Weaver (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver resolved BEAM-7656.
---
   Resolution: Fixed
Fix Version/s: 2.15.0

> Add sdk-worker-parallelism arg to flink job server shadow jar
> -
>
> Key: BEAM-7656
> URL: https://issues.apache.org/jira/browse/BEAM-7656
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's unfortunate we have to manually add these args, but /shrug



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7058) Python SDK: Collect metrics in non-cython environment

2019-07-15 Thread Kyle Weaver (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-7058:
--
Description: 
With the portable Flink runner, the metric is reported as 0, while the count 
metric works as expected.

[https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]

 Edit: metrics are collected properly when using cython, but not without 
cython. This is because state sampling has yet to be implemented in a way that 
does not depend on cython [1].

[1] 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/statesampler_slow.py#L62]

  was:
With the portable Flink runner, the metric is reported as 0, while the count 
metric works as expected.

[https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]

 Edit: metrics are collected properly when using cython, but not without cython.


> Python SDK: Collect metrics in non-cython environment
> -
>
> Key: BEAM-7058
> URL: https://issues.apache.org/jira/browse/BEAM-7058
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink, sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Alex Amato
>Priority: Major
>  Labels: metrics, portability-flink, portable-metrics-bugs
> Attachments: test-metrics.txt
>
>
> With the portable Flink runner, the metric is reported as 0, while the count 
> metric works as expected.
> [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]
>  Edit: metrics are collected properly when using cython, but not without 
> cython. This is because state sampling has yet to be implemented in a way 
> that does not depend on cython [1].
> [1] 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/statesampler_slow.py#L62]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7058) Python SDK: Collect metrics in non-cython environment

2019-07-15 Thread Kyle Weaver (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-7058:
--
Description: 
With the portable Flink runner, the metric is reported as 0, while the count 
metric works as expected.

[https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]

 Edit: metrics are collected properly when using cython, but not without cython.

  was:
With the portable Flink runner, the metric is reported as 0, while the count 
metric works as expected.

[https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]

 


> Python SDK: Collect metrics in non-cython environment
> -
>
> Key: BEAM-7058
> URL: https://issues.apache.org/jira/browse/BEAM-7058
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink, sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Alex Amato
>Priority: Major
>  Labels: metrics, portability-flink, portable-metrics-bugs
> Attachments: test-metrics.txt
>
>
> With the portable Flink runner, the metric is reported as 0, while the count 
> metric works as expected.
> [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]
>  Edit: metrics are collected properly when using cython, but not without 
> cython.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7058) Python SDK: Collect metrics in non-cython environment

2019-07-15 Thread Kyle Weaver (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885630#comment-16885630
 ] 

Kyle Weaver commented on BEAM-7058:
---

We will leave this open. I have updated the issue type and description 
accordingly.

> Python SDK: Collect metrics in non-cython environment
> -
>
> Key: BEAM-7058
> URL: https://issues.apache.org/jira/browse/BEAM-7058
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink, sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Alex Amato
>Priority: Major
>  Labels: metrics, portability-flink, portable-metrics-bugs
> Attachments: test-metrics.txt
>
>
> With the portable Flink runner, the metric is reported as 0, while the count 
> metric works as expected.
> [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7058) Python SDK: Collect metrics in non-cython environment

2019-07-15 Thread Kyle Weaver (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-7058:
--
Summary: Python SDK: Collect metrics in non-cython environment  (was: 
Python SDK: Implement state sampling for non-cython environment)

> Python SDK: Collect metrics in non-cython environment
> -
>
> Key: BEAM-7058
> URL: https://issues.apache.org/jira/browse/BEAM-7058
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink, sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Alex Amato
>Priority: Major
>  Labels: metrics, portability-flink, portable-metrics-bugs
> Attachments: test-metrics.txt
>
>
> With the portable Flink runner, the metric is reported as 0, while the count 
> metric works as expected.
> [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7058) Python SDK: Implement state sampling for non-cython environment

2019-07-15 Thread Kyle Weaver (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-7058:
--
Summary: Python SDK: Implement state sampling for non-cython environment  
(was: Python SDK metric process_bundle_msecs reported as zero)

> Python SDK: Implement state sampling for non-cython environment
> ---
>
> Key: BEAM-7058
> URL: https://issues.apache.org/jira/browse/BEAM-7058
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink, sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Alex Amato
>Priority: Major
>  Labels: metrics, portability-flink, portable-metrics-bugs
> Attachments: test-metrics.txt
>
>
> With the portable Flink runner, the metric is reported as 0, while the count 
> metric works as expected.
> [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7058) Python SDK metric process_bundle_msecs reported as zero

2019-07-15 Thread Kyle Weaver (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-7058:
--
Issue Type: New Feature  (was: Bug)

> Python SDK metric process_bundle_msecs reported as zero
> ---
>
> Key: BEAM-7058
> URL: https://issues.apache.org/jira/browse/BEAM-7058
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink, sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Alex Amato
>Priority: Major
>  Labels: metrics, portability-flink, portable-metrics-bugs
> Attachments: test-metrics.txt
>
>
> With the portable Flink runner, the metric is reported as 0, while the count 
> metric works as expected.
> [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276996&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276996
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 21:15
Start Date: 15/Jul/19 21:15
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303639762
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   Give me some time that let me check the details of PR and how a useful test 
could be.
   
   Generally if we are making changes that we cannot test, we should postponing 
such changes until we figure out how to test it.
   
   I think planner rules can be tested because I see Calcite and Flink have 
tests around optimization rules.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276996)
Time Spent: 7.5h  (was: 7h 20m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7749) Establish a strategy to prevent or mitigate quota-related flakes in Beam integration tests.

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885617#comment-16885617
 ] 

Valentyn Tymofieiev commented on BEAM-7749:
---

In the mean time as a mitigation, maintainers of apache-beam-testing project 
could request quota increases in 
http://console.cloud.google.com/iam-admin/quotas?project=apache-beam-testing.

> Establish a strategy to prevent or mitigate quota-related flakes in Beam 
> integration tests.
> ---
>
> Key: BEAM-7749
> URL: https://issues.apache.org/jira/browse/BEAM-7749
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Valentyn Tymofieiev
>Assignee: Alan Myrvold
>Priority: Major
>
> Every now and then our postcommit tests fail due to lack of quota in 
> apache-beam-testing, see [1].
> Filing this as an umbrella issue to establish a strategy for preventing 
> quota-related failures in Apache Beam tests. 
> Potential ideas:
> - improve resilience of test infrastructure to retry tests that fail due to 
> lack of quota, or delay the test execution when quota limits fall below a 
> threshold
> - monitor & procure the quota usage in apache-beam-testing
> - tune the parameters affecting the number of concurrent test jobs according 
> to available quota, and document best practices.
> - stop previous jenkins job that were triggered on a PR when a new job is 
> started. 
> [1] 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20component%20%3D%20test-failures%20AND%20text%20~%20quota



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread Luke Cwik (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885611#comment-16885611
 ] 

Luke Cwik edited comment on BEAM-2264 at 7/15/19 8:58 PM:
--

It looks like Python has the same issue where google.auth replaces oauth2client 
since oauth2client was 
[deprecated|https://google-auth.readthedocs.io/en/latest/oauth2client-deprecation.html]
 for very similar reasons to the Java one.

But that is separate from this task it seems with BEAM-7352 but it may solve 
some of the threading issues that are being referred to. WDYT?


was (Author: lcwik):
It looks like Python has the same issue where google.auth replaces oauth2client 
since oauth2client was 
[deprecated|https://google-auth.readthedocs.io/en/latest/oauth2client-deprecation.html]
 for very similar reasons to the Java one.

But that is separate from this task it seems with BEAM-7352

> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Luke Cwik
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We should cache the credential used within a Pipeline and re-use it instead 
> of generating a new one on each GCS call. When executing (against 2.0.0 RC2):
> {code}
> python -m apache_beam.examples.wordcount --input 
> "gs://dataflow-samples/shakespeare/*" --output local_counts
> {code}
> Note that we seemingly generate a new access token each time instead of when 
> a refresh is required.
> {code}
>   super(GcsIO, cls).__new__(cls, storage_client))
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 1 files. Estimation 
> took 0.286200046539 seconds
> INFO:root:Running pipeline with DirectRunner.
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 43 files. Estimation 
> took 0.205624818802 seconds
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> ... many more times ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7749) Establish a strategy to prevent or mitigate quota-related flakes in Beam integration tests.

2019-07-15 Thread Valentyn Tymofieiev (JIRA)

Valentyn Tymofieiev created BEAM-7749:
-

 Summary: Establish a strategy to prevent or mitigate quota-related 
flakes in Beam integration tests.
 Key: BEAM-7749
 URL: https://issues.apache.org/jira/browse/BEAM-7749
 Project: Beam
  Issue Type: Bug
  Components: testing
Reporter: Valentyn Tymofieiev
Assignee: Alan Myrvold


Every now and then our postcommit tests fail due to lack of quota in 
apache-beam-testing, see [1].

Filing this as an umbrella issue to establish a strategy for preventing 
quota-related failures in Apache Beam tests. 

Potential ideas:
- improve resilience of test infrastructure to retry tests that fail due to 
lack of quota, or delay the test execution when quota limits fall below a 
threshold
- monitor & procure the quota usage in apache-beam-testing
- tune the parameters affecting the number of concurrent test jobs according to 
available quota, and document best practices.
- stop previous jenkins job that were triggered on a PR when a new job is 
started. 

[1] 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20component%20%3D%20test-failures%20AND%20text%20~%20quota



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread Luke Cwik (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885611#comment-16885611
 ] 

Luke Cwik commented on BEAM-2264:
-

It looks like Python has the same issue where google.auth replaces oauth2client 
since oauth2client was 
[deprecated|https://google-auth.readthedocs.io/en/latest/oauth2client-deprecation.html]
 for very similar reasons to the Java one.

But that is separate from this task it seems with BEAM-7352

> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Luke Cwik
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We should cache the credential used within a Pipeline and re-use it instead 
> of generating a new one on each GCS call. When executing (against 2.0.0 RC2):
> {code}
> python -m apache_beam.examples.wordcount --input 
> "gs://dataflow-samples/shakespeare/*" --output local_counts
> {code}
> Note that we seemingly generate a new access token each time instead of when 
> a refresh is required.
> {code}
>   super(GcsIO, cls).__new__(cls, storage_client))
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 1 files. Estimation 
> took 0.286200046539 seconds
> INFO:root:Running pipeline with DirectRunner.
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 43 files. Estimation 
> took 0.205624818802 seconds
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> ... many more times ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7058) Python SDK metric process_bundle_msecs reported as zero

2019-07-15 Thread Rakesh Kumar (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885610#comment-16885610
 ] 

Rakesh Kumar commented on BEAM-7058:


After installing cython in the application environment, I can see that 
{{\[organization_specific_prefix\].operator.beam-metric-pardo_execution_time-process_bundle_msecs-v1.gauge.mean}}
 is being populated correctly.

Thanks [~angoenka] for looking into this.

Should we keep this jira ticket open to provide the implementation for 
non-cython version?


> Python SDK metric process_bundle_msecs reported as zero
> ---
>
> Key: BEAM-7058
> URL: https://issues.apache.org/jira/browse/BEAM-7058
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Alex Amato
>Priority: Major
>  Labels: metrics, portability-flink, portable-metrics-bugs
> Attachments: test-metrics.txt
>
>
> With the portable Flink runner, the metric is reported as 0, while the count 
> metric works as expected.
> [https://lists.apache.org/thread.html/25eec8104bda6e4c71cc6c5e9864c335833c3ae2afe225d372479f30@%3Cdev.beam.apache.org%3E]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7666) Pipe memory thrashing signal to Dataflow

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7666?focusedWorklogId=276976&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276976
 ]

ASF GitHub Bot logged work on BEAM-7666:


Author: ASF GitHub Bot
Created on: 15/Jul/19 20:45
Start Date: 15/Jul/19 20:45
Worklog Time Spent: 10m 
  Work Description: angoenka commented on pull request #8971: [BEAM-7666] 
Throttle piping
URL: https://github.com/apache/beam/pull/8971
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276976)
Time Spent: 20m  (was: 10m)

> Pipe memory thrashing signal to Dataflow
> 
>
> Key: BEAM-7666
> URL: https://issues.apache.org/jira/browse/BEAM-7666
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Dustin Rhodes
>Assignee: Dustin Rhodes
>Priority: Critical
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For autoscaling we would like to know if the user worker is spending too much 
> time garbage collecting.  Pipe this signal through counters to DF.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7527) Python 3 test parallelization causes test flakines due to ModuleNotFoundError.

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885597#comment-16885597
 ] 

Valentyn Tymofieiev commented on BEAM-7527:
---

I'll take a look at this while Mark is OOO.

> Python 3 test parallelization causes  test flakines due to 
> ModuleNotFoundError.
> ---
>
> Key: BEAM-7527
> URL: https://issues.apache.org/jira/browse/BEAM-7527
> Project: Beam
>  Issue Type: Sub-task
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Blocker
>
> I am seeing several errors in Python SDK Integration test suites, such as 
> Dataflow ValidatesRunner and Python PostCommit that fail due to one of the 
> autogenerated files not being found.
> For example:
> {noformat}
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py:84:
>  UserWarning: Running the Apache Beam SDK on Python 3 is not yet fully 
> supported. You may encounter buggy behavior or missing features.
>   'Running the Apache Beam SDK on Python 3 is not yet fully supported. '
> Failure: ModuleNotFoundError (No module named 'beam_runner_api_pb2') ... 
> ERROR
> ==
> ERROR: Failure: ModuleNotFoundError (No module named 'beam_runner_api_pb2')
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/failure.py",
>  line 39, in runTest
> raise self.exc_val.with_traceback(self.tb)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/loader.py",
>  line 418, in loadTestsFromName
> addr.filename, addr.module)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py",
>  line 47, in importFromPath
> return self.importFromDir(dir_path, fqname)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py",
>  line 94, in importFromDir
> mod = load_module(part_fqname, fh, filename, desc)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/imp.py",
>  line 245, in load_module
> return load_package(name, filename)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/imp.py",
>  line 217, in load_package
> return _load(spec)
>   File "", line 684, in _load
>   File "", line 665, in _load_unlocked
>   File "", line 678, in exec_module
>   File "", line 219, in _call_with_frames_removed
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py",
>  line 97, in 
> from apache_beam import coders
>   File "/home/jenkins/jenkins-slave/workspace/beam_Pos
> tCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/__init__.py", line 
> 19, in 
> from apache_beam.coders.coders import *
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/coders.py",
>  line 32, in 
> from apache_beam.coders import coder_impl
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/coder_impl.py",
>  line 44, in 
> from apache_beam.utils import windowed_value
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/windowed_value.py",
>  line 34, in 
> from apache_beam.utils.timestamp import MAX_TIMESTAMP
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/timestamp.py",
>  line 34, in 
> from apache_beam.portability import common_urns
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/portability/common_urns.py",
>  line 25, in 
> from apache_beam.portability.api import metrics_pb2
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/portability/api/metrics_pb2.py",
>  line 16, in 
> import beam_runner_api_pb2 as beam__runner__api__pb2
> ModuleNotFoundError: No module named 'beam_runner_api_pb2'
> {noformat}
> {noformat}
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__

[jira] [Assigned] (BEAM-7527) Python 3 test parallelization causes test flakines due to ModuleNotFoundError.

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev reassigned BEAM-7527:
-

Assignee: Valentyn Tymofieiev  (was: Mark Liu)

> Python 3 test parallelization causes  test flakines due to 
> ModuleNotFoundError.
> ---
>
> Key: BEAM-7527
> URL: https://issues.apache.org/jira/browse/BEAM-7527
> Project: Beam
>  Issue Type: Sub-task
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Blocker
>
> I am seeing several errors in Python SDK Integration test suites, such as 
> Dataflow ValidatesRunner and Python PostCommit that fail due to one of the 
> autogenerated files not being found.
> For example:
> {noformat}
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py:84:
>  UserWarning: Running the Apache Beam SDK on Python 3 is not yet fully 
> supported. You may encounter buggy behavior or missing features.
>   'Running the Apache Beam SDK on Python 3 is not yet fully supported. '
> Failure: ModuleNotFoundError (No module named 'beam_runner_api_pb2') ... 
> ERROR
> ==
> ERROR: Failure: ModuleNotFoundError (No module named 'beam_runner_api_pb2')
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/failure.py",
>  line 39, in runTest
> raise self.exc_val.with_traceback(self.tb)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/loader.py",
>  line 418, in loadTestsFromName
> addr.filename, addr.module)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py",
>  line 47, in importFromPath
> return self.importFromDir(dir_path, fqname)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py",
>  line 94, in importFromDir
> mod = load_module(part_fqname, fh, filename, desc)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/imp.py",
>  line 245, in load_module
> return load_package(name, filename)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/build/gradleenv/-1734967053/lib/python3.6/imp.py",
>  line 217, in load_package
> return _load(spec)
>   File "", line 684, in _load
>   File "", line 665, in _load_unlocked
>   File "", line 678, in exec_module
>   File "", line 219, in _call_with_frames_removed
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py",
>  line 97, in 
> from apache_beam import coders
>   File "/home/jenkins/jenkins-slave/workspace/beam_Pos
> tCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/__init__.py", line 
> 19, in 
> from apache_beam.coders.coders import *
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/coders.py",
>  line 32, in 
> from apache_beam.coders import coder_impl
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/coders/coder_impl.py",
>  line 44, in 
> from apache_beam.utils import windowed_value
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/windowed_value.py",
>  line 34, in 
> from apache_beam.utils.timestamp import MAX_TIMESTAMP
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/timestamp.py",
>  line 34, in 
> from apache_beam.portability import common_urns
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/portability/common_urns.py",
>  line 25, in 
> from apache_beam.portability.api import metrics_pb2
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/portability/api/metrics_pb2.py",
>  line 16, in 
> import beam_runner_api_pb2 as beam__runner__api__pb2
> ModuleNotFoundError: No module named 'beam_runner_api_pb2'
> {noformat}
> {noformat}
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/__init__.py:84:
>  UserWarning: Running the Ap

[jira] [Work logged] (BEAM-7656) Add sdk-worker-parallelism arg to flink job server shadow jar

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7656?focusedWorklogId=276974&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276974
 ]

ASF GitHub Bot logged work on BEAM-7656:


Author: ASF GitHub Bot
Created on: 15/Jul/19 20:41
Start Date: 15/Jul/19 20:41
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8967: [BEAM-7656] Add 
sdk-worker-parallelism arg to flink job server shadow…
URL: https://github.com/apache/beam/pull/8967#issuecomment-511561625
 
 
   LGTM
   Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276974)
Time Spent: 0.5h  (was: 20m)

> Add sdk-worker-parallelism arg to flink job server shadow jar
> -
>
> Key: BEAM-7656
> URL: https://issues.apache.org/jira/browse/BEAM-7656
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It's unfortunate we have to manually add these args, but /shrug



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7656) Add sdk-worker-parallelism arg to flink job server shadow jar

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7656?focusedWorklogId=276975&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276975
 ]

ASF GitHub Bot logged work on BEAM-7656:


Author: ASF GitHub Bot
Created on: 15/Jul/19 20:41
Start Date: 15/Jul/19 20:41
Worklog Time Spent: 10m 
  Work Description: angoenka commented on pull request #8967: [BEAM-7656] 
Add sdk-worker-parallelism arg to flink job server shadow…
URL: https://github.com/apache/beam/pull/8967
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276975)
Time Spent: 40m  (was: 0.5h)

> Add sdk-worker-parallelism arg to flink job server shadow jar
> -
>
> Key: BEAM-7656
> URL: https://issues.apache.org/jira/browse/BEAM-7656
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's unfortunate we have to manually add these args, but /shrug



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7733) Flaky tests on Dataflow because of not found information about job state

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885596#comment-16885596
 ] 

Valentyn Tymofieiev commented on BEAM-7733:
---

This a Dataflow issue, and is being addressed outside of Apache Beam.  
BEAM-6202 may be able to help if we decide to retry 404s. 

> Flaky tests on Dataflow because of not found information about job state
> 
>
> Key: BEAM-7733
> URL: https://issues.apache.org/jira/browse/BEAM-7733
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, testing
>Reporter: Kasia Kucharczyk
>Priority: Major
> Fix For: Not applicable
>
>
> Some of Python performance cron tests on Dataflow experience some problems 
> because of: 
> {code:java}
> AssertionError: Job did not reach to a terminal state after waiting 
> indefinitely.
> {code}
>  
> Dataflow returns following errors: 
> {code:java}
> "error": {
>  "code": 404,
>  "message": "(bf55d279f24d1255): Information about job 
> 2019-07-07_08_42_42-8241186059035490445 could not be found in our system. 
> Please double check the id is correct. If it is please contact customer 
> support.",
>  "status": "NOT_FOUND"
>}
>  }
> {code}
> Those tests are success on Dataflow, but Jenkins returns failure. 
> It already happened in Combine and CoGroupByKey tests:
>  * [example of Combine 
> test|https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_LoadTests_Python_Combine_Dataflow_Batch/6/console]
>  * [example of 
> CoGBK|https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_LoadTests_Python_CoGBK_Dataflow_Batch/3/console]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-2264?focusedWorklogId=276972&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276972
 ]

ASF GitHub Bot logged work on BEAM-2264:


Author: ASF GitHub Bot
Created on: 15/Jul/19 20:40
Start Date: 15/Jul/19 20:40
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9060: [BEAM-2264] Reuse GCP 
credentials in GCS calls.
URL: https://github.com/apache/beam/pull/9060#issuecomment-511560682
 
 
   R: @lukecwik (since you have some state already, lmk if you don't have 
bandwidth)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276972)
Time Spent: 1h 10m  (was: 1h)

> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Luke Cwik
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We should cache the credential used within a Pipeline and re-use it instead 
> of generating a new one on each GCS call. When executing (against 2.0.0 RC2):
> {code}
> python -m apache_beam.examples.wordcount --input 
> "gs://dataflow-samples/shakespeare/*" --output local_counts
> {code}
> Note that we seemingly generate a new access token each time instead of when 
> a refresh is required.
> {code}
>   super(GcsIO, cls).__new__(cls, storage_client))
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 1 files. Estimation 
> took 0.286200046539 seconds
> INFO:root:Running pipeline with DirectRunner.
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 43 files. Estimation 
> took 0.205624818802 seconds
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> ... many more times ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7748) beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2'

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885590#comment-16885590
 ] 

Valentyn Tymofieiev commented on BEAM-7748:
---

This is a duplicate of https://issues.apache.org/jira/browse/BEAM-7527

> beam_PostCommit_Py_VR_Dataflow -> 
> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module 
> named 'endpoints_pb2'
> -
>
> Key: BEAM-7748
> URL: https://issues.apache.org/jira/browse/BEAM-7748
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures, testing
>Reporter: Ankur Goenka
>Priority: Major
>
> beam_PostCommit_Py_VR_Dataflow is failing on Py3 related test suites becase 
> of missing module
>  
> [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Py_VR_Dataflow/4025/]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Closed] (BEAM-7748) beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2'

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev closed BEAM-7748.
-
   Resolution: Duplicate
Fix Version/s: Not applicable

> beam_PostCommit_Py_VR_Dataflow -> 
> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module 
> named 'endpoints_pb2'
> -
>
> Key: BEAM-7748
> URL: https://issues.apache.org/jira/browse/BEAM-7748
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures, testing
>Reporter: Ankur Goenka
>Priority: Major
> Fix For: Not applicable
>
>
> beam_PostCommit_Py_VR_Dataflow is failing on Py3 related test suites becase 
> of missing module
>  
> [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Py_VR_Dataflow/4025/]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7748) beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2'

2019-07-15 Thread Ankur Goenka (JIRA)

Ankur Goenka created BEAM-7748:
--

 Summary: beam_PostCommit_Py_VR_Dataflow -> 
:sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module 
named 'endpoints_pb2'
 Key: BEAM-7748
 URL: https://issues.apache.org/jira/browse/BEAM-7748
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core, test-failures, testing
Reporter: Ankur Goenka


beam_PostCommit_Py_VR_Dataflow is failing on Py3 related test suites becase of 
missing module

 

[https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Py_VR_Dataflow/4025/]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-2264) Re-use credential instead of generating a new one one each GCS call

2019-07-15 Thread Udi Meiri (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885587#comment-16885587
 ] 

Udi Meiri commented on BEAM-2264:
-

I'm not sure. I added a test logging.error('wrapper!') entry to 
_do_refresh_request() and it was only logged once, and only after the second 
"Attempting refresh..." entry.
That may point to another user of oauth2client, or concurrent refreshes (i.e. 
"wrapper!" message is from the first thread doing "Attempting refresh...").
The jsonPayload.thread field in Stackdriver logs is identical for all 3 entries 
(136:140478770464512), but I have no idea if that corresponds to a Python 
thread.

{code}
2019-07-15T17:32:45.078399658Z Loading main session from the staging area... I 
2019-07-15T17:32:45.100733518Z Attempting refresh to obtain initial 
access_token I 
2019-07-15T17:32:45.102308750Z Status HTTP server running at localhost:40229 I 
2019-07-15T17:32:46.167937994Z Executing workitem  I 
2019-07-15T17:32:46.168387413Z Memory usage of worker 
beamapp-ehudm-0715172721--07151027-edcl-harness-xxm0 is 114 MB I 
2019-07-15T17:32:46.195270538Z Attempting refresh to obtain initial 
access_token I 
2019-07-15T17:32:46.195525169Z wrapper! E 
{code}



 

> Re-use credential instead of generating a new one one each GCS call
> ---
>
> Key: BEAM-2264
> URL: https://issues.apache.org/jira/browse/BEAM-2264
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Luke Cwik
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We should cache the credential used within a Pipeline and re-use it instead 
> of generating a new one on each GCS call. When executing (against 2.0.0 RC2):
> {code}
> python -m apache_beam.examples.wordcount --input 
> "gs://dataflow-samples/shakespeare/*" --output local_counts
> {code}
> Note that we seemingly generate a new access token each time instead of when 
> a refresh is required.
> {code}
>   super(GcsIO, cls).__new__(cls, storage_client))
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 1 files. Estimation 
> took 0.286200046539 seconds
> INFO:root:Running pipeline with DirectRunner.
> INFO:root:Starting the size estimation of the input
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:root:Finished the size estimation of the input at 43 files. Estimation 
> took 0.205624818802 seconds
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> INFO:oauth2client.client:Refreshing access_token
> INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
> ... many more times ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7748) beam_PostCommit_Py_VR_Dataflow -> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module named 'endpoints_pb2'

2019-07-15 Thread Ankur Goenka (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885588#comment-16885588
 ] 

Ankur Goenka commented on BEAM-7748:


cc [~tvalentyn]

> beam_PostCommit_Py_VR_Dataflow -> 
> :sdks:python:test-suites:dataflow:py35:validatesRunnerBatchTests -> No module 
> named 'endpoints_pb2'
> -
>
> Key: BEAM-7748
> URL: https://issues.apache.org/jira/browse/BEAM-7748
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures, testing
>Reporter: Ankur Goenka
>Priority: Major
>
> beam_PostCommit_Py_VR_Dataflow is failing on Py3 related test suites becase 
> of missing module
>  
> [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Py_VR_Dataflow/4025/]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7546) Portable WordCount-on-Flink Precommit is flaky - temporary folder not found.

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7546?focusedWorklogId=276966&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276966
 ]

ASF GitHub Bot logged work on BEAM-7546:


Author: ASF GitHub Bot
Created on: 15/Jul/19 20:23
Start Date: 15/Jul/19 20:23
Worklog Time Spent: 10m 
  Work Description: angoenka commented on pull request #9046: [BEAM-7546] 
Increasing environment cache to avoid chances of recreati…
URL: https://github.com/apache/beam/pull/9046
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276966)
Time Spent: 50m  (was: 40m)

> Portable WordCount-on-Flink Precommit is flaky - temporary folder not found.
> 
>
> Key: BEAM-7546
> URL: https://issues.apache.org/jira/browse/BEAM-7546
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Valentyn Tymofieiev
>Assignee: Ankur Goenka
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> On a few occasions I see this test fail due to a temp directory being missing.
> Sample scan from 
> https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/745/: 
> https://scans.gradle.com/s/3ra4xw4hqvlyw/console-log?task=:sdks:python:portableWordCountBatch
> {noformat}
> [grpc-default-executor-0] ERROR sdk_worker._execute - Error processing 
> instruction 8. Original traceback is
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 157, in _execute
> response = task()
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 190, in 
> self._execute(lambda: worker.do_instruction(work), work)
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 342, in do_instruction
> request.instruction_id)
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 368, in process_bundle
> bundle_processor.process_bundle(instruction_id))
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
>  line 589, in process_bundle
> ].process_encoded(data.data)
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
>  line 143, in process_encoded
> self.output(decoded_value)
>   File "apache_beam/runners/worker/operations.py", line 255, in 
> apache_beam.runners.worker.operations.Operation.output
> def output(self, windowed_value, output_index=0):
>   File "apache_beam/runners/worker/operations.py", line 256, in 
> apache_beam.runners.worker.operations.Operation.output
> cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value)
>   File "apache_beam/runners/worker/operations.py", line 143, in 
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
> self.consumer.process(windowed_value)
>   File "apache_beam/runners/worker/operations.py", line 593, in 
> apache_beam.runners.worker.operations.DoOperation.process
> with self.scoped_process_state:
>   File "apache_beam/runners/worker/operations.py", line 594, in 
> apache_beam.runners.worker.operations.DoOperation.process
> delayed_application = self.dofn_receiver.receive(o)
>   File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRun
> ner.receive
> self.process(windowed_value)
>   File "apache_beam/runners/common.py", line 784, in 
> apache_beam.runners.common.DoFnRunner.process
> self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 851, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
> raise_with_traceback(new_exn)
>   File "apache_beam/runners/common.py", line 782, in 
> apache_beam.runners.common.DoFnRunner.process
> return self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 594, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
> self._invoke_process_per_window(
>   File "apache_beam/runners/common.py", line 666, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
> windowed_value, self.process_method(*args_for_process))
>   File "/usr/local/lib/python2.7/site-packages/apache_beam/io/iobase.py", 
> line 1041, in process
> self.writer = self.sink.open_writer(init_result, str(uuid.uui

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276964&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276964
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 20:18
Start Date: 15/Jul/19 20:18
Worklog Time Spent: 10m 
  Work Description: akedin commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303617190
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   In general we should have tests. However in this case the inputs/outputs are 
not changing depending on the inputs. We should make sure that we have a couple 
of tests that we know trigger these rules but we cannot enforce it, i.e. I 
don't know how we can ensure that we even hit the rule by looking at the 
results. If we don't know what a rule breaks, can we come up with a meaningful 
test for it?
   
   I also don't think we should check the plans. It will be too brittle. Rules 
can be enabled and disabled independently and probably interact in 
hard-to-predict ways. If VolcanoPlanner (or in general Calcite stuff) is 
tested, then we should not re-test it and accept that it applies the rules 
correctly. We add some rules that are applied the same way as all other rules 
and inherit from them. In this case the rules behave the same way as built-in 
Calcite rules, and reject the plans that we already don't support in the join 
rel.
   
   It feels like the only issue can be is the applicability of the reordering 
operation itself. Looking at what join operations we support and how it works 
(e.g. window is inherited for non-merging windows, or the input has to be 
explicitly re-windowed for merging windows, and that we only allow the default 
trigger that has itself as a continuation trigger), it looks safe and from the 
top of the head I cannot come up with a test case that we accept but that would 
break with the rule / without the rule.
   
   Probably a unit test of the join rejection would make sense but I don't 
think it's  really valuable as we literally call the same code that we already 
do in the existing join rel and that probably already have tests (this should 
be confirmed) that trigger that logic.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276964)
Time Spent: 7h 20m  (was: 7h 10m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276959&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276959
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 20:16
Start Date: 15/Jul/19 20:16
Worklog Time Spent: 10m 
  Work Description: riazela commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303616264
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamJoinAssociateRule.java
 ##
 @@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.sql.impl.rule;
+
+import org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.RelFactories;
+import org.apache.calcite.rel.rules.JoinAssociateRule;
+import org.apache.calcite.tools.RelBuilderFactory;
+
+/**
+ * This is very similar to {@link 
org.apache.calcite.rel.rules.JoinAssociateRule}. It only checks if
+ * the resulting condition is supported before transforming.
+ */
+public class BeamJoinAssociateRule extends JoinAssociateRule {
+  // ~ Static fields/initializers -
+
+  /** The singleton. */
+  public static final 
org.apache.beam.sdk.extensions.sql.impl.rule.BeamJoinAssociateRule INSTANCE =
+  new org.apache.beam.sdk.extensions.sql.impl.rule.BeamJoinAssociateRule(
+  RelFactories.LOGICAL_BUILDER);
+
+  // ~ Constructors ---
 
 Review comment:
   Removed it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276959)
Time Spent: 7h 10m  (was: 7h)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276957&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276957
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 20:02
Start Date: 15/Jul/19 20:02
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303607515
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   But it's less common to have a change without any test checked in to guard 
it? Also lacking of tests were reasons that we enabled rules but then broke 
users.
   
   I am thinking a test that you can enable rules and then check output plan?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276957)
Time Spent: 7h  (was: 6h 50m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276953&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276953
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 19:59
Start Date: 15/Jul/19 19:59
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303607515
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   But it's less common to have a change without any test? Also lacking of 
tests were reasons that we enabled rules but then broke users.
   
   I am thinking a test that you can enable rules and then check output plan?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276953)
Time Spent: 6h 50m  (was: 6h 40m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276946&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276946
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 19:54
Start Date: 15/Jul/19 19:54
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303607515
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   But it's less common to have a change without any test?
   
   I am thinking a test that you can enable rules and then check output plan. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276946)
Time Spent: 6h 40m  (was: 6.5h)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7747) ERROR: test_sink_transform (apache_beam.io.avroio_test.TestFastAvro) Fails on Windows

2019-07-15 Thread Valentyn Tymofieiev (JIRA)

Valentyn Tymofieiev created BEAM-7747:
-

 Summary: ERROR: test_sink_transform 
(apache_beam.io.avroio_test.TestFastAvro) Fails on Windows
 Key: BEAM-7747
 URL: https://issues.apache.org/jira/browse/BEAM-7747
 Project: Beam
  Issue Type: Bug
  Components: io-python-avro, test-failures
Reporter: Valentyn Tymofieiev



==
ERROR: test_sink_transform (apache_beam.io.avroio_test.TestFastAvro)
--
Traceback (most recent call last):
  File "C:\projects\beam\sdks\python\apache_beam\io\avroio_test.py", line 436, 
in test_sink_transform
| avroio.WriteToAvro(path, self.SCHEMA, use_fastavro=self.use_fastavro)
  File "C:\projects\beam\sdks\python\apache_beam\pipeline.py", line 426, in 
__exit__
self.run().wait_until_finish()
  File "C:\projects\beam\sdks\python\apache_beam\testing\test_pipeline.py", 
line 107, in run
else test_runner_api))
  File "C:\projects\beam\sdks\python\apache_beam\pipeline.py", line 406, in run
self._options).run(False)
  File "C:\projects\beam\sdks\python\apache_beam\pipeline.py", line 419, in run
return self.runner.run_pipeline(self, self._options)
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\direct\direct_runner.py", 
line 128, in run_pipeline
return runner.run_pipeline(pipeline, options)
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py",
 line 319, in run_pipeline
default_environment=self._default_environment))
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py",
 line 326, in run_via_runner_api
return self.run_stages(stage_context, stages)
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py",
 line 408, in run_stages
stage_context.safe_coders)
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py",
 line 681, in _run_stage
result, splits = bundle_manager.process_bundle(data_input, data_output)
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py",
 line 1562, in process_bundle
part_inputs):
  File "C:\venv\newenv1\lib\site-packages\concurrent\futures\_base.py", line 
641, in result_iterator
yield fs.pop().result()
  File "C:\venv\newenv1\lib\site-packages\concurrent\futures\_base.py", line 
462, in result
return self.__get_result()
  File "C:\venv\newenv1\lib\site-packages\concurrent\futures\thread.py", line 
63, in run
result = self.fn(*self.args, **self.kwargs)
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py",
 line 1561, in 
self._registered).process_bundle(part, expected_outputs),
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py",
 line 1500, in process_bundle
result_future = self._controller.control_handler.push(process_bundle_req)
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\portability\fn_api_runner.py",
 line 1017, in push
response = self.worker.do_instruction(request)
  File "C:\projects\beam\sdks\python\apache_beam\runners\worker\sdk_worker.py", 
line 342, in do_instruction
request.instruction_id)
  File "C:\projects\beam\sdks\python\apache_beam\runners\worker\sdk_worker.py", 
line 368, in process_bundle
bundle_processor.process_bundle(instruction_id))
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\worker\bundle_processor.py", 
line 593, in process_bundle
data.ptransform_id].process_encoded(data.data)
  File 
"C:\projects\beam\sdks\python\apache_beam\runners\worker\bundle_processor.py", 
line 143, in process_encoded
self.output(decoded_value)
  File "C:\projects\beam\sdks\python\apache_beam\runners\worker\operations.py", 
line 256, in output
cython.cast(Receiver, self.receivers[output_index]).receive(windowed_value)
  File "C:\projects\beam\sdks\python\apache_beam\runners\worker\operations.py", 
line 143, in receive
self.consumer.process(windowed_value)
  File "C:\projects\beam\sdks\python\apache_beam\runners\worker\operations.py", 
line 594, in process
delayed_application = self.dofn_receiver.receive(o)
  File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 795, 
in receive
self.process(windowed_value)
  File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 801, 
in process
self._reraise_augmented(exn)
  File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 868, 
in _reraise_augmented
raise_with_traceback(new_exn)
  File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 799, 
in process
return self.do_fn_invoker.invoke_process(windowed_value)
  File "C:\projects\beam\sdks\python\apache_beam\runners\common.py", line 611, 
in invoke_process
windowed_value, ad

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276935&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276935
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 19:01
Start Date: 15/Jul/19 19:01
Worklog Time Spent: 10m 
  Work Description: akedin commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303540287
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamJoinAssociateRule.java
 ##
 @@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.sql.impl.rule;
+
+import org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.RelFactories;
+import org.apache.calcite.rel.rules.JoinAssociateRule;
+import org.apache.calcite.tools.RelBuilderFactory;
+
+/**
+ * This is very similar to {@link 
org.apache.calcite.rel.rules.JoinAssociateRule}. It only checks if
+ * the resulting condition is supported before transforming.
+ */
+public class BeamJoinAssociateRule extends JoinAssociateRule {
+  // ~ Static fields/initializers -
+
+  /** The singleton. */
+  public static final 
org.apache.beam.sdk.extensions.sql.impl.rule.BeamJoinAssociateRule INSTANCE =
+  new org.apache.beam.sdk.extensions.sql.impl.rule.BeamJoinAssociateRule(
+  RelFactories.LOGICAL_BUILDER);
+
+  // ~ Constructors ---
 
 Review comment:
   nit: we don't usually have these separators in Beam codebase
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276935)
Time Spent: 6.5h  (was: 6h 20m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276932&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276932
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 18:58
Start Date: 15/Jul/19 18:58
Worklog Time Spent: 10m 
  Work Description: akedin commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303579948
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/JoinRelOptRuleCall.java
 ##
 @@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.sql.impl.rule;
+
+import java.util.List;
+import java.util.Map;
+import org.apache.calcite.plan.RelOptPlanner;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import org.apache.calcite.tools.RelBuilder;
+
+/**
+ * This is a class to catch the built join and check if it is a legal join 
before passing it to the
+ * actual RelOptRuleCall.
+ */
+public class JoinRelOptRuleCall extends RelOptRuleCall {
+  private final RelOptRuleCall originalCall;
+  private final JoinChecker checker;
+
+  JoinRelOptRuleCall(RelOptRuleCall originalCall, JoinChecker checker) {
+super(originalCall.getPlanner(), originalCall.getOperand0(), 
originalCall.rels, null, null);
+this.originalCall = originalCall;
+this.checker = checker;
+  }
+
+  @Override
+  public RelOptRuleOperand getOperand0() {
+return originalCall.getOperand0();
+  }
+
+  @Override
+  public RelOptRule getRule() {
+return originalCall.getRule();
+  }
+
+  @Override
+  public List getRelList() {
+return originalCall.getRelList();
+  }
+
+  @Override
+  @SuppressWarnings("TypeParameterUnusedInFormals")
+  public  T rel(int ordinal) {
+return originalCall.rel(ordinal);
+  }
+
+  @Override
+  public List getChildRels(RelNode rel) {
+return originalCall.getChildRels(rel);
+  }
+
+  @Override
+  public RelOptPlanner getPlanner() {
+return originalCall.getPlanner();
+  }
+
+  @Override
+  public RelMetadataQuery getMetadataQuery() {
+return originalCall.getMetadataQuery();
+  }
+
+  @Override
+  public List getParents() {
+return originalCall.getParents();
+  }
+
+  @Override
+  public RelBuilder builder() {
+return originalCall.builder();
+  }
+
+  @Override
+  public void transformTo(RelNode rel, Map equiv) {
 
 Review comment:
   nit: can you move this and `JoinChecker` to the top and add a comment that 
these are the only two custom things, and everything else is delegated to the 
`originalCall`?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276932)
Time Spent: 6h 10m  (was: 6h)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276933&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276933
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 18:58
Start Date: 15/Jul/19 18:58
Worklog Time Spent: 10m 
  Work Description: akedin commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303586595
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamJoinPushThroughJoinRule.java
 ##
 @@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.sql.impl.rule;
+
+import org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.RelFactories;
+import org.apache.calcite.rel.logical.LogicalJoin;
+import org.apache.calcite.rel.rules.JoinPushThroughJoinRule;
+import org.apache.calcite.tools.RelBuilderFactory;
+
+/**
+ * This is exactly similar to {@link 
org.apache.calcite.rel.rules.JoinPushThroughJoinRule}. It only
+ * checks if the condition of the new bottom join is supported.
+ */
+public class BeamJoinPushThroughJoinRule extends JoinPushThroughJoinRule {
+  /** Instance of the rule that works on logical joins only, and pushes to the 
right. */
+  public static final RelOptRule RIGHT =
+  new BeamJoinPushThroughJoinRule(
+  "BeamJoinPushThroughJoinRule:right",
+  true,
+  LogicalJoin.class,
+  RelFactories.LOGICAL_BUILDER);
+
+  /** Instance of the rule that works on logical joins only, and pushes to the 
left. */
+  public static final RelOptRule LEFT =
+  new BeamJoinPushThroughJoinRule(
+  "BeamJoinPushThroughJoinRule:left",
+  false,
+  LogicalJoin.class,
+  RelFactories.LOGICAL_BUILDER);
+
+  /** Creates a JoinPushThroughJoinRule. */
+  private BeamJoinPushThroughJoinRule(
+  String description,
+  boolean right,
+  Class clazz,
+  RelBuilderFactory relBuilderFactory) {
+super(description, right, clazz, relBuilderFactory);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+super.onMatch(
+new JoinRelOptRuleCall(
 
 Review comment:
   just an idea to think about, don't know if it makes sense:
   
   * implement the check `BeamJoinRel.isJoinLegal(topJoin) && 
BeamJoinRel.isJoinLegal(bottomJoin)` in `JoinRelOptRuleCall`;
   * create an interface like `JoinInputNodesMatcher` or something, which 
returns the top and the bottom join from the node, e.g.:
   
   ```java
   interface JoinInputNodesMatcher  implements ... {
 Pair getTopAndBottomJoin(RelOptRuleCall call);
   
 default void onMatch(RelOptRuleCall call) {
super.onMatch(new JoinRelOptRuleCall(this));
 }
   }
   ```
   * make the `BeamJoinPushThroughJoinRule` implement that interface;
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276933)
Time Spent: 6h 20m  (was: 6h 10m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and est

[jira] [Work logged] (BEAM-7711) Support DATETIME as a logical type in BeamSQL

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7711?focusedWorklogId=276931&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276931
 ]

ASF GitHub Bot logged work on BEAM-7711:


Author: ASF GitHub Bot
Created on: 15/Jul/19 18:56
Start Date: 15/Jul/19 18:56
Worklog Time Spent: 10m 
  Work Description: kennknowles commented on pull request #8994: 
[BEAM-7711] Add DATETIME as a logical type in BeamSQL
URL: https://github.com/apache/beam/pull/8994#discussion_r303585601
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java
 ##
 @@ -154,6 +163,7 @@ public static boolean isStringType(FieldType fieldType) {
   .put(TIME_WITH_LOCAL_TZ, SqlTypeName.TIME_WITH_LOCAL_TIME_ZONE)
   .put(TIMESTAMP, SqlTypeName.TIMESTAMP)
   .put(TIMESTAMP_WITH_LOCAL_TZ, 
SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE)
+  .put(DATETIME, SqlTypeName.TIMESTAMP)
 
 Review comment:
   What are the differences between DATETIME and TIMESTAMP WITHOUT TIME ZONE, 
so that they need to be distinguished?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276931)
Time Spent: 1.5h  (was: 1h 20m)

> Support DATETIME as a logical type in BeamSQL
> -
>
> Key: BEAM-7711
> URL: https://issues.apache.org/jira/browse/BEAM-7711
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> DATETIME as a type represents a year, month, day, hour, minute, second, and 
> subsecond(millis)
> it ranges from 0001-01-01 00:00:00 to -12-31 23:59:59.999.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7746) Add type hints to python code

2019-07-15 Thread Chad Dombrova (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-7746:

Description: 
As a developer of the beam source code, I would like the code to use pep484 
type hints so that I can clearly see what types are required, get completion in 
my IDE, and enforce code correctness via a static analyzer like mypy.

This may be considered a precursor to BEAM-7060

Work has been started here:  [https://github.com/apache/beam/pull/9056]

 

 

  was:
As a developer of the beam source code, I would like the code to use pep484 
type hints so that I can clearly see what types are required, get completion in 
my IDE, and enforce code correctness via a static analyzer like mypy.

This is a precursor to BEAM-7060

Work has been started here:  [https://github.com/apache/beam/pull/9056]

 

 


> Add type hints to python code
> -
>
> Key: BEAM-7746
> URL: https://issues.apache.org/jira/browse/BEAM-7746
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As a developer of the beam source code, I would like the code to use pep484 
> type hints so that I can clearly see what types are required, get completion 
> in my IDE, and enforce code correctness via a static analyzer like mypy.
> This may be considered a precursor to BEAM-7060
> Work has been started here:  [https://github.com/apache/beam/pull/9056]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7746) Add type hints to python code

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7746?focusedWorklogId=276925&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276925
 ]

ASF GitHub Bot logged work on BEAM-7746:


Author: ASF GitHub Bot
Created on: 15/Jul/19 18:44
Start Date: 15/Jul/19 18:44
Worklog Time Spent: 10m 
  Work Description: chadrik commented on issue #9056: [BEAM-7746] Add 
python type hints
URL: https://github.com/apache/beam/pull/9056#issuecomment-511522447
 
 
   > type checking pipeline construction I think the goal of JIRA-7060 is 
specifically to make type annotations useful to pipeline developers. Making 
tools like mypy work on the beam codebase, and user pipelines in general, is 
another (worthy) goal.
   
   I made a new ticket for the latter, and changed the subject of this PR:  
https://issues.apache.org/jira/browse/BEAM-7746
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276925)
Time Spent: 10m
Remaining Estimate: 0h

> Add type hints to python code
> -
>
> Key: BEAM-7746
> URL: https://issues.apache.org/jira/browse/BEAM-7746
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As a developer of the beam source code, I would like the code to use pep484 
> type hints so that I can clearly see what types are required, get completion 
> in my IDE, and enforce code correctness via a static analyzer like mypy.
> This is a precursor to BEAM-7060
> Work has been started here:  [https://github.com/apache/beam/pull/9056]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7746) Add type hints to python code

2019-07-15 Thread Chad Dombrova (JIRA)

Chad Dombrova created BEAM-7746:
---

 Summary: Add type hints to python code
 Key: BEAM-7746
 URL: https://issues.apache.org/jira/browse/BEAM-7746
 Project: Beam
  Issue Type: New Feature
  Components: sdk-py-core
Reporter: Chad Dombrova


As a developer of the beam source code, I would like the code to use pep484 
type hints so that I can clearly see what types are required, get completion in 
my IDE, and enforce code correctness via a static analyzer like mypy.

This is a precursor to BEAM-7060

Work has been started here:  [https://github.com/apache/beam/pull/9056]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7463) BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: incorrect checksum

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885473#comment-16885473
 ] 

Valentyn Tymofieiev commented on BEAM-7463:
---

We don't need to pass SDK location.

This is where the command that runs on Jenkins is evaluated, overall it looks 
similar to yours:

https://github.com/apache/beam/blob/5fb21e38d9d0e73db514e13a93c15578302c11fa/sdks/python/test-suites/direct/py35/build.gradle#L50


> BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: 
> incorrect checksum 
> --
>
> Key: BEAM-7463
> URL: https://issues.apache.org/jira/browse/BEAM-7463
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Valentyn Tymofieiev
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: currently-failing
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> {noformat}
> 15:03:38 FAIL: test_big_query_new_types 
> (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT)
> 15:03:38 
> --
> 15:03:38 Traceback (most recent call last):
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py",
>  line 211, in test_big_query_new_types
> 15:03:38 big_query_query_to_table_pipeline.run_bq_pipeline(options)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py",
>  line 82, in run_bq_pipeline
> 15:03:38 result = p.run()
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py",
>  line 107, in run
> 15:03:38 else test_runner_api))
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 406, in run
> 15:03:38 self._options).run(False)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 419, in run
> 15:03:38 return self.runner.run_pipeline(self, self._options)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py",
>  line 51, in run_pipeline
> 15:03:38 hc_assert_that(self.result, pickler.loads(on_success_matcher))
> 15:03:38 AssertionError: 
> 15:03:38 Expected: (Test pipeline expected terminated in state: DONE and 
> Expected checksum is 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214)
> 15:03:38  but: Expected checksum is 
> 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214 Actual checksum is 
> da39a3ee5e6b4b0d3255bfef95601890afd80709
> {noformat}
> [~Juta] could this be caused by changes to Bigquery matcher? 
> https://github.com/apache/beam/pull/8621/files#diff-f1ec7e3a3e7e2e5082ddb7043954c108R134
>  
> cc: [~pabloem] [~chamikara] [~apilloud]
> A recent postcommit run has BQ failures in other tests as well: 
> https://builds.apache.org/job/beam_PostCommit_Python3_Verify/1000/consoleFull



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7745) StreamingSideInputDoFnRunner/StreamingSideInputFetcher have suboptimal state access pattern during normal operation

2019-07-15 Thread Steve Niemitz (JIRA)

Steve Niemitz created BEAM-7745:
---

 Summary: StreamingSideInputDoFnRunner/StreamingSideInputFetcher 
have suboptimal state access pattern during normal operation
 Key: BEAM-7745
 URL: https://issues.apache.org/jira/browse/BEAM-7745
 Project: Beam
  Issue Type: Improvement
  Components: runner-dataflow
Reporter: Steve Niemitz


I spent some time tracking down sources of uncached state fetches in my job, 
and one large category was the interaction of StreamingSideInputDoFnRunner + 
StreamingSideInputFetcher.

Basically, during standard operations, when the main input is NOT blocked by 
the side input, the side input fetcher will perform an uncached state read for 
every input element.  Changing it to cache the blockedMap state gave me a 
~30-40% increase in throughput in my job.

The interaction is a little complicated, and there's a couple optimizations 
here I can see.

 

Primarily, the blockedMap is only persisted if it is non-empty.  Because the 
WindmillStateCache won't cache a null value, this means that the "nothing is 
blocked" signal is never actually cached, and will issue a state read to 
windmill for each input element.  The solution here seems like it is to persist 
an empty map rather than a null when there are no blocked elements.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=276905&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276905
 ]

ASF GitHub Bot logged work on BEAM-7744:


Author: ASF GitHub Bot
Created on: 15/Jul/19 18:11
Start Date: 15/Jul/19 18:11
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #9071: [BEAM-7744] cherry-pick 
BEAM-7689 for 2.7.1
URL: https://github.com/apache/beam/pull/9071#issuecomment-511510734
 
 
   R: @chamikaramj @aaltay 
   CC: @kennknowles
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276905)
Time Spent: 10m
Remaining Estimate: 0h

> LTS backport: Temporary directory for WriteOperation may not be unique in 
> FileBaseSink
> --
>
> Key: BEAM-7744
> URL: https://issues.apache.org/jira/browse/BEAM-7744
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.7.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Tracking BEAM-7689 LTS backport for 2.7.1 release



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7463) BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: incorrect checksum

2019-07-15 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885465#comment-16885465
 ] 

Pablo Estrada commented on BEAM-7463:
-

gah - I was running on dataflow. 
What's the best command to run these tests on direct runner?
I am trying  `./scripts/run_integration_test.sh --runner TestDirectRunner 
--sdk_location dist/apache-beam-2.15.0.dev0.tar.gz --test_opts 
--tests=apache_beam.io.gcp.bigquery_write_it_test:BigQueryWriteIntegrationTests.test_big_query_write_new_types`
 - but it doesn't look like it's working properly.

> BigQueryQueryToTableIT is flaky on Direct runner in PostCommit suites: 
> incorrect checksum 
> --
>
> Key: BEAM-7463
> URL: https://issues.apache.org/jira/browse/BEAM-7463
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Valentyn Tymofieiev
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: currently-failing
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> {noformat}
> 15:03:38 FAIL: test_big_query_new_types 
> (apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT)
> 15:03:38 
> --
> 15:03:38 Traceback (most recent call last):
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py",
>  line 211, in test_big_query_new_types
> 15:03:38 big_query_query_to_table_pipeline.run_bq_pipeline(options)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/io/gcp/big_query_query_to_table_pipeline.py",
>  line 82, in run_bq_pipeline
> 15:03:38 result = p.run()
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/testing/test_pipeline.py",
>  line 107, in run
> 15:03:38 else test_runner_api))
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 406, in run
> 15:03:38 self._options).run(False)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/pipeline.py",
>  line 419, in run
> 15:03:38 return self.runner.run_pipeline(self, self._options)
> 15:03:38   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify/src/sdks/python/apache_beam/runners/direct/test_direct_runner.py",
>  line 51, in run_pipeline
> 15:03:38 hc_assert_that(self.result, pickler.loads(on_success_matcher))
> 15:03:38 AssertionError: 
> 15:03:38 Expected: (Test pipeline expected terminated in state: DONE and 
> Expected checksum is 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214)
> 15:03:38  but: Expected checksum is 
> 24de460c4d344a4b77ccc4cc1acb7b7ffc11a214 Actual checksum is 
> da39a3ee5e6b4b0d3255bfef95601890afd80709
> {noformat}
> [~Juta] could this be caused by changes to Bigquery matcher? 
> https://github.com/apache/beam/pull/8621/files#diff-f1ec7e3a3e7e2e5082ddb7043954c108R134
>  
> cc: [~pabloem] [~chamikara] [~apilloud]
> A recent postcommit run has BQ failures in other tests as well: 
> https://builds.apache.org/job/beam_PostCommit_Python3_Verify/1000/consoleFull



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-07-15 Thread Heejong Lee (JIRA)

Heejong Lee created BEAM-7744:
-

 Summary: LTS backport: Temporary directory for WriteOperation may 
not be unique in FileBaseSink
 Key: BEAM-7744
 URL: https://issues.apache.org/jira/browse/BEAM-7744
 Project: Beam
  Issue Type: Bug
  Components: io-java-files
Reporter: Heejong Lee
Assignee: Heejong Lee
 Fix For: 2.7.1


Tracking BEAM-7689 LTS backport for 2.7.1 release



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7689) Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-07-15 Thread Heejong Lee (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heejong Lee updated BEAM-7689:
--
Fix Version/s: (was: 2.7.1)

> Temporary directory for WriteOperation may not be unique in FileBaseSink
> 
>
> Key: BEAM-7689
> URL: https://issues.apache.org/jira/browse/BEAM-7689
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Blocker
> Fix For: 2.14.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Temporary directory for WriteOperation in FileBasedSink is generated from a 
> second-granularity timestamp (-MM-dd_HH-mm-ss) and unique increasing 
> index. Such granularity is not enough to make temporary directories unique 
> between different jobs. When two jobs share the same temporary directory, 
> output file may not be produced in one job since the required temporary file 
> can be deleted from another job.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7689) Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7689?focusedWorklogId=276900&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276900
 ]

ASF GitHub Bot logged work on BEAM-7689:


Author: ASF GitHub Bot
Created on: 15/Jul/19 18:03
Start Date: 15/Jul/19 18:03
Worklog Time Spent: 10m 
  Work Description: ihji commented on pull request #9071: [BEAM-7689] 
cherry-pick for 2.7.1
URL: https://github.com/apache/beam/pull/9071
 
 
   
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
  [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastComplet

[jira] [Commented] (BEAM-7743) Stream position bug in BigQuery storage stream source can lead to data loss

2019-07-15 Thread Kenneth Jung (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885459#comment-16885459
 ] 

Kenneth Jung commented on BEAM-7743:


Update: this issue does not appear to be present in 2.14:

Change: 
[https://github.com/apache/beam/commit/aa8b3d054085548d79c988774923a1f68d223a26#diff-d5be8ca6fea88f6525fc7c49c3d43cd4]

Release branch: 
[https://github.com/apache/beam/commits/release-2.14.0/sdks/java/io/google-cloud-platform]

> Stream position bug in BigQuery storage stream source can lead to data loss
> ---
>
> Key: BEAM-7743
> URL: https://issues.apache.org/jira/browse/BEAM-7743
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.14.0
>Reporter: Kenneth Jung
>Priority: Blocker
>
> A bug in the splitAtFraction implementation for the BigQuery Storage API 
> stream source can lead to data loss in certain conditions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7743) Stream position bug in BigQuery storage stream source can lead to data loss

2019-07-15 Thread Kenneth Jung (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Jung updated BEAM-7743:
---
Affects Version/s: (was: 2.14.0)
   2.15.0

> Stream position bug in BigQuery storage stream source can lead to data loss
> ---
>
> Key: BEAM-7743
> URL: https://issues.apache.org/jira/browse/BEAM-7743
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.15.0
>Reporter: Kenneth Jung
>Priority: Blocker
>
> A bug in the splitAtFraction implementation for the BigQuery Storage API 
> stream source can lead to data loss in certain conditions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276895&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276895
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 17:58
Start Date: 15/Jul/19 17:58
Worklog Time Spent: 10m 
  Work Description: riazela commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303537888
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   I have tested the effect of adding these rules in terms of performance gain 
in:
   
https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#
   
   However, in terms of testing against bugs, all I can say is it is passing 
our current tests; therefore, if it can create illegal plans for some query, 
that means we are not covering it in our tests and it should be added. I am not 
sure how we can test this against all possible scenarios and queries. I think 
one way to improve our test cases is checking other systems' tests (Such as 
Calcite's test cases) and add them to ours. I will appreciate if you have any 
suggestion for that
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276895)
Time Spent: 6h  (was: 5h 50m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7743) Stream position bug in BigQuery storage stream source can lead to data loss

2019-07-15 Thread Kenneth Jung (JIRA)

Kenneth Jung created BEAM-7743:
--

 Summary: Stream position bug in BigQuery storage stream source can 
lead to data loss
 Key: BEAM-7743
 URL: https://issues.apache.org/jira/browse/BEAM-7743
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.14.0
Reporter: Kenneth Jung


A bug in the splitAtFraction implementation for the BigQuery Storage API stream 
source can lead to data loss in certain conditions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-1746) Include LICENSE and NOTICE in python dist files

2019-07-15 Thread Ahmet Altay (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885448#comment-16885448
 ] 

Ahmet Altay commented on BEAM-1746:
---

Not including these files is a bug. 2.4.0 distribution has these files, and 
starting from 2.5.0 these files are missing.

> Include LICENSE and NOTICE in python dist files
> ---
>
> Key: BEAM-1746
> URL: https://issues.apache.org/jira/browse/BEAM-1746
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Ahmet Altay
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Reopened] (BEAM-1746) Include LICENSE and NOTICE in python dist files

2019-07-15 Thread Ahmet Altay (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Altay reopened BEAM-1746:
---

> Include LICENSE and NOTICE in python dist files
> ---
>
> Key: BEAM-1746
> URL: https://issues.apache.org/jira/browse/BEAM-1746
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Ahmet Altay
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276852&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276852
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 15/Jul/19 17:08
Start Date: 15/Jul/19 17:08
Worklog Time Spent: 10m 
  Work Description: chadrik commented on issue #9056: [BEAM-7060] Add 
python type hints
URL: https://github.com/apache/beam/pull/9056#issuecomment-511487829
 
 
   > typing module import: Given that we're deprecating the old type hints, I 
think we should take a sweep through the codebase and replace all non-qualified 
typehints with their typing equivalents (when possible) or qualified ones 
(where not). This would allow us to use the common convention of importing from 
the typing module directly.
   
   Sorry, I don't completely follow this.  IIUC, we can't replace 
`apache_beam.typehints` with `typing` until beam knows how to deal with 
`typing` internally, right? 
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276852)
Time Spent: 2h 10m  (was: 2h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276847&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276847
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 15/Jul/19 17:06
Start Date: 15/Jul/19 17:06
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9056: [BEAM-7060] 
Add python type hints
URL: https://github.com/apache/beam/pull/9056#discussion_r303540136
 
 

 ##
 File path: sdks/python/apache_beam/io/iobase.py
 ##
 @@ -972,14 +979,15 @@ def expand(self, pcoll):
'or be a PTransform. Received : %r' % self.sink)
 
 
-class WriteImpl(ptransform.PTransform):
+class WriteImpl(ptransform.PTransform[InT, None]):
 
 Review comment:
   oh, also, it will require an import of the `apache_beam.pvalue` module into 
all of those modules as well.  you won't be able to protect it inside a 
`typing.TYPE_CHECKING` block since it's used in the class definition. 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276847)
Time Spent: 2h  (was: 1h 50m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276846&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276846
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 15/Jul/19 17:04
Start Date: 15/Jul/19 17:04
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9056: [BEAM-7060] 
Add python type hints
URL: https://github.com/apache/beam/pull/9056#discussion_r303539509
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/pubsub.py
 ##
 @@ -134,8 +136,13 @@ class ReadFromPubSub(PTransform):
   """A ``PTransform`` for reading from Cloud Pub/Sub."""
   # Implementation note: This ``PTransform`` is overridden by Directrunner.
 
-  def __init__(self, topic=None, subscription=None, id_label=None,
-   with_attributes=False, timestamp_attribute=None):
+  def __init__(self,
+   topic=None,  # type: typing.Optional[str]
 
 Review comment:
   I'll leave that up to the Beam Team, but we're still going to have the line 
length limit to contend with. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276846)
Time Spent: 1h 50m  (was: 1h 40m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276844&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276844
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 15/Jul/19 17:03
Start Date: 15/Jul/19 17:03
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9056: [BEAM-7060] 
Add python type hints
URL: https://github.com/apache/beam/pull/9056#discussion_r303539155
 
 

 ##
 File path: sdks/python/apache_beam/examples/wordcount.py
 ##
 @@ -85,31 +98,66 @@ def run(argv=None):
   pipeline_options.view_as(SetupOptions).save_main_session = True
   p = beam.Pipeline(options=pipeline_options)
 
+  reveal_type(p)
+
+  _read = ReadFromText(known_args.input)
+  reveal_type(_read)
+
+  read = 'read' >> _read
+  reveal_type(read)
+
   # Read the text file[pattern] into a PCollection.
-  lines = p | 'read' >> ReadFromText(known_args.input)
+  lines = p | read
+  reveal_type(lines)  # PCollection[unicode*]
+
+  def make_ones(x):
+# type: (T) -> Tuple[T, int]
 
 Review comment:
   Can you explain what you mean?   
   
   Let me try to answer what I think you're asking about.
   
   First, even for a simple function like `make_ones`, mypy does not guess at 
the arguments and return type.  
   
   Next, as a user annotating pipelines I have a few of choices  (all of them 
are valid):
   
   1. annotate functions, and let mypy infer the types of the `PTransforms` 
they are passed to
   2. annotate `PTransforms`
   3. annotate both, as a way to guard against unintentional incompatibilities
   
   Option 2 would look like this:
   
   ```python
 def make_ones(x):
   return (x, 1)
   
 makemap = beam.Map(make_ones)  # type: beam.Map[T, Tuple[T, int]]
   ```
   
   To be clear, in option 2, mypy is not using the type of `beam.Map` to infer 
the type of the `make_ones`, it's simply assuming it's correct, since in our 
thought experiment it's unannotated.  In other words, the following would not 
generate an error because `make_ones` is untyped:
   
   ```python
 def make_ones(x):
   return (x + 1, 1)  # x can only be a numeric type!
   
 makemap = beam.Map(make_ones)  # type: beam.Map[str, Tuple[str, int]]
   ```
   
   This module is very messy right now, so I apologize for any confusion that 
it's generated.  
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276844)
Time Spent: 1h 40m  (was: 1.5h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276843&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276843
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 16:59
Start Date: 15/Jul/19 16:59
Worklog Time Spent: 10m 
  Work Description: riazela commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303537888
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##
 @@ -103,6 +106,10 @@
 
   // join rules
   JoinPushExpressionsRule.INSTANCE,
+  JoinCommuteRule.INSTANCE,
 
 Review comment:
   I have tested the effect of adding these rules in terms of performance gain 
for some of the queries in 
https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#
   
   However, in terms of testing against bugs, all I can say is it is passing 
our current tests; therefore, if it can create illegal plans for some query, 
that means we are not covering it in our tests and it should be added. I am not 
sure how we can test this against all possible scenarios and queries. I think 
one way to improve our test cases is checking other systems' tests (Such as 
Calcite's test cases) and add them to ours. I will appreciate if you have any 
suggestion for that
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276843)
Time Spent: 5h 50m  (was: 5h 40m)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3645?focusedWorklogId=276842&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276842
 ]

ASF GitHub Bot logged work on BEAM-3645:


Author: ASF GitHub Bot
Created on: 15/Jul/19 16:58
Start Date: 15/Jul/19 16:58
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on issue #8979: [BEAM-3645] add 
multiplexing for python FnApiRunner
URL: https://github.com/apache/beam/pull/8979#issuecomment-511483661
 
 
   kindly ping.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276842)
Time Spent: 22h 50m  (was: 22h 40m)

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
>  Time Spent: 22h 50m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-07-15 Thread Pablo Estrada (JIRA)

Pablo Estrada created BEAM-7742:
---

 Summary: BigQuery File Loads to work well with load job size limits
 Key: BEAM-7742
 URL: https://issues.apache.org/jira/browse/BEAM-7742
 Project: Beam
  Issue Type: Improvement
  Components: io-python-gcp
Reporter: Pablo Estrada
Assignee: Tanay Tummalapalli


Load jobs into BigQuery have a number of limitations: 
[https://cloud.google.com/bigquery/quotas#load_jobs]

 

Currently, the python BQ sink implemented in `bigquery_file_loads.py` does not 
handle these limitations well. Improvements need to be made to the 
miplementation, to:
 * Decide to use temp_tables dynamically at pipeline execution
 * Add code to determine when a load job to a single destination needs to be 
partitioned into multiple jobs.
 * When this happens, then we definitely need to use temp_tables, in case one 
of the two load jobs fails, and the pipeline is rerun.

Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276836&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276836
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 15/Jul/19 16:47
Start Date: 15/Jul/19 16:47
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9056: [BEAM-7060] 
Add python type hints
URL: https://github.com/apache/beam/pull/9056#discussion_r303532868
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/datastore/v1/query_splitter.py
 ##
 @@ -37,7 +37,7 @@
PropertyFilter.GREATER_THAN,
PropertyFilter.GREATER_THAN_OR_EQUAL]
 except ImportError:
-  UNSUPPORTED_OPERATORS = None
+  UNSUPPORTED_OPERATORS = None  # type: ignore
 
 Review comment:
   Correct.  I can add a comment to the code clarify.  
   
   The problem is that if we mark `UNSUPPORTED_OPERATORS` as `Optional` then 
mypy will generate numerous errors in this module, because the code assumes the 
import succeeded and that `UNSUPPORTED_OPERATORS` is _not_ `None`.   
   
   An alternative would be to set it to `[]`.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276836)
Time Spent: 1.5h  (was: 1h 20m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276835&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276835
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 15/Jul/19 16:42
Start Date: 15/Jul/19 16:42
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9056: [BEAM-7060] 
Add python type hints
URL: https://github.com/apache/beam/pull/9056#discussion_r303530972
 
 

 ##
 File path: sdks/python/apache_beam/io/iobase.py
 ##
 @@ -972,14 +979,15 @@ def expand(self, pcoll):
'or be a PTransform. Received : %r' % self.sink)
 
 
-class WriteImpl(ptransform.PTransform):
+class WriteImpl(ptransform.PTransform[InT, None]):
 
 Review comment:
   IMHO, that would be redundant.  It won't change the behavior of the type 
checks, and it would add verbosity to _a lot_ of class definitions, since there 
are numerous `PTransforms` throughout the code. 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276835)
Time Spent: 1h 20m  (was: 1h 10m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=276826&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276826
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 15/Jul/19 16:36
Start Date: 15/Jul/19 16:36
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9056: [BEAM-7060] 
Add python type hints
URL: https://github.com/apache/beam/pull/9056#discussion_r303528558
 
 

 ##
 File path: sdks/python/apache_beam/examples/wordcount.py
 ##
 @@ -65,8 +74,12 @@ def process(self, element):
   self.word_lengths_dist.update(len(w))
 return words
 
+  def to_runner_api_parameter(self, unused_context):
+pass
+
 
 def run(argv=None):
+  # type: (Optional[Sequence[str]]) -> None
 
 Review comment:
   Sorry, I should have clarified that the changes to wordcount really amount 
to tests, primarily of the mypy plugin.  Ultimately, we'll want a separate 
example to demonstrate typing.  I'll get this reverted soon.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276826)
Time Spent: 1h 10m  (was: 1h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=276825&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276825
 ]

ASF GitHub Bot logged work on BEAM-7545:


Author: ASF GitHub Bot
Created on: 15/Jul/19 16:35
Start Date: 15/Jul/19 16:35
Worklog Time Spent: 10m 
  Work Description: riazela commented on issue #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#issuecomment-511475589
 
 
   run java postcommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276825)
Time Spent: 5h 40m  (was: 5.5h)

> Row Count Estimation for CSV TextTable
> --
>
> Key: BEAM-7545
> URL: https://issues.apache.org/jira/browse/BEAM-7545
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-6202) Gracefully handle exceptions when waiting for Dataflow job completion.

2019-07-15 Thread Kasia Kucharczyk (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885369#comment-16885369
 ] 

Kasia Kucharczyk commented on BEAM-6202:


Here is an issue with same problems but in Load Tests (with job run examples).

> Gracefully handle exceptions when waiting for Dataflow job completion.
> --
>
> Key: BEAM-6202
> URL: https://issues.apache.org/jira/browse/BEAM-6202
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core, test-failures
>Reporter: Robert Bradshaw
>Priority: Major
>
> If there is an error when trying to contact the dataflow service in Python's 
> Dataflow.poll_for_job_completion, we may exit the thread prematurely. 
> A typical manifestation is: Dataflow Runner fails with:
> {noformat}
> AssertionError: Job did not reach to a terminal state after waiting 
> indefinitely.
> {noformat}
> however job execution continues, and succeeds.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (BEAM-7733) Flaky tests on Dataflow because of not found information about job state

2019-07-15 Thread Kasia Kucharczyk (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kasia Kucharczyk resolved BEAM-7733.

   Resolution: Duplicate
Fix Version/s: Not applicable

> Flaky tests on Dataflow because of not found information about job state
> 
>
> Key: BEAM-7733
> URL: https://issues.apache.org/jira/browse/BEAM-7733
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, testing
>Reporter: Kasia Kucharczyk
>Priority: Major
> Fix For: Not applicable
>
>
> Some of Python performance cron tests on Dataflow experience some problems 
> because of: 
> {code:java}
> AssertionError: Job did not reach to a terminal state after waiting 
> indefinitely.
> {code}
>  
> Dataflow returns following errors: 
> {code:java}
> "error": {
>  "code": 404,
>  "message": "(bf55d279f24d1255): Information about job 
> 2019-07-07_08_42_42-8241186059035490445 could not be found in our system. 
> Please double check the id is correct. If it is please contact customer 
> support.",
>  "status": "NOT_FOUND"
>}
>  }
> {code}
> Those tests are success on Dataflow, but Jenkins returns failure. 
> It already happened in Combine and CoGroupByKey tests:
>  * [example of Combine 
> test|https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_LoadTests_Python_Combine_Dataflow_Batch/6/console]
>  * [example of 
> CoGBK|https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_LoadTests_Python_CoGBK_Dataflow_Batch/3/console]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7630) Add ITs to check IO behavior with bytes and unicode strings

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7630?focusedWorklogId=276791&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276791
 ]

ASF GitHub Bot logged work on BEAM-7630:


Author: ASF GitHub Bot
Created on: 15/Jul/19 15:39
Start Date: 15/Jul/19 15:39
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #8985: [BEAM-7630] add ITs 
for writing and reading bytes from pubsub
URL: https://github.com/apache/beam/pull/8985#issuecomment-511454565
 
 
   The only test failure on this PR is 
`(apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT) 
... FAIL`, which is a known flake unrelated to this change. 
   
   @pabloem Could you help merging this PR please?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276791)
Time Spent: 2h 10m  (was: 2h)

> Add ITs to check IO behavior with bytes and unicode strings
> ---
>
> Key: BEAM-7630
> URL: https://issues.apache.org/jira/browse/BEAM-7630
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Juta Staes
>Assignee: Juta Staes
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> When porting Bigquery IO to Python 3 we discovered some issues regarding 
> bytes support with BigQuery. We want to make sure that other IOs don't have 
> the same issues



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7715) External / Cross language (via expansion) transforms / public API should be marked @Experimental

2019-07-15 Thread JIRA



[ 
https://issues.apache.org/jira/browse/BEAM-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885331#comment-16885331
 ] 

Ismaël Mejía commented on BEAM-7715:


Yes definitely not a blocker but a really nice to have, however I am stuck 
because of some random python error, trying to get it done in case we have a 
second RC, but will change the fix version only if needed. Thanks Anton.

> External / Cross language (via expansion) transforms / public API should be 
> marked @Experimental
> 
>
> Key: BEAM-7715
> URL: https://issues.apache.org/jira/browse/BEAM-7715
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-kafka, sdk-java-core, sdk-py-core
>Affects Versions: 2.13.0
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Beam introduced recently APIs and transforms to invoke transforms from 
> different languages. These APIs even if tested are not yet mature and it is 
> probably a good idea to mark them as 
> [`@Experimental|mailto:%60@Experimental]` at least for the user facing parts 
> (sdks/java/core and concrete transforms).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (BEAM-3061) BigtableIO writes should support emitting "done" notifications

2019-07-15 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/BEAM-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía resolved BEAM-3061.

   Resolution: Fixed
Fix Version/s: 2.15.0

> BigtableIO writes should support emitting "done" notifications
> --
>
> Key: BEAM-3061
> URL: https://issues.apache.org/jira/browse/BEAM-3061
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 16h 20m
>  Remaining Estimate: 0h
>
> There was some discussion of this on the dev@ mailing list [1].  This 
> approach was taken based on discussion there.
> [1] 
> https://lists.apache.org/thread.html/949b33782f722a9000c9bf9e37042739c6fd0927589b99752b78d7bd@%3Cdev.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3061) BigtableIO should support emitting a sentinel "done" value when a bundle completes

2019-07-15 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3061?focusedWorklogId=276780&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276780
 ]

ASF GitHub Bot logged work on BEAM-3061:


Author: ASF GitHub Bot
Created on: 15/Jul/19 15:27
Start Date: 15/Jul/19 15:27
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #7805: [BEAM-3061] 
Done notifications for BigtableIO.Write
URL: https://github.com/apache/beam/pull/7805
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276780)
Time Spent: 16h 20m  (was: 16h 10m)

> BigtableIO should support emitting a sentinel "done" value when a bundle 
> completes
> --
>
> Key: BEAM-3061
> URL: https://issues.apache.org/jira/browse/BEAM-3061
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 16h 20m
>  Remaining Estimate: 0h
>
> There was some discussion of this on the dev@ mailing list [1].  This 
> approach was taken based on discussion there.
> [1] 
> https://lists.apache.org/thread.html/949b33782f722a9000c9bf9e37042739c6fd0927589b99752b78d7bd@%3Cdev.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-3061) BigtableIO writes should support emitting "done" notifications

2019-07-15 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/BEAM-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-3061:
---
Summary: BigtableIO writes should support emitting "done" notifications  
(was: BigtableIO should support emitting a sentinel "done" value when a bundle 
completes)

> BigtableIO writes should support emitting "done" notifications
> --
>
> Key: BEAM-3061
> URL: https://issues.apache.org/jira/browse/BEAM-3061
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 16h 20m
>  Remaining Estimate: 0h
>
> There was some discussion of this on the dev@ mailing list [1].  This 
> approach was taken based on discussion there.
> [1] 
> https://lists.apache.org/thread.html/949b33782f722a9000c9bf9e37042739c6fd0927589b99752b78d7bd@%3Cdev.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-6903) Go IT fails on quota issues frequently

2019-07-15 Thread Valentyn Tymofieiev (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885324#comment-16885324
 ] 

Valentyn Tymofieiev commented on BEAM-6903:
---

cc: [~alanmyrvold] for awareness as this is related to test flakes due to 
insufficient quota.

> Go IT fails on quota issues frequently
> --
>
> Key: BEAM-6903
> URL: https://issues.apache.org/jira/browse/BEAM-6903
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Jason Kuster
>Priority: Major
>
> https://builds.apache.org/job/beam_PostCommit_Go/3002/
> https://builds.apache.org/job/beam_PostCommit_Go/3000/
> https://builds.apache.org/job/beam_PostCommit_Go/2997/
> https://builds.apache.org/job/beam_PostCommit_Go/2993/



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

1 2 3 >

1 - 100 of 252 matches

Mail list logo