[jira] [Work logged] (BEAM-9144) Beam's own Avro TimeConversion class in beam-sdk-java-core

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9144?focusedWorklogId=374025&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374025
 ]

ASF GitHub Bot logged work on BEAM-9144:


Author: ASF GitHub Bot
Created on: 18/Jan/20 08:03
Start Date: 18/Jan/20 08:03
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #10632: 
[release-2.19.0][BEAM-9144] Beam's own Avro TimeConversion class in 
beam-sdk-java-core
URL: https://github.com/apache/beam/pull/10632#issuecomment-575877015
 
 
   Run Java_Examples_Dataflow PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374025)
Time Spent: 1h 40m  (was: 1.5h)

> Beam's own Avro TimeConversion class in beam-sdk-java-core 
> ---
>
> Key: BEAM-9144
> URL: https://issues.apache.org/jira/browse/BEAM-9144
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Tomo Suzuki
>Assignee: Tomo Suzuki
>Priority: Major
> Fix For: 2.19.0
>
> Attachments: avro-beam-dependency-graph.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> From Aaron's comment in 
> https://issues.apache.org/jira/browse/BEAM-8388?focusedCommentId=17016476&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17016476
>  .
> {quote}My org must use Avro 1.9.x (due to some Avro schema resolution issues 
> resolved in 1.9.x) so downgrading Avro is not possible for us.
>  Beam 2.16.0 is compatible with our usage of Avro 1.9.x – but upgrading to 
> 2.17.0 we are broken as 2.17.0 links to Java classes in Avro 1.8.x that are 
> not available in 1.9.x.
> {quote}
> The Java class is 
> {{org.apache.avro.data.TimeConversions.TimestampConversion}} in Avro 1.8.
>  It's renamed to {{org.apache.avro.data.JodaTimeConversions}} in Avro 1.9.
> h1. Beam Java SDK cannot upgrade Avro to 1.9
> Beam has Spark runners and Spark has not yet upgraded to Avro 1.9.
> Illustration of the dependency
> !avro-beam-dependency-graph.png|width=799,height=385!
> h1. Short-term Solution
> As illustrated above, as long as Beam Java SDK uses only the intersection of 
> Avro classes, method, and fields between Avro 1.8 and 1.9, it will provide 
> flexibility in runtime Avro versions (as it did until Beam 2.16).
> h2. Difference of the TimeConversion Classes
> Avro 1.9's TimestampConversion overrides {{getRecommendedSchema}} method. 
> Details below:
> Avro 1.8's TimeConversions.TimestampConversion:
> {code:java}
>   public static class TimestampConversion extends Conversion {
> @Override
> public Class getConvertedType() {
>   return DateTime.class;
> }
> @Override
> public String getLogicalTypeName() {
>   return "timestamp-millis";
> }
> @Override
> public DateTime fromLong(Long millisFromEpoch, Schema schema, LogicalType 
> type) {
>   return new DateTime(millisFromEpoch, DateTimeZone.UTC);
> }
> @Override
> public Long toLong(DateTime timestamp, Schema schema, LogicalType type) {
>   return timestamp.getMillis();
> }
>   }
> {code}
> Avro 1.9's JodaTimeConversions.TimestampConversion:
> {code:java}
>   public static class TimestampConversion extends Conversion {
> @Override
> public Class getConvertedType() {
>   return DateTime.class;
> }
> @Override
> public String getLogicalTypeName() {
>   return "timestamp-millis";
> }
> @Override
> public DateTime fromLong(Long millisFromEpoch, Schema schema, LogicalType 
> type) {
>   return new DateTime(millisFromEpoch, DateTimeZone.UTC);
> }
> @Override
> public Long toLong(DateTime timestamp, Schema schema, LogicalType type) {
>   return timestamp.getMillis();
> }
> @Override
> public Schema getRecommendedSchema() {
>   return 
> LogicalTypes.timestampMillis().addToSchema(Schema.create(Schema.Type.LONG));
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8684) Beam Dependency Update Request: com.google.apis:google-api-services-bigquery

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8684?focusedWorklogId=374061&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374061
 ]

ASF GitHub Bot logged work on BEAM-8684:


Author: ASF GitHub Bot
Created on: 18/Jan/20 13:31
Start Date: 18/Jan/20 13:31
Worklog Time Spent: 10m 
  Work Description: suztomo commented on issue #10631: [BEAM-8684] Google 
proto and google-services library upgrades
URL: https://github.com/apache/beam/pull/10631#issuecomment-575898674
 
 
   Will investigate the NoSuchMethodError.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374061)
Time Spent: 2h 40m  (was: 2.5h)

> Beam Dependency Update Request: com.google.apis:google-api-services-bigquery
> 
>
> Key: BEAM-8684
> URL: https://issues.apache.org/jira/browse/BEAM-8684
> Project: Beam
>  Issue Type: Sub-task
>  Components: dependencies
>Reporter: Beam JIRA Bot
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
>  - 2019-11-15 19:39:07.113511 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-11-19 21:04:31.460554 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-12-02 12:09:35.656962 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-12-09 12:08:45.501496 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-12-23 12:08:46.385699 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-12-30 14:04:15.902178 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2020-01-06 12:07:51.708019 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2020-01-13 12:07:45.384750 
> -
> Please

[jira] [Work logged] (BEAM-8684) Beam Dependency Update Request: com.google.apis:google-api-services-bigquery

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8684?focusedWorklogId=374062&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374062
 ]

ASF GitHub Bot logged work on BEAM-8684:


Author: ASF GitHub Bot
Created on: 18/Jan/20 13:31
Start Date: 18/Jan/20 13:31
Worklog Time Spent: 10m 
  Work Description: suztomo commented on pull request #10631: [BEAM-8684] 
Google proto and google-services library upgrades
URL: https://github.com/apache/beam/pull/10631
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374062)
Time Spent: 2h 50m  (was: 2h 40m)

> Beam Dependency Update Request: com.google.apis:google-api-services-bigquery
> 
>
> Key: BEAM-8684
> URL: https://issues.apache.org/jira/browse/BEAM-8684
> Project: Beam
>  Issue Type: Sub-task
>  Components: dependencies
>Reporter: Beam JIRA Bot
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
>  - 2019-11-15 19:39:07.113511 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-11-19 21:04:31.460554 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-12-02 12:09:35.656962 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-12-09 12:08:45.501496 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-12-23 12:08:46.385699 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-12-30 14:04:15.902178 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2020-01-06 12:07:51.708019 
> -
> Please consider upgrading the dependency 
> com.google.apis:google-api-services-bigquery. 
> The current version is v2-rev20181221-1.28.0. The latest version is 
> v2-rev20190917-1.30.3 
> cc: [~chamikara], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2020-01-13 12:07:45.384750 
> -
> Please consider upgrading the dependency 
> com.google.apis

[jira] [Comment Edited] (BEAM-9046) Kafka connector for Python throws ClassCastException when reading KafkaRecord

2020-01-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/BEAM-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017933#comment-17017933
 ] 

Berkay Öztürk edited comment on BEAM-9046 at 1/18/20 7:28 PM:
--

[~mxm]

This is what I have done:
 * 
{code:java}
git clone https://github.com/mxm/beam.git beam-mxm{code}

 * 
{code:java}
cd beam-mxm{code}

 * 
{code:java}
git reset --hard b31cf99c75{code}

 * 
{code:java}
./gradlew build -p runners/flink/1.8/job-server{code}

 * 
{code:java}
java -jar 
runners/flink/1.8/job-server/build/libs/beam-runners-flink-1.8-job-server-2.13.0-SNAPSHOT.jar
{code}

 * Run Python pipeline with PortableRunner.

Raises the error below:
{code:java}
Traceback (most recent call last):
  File "main.py", line 49, in 
run()
  File "main.py", line 32, in run
topics=['test']
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/transforms/ptransform.py", line 905, in __ror__
return self.transform.__ror__(pvalueish, self.label)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/transforms/ptransform.py", line 514, in __ror__
result = p.apply(self, pvalueish, label)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/pipeline.py", line 481, in apply
return self.apply(transform, pvalueish)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/pipeline.py", line 517, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/runners/runner.py", line 175, in apply
return m(transform, input, options)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/runners/runner.py", line 181, in apply_PTransform
return transform.expand(input)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/io/external/kafka.py", line 119, in expand
self.expansion_service))
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/pvalue.py", line 110, in apply
return self.pipeline.apply(*arglist, **kwargs)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/pipeline.py", line 517, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/runners/runner.py", line 175, in apply
return m(transform, input, options)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/runners/runner.py", line 181, in apply_PTransform
return transform.expand(input)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/transforms/external.py", line 142, in expand
for tag, pcoll_id in self._expanded_transform.outputs.items()
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/transforms/external.py", line 142, in 
for tag, pcoll_id in self._expanded_transform.outputs.items()
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/runners/pipeline_context.py", line 94, in get_by_id
self._id_to_proto[id], self._pipeline_context)
  File "/home/USER/beam/lib/python3.5/site-packages/apache_be   

  am/pvalue.py", line 178, in from_runner_api
eleme

[jira] [Commented] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2020-01-18 Thread Preston Marshall (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018728#comment-17018728
 ] 

Preston Marshall commented on BEAM-7693:


This is such a useful pattern. I wish there was an answer for it. Validating up 
front is so slow.

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, runner-dataflow
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: Major
> Attachments: Screenshot 2019-07-06 at 15.05.04.png, 
> image-2019-07-06-15-04-17-593.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again.
> !image-2019-07-06-15-04-17-593.png|width=497,height=134!
> At the same time, no message is sent to the DropInputs step, even when I 
> created my own step for DeadLetter, the process think that it hasn't failed 
> yet.
> !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2020-01-18 Thread Preston Marshall (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018732#comment-17018732
 ] 

Preston Marshall commented on BEAM-7693:


I could be wrong and it appears to be an explicit choice: 
[https://github.com/apache/beam/blob/148cc71b1c3c6d54835c3b72d7842052fcf6c340/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2663]
 setMaxRetryJobs seems to be public though it's not clear how to get at it. I'd 
also like to understand the implications of failing a load job on a streaming 
pipeline, since apparently it's a bad thing.

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, runner-dataflow
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: Major
> Attachments: Screenshot 2019-07-06 at 15.05.04.png, 
> image-2019-07-06-15-04-17-593.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again.
> !image-2019-07-06-15-04-17-593.png|width=497,height=134!
> At the same time, no message is sent to the DropInputs step, even when I 
> created my own step for DeadLetter, the process think that it hasn't failed 
> yet.
> !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-5040) BigQueryIO retries infinitely in WriteTable and WriteRename

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-5040?focusedWorklogId=374147&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374147
 ]

ASF GitHub Bot logged work on BEAM-5040:


Author: ASF GitHub Bot
Created on: 18/Jan/20 23:50
Start Date: 18/Jan/20 23:50
Worklog Time Spent: 10m 
  Work Description: preston-hawkfish commented on pull request #6080: 
[BEAM-5040] Fix retry bug for BigQuery jobs.
URL: https://github.com/apache/beam/pull/6080#discussion_r368255226
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
 ##
 @@ -1689,6 +1689,11 @@ public WriteResult expand(PCollection input) {
 if (getMaxFileSize() != null) {
   batchLoads.setMaxFileSize(getMaxFileSize());
 }
+// When running in streaming (unbounded mode) we want to retry failed 
load jobs
+// indefinitely. Failing the bundle is expensive, so we set a fairly 
high limit on retries.
+if (IsBounded.UNBOUNDED.equals(input.isBounded())) {
+  batchLoads.setMaxRetryJobs(1000);
 
 Review comment:
   How difficult would it be to make them not retry forever in case we want to 
route them to FailedInserts? I'm investigating 
https://issues.apache.org/jira/browse/BEAM-7693 and this seems to be 
implicated. Thanks and sorry for the 🧟‍♂️thread revival :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374147)
Time Spent: 2.5h  (was: 2h 20m)

> BigQueryIO retries infinitely in WriteTable and WriteRename
> ---
>
> Key: BEAM-5040
> URL: https://issues.apache.org/jira/browse/BEAM-5040
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.5.0
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> BigQueryIO retries infinitely in WriteTable and WriteRename
> Several failure scenarios with the current code:
>  # It's possible for a load job to return failure even though it actually 
> succeeded (e.g. the reply might have timed out). In this case, BigQueryIO 
> will retry the job which will fail again (because the job id has already been 
> used), leading to indefinite retries. Correct behavior is to stop retrying as 
> the load job has succeeded.
>  # It's possible for a load job to be accepted by BigQuery, but then to fail 
> on the BigQuery side. In this case a retry with the same job id will fail as 
> that job id has already been used. BigQueryIO will sometimes detect this, but 
> if the worker has restarted it will instead issue a load with the old job id 
> and go into a retry loop. Correct behavior is to generate a new deterministic 
> job id and retry using that new job id.
>  # In many cases of worker restart, BigQueryIO ends up in infinite retry 
> loops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8613) Add environment variable support to Docker environment

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8613?focusedWorklogId=374151&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374151
 ]

ASF GitHub Bot logged work on BEAM-8613:


Author: ASF GitHub Bot
Created on: 19/Jan/20 00:35
Start Date: 19/Jan/20 00:35
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on issue #10064: [BEAM-8613] Add 
environment variable support to Docker environment
URL: https://github.com/apache/beam/pull/10064#issuecomment-575952376
 
 
   This pull request has been marked as stale due to 60 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@beam.apache.org list. Thank you for your 
contributions.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374151)
Time Spent: 1h 10m  (was: 1h)

> Add environment variable support to Docker environment
> --
>
> Key: BEAM-8613
> URL: https://issues.apache.org/jira/browse/BEAM-8613
> Project: Beam
>  Issue Type: Improvement
>  Components: java-fn-execution, runner-core, runner-direct
>Reporter: Nathan Rusch
>Assignee: Nathan Rusch
>Priority: Trivial
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The Process environment allows specifying environment variables via a map 
> field on its payload message. The Docker environment should support this same 
> pattern, and forward the contents of the map through to the container runtime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7765) Add test for snippet accessing_valueprovider_info_after_run

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7765?focusedWorklogId=374152&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374152
 ]

ASF GitHub Bot logged work on BEAM-7765:


Author: ASF GitHub Bot
Created on: 19/Jan/20 00:35
Start Date: 19/Jan/20 00:35
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on issue #9685: [BEAM-7765] - Add 
test for snippet accessing_valueprovider_info_after_run
URL: https://github.com/apache/beam/pull/9685#issuecomment-575952382
 
 
   This pull request has been closed due to lack of activity. If you think that 
is incorrect, or the pull request requires review, you can revive the PR at any 
time.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374152)
Time Spent: 2h 50m  (was: 2h 40m)

> Add test for snippet accessing_valueprovider_info_after_run
> ---
>
> Key: BEAM-7765
> URL: https://issues.apache.org/jira/browse/BEAM-7765
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: John Patoch
>Priority: Major
>  Labels: easy
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This snippet needs a unit test.
> It has bugs. For example:
> - apache_beam.utils.value_provider doesn't exist
> - beam.combiners.Sum doesn't exist
> - unused import of: WriteToText
> cc: [~pabloem]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7765) Add test for snippet accessing_valueprovider_info_after_run

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7765?focusedWorklogId=374153&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374153
 ]

ASF GitHub Bot logged work on BEAM-7765:


Author: ASF GitHub Bot
Created on: 19/Jan/20 00:35
Start Date: 19/Jan/20 00:35
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on pull request #9685: [BEAM-7765] 
- Add test for snippet accessing_valueprovider_info_after_run
URL: https://github.com/apache/beam/pull/9685
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374153)
Time Spent: 3h  (was: 2h 50m)

> Add test for snippet accessing_valueprovider_info_after_run
> ---
>
> Key: BEAM-7765
> URL: https://issues.apache.org/jira/browse/BEAM-7765
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: John Patoch
>Priority: Major
>  Labels: easy
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> This snippet needs a unit test.
> It has bugs. For example:
> - apache_beam.utils.value_provider doesn't exist
> - beam.combiners.Sum doesn't exist
> - unused import of: WriteToText
> cc: [~pabloem]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8935) Fail fast if sdk harness startup failed

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8935?focusedWorklogId=374183&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374183
 ]

ASF GitHub Bot logged work on BEAM-8935:


Author: ASF GitHub Bot
Created on: 19/Jan/20 03:35
Start Date: 19/Jan/20 03:35
Worklog Time Spent: 10m 
  Work Description: sunjincheng121 commented on pull request #10623: [DO 
NOT REVIEW] Revert "[BEAM-8935] Fail fast if sdk harness startup failed."
URL: https://github.com/apache/beam/pull/10623
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374183)
Time Spent: 2h 40m  (was: 2.5h)

> Fail fast if sdk harness startup failed
> ---
>
> Key: BEAM-8935
> URL: https://issues.apache.org/jira/browse/BEAM-8935
> Project: Beam
>  Issue Type: Improvement
>  Components: java-fn-execution
>Reporter: sunjincheng
>Assignee: sunjincheng
>Priority: Major
> Fix For: 2.19.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Currently the runner waits for the sdk harness to startup blockingly until 
> the sdk harness is available or timeout occurs. The timeout is 1 or 2 
> minutes. If the sdk harness startup failed for some reason, the runner may be 
> aware of it after 1 or 2 minutes. This is too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=374184&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374184
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 19/Jan/20 03:37
Start Date: 19/Jan/20 03:37
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix 
two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-575963489
 
 
   @pabloem 
   
   Hi, pabloem :)
   
   I have changed the codes and try to rerun the lint tests and precommit 
tests, but it seems I failed to trigger the tests, and can not see the results 
now.
   
   Would you please help to rerun the tests? Thank you so much.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374184)
Remaining Estimate: 21h 40m  (was: 21h 50m)
Time Spent: 2h 20m  (was: 2h 10m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 2h 20m
>  Remaining Estimate: 21h 40m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)