[jira] [Work logged] (BEAM-7437) Integration Test for BQ streaming inserts for streaming pipelines
[ https://issues.apache.org/jira/browse/BEAM-7437?focusedWorklogId=272885&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-272885 ] ASF GitHub Bot logged work on BEAM-7437: Author: ASF GitHub Bot Created on: 06/Jul/19 17:43 Start Date: 06/Jul/19 17:43 Worklog Time Spent: 10m Work Description: ttanay commented on issue #8934: [BEAM-7437] Add streaming flag to BQ streaming inserts IT test URL: https://github.com/apache/beam/pull/8934#issuecomment-508942895 Run Python PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 272885) Time Spent: 5h 20m (was: 5h 10m) > Integration Test for BQ streaming inserts for streaming pipelines > - > > Key: BEAM-7437 > URL: https://issues.apache.org/jira/browse/BEAM-7437 > Project: Beam > Issue Type: Test > Components: io-python-gcp >Affects Versions: 2.12.0 >Reporter: Tanay Tummalapalli >Assignee: Tanay Tummalapalli >Priority: Minor > Labels: test > Time Spent: 5h 20m > Remaining Estimate: 0h > > Integration Test for BigQuery Sink using Streaming Inserts for streaming > pipelines. > Integration tests currently exist for batch pipelines, it can also be added > for streaming pipelines using TestStream. This will be a precursor to the > failing integration test to be added for [BEAM-6611| > https://issues.apache.org/jira/browse/BEAM-6611]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-7668) Add ability to query a pipeline from a gRPC JobService
[ https://issues.apache.org/jira/browse/BEAM-7668?focusedWorklogId=272861&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-272861 ] ASF GitHub Bot logged work on BEAM-7668: Author: ASF GitHub Bot Created on: 06/Jul/19 15:51 Start Date: 06/Jul/19 15:51 Worklog Time Spent: 10m Work Description: chadrik commented on issue #8977: [BEAM-7668] Add GetPipeline method to gRPC JobService URL: https://github.com/apache/beam/pull/8977#issuecomment-508935895 R: @angoenka This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 272861) Time Spent: 40m (was: 0.5h) > Add ability to query a pipeline from a gRPC JobService > -- > > Key: BEAM-7668 > URL: https://issues.apache.org/jira/browse/BEAM-7668 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Labels: portability > Time Spent: 40m > Remaining Estimate: 0h > > As a developer I want to query the pipeline definition of jobs that have > already been submitted to the Job Service, so that I can write a UI to view > and monitor Beam pipelines via the portability model. > > PR is coming soon! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-7668) Add ability to query a pipeline from a gRPC JobService
[ https://issues.apache.org/jira/browse/BEAM-7668?focusedWorklogId=272860&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-272860 ] ASF GitHub Bot logged work on BEAM-7668: Author: ASF GitHub Bot Created on: 06/Jul/19 15:50 Start Date: 06/Jul/19 15:50 Worklog Time Spent: 10m Work Description: chadrik commented on issue #8977: [BEAM-7668] Add GetPipeline method to gRPC JobService URL: https://github.com/apache/beam/pull/8977#issuecomment-508935895 R: @lukecwik R: @angoenka This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 272860) Time Spent: 0.5h (was: 20m) > Add ability to query a pipeline from a gRPC JobService > -- > > Key: BEAM-7668 > URL: https://issues.apache.org/jira/browse/BEAM-7668 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Labels: portability > Time Spent: 0.5h > Remaining Estimate: 0h > > As a developer I want to query the pipeline definition of jobs that have > already been submitted to the Job Service, so that I can write a UI to view > and monitor Beam pipelines via the portability model. > > PR is coming soon! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job
[ https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Urrego updated BEAM-7693: -- Description: During a Stream Job, when you insert records to BigQuery in batch using the FILE_LOADS option and one of the jobs fail, the thread who failed is getting stuck and eventually it saturates the Job resources, making the autoscaling option useless (uses the max number of workers and the system latency always goes up). In some cases it has become ridiculous slow trying to process the incoming events. Here is an example: {code:java} BigQueryIO.writeTableRows() .to(destinationTableSerializableFunction) .withMethod(Method.FILE_LOADS) .withJsonSchema(tableSchema) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withTriggeringFrequency(Duration.standardMinutes(5)) .withNumFileShards(25); {code} The pipeline works like a charm, but in the moment that I send a wrong tableRow (for instance a required value as null) the pipeline starts sending this messages: {code:java} Processing stuck in step FILE_LOADS: in BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 10m00s without outputting or completing in state finish at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source) {code} It's clear that the step keeps running even when it failed. The BigQuery Job mentions that the task failed, but DataFlow keeps trying to wait for a response, even when the job is never executed again. !image-2019-07-06-15-04-17-593.png|width=497,height=134! At the same time, no message is sent to the DropInputs step, even when I created my own step for DeadLetter, the process think that it hasn't failed yet. !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306! The only option that I have found so far, is to pre validate all the fields before, but I was expecting the DB to do that for me, especially in some extreme cases (like decimal numbers or constraint limitations). Please help fixing this issue, otherwise the batch option in stream jobs is almost useless, because I can't trust the own library to manage dead letters properly was: During a Stream Job, when you insert records to BigQuery in batch using the FILE_LOADS option and one of the jobs fail, the thread who failed is getting stuck and eventually it saturates the Job resources, making the autoscaling option useless (uses the max number of workers and the system latency always goes up). In some cases it has become ridiculous slow trying to process the incoming events. Here is an example: {code:java} BigQueryIO.writeTableRows() .to(destinationTableSerializableFunction) .withMethod(Method.FILE_LOADS) .withJsonSchema(tableSchema) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withTriggeringFrequency(Duration.standardMinutes(5)) .withNumFileShards(25); {code} The pipeline works like a charm, but in the moment that I send a wrong tableRow (for instance a required value as null) the pipeline starts sending this messages: {code:java} Processing stuck in step FILE_LOADS: in BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 10m00s without outputting or completing in state finish at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source) {code} It's clear that the step keeps running even when it failed. The BigQuery Job mentions that the task failed, but DataFlow keeps trying to wait for a response, even when the job is never executed again. !image-2019-07-06-15-04-17-593.png|width=497,height=134! At the same time, no message is sent to the DropInputs step, even when I created my own step for DeadLetter, the process think that it hasn't failed y
[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job
[ https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Urrego updated BEAM-7693: -- Description: During a Stream Job, when you insert records to BigQuery in batch using the FILE_LOADS option and one of the jobs fail, the thread who failed is getting stuck and eventually it saturates the Job resources, making the autoscaling option useless (uses the max number of workers and the system latency always goes up). In some cases it has become ridiculous slow trying to process the incoming events. Here is an example: {code:java} BigQueryIO.writeTableRows() .to(destinationTableSerializableFunction) .withMethod(Method.FILE_LOADS) .withJsonSchema(tableSchema) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withTriggeringFrequency(Duration.standardMinutes(5)) .withNumFileShards(25); {code} The pipeline works like a charm, but in the moment that I send a wrong tableRow (for instance a required value as null) the pipeline starts sending this messages: {code:java} Processing stuck in step FILE_LOADS: in BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 10m00s without outputting or completing in state finish at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source) {code} It's clear that the step keeps running even when it failed. The BigQuery Job mentions that the task failed, but DataFlow keeps trying to wait for a response, even when the job is never executed again. !image-2019-07-06-15-04-17-593.png|width=497,height=134! At the same time, no message is sent to the DropInputs step, even when I created my own step for DeadLetter, the process think that it hasn't failed yet. !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306! The only option that I have found so far, is to pre validate all the fields before, but I was expecting the DB to do that for me, especially in some extreme cases (like decimal numbers or constraint limitations). Please help fixing this issue, otherwise the batch option in stream jobs is almost useless, because I can't trust the own library to manage dead letters properly was: During a Stream Job, when you insert records to BigQuery in batch using the FILE_LOADS option and one of the jobs fail, the thread who failed is getting stuck and eventually it saturates the Job resources, making the autoscaling option useless (uses the max number of workers and the system latency always goes up). In some cases it has become ridiculous slow trying to process the incoming events. Here is an example: {code:java} BigQueryIO.writeTableRows() .to(destinationTableSerializableFunction) .withMethod(Method.FILE_LOADS) .withJsonSchema(tableSchema) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withTriggeringFrequency(Duration.standardMinutes(5)) .withNumFileShards(25); {code} The pipeline works like a charm, but in the moment that I send a wrong tableRow (for instance a required value as null) the pipeline starts sending this messages: {code:java} Processing stuck in step FILE_LOADS: in BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 10m00s without outputting or completing in state finish at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source) {code} It's clear that the step keeps running even when it failed. The BigQuery Job mentions that the task failed, but DataFlow keeps trying to wait for a response, even when the job is never executed again. !image-2019-07-06-15-04-17-593.png|width=497,height=134! At the same time, no message is sent to the DropInputs step, even when I created my own step for DeadLetter, the process think that it hasn't fa
[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job
[ https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Urrego updated BEAM-7693: -- Attachment: (was: Screenshot 2019-07-06 at 14.51.36.png) > FILE_LOADS option for inserting rows in BigQuery creates a stuck process in > Dataflow that saturates all the resources of the Job > > > Key: BEAM-7693 > URL: https://issues.apache.org/jira/browse/BEAM-7693 > Project: Beam > Issue Type: Bug > Components: io-java-files >Affects Versions: 2.13.0 > Environment: Dataflow >Reporter: Juan Urrego >Priority: Major > Attachments: Screenshot 2019-07-06 at 15.05.04.png, > image-2019-07-06-15-04-17-593.png > > > During a Stream Job, when you insert records to BigQuery in batch using the > FILE_LOADS option and one of the jobs fail, the thread who failed is getting > stuck and eventually it saturates the Job resources, making the autoscaling > option useless (uses the max number of workers and the system latency always > goes up). In some cases it has become ridiculous slow trying to process the > incoming events. > Here is an example: > {code:java} > BigQueryIO.writeTableRows() > .to(destinationTableSerializableFunction) > .withMethod(Method.FILE_LOADS) > .withJsonSchema(tableSchema) > .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(WriteDisposition.WRITE_APPEND) > .withTriggeringFrequency(Duration.standardMinutes(5)) > .withNumFileShards(25); > {code} > The pipeline works like a charm, but in the moment that I send a wrong > tableRow (for instance a required value as null) the pipeline starts sending > this messages: > {code:java} > Processing stuck in step FILE_LOADS: in > BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at > least 10m00s without outputting or completing in state finish at > java.lang.Thread.sleep(Native Method) at > com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at > com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) > at > org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) > at > org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown > Source) > {code} > It's clear that the step keeps running even when it failed. The BigQuery Job > mentions that the task failed, but DataFlow keeps trying to wait for a > response, even when the job is never executed again. > !image-2019-07-06-15-04-17-593.png|width=497,height=134! > At the same time, no message is sent to the DropInputs step, even when I > created my own step for DeadLetter, the process think that it hasn't failed > yet. > !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306! > The only option that I have found so far, is to pre validate all the fields > before, but I was expecting the DB to do that for me, especially in some > extreme cases (like decimal numbers or constraint limitations). Please help > fixing this issue, otherwise the batch option in stream jobs is almost > useless, because I can't trust the own library to manage dead letters properly > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job
[ https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Urrego updated BEAM-7693: -- Description: During a Stream Job, when you insert records to BigQuery in batch using the FILE_LOADS option and one of the jobs fail, the thread who failed is getting stuck and eventually it saturates the Job resources, making the autoscaling option useless (uses the max number of workers and the system latency always goes up). In some cases it has become ridiculous slow trying to process the incoming events. Here is an example: {code:java} BigQueryIO.writeTableRows() .to(destinationTableSerializableFunction) .withMethod(Method.FILE_LOADS) .withJsonSchema(tableSchema) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withTriggeringFrequency(Duration.standardMinutes(5)) .withNumFileShards(25); {code} The pipeline works like a charm, but in the moment that I send a wrong tableRow (for instance a required value as null) the pipeline starts sending this messages: {code:java} Processing stuck in step FILE_LOADS: in BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 10m00s without outputting or completing in state finish at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source) {code} It's clear that the step keeps running even when it failed. The BigQuery Job mentions that the task failed, but DataFlow keeps trying to wait for a response, even when the job is never executed again. !image-2019-07-06-15-04-17-593.png|width=497,height=134! At the same time, no message is sent to the DropInputs step, even when I created my own step for DeadLetter, the process think that it hasn't failed yet. !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306! The only option that I have found so far, is to pre validate all the fields before, but I was expecting the DB to do that for me, especially in some extreme cases (like decimal numbers or constraint limitations). Please help fixing this issue, otherwise the batch option in stream jobs is almost useless, because I can't trust the own library to manage dead letters properly was: During a Stream Job, when you insert records to BigQuery in batch using the FILE_LOADS option and one of the jobs fail, the thread who failed is getting stuck and eventually it saturates the Job resources, making the autoscaling option useless (uses the max number of workers and the system latency always goes up). In some cases it has become ridiculous slow trying to process the incoming events. Here is an example: {code:java} BigQueryIO.writeTableRows() .to(destinationTableSerializableFunction) .withMethod(Method.FILE_LOADS) .withJsonSchema(tableSchema) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withTriggeringFrequency(Duration.standardMinutes(5)) .withNumFileShards(25); {code} The pipeline works like a charm, but in the moment that I send a wrong tableRow (for instance a required value as null) the pipeline starts sending this messages: {code:java} Processing stuck in step FILE_LOADS: in BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 10m00s without outputting or completing in state finish at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source) {code} It's clear that the step keeps running even when it failed. The BigQuery Job mentions that the task failed, but DataFlow keeps trying to wait for a response, even when the job is never executed again. At the same time, no message is sent to the DropInputs step, even when I created my own step for DeadLetter, the process think that it hasn't failed yet. The only option that I have found so far, is to pre v
[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job
[ https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Urrego updated BEAM-7693: -- Attachment: Screenshot 2019-07-06 at 15.05.04.png > FILE_LOADS option for inserting rows in BigQuery creates a stuck process in > Dataflow that saturates all the resources of the Job > > > Key: BEAM-7693 > URL: https://issues.apache.org/jira/browse/BEAM-7693 > Project: Beam > Issue Type: Bug > Components: io-java-files >Affects Versions: 2.13.0 > Environment: Dataflow >Reporter: Juan Urrego >Priority: Major > Attachments: Screenshot 2019-07-06 at 14.51.36.png, Screenshot > 2019-07-06 at 15.05.04.png, image-2019-07-06-15-04-17-593.png > > > During a Stream Job, when you insert records to BigQuery in batch using the > FILE_LOADS option and one of the jobs fail, the thread who failed is getting > stuck and eventually it saturates the Job resources, making the autoscaling > option useless (uses the max number of workers and the system latency always > goes up). In some cases it has become ridiculous slow trying to process the > incoming events. > Here is an example: > {code:java} > BigQueryIO.writeTableRows() > .to(destinationTableSerializableFunction) > .withMethod(Method.FILE_LOADS) > .withJsonSchema(tableSchema) > .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(WriteDisposition.WRITE_APPEND) > .withTriggeringFrequency(Duration.standardMinutes(5)) > .withNumFileShards(25); > {code} > The pipeline works like a charm, but in the moment that I send a wrong > tableRow (for instance a required value as null) the pipeline starts sending > this messages: > {code:java} > Processing stuck in step FILE_LOADS: in > BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at > least 10m00s without outputting or completing in state finish at > java.lang.Thread.sleep(Native Method) at > com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at > com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) > at > org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) > at > org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown > Source) > {code} > It's clear that the step keeps running even when it failed. The BigQuery Job > mentions that the task failed, but DataFlow keeps trying to wait for a > response, even when the job is never executed again. At the same time, no > message is sent to the DropInputs step, even when I created my own step for > DeadLetter, the process think that it hasn't failed yet. > The only option that I have found so far, is to pre validate all the fields > before, but I was expecting the DB to do that for me, especially in some > extreme cases (like decimal numbers or constraint limitations). Please help > fixing this issue, otherwise the batch option in stream jobs is almost > useless, because I can't trust the own library to manage dead letters properly > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job
[ https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Urrego updated BEAM-7693: -- Attachment: image-2019-07-06-15-04-17-593.png > FILE_LOADS option for inserting rows in BigQuery creates a stuck process in > Dataflow that saturates all the resources of the Job > > > Key: BEAM-7693 > URL: https://issues.apache.org/jira/browse/BEAM-7693 > Project: Beam > Issue Type: Bug > Components: io-java-files >Affects Versions: 2.13.0 > Environment: Dataflow >Reporter: Juan Urrego >Priority: Major > Attachments: Screenshot 2019-07-06 at 14.51.36.png, > image-2019-07-06-15-04-17-593.png > > > During a Stream Job, when you insert records to BigQuery in batch using the > FILE_LOADS option and one of the jobs fail, the thread who failed is getting > stuck and eventually it saturates the Job resources, making the autoscaling > option useless (uses the max number of workers and the system latency always > goes up). In some cases it has become ridiculous slow trying to process the > incoming events. > Here is an example: > {code:java} > BigQueryIO.writeTableRows() > .to(destinationTableSerializableFunction) > .withMethod(Method.FILE_LOADS) > .withJsonSchema(tableSchema) > .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(WriteDisposition.WRITE_APPEND) > .withTriggeringFrequency(Duration.standardMinutes(5)) > .withNumFileShards(25); > {code} > The pipeline works like a charm, but in the moment that I send a wrong > tableRow (for instance a required value as null) the pipeline starts sending > this messages: > {code:java} > Processing stuck in step FILE_LOADS: in > BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at > least 10m00s without outputting or completing in state finish at > java.lang.Thread.sleep(Native Method) at > com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at > com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) > at > org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) > at > org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown > Source) > {code} > It's clear that the step keeps running even when it failed. The BigQuery Job > mentions that the task failed, but DataFlow keeps trying to wait for a > response, even when the job is never executed again. At the same time, no > message is sent to the DropInputs step, even when I created my own step for > DeadLetter, the process think that it hasn't failed yet. > The only option that I have found so far, is to pre validate all the fields > before, but I was expecting the DB to do that for me, especially in some > extreme cases (like decimal numbers or constraint limitations). Please help > fixing this issue, otherwise the batch option in stream jobs is almost > useless, because I can't trust the own library to manage dead letters properly > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job
[ https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Urrego updated BEAM-7693: -- Attachment: Screenshot 2019-07-06 at 14.51.36.png > FILE_LOADS option for inserting rows in BigQuery creates a stuck process in > Dataflow that saturates all the resources of the Job > > > Key: BEAM-7693 > URL: https://issues.apache.org/jira/browse/BEAM-7693 > Project: Beam > Issue Type: Bug > Components: io-java-files >Affects Versions: 2.13.0 > Environment: Dataflow >Reporter: Juan Urrego >Priority: Major > Attachments: Screenshot 2019-07-06 at 14.51.36.png > > > During a Stream Job, when you insert records to BigQuery in batch using the > FILE_LOADS option and one of the jobs fail, the thread who failed is getting > stuck and eventually it saturates the Job resources, making the autoscaling > option useless (uses the max number of workers and the system latency always > goes up). In some cases it has become ridiculous slow trying to process the > incoming events. > Here is an example: > {code:java} > BigQueryIO.writeTableRows() > .to(destinationTableSerializableFunction) > .withMethod(Method.FILE_LOADS) > .withJsonSchema(tableSchema) > .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(WriteDisposition.WRITE_APPEND) > .withTriggeringFrequency(Duration.standardMinutes(5)) > .withNumFileShards(25); > {code} > The pipeline works like a charm, but in the moment that I send a wrong > tableRow (for instance a required value as null) the pipeline starts sending > this messages: > {code:java} > Processing stuck in step FILE_LOADS: in > BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at > least 10m00s without outputting or completing in state finish at > java.lang.Thread.sleep(Native Method) at > com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at > com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) > at > org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) > at > org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown > Source) > {code} > It's clear that the step keeps running even when it failed. The BigQuery Job > mentions that the task failed, but DataFlow keeps trying to wait for a > response, even when the job is never executed again. At the same time, no > message is sent to the DropInputs step, even when I created my own step for > DeadLetter, the process think that it hasn't failed yet. > The only option that I have found so far, is to pre validate all the fields > before, but I was expecting the DB to do that for me, especially in some > extreme cases (like decimal numbers or constraint limitations). Please help > fixing this issue, otherwise the batch option in stream jobs is almost > useless, because I can't trust the own library to manage dead letters properly > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job
Juan Urrego created BEAM-7693: - Summary: FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job Key: BEAM-7693 URL: https://issues.apache.org/jira/browse/BEAM-7693 Project: Beam Issue Type: Bug Components: io-java-files Affects Versions: 2.13.0 Environment: Dataflow Reporter: Juan Urrego During a Stream Job, when you insert records to BigQuery in batch using the FILE_LOADS option and one of the jobs fail, the thread who failed is getting stuck and eventually it saturates the Job resources, making the autoscaling option useless (uses the max number of workers and the system latency always goes up). In some cases it has become ridiculous slow trying to process the incoming events. Here is an example: {code:java} BigQueryIO.writeTableRows() .to(destinationTableSerializableFunction) .withMethod(Method.FILE_LOADS) .withJsonSchema(tableSchema) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withTriggeringFrequency(Duration.standardMinutes(5)) .withNumFileShards(25); {code} The pipeline works like a charm, but in the moment that I send a wrong tableRow (for instance a required value as null) the pipeline starts sending this messages: {code:java} Processing stuck in step FILE_LOADS: in BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 10m00s without outputting or completing in state finish at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255) at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source) {code} It's clear that the step keeps running even when it failed. The BigQuery Job mentions that the task failed, but DataFlow keeps trying to wait for a response, even when the job is never executed again. At the same time, no message is sent to the DropInputs step, even when I created my own step for DeadLetter, the process think that it hasn't failed yet. The only option that I have found so far, is to pre validate all the fields before, but I was expecting the DB to do that for me, especially in some extreme cases (like decimal numbers or constraint limitations). Please help fixing this issue, otherwise the batch option in stream jobs is almost useless, because I can't trust the own library to manage dead letters properly -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialisation issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=272828&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-272828 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 06/Jul/19 11:11 Start Date: 06/Jul/19 11:11 Worklog Time Spent: 10m Work Description: zouabimourad commented on issue #8757: [BEAM-7427] Fix JmsCheckpointMark Avro Encoding URL: https://github.com/apache/beam/pull/8757#issuecomment-508917576 > @zouabimourad just for some extra context my comment on testing encoding/decoding comes from the fact that `CheckpointMark`s should be able to be converted from/to bytes. However the current implementation is trying to do that for a list of not yet processed `javax.jms.Message`s which is probably not a good idea because we cannot guarantee that those will be correctly serialized with or without Avro. So probably a better approach (and if I understand correctly is what @jbonofre proposes) is to contain a list of `JmsRecord` in `JmsCheckpointMark` because we know those are `Serializable`. I agree with you `javax.jms.Message` could not be deserialized since it's an interface .. i got errors after using `CoderProperties.coderDecodeEncodeEqual(avroCoder, checkpointMark);` in test. @jbonofre has a better fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 272828) Time Spent: 1h 40m (was: 1.5h) > JmsCheckpointMark Avro Serialisation issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Mourad >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)