[jira] [Updated] (BEAM-4548) Long execution delay when using DirectRunner to read from BigQuery Table
[ https://issues.apache.org/jira/browse/BEAM-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Foo updated BEAM-4548: Description: When using DirectRunner to execute a simple select query against a BigQuery table that contains 100 rows, the pipeline stalls for over 3 minutes. The BigQuery UI can run the same query in under 2 seconds. A similar issue was reported here: [https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read|https://www.google.com/url?q=https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read&sa=D&source=hangouts&ust=1528912448506000&usg=AFQjCNHp9JWHFJOnJlBJmLODU1cGBIeXtg] I ran a thread dump using Visual M seems like the main thread was in a state of backoff: java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:43) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:50) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.nextBackOff(BigQueryServicesImpl.java:870) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.access$500(BigQueryServicesImpl.java:79) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:273) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:247) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.executeQuery(BigQueryQuerySource.java:191) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:136) at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:103) at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:134) at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:210) at org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87) at org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62) at org.apache.beam.runners.direct.ExecutorServiceParallelExecutor.start(ExecutorServiceParallelExecutor.java:144) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297) was: When using DirectRunner to execute a simple select query against a BigQuery table that contains 100 rows, the pipeline stalls for about 5 minutes. The BigQuery UI can run the same query in under 2 seconds. A similar issue was reported here: [https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read|https://www.google.com/url?q=https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read&sa=D&source=hangouts&ust=1528912448506000&usg=AFQjCNHp9JWHFJOnJlBJmLODU1cGBIeXtg] I ran a thread dump using Visual M seems like the main thread was in a state of backoff: java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:43) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:50) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.nextBackOff(BigQueryServicesImpl.java:870) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.access$500(BigQueryServicesImpl.java:79) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:273) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:247) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.executeQuery(BigQueryQuerySource.java:191) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:136) at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:103) at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:134) at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:210) at org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87) at org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62) at org.apache.beam.runners.direct.ExecutorServiceParallelExecutor.start(ExecutorServiceParallelExecutor.java:144) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:3
[jira] [Commented] (BEAM-4548) Long execution delay when using DirectRunner to read from BigQuery Table
[ https://issues.apache.org/jira/browse/BEAM-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510328#comment-16510328 ] Brian Foo commented on BEAM-4548: - Here's a simple piece of code that takes 2.4 seconds on the BQ UI, and 2 minutes on DirectRunner. {code:java} import com.google.api.services.bigquery.model.TableRow; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.values.PCollection; public class Test { public static void main(String[] args) { Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withoutStrictParsing().create()); String query = "SELECT * FROM `bigquery-public-data.samples.gsod` LIMIT 10"; // takes 2.4 seconds PCollection simpleResult = p.apply(BigQueryIO.readTableRows().fromQuery(query).withoutValidation().usingStandardSql()) .apply(ParDo.of(new PrintAndReturnDoFn())); p.run(); } /** Just print the rows and send input to output. */ private static class PrintAndReturnDoFn extends DoFn { @ProcessElement public void processElement(ProcessContext c) { TableRow t = c.element(); System.out.println(t); c.output(t); } } } {code} > Long execution delay when using DirectRunner to read from BigQuery Table > > > Key: BEAM-4548 > URL: https://issues.apache.org/jira/browse/BEAM-4548 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, runner-direct >Affects Versions: 2.4.0 >Reporter: Brian Foo >Assignee: Chamikara Jayalath >Priority: Major > > When using DirectRunner to execute a simple select query against a BigQuery > table that contains 100 rows, the pipeline stalls for about 5 minutes. The > BigQuery UI can run the same query in under 2 seconds. > A similar issue was reported here: > [https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read|https://www.google.com/url?q=https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read&sa=D&source=hangouts&ust=1528912448506000&usg=AFQjCNHp9JWHFJOnJlBJmLODU1cGBIeXtg] > I ran a thread dump using Visual M seems like the main thread was in a state > of backoff: > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:43) > at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:50) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.nextBackOff(BigQueryServicesImpl.java:870) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.access$500(BigQueryServicesImpl.java:79) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:273) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:247) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.executeQuery(BigQueryQuerySource.java:191) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:136) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:103) > at > org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:134) > at > org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:210) > at > org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87) > at > org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62) > at > org.apache.beam.runners.direct.ExecutorServiceParallelExecutor.start(ExecutorServiceParallelExecutor.java:144) > at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201) > at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4548) Long execution delay when using DirectRunner to read from BigQuery Table
Brian Foo created BEAM-4548: --- Summary: Long execution delay when using DirectRunner to read from BigQuery Table Key: BEAM-4548 URL: https://issues.apache.org/jira/browse/BEAM-4548 Project: Beam Issue Type: Bug Components: io-java-gcp, runner-direct Affects Versions: 2.4.0 Reporter: Brian Foo Assignee: Chamikara Jayalath When using DirectRunner to execute a simple select query against a BigQuery table that contains 100 rows, the pipeline stalls for about 5 minutes. The BigQuery UI can run the same query in under 2 seconds. A similar issue was reported here: [https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read|https://www.google.com/url?q=https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read&sa=D&source=hangouts&ust=1528912448506000&usg=AFQjCNHp9JWHFJOnJlBJmLODU1cGBIeXtg] I ran a thread dump using Visual M seems like the main thread was in a state of backoff: java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:43) at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:50) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.nextBackOff(BigQueryServicesImpl.java:870) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.access$500(BigQueryServicesImpl.java:79) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:273) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:247) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.executeQuery(BigQueryQuerySource.java:191) at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:136) at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:103) at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:134) at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:210) at org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87) at org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62) at org.apache.beam.runners.direct.ExecutorServiceParallelExecutor.start(ExecutorServiceParallelExecutor.java:144) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297) -- This message was sent by Atlassian JIRA (v7.6.3#76005)