[jira] [Updated] (BEAM-4548) Long execution delay when using DirectRunner to read from BigQuery Table

2018-06-12 Thread Brian Foo (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Foo updated BEAM-4548:

Description: 
When using DirectRunner to execute a simple select query against a BigQuery 
table that contains 100 rows, the pipeline stalls for over 3 minutes. The 
BigQuery UI can run the same query in under 2 seconds.

A similar issue was reported here: 
[https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read|https://www.google.com/url?q=https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read&sa=D&source=hangouts&ust=1528912448506000&usg=AFQjCNHp9JWHFJOnJlBJmLODU1cGBIeXtg]

I ran a thread dump using Visual M seems like the main thread was in a state of 
backoff: 

java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:43)
 at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:50)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.nextBackOff(BigQueryServicesImpl.java:870)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.access$500(BigQueryServicesImpl.java:79)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:273)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:247)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.executeQuery(BigQueryQuerySource.java:191)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:136)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:103)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:134)
 at 
org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:210)
 at 
org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87)
 at 
org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62)
 at 
org.apache.beam.runners.direct.ExecutorServiceParallelExecutor.start(ExecutorServiceParallelExecutor.java:144)
 at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201)
 at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62)
 at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
 at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)

  was:
When using DirectRunner to execute a simple select query against a BigQuery 
table that contains 100 rows, the pipeline stalls for about 5 minutes. The 
BigQuery UI can run the same query in under 2 seconds.

A similar issue was reported here: 
[https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read|https://www.google.com/url?q=https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read&sa=D&source=hangouts&ust=1528912448506000&usg=AFQjCNHp9JWHFJOnJlBJmLODU1cGBIeXtg]

I ran a thread dump using Visual M seems like the main thread was in a state of 
backoff: 

java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:43)
 at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:50)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.nextBackOff(BigQueryServicesImpl.java:870)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.access$500(BigQueryServicesImpl.java:79)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:273)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:247)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.executeQuery(BigQueryQuerySource.java:191)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:136)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:103)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:134)
 at 
org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:210)
 at 
org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87)
 at 
org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62)
 at 
org.apache.beam.runners.direct.ExecutorServiceParallelExecutor.start(ExecutorServiceParallelExecutor.java:144)
 at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201)
 at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62)
 at org.apache.beam.sdk.Pipeline.run(Pipeline.java:3

[jira] [Commented] (BEAM-4548) Long execution delay when using DirectRunner to read from BigQuery Table

2018-06-12 Thread Brian Foo (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510328#comment-16510328
 ] 

Brian Foo commented on BEAM-4548:
-

Here's a simple piece of code that takes 2.4 seconds on the BQ UI, and 2 
minutes on DirectRunner.
{code:java}
import com.google.api.services.bigquery.model.TableRow;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;

public class Test {
  public static void main(String[] args) {
Pipeline p =

Pipeline.create(PipelineOptionsFactory.fromArgs(args).withoutStrictParsing().create());
String query =
"SELECT * FROM `bigquery-public-data.samples.gsod` LIMIT 10"; // takes 
2.4 seconds
PCollection simpleResult =

p.apply(BigQueryIO.readTableRows().fromQuery(query).withoutValidation().usingStandardSql())
.apply(ParDo.of(new PrintAndReturnDoFn()));
p.run();
  }

  /** Just print the rows and send input to output. */
  private static class PrintAndReturnDoFn extends DoFn {

@ProcessElement
public void processElement(ProcessContext c) {
  TableRow t = c.element();
  System.out.println(t);
  c.output(t);
}
  }
}
{code}

> Long execution delay when using DirectRunner to read from BigQuery Table
> 
>
> Key: BEAM-4548
> URL: https://issues.apache.org/jira/browse/BEAM-4548
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, runner-direct
>Affects Versions: 2.4.0
>Reporter: Brian Foo
>Assignee: Chamikara Jayalath
>Priority: Major
>
> When using DirectRunner to execute a simple select query against a BigQuery 
> table that contains 100 rows, the pipeline stalls for about 5 minutes. The 
> BigQuery UI can run the same query in under 2 seconds.
> A similar issue was reported here: 
> [https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read|https://www.google.com/url?q=https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read&sa=D&source=hangouts&ust=1528912448506000&usg=AFQjCNHp9JWHFJOnJlBJmLODU1cGBIeXtg]
> I ran a thread dump using Visual M seems like the main thread was in a state 
> of backoff: 
> java.lang.Thread.State: TIMED_WAITING (sleeping)
>  at java.lang.Thread.sleep(Native Method)
>  at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:43)
>  at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:50)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.nextBackOff(BigQueryServicesImpl.java:870)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.access$500(BigQueryServicesImpl.java:79)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:273)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:247)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.executeQuery(BigQueryQuerySource.java:191)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:136)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:103)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:134)
>  at 
> org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:210)
>  at 
> org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87)
>  at 
> org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62)
>  at 
> org.apache.beam.runners.direct.ExecutorServiceParallelExecutor.start(ExecutorServiceParallelExecutor.java:144)
>  at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201)
>  at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62)
>  at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
>  at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4548) Long execution delay when using DirectRunner to read from BigQuery Table

2018-06-12 Thread Brian Foo (JIRA)
Brian Foo created BEAM-4548:
---

 Summary: Long execution delay when using DirectRunner to read from 
BigQuery Table
 Key: BEAM-4548
 URL: https://issues.apache.org/jira/browse/BEAM-4548
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp, runner-direct
Affects Versions: 2.4.0
Reporter: Brian Foo
Assignee: Chamikara Jayalath


When using DirectRunner to execute a simple select query against a BigQuery 
table that contains 100 rows, the pipeline stalls for about 5 minutes. The 
BigQuery UI can run the same query in under 2 seconds.

A similar issue was reported here: 
[https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read|https://www.google.com/url?q=https://stackoverflow.com/questions/46907735/beam-direct-runner-slow-bigquery-read&sa=D&source=hangouts&ust=1528912448506000&usg=AFQjCNHp9JWHFJOnJlBJmLODU1cGBIeXtg]

I ran a thread dump using Visual M seems like the main thread was in a state of 
backoff: 

java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:43)
 at com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:50)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.nextBackOff(BigQueryServicesImpl.java:870)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.access$500(BigQueryServicesImpl.java:79)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:273)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.pollJob(BigQueryServicesImpl.java:247)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.executeQuery(BigQueryQuerySource.java:191)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:136)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:103)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:134)
 at 
org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:210)
 at 
org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87)
 at 
org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62)
 at 
org.apache.beam.runners.direct.ExecutorServiceParallelExecutor.start(ExecutorServiceParallelExecutor.java:144)
 at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201)
 at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62)
 at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
 at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)