[ https://issues.apache.org/jira/browse/BEAM-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572629#comment-16572629 ]
Reuven Lax commented on BEAM-5105: ---------------------------------- Possible to do, but slightly tricky to implement for a couple of reason. One is that the retry code for failed inserts will have to be triggered from finishBundle. Polling code gets more complicated because you have to poll N jobs instead of one at a time. There will also be more bookkeeping needed to keep windows correct (because multiple windows can occur inside a single bundle). > Move load job poll to finishBundle() method to better parallelize execution > --------------------------------------------------------------------------- > > Key: BEAM-5105 > URL: https://issues.apache.org/jira/browse/BEAM-5105 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp > Reporter: Chamikara Jayalath > Assignee: Reuven Lax > Priority: Major > > It appears that when we write to BigQuery using WriteTablesDoFn we start a > load job and wait for that job to finish. > [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L318] > > In cases where we are trying to write a PCollection of tables (for example, > when user use dynamic destinations feature) this relies on dynamic work > rebalancing to parallellize execution of load jobs. If the runner does not > support dynamic work rebalancing or does not execute dynamic work rebalancing > from some reason this could have significant performance drawbacks. For > example, scheduling times for load jobs will add up. > > A better approach might be to start load jobs at process() method but wait > for all load jobs to finish at finishBundle() method. This will parallelize > any overheads as well as job execution (assuming more than one job is > schedule by BQ.). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)