[ 
https://issues.apache.org/jira/browse/BEAM-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572629#comment-16572629
 ] 

Reuven Lax commented on BEAM-5105:
----------------------------------

Possible to do, but slightly tricky to implement for a couple of reason. One is 
that the retry code for failed inserts will have to be triggered from 
finishBundle. Polling code gets more complicated because you have to poll N 
jobs instead of one at a time. There will also be more bookkeeping needed to 
keep windows correct (because multiple windows can occur inside a single 
bundle).

> Move load job poll to finishBundle() method to better parallelize execution
> ---------------------------------------------------------------------------
>
>                 Key: BEAM-5105
>                 URL: https://issues.apache.org/jira/browse/BEAM-5105
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Chamikara Jayalath
>            Assignee: Reuven Lax
>            Priority: Major
>
> It appears that when we write to BigQuery using WriteTablesDoFn we start a 
> load job and wait for that job to finish.
> [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L318]
>  
> In cases where we are trying to write a PCollection of tables (for example, 
> when user use dynamic destinations feature) this relies on dynamic work 
> rebalancing to parallellize execution of load jobs. If the runner does not 
> support dynamic work rebalancing or does not execute dynamic work rebalancing 
> from some reason this could have significant performance drawbacks. For 
> example, scheduling times for load jobs will add up.
>  
> A better approach might be to start load jobs at process() method but wait 
> for all load jobs to finish at finishBundle() method. This will parallelize 
> any overheads as well as job execution (assuming more than one job is 
> schedule by BQ.).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to