[ 
https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Halperin reassigned BEAM-50:
-----------------------------------

    Assignee: Daniel Halperin

> BigQueryIO.Write: reimplement in Java
> -------------------------------------
>
>                 Key: BEAM-50
>                 URL: https://issues.apache.org/jira/browse/BEAM-50
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-gcp
>            Reporter: Daniel Halperin
>            Assignee: Daniel Halperin
>            Priority: Minor
>
> BigQueryIO.Write is currently implemented in a somewhat hacky way.
> Unbounded sink:
> * The DirectPipelineRunner and the DataflowPipelineRunner use 
> StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's 
> streaming writes API.
> Bounded sink:
> * The DirectPipelineRunner still uses streaming writes.
> * The DataflowPipelineRunner uses a different code path in the Google Cloud 
> Dataflow service that writes to GCS and the initiates a BigQuery load job.
> * Per-window table destinations do not work scalably. (See Beam-XXX).
> We need to reimplement BigQueryIO.Write fully in Java code in order to 
> support other runners in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO sink in 
> the process. A short list:
> * Do not use TableRow as the default value for rows. It could be Map<String, 
> Object> with well-defined types, for example, or an Avro GenericRecord. 
> Dropping TableRow will get around a variety of issues with types, fields 
> named 'f', etc., and it will also reduce confusion as we use TableRow objects 
> differently than usual (for good reason).
> * Possibly support not-knowing the schema until pipeline execution time.
> * Our builders for BigQueryIO.Write are useful and we should keep them. Where 
> possible we should also allow users to provide the JSON objects that 
> configure the underlying table creation, write disposition, etc. This would 
> let users directly control things like table expiration time, table location, 
> etc., Would also optimistically let users take advantage of some new BigQuery 
> features without code changes.
> * We could choose between streaming write API and load jobs based on user 
> preference or dynamic job properties . We could use streaming write in a 
> batch pipeline if the data is small. We could use load jobs in streaming 
> pipelines if the windows are large enough to make this practical.
> * When issuing BigQuery load jobs, we could leave files in GCS if the import 
> fails, so that data errors can be debugged.
> * We should make per-window table writes scalable in batch.
> Caveat, possibly blocker:
> * (Beam-XXX): cleanup and temp file management. One advantage of the Google 
> Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that 
> intermediate files are deleted when bundles or jobs fail, etc. Beam does not 
> currently support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to