Daniel Halperin created BEAM-50:
-----------------------------------

             Summary: BigQueryIO.Write: reimplement in Java
                 Key: BEAM-50
                 URL: https://issues.apache.org/jira/browse/BEAM-50
             Project: Beam
          Issue Type: New Feature
          Components: sdk-java-gcp
            Reporter: Daniel Halperin
            Priority: Minor


BigQueryIO.Write is currently implemented in a somewhat hacky way.

Unbounded sink:
* The DirectPipelineRunner and the DataflowPipelineRunner use StreamingWriteFn 
and BigQueryTableInserter to insert rows using BigQuery's streaming writes API.

Bounded sink:
* The DirectPipelineRunner still uses streaming writes.
* The DataflowPipelineRunner uses a different code path in the Google Cloud 
Dataflow service that writes to GCS and the initiates a BigQuery load job.
* Per-window table destinations do not work scalably. (See Beam-XXX).

We need to reimplement BigQueryIO.Write fully in Java code in order to support 
other runners in a scalable way.

I additionally suggest that we revisit the design of the BigQueryIO sink in the 
process. A short list:

* Do not use TableRow as the default value for rows. It could be Map<String, 
Object> with well-defined types, for example, or an Avro GenericRecord. 
Dropping TableRow will get around a variety of issues with types, fields named 
'f', etc., and it will also reduce confusion as we use TableRow objects 
differently than usual (for good reason).

* Possibly support not-knowing the schema until pipeline execution time.

* Our builders for BigQueryIO.Write are useful and we should keep them. Where 
possible we should also allow users to provide the JSON objects that configure 
the underlying table creation, write disposition, etc. This would let users 
directly control things like table expiration time, table location, etc., Would 
also optimistically let users take advantage of some new BigQuery features 
without code changes.

* We could choose between streaming write API and load jobs based on user 
preference or dynamic job properties . We could use streaming write in a batch 
pipeline if the data is small. We could use load jobs in streaming pipelines if 
the windows are large enough to make this practical.

* When issuing BigQuery load jobs, we could leave files in GCS if the import 
fails, so that data errors can be debugged.

* We should make per-window table writes scalable in batch.

Caveat, possibly blocker:
* (Beam-XXX): cleanup and temp file management. One advantage of the Google 
Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that 
intermediate files are deleted when bundles or jobs fail, etc. Beam does not 
currently support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to