Daniel Halperin created BEAM-50:
-----------------------------------
Summary: BigQueryIO.Write: reimplement in Java
Key: BEAM-50
URL: https://issues.apache.org/jira/browse/BEAM-50
Project: Beam
Issue Type: New Feature
Components: sdk-java-gcp
Reporter: Daniel Halperin
Priority: Minor
BigQueryIO.Write is currently implemented in a somewhat hacky way.
Unbounded sink:
* The DirectPipelineRunner and the DataflowPipelineRunner use StreamingWriteFn
and BigQueryTableInserter to insert rows using BigQuery's streaming writes API.
Bounded sink:
* The DirectPipelineRunner still uses streaming writes.
* The DataflowPipelineRunner uses a different code path in the Google Cloud
Dataflow service that writes to GCS and the initiates a BigQuery load job.
* Per-window table destinations do not work scalably. (See Beam-XXX).
We need to reimplement BigQueryIO.Write fully in Java code in order to support
other runners in a scalable way.
I additionally suggest that we revisit the design of the BigQueryIO sink in the
process. A short list:
* Do not use TableRow as the default value for rows. It could be Map<String,
Object> with well-defined types, for example, or an Avro GenericRecord.
Dropping TableRow will get around a variety of issues with types, fields named
'f', etc., and it will also reduce confusion as we use TableRow objects
differently than usual (for good reason).
* Possibly support not-knowing the schema until pipeline execution time.
* Our builders for BigQueryIO.Write are useful and we should keep them. Where
possible we should also allow users to provide the JSON objects that configure
the underlying table creation, write disposition, etc. This would let users
directly control things like table expiration time, table location, etc., Would
also optimistically let users take advantage of some new BigQuery features
without code changes.
* We could choose between streaming write API and load jobs based on user
preference or dynamic job properties . We could use streaming write in a batch
pipeline if the data is small. We could use load jobs in streaming pipelines if
the windows are large enough to make this practical.
* When issuing BigQuery load jobs, we could leave files in GCS if the import
fails, so that data errors can be debugged.
* We should make per-window table writes scalable in batch.
Caveat, possibly blocker:
* (Beam-XXX): cleanup and temp file management. One advantage of the Google
Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that
intermediate files are deleted when bundles or jobs fail, etc. Beam does not
currently support this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)