[ https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236090#comment-15236090 ]
ASF GitHub Bot commented on BEAM-50: ------------------------------------ Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/48 > BigQueryIO.Write: reimplement in Java > ------------------------------------- > > Key: BEAM-50 > URL: https://issues.apache.org/jira/browse/BEAM-50 > Project: Beam > Issue Type: New Feature > Components: sdk-java-gcp > Reporter: Daniel Halperin > Priority: Minor > > BigQueryIO.Write is currently implemented in a somewhat hacky way. > Unbounded sink: > * The DirectPipelineRunner and the DataflowPipelineRunner use > StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's > streaming writes API. > Bounded sink: > * The DirectPipelineRunner still uses streaming writes. > * The DataflowPipelineRunner uses a different code path in the Google Cloud > Dataflow service that writes to GCS and the initiates a BigQuery load job. > * Per-window table destinations do not work scalably. (See Beam-XXX). > We need to reimplement BigQueryIO.Write fully in Java code in order to > support other runners in a scalable way. > I additionally suggest that we revisit the design of the BigQueryIO sink in > the process. A short list: > * Do not use TableRow as the default value for rows. It could be Map<String, > Object> with well-defined types, for example, or an Avro GenericRecord. > Dropping TableRow will get around a variety of issues with types, fields > named 'f', etc., and it will also reduce confusion as we use TableRow objects > differently than usual (for good reason). > * Possibly support not-knowing the schema until pipeline execution time. > * Our builders for BigQueryIO.Write are useful and we should keep them. Where > possible we should also allow users to provide the JSON objects that > configure the underlying table creation, write disposition, etc. This would > let users directly control things like table expiration time, table location, > etc., Would also optimistically let users take advantage of some new BigQuery > features without code changes. > * We could choose between streaming write API and load jobs based on user > preference or dynamic job properties . We could use streaming write in a > batch pipeline if the data is small. We could use load jobs in streaming > pipelines if the windows are large enough to make this practical. > * When issuing BigQuery load jobs, we could leave files in GCS if the import > fails, so that data errors can be debugged. > * We should make per-window table writes scalable in batch. > Caveat, possibly blocker: > * (Beam-XXX): cleanup and temp file management. One advantage of the Google > Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that > intermediate files are deleted when bundles or jobs fail, etc. Beam does not > currently support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)