[ https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243836#comment-15243836 ]
ASF GitHub Bot commented on BEAM-50: ------------------------------------ GitHub user peihe opened a pull request: https://github.com/apache/incubator-beam/pull/193 [BEAM-50] Remove BigQueryIO.Write.Bound translator You can merge this pull request into a Git repository by running: $ git pull https://github.com/peihe/incubator-beam remove-bq-write-translator Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/193.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #193 ---- commit 5dd12fe94a5ea2249cf5156c16df1b22aa663baf Author: Pei He <pe...@google.com> Date: 2016-04-15T23:19:54Z [BEAM-50] Remove BigQueryIO.Write.Bound translator ---- > BigQueryIO.Write: reimplement in Java > ------------------------------------- > > Key: BEAM-50 > URL: https://issues.apache.org/jira/browse/BEAM-50 > Project: Beam > Issue Type: New Feature > Components: sdk-java-gcp > Reporter: Daniel Halperin > Priority: Minor > > BigQueryIO.Write is currently implemented in a somewhat hacky way. > Unbounded sink: > * The DirectPipelineRunner and the DataflowPipelineRunner use > StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's > streaming writes API. > Bounded sink: > * The DirectPipelineRunner still uses streaming writes. > * The DataflowPipelineRunner uses a different code path in the Google Cloud > Dataflow service that writes to GCS and the initiates a BigQuery load job. > * Per-window table destinations do not work scalably. (See Beam-XXX). > We need to reimplement BigQueryIO.Write fully in Java code in order to > support other runners in a scalable way. > I additionally suggest that we revisit the design of the BigQueryIO sink in > the process. A short list: > * Do not use TableRow as the default value for rows. It could be Map<String, > Object> with well-defined types, for example, or an Avro GenericRecord. > Dropping TableRow will get around a variety of issues with types, fields > named 'f', etc., and it will also reduce confusion as we use TableRow objects > differently than usual (for good reason). > * Possibly support not-knowing the schema until pipeline execution time. > * Our builders for BigQueryIO.Write are useful and we should keep them. Where > possible we should also allow users to provide the JSON objects that > configure the underlying table creation, write disposition, etc. This would > let users directly control things like table expiration time, table location, > etc., Would also optimistically let users take advantage of some new BigQuery > features without code changes. > * We could choose between streaming write API and load jobs based on user > preference or dynamic job properties . We could use streaming write in a > batch pipeline if the data is small. We could use load jobs in streaming > pipelines if the windows are large enough to make this practical. > * When issuing BigQuery load jobs, we could leave files in GCS if the import > fails, so that data errors can be debugged. > * We should make per-window table writes scalable in batch. > Caveat, possibly blocker: > * (Beam-XXX): cleanup and temp file management. One advantage of the Google > Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that > intermediate files are deleted when bundles or jobs fail, etc. Beam does not > currently support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)