[ https://issues.apache.org/jira/browse/BEAM-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Halperin updated BEAM-48: -------------------------------- Assignee: Pei He > BigQueryIO.Read reimplemented as BoundedSource > ---------------------------------------------- > > Key: BEAM-48 > URL: https://issues.apache.org/jira/browse/BEAM-48 > Project: Beam > Issue Type: New Feature > Components: sdk-java-gcp > Reporter: Daniel Halperin > Assignee: Pei He > > BigQueryIO.Read is currently implemented in a hacky way: the > DirectPipelineRunner streams all rows in the table or query result directly > using the JSON API, in a single-threaded manner. > In contrast, the DataflowPipelineRunner uses an entirely different code path > implemented in the Google Cloud Dataflow service. (A BigQuery export job to > GCS, followed by a parallel read from GCS). > We need to reimplement BigQueryIO as a BoundedSource in order to support > other runners in a scalable way. > I additionally suggest that we revisit the design of the BigQueryIO source in > the process. A short list: > * Do not use TableRow as the default value for rows. It could be Map<String, > Object> with well-defined types, for example, or an Avro GenericRecord. > Dropping TableRow will get around a variety of issues with types, fields > named 'f', etc., and it will also reduce confusion as we use TableRow objects > differently than usual (for good reason). > * We could also directly add support for a RowParser to a user's POJO. > * We should expose TableSchema as a side output from the BigQueryIO.Read. > * Our builders for BigQueryIO.Read are useful and we should keep them. Where > possible we should also allow users to provide the JSON objects that > configure the underlying intermediate tables, query export, etc. This would > let users directly control result flattening, location of intermediate > tables, table decorators, etc., and also optimistically let users take > advantage of some new BigQuery features without code changes. > * We could use switch between whether we use a BigQuery export + parallel > scan vs API read based on factors such as the size of the table at pipeline > construction time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)