Daniel Halperin created BEAM-48:
-----------------------------------

             Summary: BigQueryIO.Read reimplemented as BoundedSource
                 Key: BEAM-48
                 URL: https://issues.apache.org/jira/browse/BEAM-48
             Project: Beam
          Issue Type: New Feature
          Components: sdk-java-gcp
            Reporter: Daniel Halperin


BigQueryIO.Read is currently implemented in a hacky way: the 
DirectPipelineRunner streams all rows in the table or query result directly 
using the JSON API, in a single-threaded manner.

In contrast, the DataflowPipelineRunner uses an entirely different code path 
implemented in the Google Cloud Dataflow service. (A BigQuery export job to 
GCS, followed by a parallel read from GCS).

We need to reimplement BigQueryIO as a BoundedSource in order to support other 
runners in a scalable way.

I additionally suggest that we revisit the design of the BigQueryIO source in 
the process. A short list:

* Do not use TableRow as the default value for rows. It could be Map<String, 
Object> with well-defined types, for example, or an Avro GenericRecord. 
Dropping TableRow will get around a variety of issues with types, fields named 
'f', etc., and it will also reduce confusion as we use TableRow objects 
differently than usual (for good reason).

* We could also directly add support for a RowParser to a user's POJO.

* We should expose TableSchema as a side output from the BigQueryIO.Read.

* Our builders for BigQueryIO.Read are useful and we should keep them. Where 
possible we should also allow users to provide the JSON objects that configure 
the underlying intermediate tables, query export, etc. This would let users 
directly control result flattening, location of intermediate tables, table 
decorators, etc., and also optimistically let users take advantage of some new 
BigQuery features without code changes.

* We could use switch between whether we use a BigQuery export + parallel scan 
vs API read based on factors such as the size of the table at pipeline 
construction time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to