Understanding BigQueryIO.Read performance and options

Steve Niemitz Sat, 09 Sep 2017 09:35:52 -0700

Hello!

Until now I've been using a custom-built alternative to BigQueryIO.Read
that manually runs a BigQuery extract job (to avro), then uses
AvroIO.parseGenericRecords() to read the output.


I'm investigating instead enhancing the actual BigQueryIO.Read to allow
something similar, since it appears a good amount of the plumbing is
already in place to do this.  However I'm confused at some of the
implementation details.

To start, it seems like there's two different read paths:
- If "withTemplateCompatibility" is set, a similar method to what I
described above is used; an extract job is started to export to avro, and
AvroSource is used to read files and transform them into TableRows.

- However, if not set, the BigQueryReader class simply uses the REST API to
read rows from the tables.  This method, I've seen in practice, has some
significant performance limitations.

It seems to me that for large tables, I'd always want to use the first
method, however I'm not sure why the implementation is tied to the oddly
named "withTemplateCompatibility" option.  Does anyone have insight as to
the implementation details here?

Additionally, would the community in general be accepting to enhancements
to BigQueryIO to allow the final output to be something other than
"TableRow" instances, similar to how AvroIO.parseGenericRecords takes a
parseFn?

Thanks!

Understanding BigQueryIO.Read performance and options

Reply via email to