Re: Understanding BigQueryIO.Read performance and options

Eugene Kirpichov Sat, 09 Sep 2017 10:54:07 -0700

This is a bit confusing - BigQueryQuerySource and BigQueryTableSource
indeed use the REST API to read rows if you read them unsplit - however, in
split() they run extract jobs and produce a bunch of Avro sources that are
read in parallel. I'm not sure we have any use cases for reading them
unsplit (except unit tests) - perhaps that code path can be removed?


About outputting non-TableRow: per
https://beam.apache.org/contribute/ptransform-style-guide/#choosing-types-of-input-and-output-pcollections,
it is recommended to output the native type of the connector, unless it's
impossible to provide a coder for it. This is the case for
AvroIO.parseGenericRecords, but it's not the case for TableRow, so I would
recommend against it: you can always map a TableRow to something else using
MapElements.

On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax <[email protected]> wrote:

> Hi Steve,
>
> The BigQuery source should always uses extract jobs, regardless of
> withTemplateCompatibility. What makes you think otherwise?
>
> Reuven
>
>
> On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <[email protected]> wrote:
>
> > Hello!
> >
> > Until now I've been using a custom-built alternative to BigQueryIO.Read
> > that manually runs a BigQuery extract job (to avro), then uses
> > AvroIO.parseGenericRecords() to read the output.
> >
> > I'm investigating instead enhancing the actual BigQueryIO.Read to allow
> > something similar, since it appears a good amount of the plumbing is
> > already in place to do this.  However I'm confused at some of the
> > implementation details.
> >
> > To start, it seems like there's two different read paths:
> > - If "withTemplateCompatibility" is set, a similar method to what I
> > described above is used; an extract job is started to export to avro, and
> > AvroSource is used to read files and transform them into TableRows.
> >
> > - However, if not set, the BigQueryReader class simply uses the REST API
> to
> > read rows from the tables.  This method, I've seen in practice, has some
> > significant performance limitations.
> >
> > It seems to me that for large tables, I'd always want to use the first
> > method, however I'm not sure why the implementation is tied to the oddly
> > named "withTemplateCompatibility" option.  Does anyone have insight as to
> > the implementation details here?
> >
> > Additionally, would the community in general be accepting to enhancements
> > to BigQueryIO to allow the final output to be something other than
> > "TableRow" instances, similar to how AvroIO.parseGenericRecords takes a
> > parseFn?
> >
> > Thanks!
> >
>

Re: Understanding BigQueryIO.Read performance and options

Reply via email to