Re: Understanding BigQueryIO.Read performance and options

Reuven Lax Sat, 09 Sep 2017 10:56:20 -0700

On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov <
[email protected]> wrote:


> This is a bit confusing - BigQueryQuerySource and BigQueryTableSource
> indeed use the REST API to read rows if you read them unsplit - however, in
> split() they run extract jobs and produce a bunch of Avro sources that are
> read in parallel. I'm not sure we have any use cases for reading them
> unsplit (except unit tests) - perhaps that code path can be removed?
>

I believe split() will always be called in production. Maybe not in unit
tests?


>
> About outputting non-TableRow: per
> https://beam.apache.org/contribute/ptransform-style-
> guide/#choosing-types-of-input-and-output-pcollections,
> it is recommended to output the native type of the connector, unless it's
> impossible to provide a coder for it. This is the case for
> AvroIO.parseGenericRecords, but it's not the case for TableRow, so I would
> recommend against it: you can always map a TableRow to something else using
> MapElements.
>
> On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax <[email protected]>
> wrote:
>
> > Hi Steve,
> >
> > The BigQuery source should always uses extract jobs, regardless of
> > withTemplateCompatibility. What makes you think otherwise?
> >
> > Reuven
> >
> >
> > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <[email protected]>
> wrote:
> >
> > > Hello!
> > >
> > > Until now I've been using a custom-built alternative to BigQueryIO.Read
> > > that manually runs a BigQuery extract job (to avro), then uses
> > > AvroIO.parseGenericRecords() to read the output.
> > >
> > > I'm investigating instead enhancing the actual BigQueryIO.Read to allow
> > > something similar, since it appears a good amount of the plumbing is
> > > already in place to do this.  However I'm confused at some of the
> > > implementation details.
> > >
> > > To start, it seems like there's two different read paths:
> > > - If "withTemplateCompatibility" is set, a similar method to what I
> > > described above is used; an extract job is started to export to avro,
> and
> > > AvroSource is used to read files and transform them into TableRows.
> > >
> > > - However, if not set, the BigQueryReader class simply uses the REST
> API
> > to
> > > read rows from the tables.  This method, I've seen in practice, has
> some
> > > significant performance limitations.
> > >
> > > It seems to me that for large tables, I'd always want to use the first
> > > method, however I'm not sure why the implementation is tied to the
> oddly
> > > named "withTemplateCompatibility" option.  Does anyone have insight as
> to
> > > the implementation details here?
> > >
> > > Additionally, would the community in general be accepting to
> enhancements
> > > to BigQueryIO to allow the final output to be something other than
> > > "TableRow" instances, similar to how AvroIO.parseGenericRecords takes a
> > > parseFn?
> > >
> > > Thanks!
> > >
> >
>

Re: Understanding BigQueryIO.Read performance and options

Reply via email to