Re: Understanding BigQueryIO.Read performance and options

Steve Niemitz Thu, 14 Sep 2017 13:10:53 -0700

I spent a little time trying to refactor it yesterday, but ran into a
couple tricky points:


- The BigQuerySource implementation's non-split version (mentioned above)
doesn't read from avro files and this would be pretty difficult to "fake"
into GenericRecords.  It sounds like this could just be dropped completely
from what was mentioned previously though.

- A user-provided SerializableFunction<GenericRecord, T> seems like the a
good route to providing an extension point (which could be passed directly
to the AvroSource used to read the files), but a) goes against the
PTransform style guide, and b) is tricky to shoehorn into the current
implementation, due to the existing tight coupling of the GenericRecord ->
TableRow transform and the rest of BigQuerySource.

- I feel like the ideal solution would really be to provide a coder for
GenericRecord, and allow anyone to hook up a MapElements to the output of a
BigQueryIO that produces them.  However, the fact that GenericRecord
bundles the avro schema along with the object means every record serialized
would need to also include the full schema.  I was playing around with ways
to possibly memoize the schema so it doesn't need to be serialized for each
record, but I'm not familiar enough with the guarantees a Coder is provided
to know if its safe to do.

I hope to have something concrete soon as an example, but in the mean time
I'm interested to hear other people's thoughts on the above.


On Thu, Sep 14, 2017 at 3:29 PM, Eugene Kirpichov <
[email protected]> wrote:

> Oh, I see what you mean. Yeah, I agree that having BigQueryIO use TableRow
> as the native format was a suboptimal decision in retrospect, and I agree
> that it would be reasonable to provide ability to go through Avro
> GenericRecord instead. I'm just not sure how to provide it in an
> API-compatible way - that would be particularly challenging since
> BigQueryIO is a beast in terms of amount of code and intermediate
> transforms involved. But if you have ideas, they would be welcome.
>
> On Sat, Sep 9, 2017 at 11:18 AM Steve Niemitz <[email protected]> wrote:
>
> > Ah that makes sense wrt splitting, but is indeed confusing!  Thanks for
> the
> > explanation. :)
> >
> > wrt native types and TableRow, I understand your point, but could also
> > argue that the raw avro records are just as "native" to the BigQuery
> > connector as the TableRow JSON objects, since both are directly exposed
> by
> > BigQuery.
> >
> > Maybe my use-case is more specialized, but I already have a good amount
> of
> > code that I used pre-Beam to process BigQuery avro extract files, and
> avro
> > is significantly smaller and more performant than JSON, which is why I'm
> > using it rather than just using TableRows.
> >
> > In any case, if there's no desire for such a feature I can always
> replicate
> > the functionality of BigQueryIO in my own codebase, so it's not a big
> deal,
> > it just seems like a feature that would be useful for other people as
> well.
> >
> > On Sat, Sep 9, 2017 at 1:55 PM, Reuven Lax <[email protected]>
> > wrote:
> >
> > > On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov <
> > > [email protected]> wrote:
> > >
> > > > This is a bit confusing - BigQueryQuerySource and BigQueryTableSource
> > > > indeed use the REST API to read rows if you read them unsplit -
> > however,
> > > in
> > > > split() they run extract jobs and produce a bunch of Avro sources
> that
> > > are
> > > > read in parallel. I'm not sure we have any use cases for reading them
> > > > unsplit (except unit tests) - perhaps that code path can be removed?
> > > >
> > >
> > > I believe split() will always be called in production. Maybe not in
> unit
> > > tests?
> > >
> > >
> > > >
> > > > About outputting non-TableRow: per
> > > > https://beam.apache.org/contribute/ptransform-style-
> > > > guide/#choosing-types-of-input-and-output-pcollections,
> > > > it is recommended to output the native type of the connector, unless
> > it's
> > > > impossible to provide a coder for it. This is the case for
> > > > AvroIO.parseGenericRecords, but it's not the case for TableRow, so I
> > > would
> > > > recommend against it: you can always map a TableRow to something else
> > > using
> > > > MapElements.
> > > >
> > > > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi Steve,
> > > > >
> > > > > The BigQuery source should always uses extract jobs, regardless of
> > > > > withTemplateCompatibility. What makes you think otherwise?
> > > > >
> > > > > Reuven
> > > > >
> > > > >
> > > > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > Until now I've been using a custom-built alternative to
> > > BigQueryIO.Read
> > > > > > that manually runs a BigQuery extract job (to avro), then uses
> > > > > > AvroIO.parseGenericRecords() to read the output.
> > > > > >
> > > > > > I'm investigating instead enhancing the actual BigQueryIO.Read to
> > > allow
> > > > > > something similar, since it appears a good amount of the plumbing
> > is
> > > > > > already in place to do this.  However I'm confused at some of the
> > > > > > implementation details.
> > > > > >
> > > > > > To start, it seems like there's two different read paths:
> > > > > > - If "withTemplateCompatibility" is set, a similar method to
> what I
> > > > > > described above is used; an extract job is started to export to
> > avro,
> > > > and
> > > > > > AvroSource is used to read files and transform them into
> TableRows.
> > > > > >
> > > > > > - However, if not set, the BigQueryReader class simply uses the
> > REST
> > > > API
> > > > > to
> > > > > > read rows from the tables.  This method, I've seen in practice,
> has
> > > > some
> > > > > > significant performance limitations.
> > > > > >
> > > > > > It seems to me that for large tables, I'd always want to use the
> > > first
> > > > > > method, however I'm not sure why the implementation is tied to
> the
> > > > oddly
> > > > > > named "withTemplateCompatibility" option.  Does anyone have
> insight
> > > as
> > > > to
> > > > > > the implementation details here?
> > > > > >
> > > > > > Additionally, would the community in general be accepting to
> > > > enhancements
> > > > > > to BigQueryIO to allow the final output to be something other
> than
> > > > > > "TableRow" instances, similar to how AvroIO.parseGenericRecords
> > > takes a
> > > > > > parseFn?
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Understanding BigQueryIO.Read performance and options

Reply via email to