Re: Understanding BigQueryIO.Read performance and options

Eugene Kirpichov Thu, 14 Sep 2017 13:38:16 -0700

On Thu, Sep 14, 2017 at 1:10 PM Steve Niemitz <[email protected]> wrote:


> I spent a little time trying to refactor it yesterday, but ran into a
> couple tricky points:
>
> - The BigQuerySource implementation's non-split version (mentioned above)
> doesn't read from avro files and this would be pretty difficult to "fake"
> into GenericRecords.  It sounds like this could just be dropped completely
> from what was mentioned previously though.
>
Agreed. This would be a very welcome contribution.


>
> - A user-provided SerializableFunction<GenericRecord, T> seems like the a
> good route to providing an extension point (which could be passed directly
> to the AvroSource used to read the files), but a) goes against the
> PTransform style guide, and b) is tricky to shoehorn into the current
> implementation, due to the existing tight coupling of the GenericRecord ->
> TableRow transform and the rest of BigQuerySource.
>
If we create a version of BigQueryIO that goes through GenericRecord, then
providing a SerializableFunction<GenericRecord, T> will be reasonable, and
not against the PT Style Guide, because GenericRecord is not encodable
unless you know the schema, which in case of reading from BigQuery you
don't.


>
> - I feel like the ideal solution would really be to provide a coder for
> GenericRecord, and allow anyone to hook up a MapElements to the output of a
> BigQueryIO that produces them.  However, the fact that GenericRecord
> bundles the avro schema along with the object means every record serialized
> would need to also include the full schema.  I was playing around with ways
> to possibly memoize the schema so it doesn't need to be serialized for each
> record, but I'm not familiar enough with the guarantees a Coder is provided
> to know if its safe to do.
>
I suggest to not spend time looking in this direction - I think it's
impossible to provide a Coder for GenericRecord with the current Coder API
[changing the Coder API may be possible, but it would be a huge
undertaking]. A SerializableFunction is a reasonable way to go.


>
> I hope to have something concrete soon as an example, but in the mean time
> I'm interested to hear other people's thoughts on the above.
>
>
> On Thu, Sep 14, 2017 at 3:29 PM, Eugene Kirpichov <
> [email protected]> wrote:
>
> > Oh, I see what you mean. Yeah, I agree that having BigQueryIO use
> TableRow
> > as the native format was a suboptimal decision in retrospect, and I agree
> > that it would be reasonable to provide ability to go through Avro
> > GenericRecord instead. I'm just not sure how to provide it in an
> > API-compatible way - that would be particularly challenging since
> > BigQueryIO is a beast in terms of amount of code and intermediate
> > transforms involved. But if you have ideas, they would be welcome.
> >
> > On Sat, Sep 9, 2017 at 11:18 AM Steve Niemitz <[email protected]>
> wrote:
> >
> > > Ah that makes sense wrt splitting, but is indeed confusing!  Thanks for
> > the
> > > explanation. :)
> > >
> > > wrt native types and TableRow, I understand your point, but could also
> > > argue that the raw avro records are just as "native" to the BigQuery
> > > connector as the TableRow JSON objects, since both are directly exposed
> > by
> > > BigQuery.
> > >
> > > Maybe my use-case is more specialized, but I already have a good amount
> > of
> > > code that I used pre-Beam to process BigQuery avro extract files, and
> > avro
> > > is significantly smaller and more performant than JSON, which is why
> I'm
> > > using it rather than just using TableRows.
> > >
> > > In any case, if there's no desire for such a feature I can always
> > replicate
> > > the functionality of BigQueryIO in my own codebase, so it's not a big
> > deal,
> > > it just seems like a feature that would be useful for other people as
> > well.
> > >
> > > On Sat, Sep 9, 2017 at 1:55 PM, Reuven Lax <[email protected]>
> > > wrote:
> > >
> > > > On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov <
> > > > [email protected]> wrote:
> > > >
> > > > > This is a bit confusing - BigQueryQuerySource and
> BigQueryTableSource
> > > > > indeed use the REST API to read rows if you read them unsplit -
> > > however,
> > > > in
> > > > > split() they run extract jobs and produce a bunch of Avro sources
> > that
> > > > are
> > > > > read in parallel. I'm not sure we have any use cases for reading
> them
> > > > > unsplit (except unit tests) - perhaps that code path can be
> removed?
> > > > >
> > > >
> > > > I believe split() will always be called in production. Maybe not in
> > unit
> > > > tests?
> > > >
> > > >
> > > > >
> > > > > About outputting non-TableRow: per
> > > > > https://beam.apache.org/contribute/ptransform-style-
> > > > > guide/#choosing-types-of-input-and-output-pcollections,
> > > > > it is recommended to output the native type of the connector,
> unless
> > > it's
> > > > > impossible to provide a coder for it. This is the case for
> > > > > AvroIO.parseGenericRecords, but it's not the case for TableRow, so
> I
> > > > would
> > > > > recommend against it: you can always map a TableRow to something
> else
> > > > using
> > > > > MapElements.
> > > > >
> > > > > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax
> <[email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Steve,
> > > > > >
> > > > > > The BigQuery source should always uses extract jobs, regardless
> of
> > > > > > withTemplateCompatibility. What makes you think otherwise?
> > > > > >
> > > > > > Reuven
> > > > > >
> > > > > >
> > > > > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > Until now I've been using a custom-built alternative to
> > > > BigQueryIO.Read
> > > > > > > that manually runs a BigQuery extract job (to avro), then uses
> > > > > > > AvroIO.parseGenericRecords() to read the output.
> > > > > > >
> > > > > > > I'm investigating instead enhancing the actual BigQueryIO.Read
> to
> > > > allow
> > > > > > > something similar, since it appears a good amount of the
> plumbing
> > > is
> > > > > > > already in place to do this.  However I'm confused at some of
> the
> > > > > > > implementation details.
> > > > > > >
> > > > > > > To start, it seems like there's two different read paths:
> > > > > > > - If "withTemplateCompatibility" is set, a similar method to
> > what I
> > > > > > > described above is used; an extract job is started to export to
> > > avro,
> > > > > and
> > > > > > > AvroSource is used to read files and transform them into
> > TableRows.
> > > > > > >
> > > > > > > - However, if not set, the BigQueryReader class simply uses the
> > > REST
> > > > > API
> > > > > > to
> > > > > > > read rows from the tables.  This method, I've seen in practice,
> > has
> > > > > some
> > > > > > > significant performance limitations.
> > > > > > >
> > > > > > > It seems to me that for large tables, I'd always want to use
> the
> > > > first
> > > > > > > method, however I'm not sure why the implementation is tied to
> > the
> > > > > oddly
> > > > > > > named "withTemplateCompatibility" option.  Does anyone have
> > insight
> > > > as
> > > > > to
> > > > > > > the implementation details here?
> > > > > > >
> > > > > > > Additionally, would the community in general be accepting to
> > > > > enhancements
> > > > > > > to BigQueryIO to allow the final output to be something other
> > than
> > > > > > > "TableRow" instances, similar to how AvroIO.parseGenericRecords
> > > > takes a
> > > > > > > parseFn?
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Understanding BigQueryIO.Read performance and options

Reply via email to