Re: Understanding BigQueryIO.Read performance and options

Steve Niemitz Mon, 25 Sep 2017 07:13:20 -0700

I've just about got something to send for a PR, I'll push it shortly.

One thing I noticed though while running the BigQueryIO unit tests that I
found strange.
It looks like currently the TableRow / Avro / TableRowJsonCoder interaction
changes BigQuery INTEGERs to strings in the public interface.  eg
TableRow.get("someInteger") returns a string, not a long.  The strange
thing is some unit tests do not mimic this behavior.  For
example testBigQueryTableSourceInitSplit asserts that the long values exist
in the result set, not the string versions.  On the other
hand, testValidateReadSetsDefaultProject gets the value as a String and
converts it to a Long.


It seems like the behavior to convert to strings is intended, and the unit
tests are incorrect?  Can anyone confirm?

On Thu, Sep 14, 2017 at 5:31 PM, Steve Niemitz <[email protected]> wrote:

> Cool, thanks for the shove in the right direction!
>
> I'll probably have some more time to spend on this early next week,
> hopefully I'll have a PR to submit after that. :)
>
> On Thu, Sep 14, 2017 at 4:37 PM, Eugene Kirpichov <
> [email protected]> wrote:
>
>> On Thu, Sep 14, 2017 at 1:10 PM Steve Niemitz <[email protected]>
>> wrote:
>>
>> > I spent a little time trying to refactor it yesterday, but ran into a
>> > couple tricky points:
>> >
>> > - The BigQuerySource implementation's non-split version (mentioned
>> above)
>> > doesn't read from avro files and this would be pretty difficult to
>> "fake"
>> > into GenericRecords.  It sounds like this could just be dropped
>> completely
>> > from what was mentioned previously though.
>> >
>> Agreed. This would be a very welcome contribution.
>>
>>
>> >
>> > - A user-provided SerializableFunction<GenericRecord, T> seems like
>> the a
>> > good route to providing an extension point (which could be passed
>> directly
>> > to the AvroSource used to read the files), but a) goes against the
>> > PTransform style guide, and b) is tricky to shoehorn into the current
>> > implementation, due to the existing tight coupling of the GenericRecord
>> ->
>> > TableRow transform and the rest of BigQuerySource.
>> >
>> If we create a version of BigQueryIO that goes through GenericRecord, then
>> providing a SerializableFunction<GenericRecord, T> will be reasonable,
>> and
>> not against the PT Style Guide, because GenericRecord is not encodable
>> unless you know the schema, which in case of reading from BigQuery you
>> don't.
>>
>>
>> >
>> > - I feel like the ideal solution would really be to provide a coder for
>> > GenericRecord, and allow anyone to hook up a MapElements to the output
>> of a
>> > BigQueryIO that produces them.  However, the fact that GenericRecord
>> > bundles the avro schema along with the object means every record
>> serialized
>> > would need to also include the full schema.  I was playing around with
>> ways
>> > to possibly memoize the schema so it doesn't need to be serialized for
>> each
>> > record, but I'm not familiar enough with the guarantees a Coder is
>> provided
>> > to know if its safe to do.
>> >
>> I suggest to not spend time looking in this direction - I think it's
>> impossible to provide a Coder for GenericRecord with the current Coder API
>> [changing the Coder API may be possible, but it would be a huge
>> undertaking]. A SerializableFunction is a reasonable way to go.
>>
>>
>> >
>> > I hope to have something concrete soon as an example, but in the mean
>> time
>> > I'm interested to hear other people's thoughts on the above.
>> >
>> >
>> > On Thu, Sep 14, 2017 at 3:29 PM, Eugene Kirpichov <
>> > [email protected]> wrote:
>> >
>> > > Oh, I see what you mean. Yeah, I agree that having BigQueryIO use
>> > TableRow
>> > > as the native format was a suboptimal decision in retrospect, and I
>> agree
>> > > that it would be reasonable to provide ability to go through Avro
>> > > GenericRecord instead. I'm just not sure how to provide it in an
>> > > API-compatible way - that would be particularly challenging since
>> > > BigQueryIO is a beast in terms of amount of code and intermediate
>> > > transforms involved. But if you have ideas, they would be welcome.
>> > >
>> > > On Sat, Sep 9, 2017 at 11:18 AM Steve Niemitz <[email protected]>
>> > wrote:
>> > >
>> > > > Ah that makes sense wrt splitting, but is indeed confusing!  Thanks
>> for
>> > > the
>> > > > explanation. :)
>> > > >
>> > > > wrt native types and TableRow, I understand your point, but could
>> also
>> > > > argue that the raw avro records are just as "native" to the BigQuery
>> > > > connector as the TableRow JSON objects, since both are directly
>> exposed
>> > > by
>> > > > BigQuery.
>> > > >
>> > > > Maybe my use-case is more specialized, but I already have a good
>> amount
>> > > of
>> > > > code that I used pre-Beam to process BigQuery avro extract files,
>> and
>> > > avro
>> > > > is significantly smaller and more performant than JSON, which is why
>> > I'm
>> > > > using it rather than just using TableRows.
>> > > >
>> > > > In any case, if there's no desire for such a feature I can always
>> > > replicate
>> > > > the functionality of BigQueryIO in my own codebase, so it's not a
>> big
>> > > deal,
>> > > > it just seems like a feature that would be useful for other people
>> as
>> > > well.
>> > > >
>> > > > On Sat, Sep 9, 2017 at 1:55 PM, Reuven Lax <[email protected]
>> >
>> > > > wrote:
>> > > >
>> > > > > On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov <
>> > > > > [email protected]> wrote:
>> > > > >
>> > > > > > This is a bit confusing - BigQueryQuerySource and
>> > BigQueryTableSource
>> > > > > > indeed use the REST API to read rows if you read them unsplit -
>> > > > however,
>> > > > > in
>> > > > > > split() they run extract jobs and produce a bunch of Avro
>> sources
>> > > that
>> > > > > are
>> > > > > > read in parallel. I'm not sure we have any use cases for reading
>> > them
>> > > > > > unsplit (except unit tests) - perhaps that code path can be
>> > removed?
>> > > > > >
>> > > > >
>> > > > > I believe split() will always be called in production. Maybe not
>> in
>> > > unit
>> > > > > tests?
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > About outputting non-TableRow: per
>> > > > > > https://beam.apache.org/contribute/ptransform-style-
>> > > > > > guide/#choosing-types-of-input-and-output-pcollections,
>> > > > > > it is recommended to output the native type of the connector,
>> > unless
>> > > > it's
>> > > > > > impossible to provide a coder for it. This is the case for
>> > > > > > AvroIO.parseGenericRecords, but it's not the case for TableRow,
>> so
>> > I
>> > > > > would
>> > > > > > recommend against it: you can always map a TableRow to something
>> > else
>> > > > > using
>> > > > > > MapElements.
>> > > > > >
>> > > > > > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax
>> > <[email protected]
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Steve,
>> > > > > > >
>> > > > > > > The BigQuery source should always uses extract jobs,
>> regardless
>> > of
>> > > > > > > withTemplateCompatibility. What makes you think otherwise?
>> > > > > > >
>> > > > > > > Reuven
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <
>> > [email protected]
>> > > >
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hello!
>> > > > > > > >
>> > > > > > > > Until now I've been using a custom-built alternative to
>> > > > > BigQueryIO.Read
>> > > > > > > > that manually runs a BigQuery extract job (to avro), then
>> uses
>> > > > > > > > AvroIO.parseGenericRecords() to read the output.
>> > > > > > > >
>> > > > > > > > I'm investigating instead enhancing the actual
>> BigQueryIO.Read
>> > to
>> > > > > allow
>> > > > > > > > something similar, since it appears a good amount of the
>> > plumbing
>> > > > is
>> > > > > > > > already in place to do this.  However I'm confused at some
>> of
>> > the
>> > > > > > > > implementation details.
>> > > > > > > >
>> > > > > > > > To start, it seems like there's two different read paths:
>> > > > > > > > - If "withTemplateCompatibility" is set, a similar method to
>> > > what I
>> > > > > > > > described above is used; an extract job is started to
>> export to
>> > > > avro,
>> > > > > > and
>> > > > > > > > AvroSource is used to read files and transform them into
>> > > TableRows.
>> > > > > > > >
>> > > > > > > > - However, if not set, the BigQueryReader class simply uses
>> the
>> > > > REST
>> > > > > > API
>> > > > > > > to
>> > > > > > > > read rows from the tables.  This method, I've seen in
>> practice,
>> > > has
>> > > > > > some
>> > > > > > > > significant performance limitations.
>> > > > > > > >
>> > > > > > > > It seems to me that for large tables, I'd always want to use
>> > the
>> > > > > first
>> > > > > > > > method, however I'm not sure why the implementation is tied
>> to
>> > > the
>> > > > > > oddly
>> > > > > > > > named "withTemplateCompatibility" option.  Does anyone have
>> > > insight
>> > > > > as
>> > > > > > to
>> > > > > > > > the implementation details here?
>> > > > > > > >
>> > > > > > > > Additionally, would the community in general be accepting to
>> > > > > > enhancements
>> > > > > > > > to BigQueryIO to allow the final output to be something
>> other
>> > > than
>> > > > > > > > "TableRow" instances, similar to how
>> AvroIO.parseGenericRecords
>> > > > > takes a
>> > > > > > > > parseFn?
>> > > > > > > >
>> > > > > > > > Thanks!
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Understanding BigQueryIO.Read performance and options

Reply via email to