Re: Understanding BigQueryIO.Read performance and options

Steve Niemitz Mon, 25 Sep 2017 08:36:35 -0700

Heh maybe we can merge our merge requests :P

I just posted mine here as well: https://github.com/apache/beam/pull/3894


On Mon, Sep 25, 2017 at 11:30 AM, Eugene Kirpichov <
[email protected]> wrote:

> Hi,
> I actually also made a PR for this myself the other day, but forgot to
> update this thread, sorry :-|
> https://github.com/apache/beam/pull/3891
> I also hit this conversion issue, and it is intended behavior because
> TableRow is the JSON representation of the data, and it matches what
> BigQuery does when exporting to JSON format: it converts numbers to STRING.
> Yes, some tests were incorrect.
>
> On Mon, Sep 25, 2017 at 7:13 AM Steve Niemitz <[email protected]> wrote:
>
> > I've just about got something to send for a PR, I'll push it shortly.
> >
> > One thing I noticed though while running the BigQueryIO unit tests that I
> > found strange.
> > It looks like currently the TableRow / Avro / TableRowJsonCoder
> interaction
> > changes BigQuery INTEGERs to strings in the public interface.  eg
> > TableRow.get("someInteger") returns a string, not a long.  The strange
> > thing is some unit tests do not mimic this behavior.  For
> > example testBigQueryTableSourceInitSplit asserts that the long values
> exist
> > in the result set, not the string versions.  On the other
> > hand, testValidateReadSetsDefaultProject gets the value as a String and
> > converts it to a Long.
> >
> > It seems like the behavior to convert to strings is intended, and the
> unit
> > tests are incorrect?  Can anyone confirm?
> >
> > On Thu, Sep 14, 2017 at 5:31 PM, Steve Niemitz <[email protected]>
> > wrote:
> >
> > > Cool, thanks for the shove in the right direction!
> > >
> > > I'll probably have some more time to spend on this early next week,
> > > hopefully I'll have a PR to submit after that. :)
> > >
> > > On Thu, Sep 14, 2017 at 4:37 PM, Eugene Kirpichov <
> > > [email protected]> wrote:
> > >
> > >> On Thu, Sep 14, 2017 at 1:10 PM Steve Niemitz <[email protected]>
> > >> wrote:
> > >>
> > >> > I spent a little time trying to refactor it yesterday, but ran into
> a
> > >> > couple tricky points:
> > >> >
> > >> > - The BigQuerySource implementation's non-split version (mentioned
> > >> above)
> > >> > doesn't read from avro files and this would be pretty difficult to
> > >> "fake"
> > >> > into GenericRecords.  It sounds like this could just be dropped
> > >> completely
> > >> > from what was mentioned previously though.
> > >> >
> > >> Agreed. This would be a very welcome contribution.
> > >>
> > >>
> > >> >
> > >> > - A user-provided SerializableFunction<GenericRecord, T> seems like
> > >> the a
> > >> > good route to providing an extension point (which could be passed
> > >> directly
> > >> > to the AvroSource used to read the files), but a) goes against the
> > >> > PTransform style guide, and b) is tricky to shoehorn into the
> current
> > >> > implementation, due to the existing tight coupling of the
> > GenericRecord
> > >> ->
> > >> > TableRow transform and the rest of BigQuerySource.
> > >> >
> > >> If we create a version of BigQueryIO that goes through GenericRecord,
> > then
> > >> providing a SerializableFunction<GenericRecord, T> will be
> reasonable,
> > >> and
> > >> not against the PT Style Guide, because GenericRecord is not encodable
> > >> unless you know the schema, which in case of reading from BigQuery you
> > >> don't.
> > >>
> > >>
> > >> >
> > >> > - I feel like the ideal solution would really be to provide a coder
> > for
> > >> > GenericRecord, and allow anyone to hook up a MapElements to the
> output
> > >> of a
> > >> > BigQueryIO that produces them.  However, the fact that GenericRecord
> > >> > bundles the avro schema along with the object means every record
> > >> serialized
> > >> > would need to also include the full schema.  I was playing around
> with
> > >> ways
> > >> > to possibly memoize the schema so it doesn't need to be serialized
> for
> > >> each
> > >> > record, but I'm not familiar enough with the guarantees a Coder is
> > >> provided
> > >> > to know if its safe to do.
> > >> >
> > >> I suggest to not spend time looking in this direction - I think it's
> > >> impossible to provide a Coder for GenericRecord with the current Coder
> > API
> > >> [changing the Coder API may be possible, but it would be a huge
> > >> undertaking]. A SerializableFunction is a reasonable way to go.
> > >>
> > >>
> > >> >
> > >> > I hope to have something concrete soon as an example, but in the
> mean
> > >> time
> > >> > I'm interested to hear other people's thoughts on the above.
> > >> >
> > >> >
> > >> > On Thu, Sep 14, 2017 at 3:29 PM, Eugene Kirpichov <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Oh, I see what you mean. Yeah, I agree that having BigQueryIO use
> > >> > TableRow
> > >> > > as the native format was a suboptimal decision in retrospect, and
> I
> > >> agree
> > >> > > that it would be reasonable to provide ability to go through Avro
> > >> > > GenericRecord instead. I'm just not sure how to provide it in an
> > >> > > API-compatible way - that would be particularly challenging since
> > >> > > BigQueryIO is a beast in terms of amount of code and intermediate
> > >> > > transforms involved. But if you have ideas, they would be welcome.
> > >> > >
> > >> > > On Sat, Sep 9, 2017 at 11:18 AM Steve Niemitz <
> [email protected]>
> > >> > wrote:
> > >> > >
> > >> > > > Ah that makes sense wrt splitting, but is indeed confusing!
> > Thanks
> > >> for
> > >> > > the
> > >> > > > explanation. :)
> > >> > > >
> > >> > > > wrt native types and TableRow, I understand your point, but
> could
> > >> also
> > >> > > > argue that the raw avro records are just as "native" to the
> > BigQuery
> > >> > > > connector as the TableRow JSON objects, since both are directly
> > >> exposed
> > >> > > by
> > >> > > > BigQuery.
> > >> > > >
> > >> > > > Maybe my use-case is more specialized, but I already have a good
> > >> amount
> > >> > > of
> > >> > > > code that I used pre-Beam to process BigQuery avro extract
> files,
> > >> and
> > >> > > avro
> > >> > > > is significantly smaller and more performant than JSON, which is
> > why
> > >> > I'm
> > >> > > > using it rather than just using TableRows.
> > >> > > >
> > >> > > > In any case, if there's no desire for such a feature I can
> always
> > >> > > replicate
> > >> > > > the functionality of BigQueryIO in my own codebase, so it's not
> a
> > >> big
> > >> > > deal,
> > >> > > > it just seems like a feature that would be useful for other
> people
> > >> as
> > >> > > well.
> > >> > > >
> > >> > > > On Sat, Sep 9, 2017 at 1:55 PM, Reuven Lax
> > <[email protected]
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov <
> > >> > > > > [email protected]> wrote:
> > >> > > > >
> > >> > > > > > This is a bit confusing - BigQueryQuerySource and
> > >> > BigQueryTableSource
> > >> > > > > > indeed use the REST API to read rows if you read them
> unsplit
> > -
> > >> > > > however,
> > >> > > > > in
> > >> > > > > > split() they run extract jobs and produce a bunch of Avro
> > >> sources
> > >> > > that
> > >> > > > > are
> > >> > > > > > read in parallel. I'm not sure we have any use cases for
> > reading
> > >> > them
> > >> > > > > > unsplit (except unit tests) - perhaps that code path can be
> > >> > removed?
> > >> > > > > >
> > >> > > > >
> > >> > > > > I believe split() will always be called in production. Maybe
> not
> > >> in
> > >> > > unit
> > >> > > > > tests?
> > >> > > > >
> > >> > > > >
> > >> > > > > >
> > >> > > > > > About outputting non-TableRow: per
> > >> > > > > > https://beam.apache.org/contribute/ptransform-style-
> > >> > > > > > guide/#choosing-types-of-input-and-output-pcollections,
> > >> > > > > > it is recommended to output the native type of the
> connector,
> > >> > unless
> > >> > > > it's
> > >> > > > > > impossible to provide a coder for it. This is the case for
> > >> > > > > > AvroIO.parseGenericRecords, but it's not the case for
> > TableRow,
> > >> so
> > >> > I
> > >> > > > > would
> > >> > > > > > recommend against it: you can always map a TableRow to
> > something
> > >> > else
> > >> > > > > using
> > >> > > > > > MapElements.
> > >> > > > > >
> > >> > > > > > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax
> > >> > <[email protected]
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi Steve,
> > >> > > > > > >
> > >> > > > > > > The BigQuery source should always uses extract jobs,
> > >> regardless
> > >> > of
> > >> > > > > > > withTemplateCompatibility. What makes you think otherwise?
> > >> > > > > > >
> > >> > > > > > > Reuven
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <
> > >> > [email protected]
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hello!
> > >> > > > > > > >
> > >> > > > > > > > Until now I've been using a custom-built alternative to
> > >> > > > > BigQueryIO.Read
> > >> > > > > > > > that manually runs a BigQuery extract job (to avro),
> then
> > >> uses
> > >> > > > > > > > AvroIO.parseGenericRecords() to read the output.
> > >> > > > > > > >
> > >> > > > > > > > I'm investigating instead enhancing the actual
> > >> BigQueryIO.Read
> > >> > to
> > >> > > > > allow
> > >> > > > > > > > something similar, since it appears a good amount of the
> > >> > plumbing
> > >> > > > is
> > >> > > > > > > > already in place to do this.  However I'm confused at
> some
> > >> of
> > >> > the
> > >> > > > > > > > implementation details.
> > >> > > > > > > >
> > >> > > > > > > > To start, it seems like there's two different read
> paths:
> > >> > > > > > > > - If "withTemplateCompatibility" is set, a similar
> method
> > to
> > >> > > what I
> > >> > > > > > > > described above is used; an extract job is started to
> > >> export to
> > >> > > > avro,
> > >> > > > > > and
> > >> > > > > > > > AvroSource is used to read files and transform them into
> > >> > > TableRows.
> > >> > > > > > > >
> > >> > > > > > > > - However, if not set, the BigQueryReader class simply
> > uses
> > >> the
> > >> > > > REST
> > >> > > > > > API
> > >> > > > > > > to
> > >> > > > > > > > read rows from the tables.  This method, I've seen in
> > >> practice,
> > >> > > has
> > >> > > > > > some
> > >> > > > > > > > significant performance limitations.
> > >> > > > > > > >
> > >> > > > > > > > It seems to me that for large tables, I'd always want to
> > use
> > >> > the
> > >> > > > > first
> > >> > > > > > > > method, however I'm not sure why the implementation is
> > tied
> > >> to
> > >> > > the
> > >> > > > > > oddly
> > >> > > > > > > > named "withTemplateCompatibility" option.  Does anyone
> > have
> > >> > > insight
> > >> > > > > as
> > >> > > > > > to
> > >> > > > > > > > the implementation details here?
> > >> > > > > > > >
> > >> > > > > > > > Additionally, would the community in general be
> accepting
> > to
> > >> > > > > > enhancements
> > >> > > > > > > > to BigQueryIO to allow the final output to be something
> > >> other
> > >> > > than
> > >> > > > > > > > "TableRow" instances, similar to how
> > >> AvroIO.parseGenericRecords
> > >> > > > > takes a
> > >> > > > > > > > parseFn?
> > >> > > > > > > >
> > >> > > > > > > > Thanks!
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Understanding BigQueryIO.Read performance and options

Reply via email to