Re: BEAM-9613 - BigQueryUtils and Avro

Matthew Ouyang Mon, 29 Mar 2021 06:28:38 -0700

It looks like some unit tests failed.  I'm unable to reproduce it in my
environment because the error I received (described in the PR) suggests I
can't even start the job in the first place.  Any help would be
appreciated.  Ideally I'd like to have this included in the next possible
release (2.30.0 it looks like) since my team had to do a few workarounds
to get past these issues.


On Sun, Mar 28, 2021 at 2:25 AM Matthew Ouyang <[email protected]>
wrote:

> Thank you for the feedback.  I walked back my initial approach in favour
> of Brian's option (2) and also implemented a fix for lists as well (
> https://github.com/apache/beam/pull/14350).  I agree with the tradeoff
> Brian pointed out as it is consistent with the rest of that component.  If
> performance ends up being a problem BigQueryUtils could have a different
> mode for TableRow and Avro.
>
> On Tue, Mar 23, 2021 at 1:47 PM Reuven Lax <[email protected]> wrote:
>
>> Logically, all JSON values are string. We often have put other objects in
>> there, which I believe works simply because of the implicit .toString()
>> method on those objects, but I'm not convinced this is really correct.
>>
>> On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <[email protected]>
>> wrote:
>>
>>> Thank you for digging into this and figuring out how this bug was
>>> introduced!
>>>
>>> In the long-term I think it would be preferable to avoid
>>> TableRow altogether in order to do a schema-aware read of avro data from
>>> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This
>>> would also be preferable for Arrow, where we could produce Row instances
>>> that are references into an underlying arrow RecordBatch (using
>>> RowWithGetters), rather than having to materialize each row to make a
>>> TableRow instance.
>>>
>>> For a short-term fix there are two options, both came up in Reuven's
>>> comments on BEAM-9613:
>>>
>>>> However if we expect to get Long, Double,etc. objects in the TableRow,
>>>> then this mapping code needs to handle those objects. Handling them
>>>> directly would be more efficient - converting to a String would simply be a
>>>> stopgap "one-line" fix.
>>>
>>>
>>> 1. handle them directly [in BigQueryUtils], this is what you've done in
>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb
>>> 2. convert to a String [in BigQueryAvroUtils]
>>>
>>> I don't have a strong preference but I think (2) is a cleaner, albeit
>>> less performant, solution. It seems that BigQueryUtils expects all values
>>> in TableRow instances to be String instances. Since BigQueryAvroUtils is
>>> just a shim to convert GenericRecord to TableRow for use in BigQueryUtils,
>>> it should comply with that interface, rather than making BigQueryUtils work
>>> around the discrepancy.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java
>>>
>>> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <[email protected]>
>>> wrote:
>>>
>>>> I'm working on fixing BEAM-9613 because encountered this issue as a
>>>> result of using BigQueryIO.readTableRowsWithSchema()
>>>>
>>>>    1. BEAM-7526 provided support for Lists and Maps that came from the
>>>>    JSON export format
>>>>    2. BEAM-2879 switched the export format from JSON to Avro.  This
>>>>    caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and 
>>>> FLOAT as
>>>>    a Java String but rather Java Boolean and Double.
>>>>    3. The switch from JSON to Avro also introduced an issue with
>>>>    fields with BQ REPEATED mode because fields of this mode.
>>>>
>>>> I have a simple fix to handle BQ BOOLEAN and FLOAT (
>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb)
>>>> however I'm a bit uncomfortable with it because
>>>>
>>>>    1. This would introduce mixing of both the JSON and Avro export
>>>>    formats,
>>>>    2. BEAM-8933 while still in progress would introduce Arrow and risk
>>>>    a regression,
>>>>    3. I haven't made a fix for REPEATED mode yet, but tests that use
>>>>    BigQueryUtilsTest.BQ_ARRAY_ROW would have to change (
>>>>    
>>>> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311).
>>>>    I don't know if I should change this because I don't know if Beam 
>>>> wishes to
>>>>    continue support for the JSON format.
>>>>
>>>> I'd like to incorporate these changes in my project as soon as
>>>> possible, but I also need guidance on next steps that would be in line with
>>>> the general direction of the project.  I'm looking forward to any
>>>> feedback.  Thanks.
>>>>
>>>

Re: BEAM-9613 - BigQueryUtils and Avro

Reply via email to