Re: BEAM-9613 - BigQueryUtils and Avro

Matthew Ouyang Wed, 31 Mar 2021 07:07:43 -0700

Hello, Apologies for not linking to the PR with the failed unit tests in my
last e-mail.  The link is below.  I'm stuck and I need help in fixing
this.  I can't reproduce in my dev environment because the build can't
trigger a BQ job, and it seems like the failed tests complement each other
(collection of two elements, first test is missing one element, second test
misses the other).


https://github.com/apache/beam/pull/14350

On Mon, Mar 29, 2021 at 9:28 AM Matthew Ouyang <[email protected]>
wrote:

> It looks like some unit tests failed.  I'm unable to reproduce it in my
> environment because the error I received (described in the PR) suggests I
> can't even start the job in the first place.  Any help would be
> appreciated.  Ideally I'd like to have this included in the next possible
> release (2.30.0 it looks like) since my team had to do a few workarounds
> to get past these issues.
>
> On Sun, Mar 28, 2021 at 2:25 AM Matthew Ouyang <[email protected]>
> wrote:
>
>> Thank you for the feedback.  I walked back my initial approach in favour
>> of Brian's option (2) and also implemented a fix for lists as well (
>> https://github.com/apache/beam/pull/14350).  I agree with the tradeoff
>> Brian pointed out as it is consistent with the rest of that component.  If
>> performance ends up being a problem BigQueryUtils could have a different
>> mode for TableRow and Avro.
>>
>> On Tue, Mar 23, 2021 at 1:47 PM Reuven Lax <[email protected]> wrote:
>>
>>> Logically, all JSON values are string. We often have put other objects
>>> in there, which I believe works simply because of the implicit .toString()
>>> method on those objects, but I'm not convinced this is really correct.
>>>
>>> On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <[email protected]>
>>> wrote:
>>>
>>>> Thank you for digging into this and figuring out how this bug was
>>>> introduced!
>>>>
>>>> In the long-term I think it would be preferable to avoid
>>>> TableRow altogether in order to do a schema-aware read of avro data from
>>>> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This
>>>> would also be preferable for Arrow, where we could produce Row instances
>>>> that are references into an underlying arrow RecordBatch (using
>>>> RowWithGetters), rather than having to materialize each row to make a
>>>> TableRow instance.
>>>>
>>>> For a short-term fix there are two options, both came up in Reuven's
>>>> comments on BEAM-9613:
>>>>
>>>>> However if we expect to get Long, Double,etc. objects in the TableRow,
>>>>> then this mapping code needs to handle those objects. Handling them
>>>>> directly would be more efficient - converting to a String would simply be 
>>>>> a
>>>>> stopgap "one-line" fix.
>>>>
>>>>
>>>> 1. handle them directly [in BigQueryUtils], this is what you've done in
>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb
>>>> 2. convert to a String [in BigQueryAvroUtils]
>>>>
>>>> I don't have a strong preference but I think (2) is a cleaner, albeit
>>>> less performant, solution. It seems that BigQueryUtils expects all values
>>>> in TableRow instances to be String instances. Since BigQueryAvroUtils is
>>>> just a shim to convert GenericRecord to TableRow for use in BigQueryUtils,
>>>> it should comply with that interface, rather than making BigQueryUtils work
>>>> around the discrepancy.
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java
>>>>
>>>> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <
>>>> [email protected]> wrote:
>>>>
>>>>> I'm working on fixing BEAM-9613 because encountered this issue as a
>>>>> result of using BigQueryIO.readTableRowsWithSchema()
>>>>>
>>>>>    1. BEAM-7526 provided support for Lists and Maps that came from
>>>>>    the JSON export format
>>>>>    2. BEAM-2879 switched the export format from JSON to Avro.  This
>>>>>    caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and 
>>>>> FLOAT as
>>>>>    a Java String but rather Java Boolean and Double.
>>>>>    3. The switch from JSON to Avro also introduced an issue with
>>>>>    fields with BQ REPEATED mode because fields of this mode.
>>>>>
>>>>> I have a simple fix to handle BQ BOOLEAN and FLOAT (
>>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb)
>>>>> however I'm a bit uncomfortable with it because
>>>>>
>>>>>    1. This would introduce mixing of both the JSON and Avro export
>>>>>    formats,
>>>>>    2. BEAM-8933 while still in progress would introduce Arrow and
>>>>>    risk a regression,
>>>>>    3. I haven't made a fix for REPEATED mode yet, but tests that use
>>>>>    BigQueryUtilsTest.BQ_ARRAY_ROW would have to change (
>>>>>    
>>>>> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311).
>>>>>    I don't know if I should change this because I don't know if Beam 
>>>>> wishes to
>>>>>    continue support for the JSON format.
>>>>>
>>>>> I'd like to incorporate these changes in my project as soon as
>>>>> possible, but I also need guidance on next steps that would be in line 
>>>>> with
>>>>> the general direction of the project.  I'm looking forward to any
>>>>> feedback.  Thanks.
>>>>>
>>>>

Re: BEAM-9613 - BigQueryUtils and Avro

Reply via email to