Re: BEAM-9613 - BigQueryUtils and Avro

Brian Hulette Wed, 31 Mar 2021 11:28:34 -0700

Commented on the PR, it looks like there are just some double literals that
need to be changed to strings.


On Wed, Mar 31, 2021 at 7:07 AM Matthew Ouyang <[email protected]>
wrote:

> Hello, Apologies for not linking to the PR with the failed unit tests in
> my last e-mail.  The link is below.  I'm stuck and I need help in fixing
> this.  I can't reproduce in my dev environment because the build can't
> trigger a BQ job, and it seems like the failed tests complement each other
> (collection of two elements, first test is missing one element, second test
> misses the other).
>
> https://github.com/apache/beam/pull/14350
>
> On Mon, Mar 29, 2021 at 9:28 AM Matthew Ouyang <[email protected]>
> wrote:
>
>> It looks like some unit tests failed.  I'm unable to reproduce it in my
>> environment because the error I received (described in the PR) suggests I
>> can't even start the job in the first place.  Any help would be
>> appreciated.  Ideally I'd like to have this included in the next possible
>> release (2.30.0 it looks like) since my team had to do a few workarounds
>> to get past these issues.
>>
>> On Sun, Mar 28, 2021 at 2:25 AM Matthew Ouyang <[email protected]>
>> wrote:
>>
>>> Thank you for the feedback.  I walked back my initial approach in favour
>>> of Brian's option (2) and also implemented a fix for lists as well (
>>> https://github.com/apache/beam/pull/14350).  I agree with the tradeoff
>>> Brian pointed out as it is consistent with the rest of that component.  If
>>> performance ends up being a problem BigQueryUtils could have a different
>>> mode for TableRow and Avro.
>>>
>>> On Tue, Mar 23, 2021 at 1:47 PM Reuven Lax <[email protected]> wrote:
>>>
>>>> Logically, all JSON values are string. We often have put other objects
>>>> in there, which I believe works simply because of the implicit .toString()
>>>> method on those objects, but I'm not convinced this is really correct.
>>>>
>>>> On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <[email protected]>
>>>> wrote:
>>>>
>>>>> Thank you for digging into this and figuring out how this bug was
>>>>> introduced!
>>>>>
>>>>> In the long-term I think it would be preferable to avoid
>>>>> TableRow altogether in order to do a schema-aware read of avro data from
>>>>> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This
>>>>> would also be preferable for Arrow, where we could produce Row instances
>>>>> that are references into an underlying arrow RecordBatch (using
>>>>> RowWithGetters), rather than having to materialize each row to make a
>>>>> TableRow instance.
>>>>>
>>>>> For a short-term fix there are two options, both came up in Reuven's
>>>>> comments on BEAM-9613:
>>>>>
>>>>>> However if we expect to get Long, Double,etc. objects in the
>>>>>> TableRow, then this mapping code needs to handle those objects. Handling
>>>>>> them directly would be more efficient - converting to a String would 
>>>>>> simply
>>>>>> be a stopgap "one-line" fix.
>>>>>
>>>>>
>>>>> 1. handle them directly [in BigQueryUtils], this is what you've done
>>>>> in
>>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb
>>>>> 2. convert to a String [in BigQueryAvroUtils]
>>>>>
>>>>> I don't have a strong preference but I think (2) is a cleaner, albeit
>>>>> less performant, solution. It seems that BigQueryUtils expects all values
>>>>> in TableRow instances to be String instances. Since BigQueryAvroUtils is
>>>>> just a shim to convert GenericRecord to TableRow for use in BigQueryUtils,
>>>>> it should comply with that interface, rather than making BigQueryUtils 
>>>>> work
>>>>> around the discrepancy.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java
>>>>>
>>>>> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I'm working on fixing BEAM-9613 because encountered this issue as a
>>>>>> result of using BigQueryIO.readTableRowsWithSchema()
>>>>>>
>>>>>>    1. BEAM-7526 provided support for Lists and Maps that came from
>>>>>>    the JSON export format
>>>>>>    2. BEAM-2879 switched the export format from JSON to Avro.  This
>>>>>>    caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and 
>>>>>> FLOAT as
>>>>>>    a Java String but rather Java Boolean and Double.
>>>>>>    3. The switch from JSON to Avro also introduced an issue with
>>>>>>    fields with BQ REPEATED mode because fields of this mode.
>>>>>>
>>>>>> I have a simple fix to handle BQ BOOLEAN and FLOAT (
>>>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb)
>>>>>> however I'm a bit uncomfortable with it because
>>>>>>
>>>>>>    1. This would introduce mixing of both the JSON and Avro export
>>>>>>    formats,
>>>>>>    2. BEAM-8933 while still in progress would introduce Arrow and
>>>>>>    risk a regression,
>>>>>>    3. I haven't made a fix for REPEATED mode yet, but tests that use
>>>>>>    BigQueryUtilsTest.BQ_ARRAY_ROW would have to change (
>>>>>>    
>>>>>> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311).
>>>>>>    I don't know if I should change this because I don't know if Beam 
>>>>>> wishes to
>>>>>>    continue support for the JSON format.
>>>>>>
>>>>>> I'd like to incorporate these changes in my project as soon as
>>>>>> possible, but I also need guidance on next steps that would be in line 
>>>>>> with
>>>>>> the general direction of the project.  I'm looking forward to any
>>>>>> feedback.  Thanks.
>>>>>>
>>>>>

Re: BEAM-9613 - BigQueryUtils and Avro

Reply via email to