Re: BEAM-9613 - BigQueryUtils and Avro

Reuven Lax Tue, 23 Mar 2021 10:47:25 -0700

Logically, all JSON values are string. We often have put other objects in
there, which I believe works simply because of the implicit .toString()
method on those objects, but I'm not convinced this is really correct.


On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <[email protected]> wrote:

> Thank you for digging into this and figuring out how this bug was
> introduced!
>
> In the long-term I think it would be preferable to avoid
> TableRow altogether in order to do a schema-aware read of avro data from
> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This
> would also be preferable for Arrow, where we could produce Row instances
> that are references into an underlying arrow RecordBatch (using
> RowWithGetters), rather than having to materialize each row to make a
> TableRow instance.
>
> For a short-term fix there are two options, both came up in Reuven's
> comments on BEAM-9613:
>
>> However if we expect to get Long, Double,etc. objects in the TableRow,
>> then this mapping code needs to handle those objects. Handling them
>> directly would be more efficient - converting to a String would simply be a
>> stopgap "one-line" fix.
>
>
> 1. handle them directly [in BigQueryUtils], this is what you've done in
> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb
> 2. convert to a String [in BigQueryAvroUtils]
>
> I don't have a strong preference but I think (2) is a cleaner, albeit less
> performant, solution. It seems that BigQueryUtils expects all values in
> TableRow instances to be String instances. Since BigQueryAvroUtils is just
> a shim to convert GenericRecord to TableRow for use in BigQueryUtils, it
> should comply with that interface, rather than making BigQueryUtils work
> around the discrepancy.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java
>
> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <[email protected]>
> wrote:
>
>> I'm working on fixing BEAM-9613 because encountered this issue as a
>> result of using BigQueryIO.readTableRowsWithSchema()
>>
>>    1. BEAM-7526 provided support for Lists and Maps that came from the
>>    JSON export format
>>    2. BEAM-2879 switched the export format from JSON to Avro.  This
>>    caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and FLOAT 
>> as
>>    a Java String but rather Java Boolean and Double.
>>    3. The switch from JSON to Avro also introduced an issue with fields
>>    with BQ REPEATED mode because fields of this mode.
>>
>> I have a simple fix to handle BQ BOOLEAN and FLOAT (
>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb)
>> however I'm a bit uncomfortable with it because
>>
>>    1. This would introduce mixing of both the JSON and Avro export
>>    formats,
>>    2. BEAM-8933 while still in progress would introduce Arrow and risk a
>>    regression,
>>    3. I haven't made a fix for REPEATED mode yet, but tests that use
>>    BigQueryUtilsTest.BQ_ARRAY_ROW would have to change (
>>    
>> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311).
>>    I don't know if I should change this because I don't know if Beam wishes 
>> to
>>    continue support for the JSON format.
>>
>> I'd like to incorporate these changes in my project as soon as possible,
>> but I also need guidance on next steps that would be in line with the
>> general direction of the project.  I'm looking forward to any feedback.
>> Thanks.
>>
>

Re: BEAM-9613 - BigQueryUtils and Avro

Reply via email to