Thank you for the feedback. I walked back my initial approach in favour of Brian's option (2) and also implemented a fix for lists as well ( https://github.com/apache/beam/pull/14350). I agree with the tradeoff Brian pointed out as it is consistent with the rest of that component. If performance ends up being a problem BigQueryUtils could have a different mode for TableRow and Avro.
On Tue, Mar 23, 2021 at 1:47 PM Reuven Lax <[email protected]> wrote: > Logically, all JSON values are string. We often have put other objects in > there, which I believe works simply because of the implicit .toString() > method on those objects, but I'm not convinced this is really correct. > > On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <[email protected]> wrote: > >> Thank you for digging into this and figuring out how this bug was >> introduced! >> >> In the long-term I think it would be preferable to avoid >> TableRow altogether in order to do a schema-aware read of avro data from >> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This >> would also be preferable for Arrow, where we could produce Row instances >> that are references into an underlying arrow RecordBatch (using >> RowWithGetters), rather than having to materialize each row to make a >> TableRow instance. >> >> For a short-term fix there are two options, both came up in Reuven's >> comments on BEAM-9613: >> >>> However if we expect to get Long, Double,etc. objects in the TableRow, >>> then this mapping code needs to handle those objects. Handling them >>> directly would be more efficient - converting to a String would simply be a >>> stopgap "one-line" fix. >> >> >> 1. handle them directly [in BigQueryUtils], this is what you've done in >> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb >> 2. convert to a String [in BigQueryAvroUtils] >> >> I don't have a strong preference but I think (2) is a cleaner, albeit >> less performant, solution. It seems that BigQueryUtils expects all values >> in TableRow instances to be String instances. Since BigQueryAvroUtils is >> just a shim to convert GenericRecord to TableRow for use in BigQueryUtils, >> it should comply with that interface, rather than making BigQueryUtils work >> around the discrepancy. >> >> [1] >> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java >> >> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <[email protected]> >> wrote: >> >>> I'm working on fixing BEAM-9613 because encountered this issue as a >>> result of using BigQueryIO.readTableRowsWithSchema() >>> >>> 1. BEAM-7526 provided support for Lists and Maps that came from the >>> JSON export format >>> 2. BEAM-2879 switched the export format from JSON to Avro. This >>> caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and FLOAT >>> as >>> a Java String but rather Java Boolean and Double. >>> 3. The switch from JSON to Avro also introduced an issue with fields >>> with BQ REPEATED mode because fields of this mode. >>> >>> I have a simple fix to handle BQ BOOLEAN and FLOAT ( >>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb) >>> however I'm a bit uncomfortable with it because >>> >>> 1. This would introduce mixing of both the JSON and Avro export >>> formats, >>> 2. BEAM-8933 while still in progress would introduce Arrow and risk >>> a regression, >>> 3. I haven't made a fix for REPEATED mode yet, but tests that use >>> BigQueryUtilsTest.BQ_ARRAY_ROW would have to change ( >>> >>> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311). >>> I don't know if I should change this because I don't know if Beam wishes >>> to >>> continue support for the JSON format. >>> >>> I'd like to incorporate these changes in my project as soon as possible, >>> but I also need guidance on next steps that would be in line with the >>> general direction of the project. I'm looking forward to any feedback. >>> Thanks. >>> >>
