Logically, all JSON values are string. We often have put other objects in there, which I believe works simply because of the implicit .toString() method on those objects, but I'm not convinced this is really correct.
On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <bhule...@google.com> wrote: > Thank you for digging into this and figuring out how this bug was > introduced! > > In the long-term I think it would be preferable to avoid > TableRow altogether in order to do a schema-aware read of avro data from > BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This > would also be preferable for Arrow, where we could produce Row instances > that are references into an underlying arrow RecordBatch (using > RowWithGetters), rather than having to materialize each row to make a > TableRow instance. > > For a short-term fix there are two options, both came up in Reuven's > comments on BEAM-9613: > >> However if we expect to get Long, Double,etc. objects in the TableRow, >> then this mapping code needs to handle those objects. Handling them >> directly would be more efficient - converting to a String would simply be a >> stopgap "one-line" fix. > > > 1. handle them directly [in BigQueryUtils], this is what you've done in > https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb > 2. convert to a String [in BigQueryAvroUtils] > > I don't have a strong preference but I think (2) is a cleaner, albeit less > performant, solution. It seems that BigQueryUtils expects all values in > TableRow instances to be String instances. Since BigQueryAvroUtils is just > a shim to convert GenericRecord to TableRow for use in BigQueryUtils, it > should comply with that interface, rather than making BigQueryUtils work > around the discrepancy. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java > > On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <matthew.ouy...@gmail.com> > wrote: > >> I'm working on fixing BEAM-9613 because encountered this issue as a >> result of using BigQueryIO.readTableRowsWithSchema() >> >> 1. BEAM-7526 provided support for Lists and Maps that came from the >> JSON export format >> 2. BEAM-2879 switched the export format from JSON to Avro. This >> caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and FLOAT >> as >> a Java String but rather Java Boolean and Double. >> 3. The switch from JSON to Avro also introduced an issue with fields >> with BQ REPEATED mode because fields of this mode. >> >> I have a simple fix to handle BQ BOOLEAN and FLOAT ( >> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb) >> however I'm a bit uncomfortable with it because >> >> 1. This would introduce mixing of both the JSON and Avro export >> formats, >> 2. BEAM-8933 while still in progress would introduce Arrow and risk a >> regression, >> 3. I haven't made a fix for REPEATED mode yet, but tests that use >> BigQueryUtilsTest.BQ_ARRAY_ROW would have to change ( >> >> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311). >> I don't know if I should change this because I don't know if Beam wishes >> to >> continue support for the JSON format. >> >> I'd like to incorporate these changes in my project as soon as possible, >> but I also need guidance on next steps that would be in line with the >> general direction of the project. I'm looking forward to any feedback. >> Thanks. >> >