Hello, Apologies for not linking to the PR with the failed unit tests in my last e-mail. The link is below. I'm stuck and I need help in fixing this. I can't reproduce in my dev environment because the build can't trigger a BQ job, and it seems like the failed tests complement each other (collection of two elements, first test is missing one element, second test misses the other).
https://github.com/apache/beam/pull/14350 On Mon, Mar 29, 2021 at 9:28 AM Matthew Ouyang <[email protected]> wrote: > It looks like some unit tests failed. I'm unable to reproduce it in my > environment because the error I received (described in the PR) suggests I > can't even start the job in the first place. Any help would be > appreciated. Ideally I'd like to have this included in the next possible > release (2.30.0 it looks like) since my team had to do a few workarounds > to get past these issues. > > On Sun, Mar 28, 2021 at 2:25 AM Matthew Ouyang <[email protected]> > wrote: > >> Thank you for the feedback. I walked back my initial approach in favour >> of Brian's option (2) and also implemented a fix for lists as well ( >> https://github.com/apache/beam/pull/14350). I agree with the tradeoff >> Brian pointed out as it is consistent with the rest of that component. If >> performance ends up being a problem BigQueryUtils could have a different >> mode for TableRow and Avro. >> >> On Tue, Mar 23, 2021 at 1:47 PM Reuven Lax <[email protected]> wrote: >> >>> Logically, all JSON values are string. We often have put other objects >>> in there, which I believe works simply because of the implicit .toString() >>> method on those objects, but I'm not convinced this is really correct. >>> >>> On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <[email protected]> >>> wrote: >>> >>>> Thank you for digging into this and figuring out how this bug was >>>> introduced! >>>> >>>> In the long-term I think it would be preferable to avoid >>>> TableRow altogether in order to do a schema-aware read of avro data from >>>> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This >>>> would also be preferable for Arrow, where we could produce Row instances >>>> that are references into an underlying arrow RecordBatch (using >>>> RowWithGetters), rather than having to materialize each row to make a >>>> TableRow instance. >>>> >>>> For a short-term fix there are two options, both came up in Reuven's >>>> comments on BEAM-9613: >>>> >>>>> However if we expect to get Long, Double,etc. objects in the TableRow, >>>>> then this mapping code needs to handle those objects. Handling them >>>>> directly would be more efficient - converting to a String would simply be >>>>> a >>>>> stopgap "one-line" fix. >>>> >>>> >>>> 1. handle them directly [in BigQueryUtils], this is what you've done in >>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb >>>> 2. convert to a String [in BigQueryAvroUtils] >>>> >>>> I don't have a strong preference but I think (2) is a cleaner, albeit >>>> less performant, solution. It seems that BigQueryUtils expects all values >>>> in TableRow instances to be String instances. Since BigQueryAvroUtils is >>>> just a shim to convert GenericRecord to TableRow for use in BigQueryUtils, >>>> it should comply with that interface, rather than making BigQueryUtils work >>>> around the discrepancy. >>>> >>>> [1] >>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java >>>> >>>> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang < >>>> [email protected]> wrote: >>>> >>>>> I'm working on fixing BEAM-9613 because encountered this issue as a >>>>> result of using BigQueryIO.readTableRowsWithSchema() >>>>> >>>>> 1. BEAM-7526 provided support for Lists and Maps that came from >>>>> the JSON export format >>>>> 2. BEAM-2879 switched the export format from JSON to Avro. This >>>>> caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and >>>>> FLOAT as >>>>> a Java String but rather Java Boolean and Double. >>>>> 3. The switch from JSON to Avro also introduced an issue with >>>>> fields with BQ REPEATED mode because fields of this mode. >>>>> >>>>> I have a simple fix to handle BQ BOOLEAN and FLOAT ( >>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb) >>>>> however I'm a bit uncomfortable with it because >>>>> >>>>> 1. This would introduce mixing of both the JSON and Avro export >>>>> formats, >>>>> 2. BEAM-8933 while still in progress would introduce Arrow and >>>>> risk a regression, >>>>> 3. I haven't made a fix for REPEATED mode yet, but tests that use >>>>> BigQueryUtilsTest.BQ_ARRAY_ROW would have to change ( >>>>> >>>>> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311). >>>>> I don't know if I should change this because I don't know if Beam >>>>> wishes to >>>>> continue support for the JSON format. >>>>> >>>>> I'd like to incorporate these changes in my project as soon as >>>>> possible, but I also need guidance on next steps that would be in line >>>>> with >>>>> the general direction of the project. I'm looking forward to any >>>>> feedback. Thanks. >>>>> >>>>
