Commented on the PR, it looks like there are just some double literals that need to be changed to strings.
On Wed, Mar 31, 2021 at 7:07 AM Matthew Ouyang <[email protected]> wrote: > Hello, Apologies for not linking to the PR with the failed unit tests in > my last e-mail. The link is below. I'm stuck and I need help in fixing > this. I can't reproduce in my dev environment because the build can't > trigger a BQ job, and it seems like the failed tests complement each other > (collection of two elements, first test is missing one element, second test > misses the other). > > https://github.com/apache/beam/pull/14350 > > On Mon, Mar 29, 2021 at 9:28 AM Matthew Ouyang <[email protected]> > wrote: > >> It looks like some unit tests failed. I'm unable to reproduce it in my >> environment because the error I received (described in the PR) suggests I >> can't even start the job in the first place. Any help would be >> appreciated. Ideally I'd like to have this included in the next possible >> release (2.30.0 it looks like) since my team had to do a few workarounds >> to get past these issues. >> >> On Sun, Mar 28, 2021 at 2:25 AM Matthew Ouyang <[email protected]> >> wrote: >> >>> Thank you for the feedback. I walked back my initial approach in favour >>> of Brian's option (2) and also implemented a fix for lists as well ( >>> https://github.com/apache/beam/pull/14350). I agree with the tradeoff >>> Brian pointed out as it is consistent with the rest of that component. If >>> performance ends up being a problem BigQueryUtils could have a different >>> mode for TableRow and Avro. >>> >>> On Tue, Mar 23, 2021 at 1:47 PM Reuven Lax <[email protected]> wrote: >>> >>>> Logically, all JSON values are string. We often have put other objects >>>> in there, which I believe works simply because of the implicit .toString() >>>> method on those objects, but I'm not convinced this is really correct. >>>> >>>> On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <[email protected]> >>>> wrote: >>>> >>>>> Thank you for digging into this and figuring out how this bug was >>>>> introduced! >>>>> >>>>> In the long-term I think it would be preferable to avoid >>>>> TableRow altogether in order to do a schema-aware read of avro data from >>>>> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This >>>>> would also be preferable for Arrow, where we could produce Row instances >>>>> that are references into an underlying arrow RecordBatch (using >>>>> RowWithGetters), rather than having to materialize each row to make a >>>>> TableRow instance. >>>>> >>>>> For a short-term fix there are two options, both came up in Reuven's >>>>> comments on BEAM-9613: >>>>> >>>>>> However if we expect to get Long, Double,etc. objects in the >>>>>> TableRow, then this mapping code needs to handle those objects. Handling >>>>>> them directly would be more efficient - converting to a String would >>>>>> simply >>>>>> be a stopgap "one-line" fix. >>>>> >>>>> >>>>> 1. handle them directly [in BigQueryUtils], this is what you've done >>>>> in >>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb >>>>> 2. convert to a String [in BigQueryAvroUtils] >>>>> >>>>> I don't have a strong preference but I think (2) is a cleaner, albeit >>>>> less performant, solution. It seems that BigQueryUtils expects all values >>>>> in TableRow instances to be String instances. Since BigQueryAvroUtils is >>>>> just a shim to convert GenericRecord to TableRow for use in BigQueryUtils, >>>>> it should comply with that interface, rather than making BigQueryUtils >>>>> work >>>>> around the discrepancy. >>>>> >>>>> [1] >>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java >>>>> >>>>> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang < >>>>> [email protected]> wrote: >>>>> >>>>>> I'm working on fixing BEAM-9613 because encountered this issue as a >>>>>> result of using BigQueryIO.readTableRowsWithSchema() >>>>>> >>>>>> 1. BEAM-7526 provided support for Lists and Maps that came from >>>>>> the JSON export format >>>>>> 2. BEAM-2879 switched the export format from JSON to Avro. This >>>>>> caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and >>>>>> FLOAT as >>>>>> a Java String but rather Java Boolean and Double. >>>>>> 3. The switch from JSON to Avro also introduced an issue with >>>>>> fields with BQ REPEATED mode because fields of this mode. >>>>>> >>>>>> I have a simple fix to handle BQ BOOLEAN and FLOAT ( >>>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb) >>>>>> however I'm a bit uncomfortable with it because >>>>>> >>>>>> 1. This would introduce mixing of both the JSON and Avro export >>>>>> formats, >>>>>> 2. BEAM-8933 while still in progress would introduce Arrow and >>>>>> risk a regression, >>>>>> 3. I haven't made a fix for REPEATED mode yet, but tests that use >>>>>> BigQueryUtilsTest.BQ_ARRAY_ROW would have to change ( >>>>>> >>>>>> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311). >>>>>> I don't know if I should change this because I don't know if Beam >>>>>> wishes to >>>>>> continue support for the JSON format. >>>>>> >>>>>> I'd like to incorporate these changes in my project as soon as >>>>>> possible, but I also need guidance on next steps that would be in line >>>>>> with >>>>>> the general direction of the project. I'm looking forward to any >>>>>> feedback. Thanks. >>>>>> >>>>>
