[ https://issues.apache.org/jira/browse/BEAM-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307350#comment-17307350 ]
Pablo Estrada commented on BEAM-11330: -------------------------------------- Thanks [~liamhaworth01] for looking and finding this! I think that's a reasonable observation - I've only noticed this issue once before, so I don't think it's very common, but it is possible. Do you have time to apply a fix for it? : ) If not, I can take a look and fix it later. > BigQueryServicesImpl.insertAll evaluates maxRowBatchSize after a row is added > to the batch > ------------------------------------------------------------------------------------------ > > Key: BEAM-11330 > URL: https://issues.apache.org/jira/browse/BEAM-11330 > Project: Beam > Issue Type: Bug > Components: io-java-gcp > Affects Versions: 2.22.0, 2.23.0, 2.24.0, 2.25.0 > Reporter: Liam Haworth > Assignee: Pablo Estrada > Priority: P3 > > When using the {{BigQueryIO.Write}} transformation, a set of pipeline options > defined in {{BigQueryOptions}} become available to the pipeline. > Two of these options being: > * {{maxStreamingRowsToBatch}} - "The maximum number of rows to batch in a > single streaming insert to BigQuery." > * {{maxStreamingBatchSize}} - "The maximum byte size of a single streaming > insert to BigQuery" > Reading the description of the {{maxStreamingBatchSize}}, I am given the > impression that the BigQuery sink will ensure that each batch is either on, > or under, the max byte size configured. > But after [reviewing the code of the internal sink > transformation|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L826], > I can see that the batching code will first add a row to the batch and then > compares the new batch size against the maximum configured. > The description of the option, {{maxStreamingBatchSize}}, gives the end user > an impression that this will protect them from batches that will exceed the > size limit of the BigQuery streaming inserts API. > When in reality it can lead to a situation where a batch is produced that > massively exceeds the limit and the transformation will get stuck into a loop > of constantly retrying the request. -- This message was sent by Atlassian Jira (v8.3.4#803005)