criptonyte opened a new issue, #35425:
URL: https://github.com/apache/beam/issues/35425

   ### What happened?
   
   We are reporting an issue where BigQueryIO failed due to `maxRequestSize` 
being exceeded. This occurred when attempting to append ProtoRows where the 
serialised size of a single row pushed the overall request beyond the BiqQuery 
API's defined limit, despite the code's internal checks and a "TODO" comment 
suggesting this scenario is "nearly impossible".
   
   ### Problematic Code Location
   This issue originates around lines 
[L885](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWritesShardedRecords.java#L885)
 in `StorageApiWritesShardedRecords`. 
   
   Specifically the code block:
   
   ```
           // Handle the case where the request is too large.
           if (inserts.getSerializedSize() >= maxRequestSize) {
             if (inserts.getSerializedRowsCount() > 1) {
               // TODO(reuvenlax): Is it worth trying to handle this case by 
splitting the protoRows?
               // Given that we split
               // the ProtoRows iterable at 2MB and the max request size is 
10MB, this scenario seems
               // nearly impossible.
               LOG.error(
                   "A request containing more than one row is over the request 
size limit of {}. "
                       + "This is unexpected. All rows in the request will be 
sent to the failed-rows PCollection.",
                   maxRequestSize);
             }
   ```
   
   However, we found that the same problematic code also exists in 
`StorageApiWriteUnshardedRecords` line 
[640](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWriteUnshardedRecords.java#L640)
   
   ### Context and Steps to Reproduce:
   1. Pipeline: We utilise an Apache Beam pipeline (running on version 2.62.0) 
that processes and writes data to BigQuery using BigQueryIO via the Storage 
Write API.
   2. Data Trigger: The incident was triggered by a specific, albeit rare, data 
pattern within the input stream. A particular record generated an exceptionally 
large serialised payload when converted to ProtoRows. This large payload, when 
included in its micro-batch, caused the total size of the request to exceed the 
10MB BigQuery API limit.
   Crucially, this micro-batch was constructed and attempted to be sent before 
a definitive check or pre-validation mechanism could prevent it from exceeding 
BigQuery API's hard limit. This suggest a reliance on a post-submission 
validation or a lack of proactive size management for dynamically growing 
ProtoRow payloads.
   4. Observed Behaviour: Despite the internal BigQueryIO logic and comments 
suggesting this would not occur, our production pipeline did encounter this 
exact issue. The described LOG.error message was triggered, leading to the 
failure of the affected micro-batch and downstream data synchronisation 
problems.
   ### Expected Behaviour
   The logic should robustly handle scenarios where serialised ProtoRows within 
a micro-batch, or even a single very large row, approach or exceed the 
maxRequestSize (10MB). This should involve:
   1. More sophisticated pre-validation or dynamic splitting mechanisms to 
ensure that individual AppendRows requests always remain within BigQuery API 
limits
   2. More resilient error handling that minimises the impact of an oversized 
record, ideally allowing the successful processing of other valid records in 
the batch.
   3. A re-evaluation and implementation of a solution for the "TODO" comment, 
acknowledging that this scenario is indeed possible in real-world production 
data.
   
   ### Actual Behaviour
   The pipeline failed to write data to BigQuery, logging the LOG.error 
regarding the maxRequestSize being exceeded. This resulted in:
   1. Data sync failure for the affected micro-batch
   2. Interruption of a critical data flows
   3. Operation effort due to manual intervention and remediation.
   
   ### Proposed Solution / Request
   1. Prioritise the implementation of a robust solution for the "TODO" within 
`StorageApiWritesShardedRecords` (and `StorageApiWriteUnshardedRecords`), as 
our experience confirms it is a real-world edge case with production impact.
   2. Implement more resilient splitting or batching logic to prevent single 
large records or small groups of records from causing the entire micro-batch to 
fail agains the 10MB BQ limit.
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [x] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [x] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [x] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to