aprochko opened a new issue, #35755:
URL: https://github.com/apache/beam/issues/35755

   ### What happened?
   
   **Versions affected**: The problem exists in the last versions 2.65.0, 2.66.0
   
   **Description**:
   When running a Beam job with two logically isolated BigQueryIO write streams 
(e.g., writing to different tables), an intermittent issue arises where errors 
from one stream (OffsetOutOfRange referencing trinity_02) seem to incorrectly 
influence the state or retries of another stream (landmarks_02). This leads to 
SchemaMismatchedException errors on the landmarks_02 stream, where the schema 
fields of trinity_02 are incorrectly applied or checked against landmarks_02. 
This behavior suggests a potential cross-contamination or shared state issue 
between seemingly independent BigQueryIO operations within the same job.
   
   **Steps to Reproduce:**
   
   Set up a Beam Java job (versions 2.65.0 or 2.66.0).
   Configure two distinct BigQueryIO.write streams within this single job. For 
example:
   Stream A writing to projects/<project>/datasets/events_02/tables/landmarks_02
   Stream B writing to projects/<project>/datasets/events_02/tables/trinity_02
   Ensure both streams are active and processing data concurrently with 1.5k+ 
events per second in every stream.
   Run the job.
   Observe the logs for OffsetOutOfRange and SchemaMismatchedException errors.
   
   
   **Observed Behavior**:
   The job experiences failures with the following sequence of errors in the 
logs:
   
   An OffsetOutOfRange error occurs, where the streamName in the log refers to 
one table (**landmarks_02**), but the Entity field references a different table 
(**trinity_02**).
   
   Append to Context: key=ShardedKey{key=2, shardId=[1]} 
streamName=projects/<PID>/datasets/events_02/tables/**landmarks_02**/streams/**Cic2Y2Y4YjUwOS0wMDAwLTI2YzUtOGU3ZC1mNGY1ZTgwOWVjMmM6czI**
 offset=**5952366** numRows=0 tryIteration: 1 failed with 
com.google.cloud.bigquery.storage.v1.Exceptions$OffsetOutOfRange: OUT_OF_RANGE: 
The offset is beyond stream, expected offset 52858, received 5952366 Entity: 
projects/<PID>/datasets/events_02/tables/**trinity_02**/streams/**CiQ2ZDM0MWFiNy0wMDAwLTIyNDItYmY1Yi04ODNkMjRmOGVhYTQ**
 Will retry with a new stream
   This single log message indicates a mix-up between landmarks_02 and 
trinity_02.
   
   Following this, the landmarks_02 stream enters a retry loop, but subsequent 
retries fail with SchemaMismatchedException. Crucially, the error message 
indicates that the schema fields from trinity_02 (**field1trinity, 
field2trinity, field3trinity**) are being checked against the **landmarks_02** 
table. This suggests that the schema context for landmarks_02 has been 
corrupted or incorrectly switched to that of trinity_02.
   
   Append to Context: key=ShardedKey{key=2, shardId=[1]} 
streamName=projects/<PID>/datasets/events_02/tables/landmarks_02/streams/**Cic2YjI1MmQ0Ny0wMDAwLTJiYjUtYWRkOS0zYzI4NmQzNjJkY2U6czI**
 offset=**0** numRows=0 tryIteration: **240** failed with 
com.google.cloud.bigquery.storage.v1.Exceptions$SchemaMismatchedException: 
INVALID_ARGUMENT: Input schema has more fields than BigQuery schema, extra 
fields: '**field1trinity,field2trinity,field3trinity'** Entity: 
projects/<PID>/datasets/events_02/tables/**landmarks_02**/streams/**Cic2YjI1MmQ0Ny0wMDAwLTJiYjUtYWRkOS0zYzI4NmQzNjJkY2U6czI**
 Will retry with a new stream
   
   This retry loop continues for a significant number of iterations (e.g., 
240), effectively stalling the **landmarks_02** stream.
   
   **Expected Behavior**:
   Each BigQueryIO write stream should operate independently without its state 
or schema context being influenced by errors or operations pertaining to other 
BigQueryIO streams within the same job.
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [x] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [x] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to