aprochko opened a new issue, #35755:
URL: https://github.com/apache/beam/issues/35755
### What happened?
**Versions affected**: The problem exists in the last versions 2.65.0, 2.66.0
**Description**:
When running a Beam job with two logically isolated BigQueryIO write streams
(e.g., writing to different tables), an intermittent issue arises where errors
from one stream (OffsetOutOfRange referencing trinity_02) seem to incorrectly
influence the state or retries of another stream (landmarks_02). This leads to
SchemaMismatchedException errors on the landmarks_02 stream, where the schema
fields of trinity_02 are incorrectly applied or checked against landmarks_02.
This behavior suggests a potential cross-contamination or shared state issue
between seemingly independent BigQueryIO operations within the same job.
**Steps to Reproduce:**
Set up a Beam Java job (versions 2.65.0 or 2.66.0).
Configure two distinct BigQueryIO.write streams within this single job. For
example:
Stream A writing to projects/<project>/datasets/events_02/tables/landmarks_02
Stream B writing to projects/<project>/datasets/events_02/tables/trinity_02
Ensure both streams are active and processing data concurrently with 1.5k+
events per second in every stream.
Run the job.
Observe the logs for OffsetOutOfRange and SchemaMismatchedException errors.
**Observed Behavior**:
The job experiences failures with the following sequence of errors in the
logs:
An OffsetOutOfRange error occurs, where the streamName in the log refers to
one table (**landmarks_02**), but the Entity field references a different table
(**trinity_02**).
Append to Context: key=ShardedKey{key=2, shardId=[1]}
streamName=projects/<PID>/datasets/events_02/tables/**landmarks_02**/streams/**Cic2Y2Y4YjUwOS0wMDAwLTI2YzUtOGU3ZC1mNGY1ZTgwOWVjMmM6czI**
offset=**5952366** numRows=0 tryIteration: 1 failed with
com.google.cloud.bigquery.storage.v1.Exceptions$OffsetOutOfRange: OUT_OF_RANGE:
The offset is beyond stream, expected offset 52858, received 5952366 Entity:
projects/<PID>/datasets/events_02/tables/**trinity_02**/streams/**CiQ2ZDM0MWFiNy0wMDAwLTIyNDItYmY1Yi04ODNkMjRmOGVhYTQ**
Will retry with a new stream
This single log message indicates a mix-up between landmarks_02 and
trinity_02.
Following this, the landmarks_02 stream enters a retry loop, but subsequent
retries fail with SchemaMismatchedException. Crucially, the error message
indicates that the schema fields from trinity_02 (**field1trinity,
field2trinity, field3trinity**) are being checked against the **landmarks_02**
table. This suggests that the schema context for landmarks_02 has been
corrupted or incorrectly switched to that of trinity_02.
Append to Context: key=ShardedKey{key=2, shardId=[1]}
streamName=projects/<PID>/datasets/events_02/tables/landmarks_02/streams/**Cic2YjI1MmQ0Ny0wMDAwLTJiYjUtYWRkOS0zYzI4NmQzNjJkY2U6czI**
offset=**0** numRows=0 tryIteration: **240** failed with
com.google.cloud.bigquery.storage.v1.Exceptions$SchemaMismatchedException:
INVALID_ARGUMENT: Input schema has more fields than BigQuery schema, extra
fields: '**field1trinity,field2trinity,field3trinity'** Entity:
projects/<PID>/datasets/events_02/tables/**landmarks_02**/streams/**Cic2YjI1MmQ0Ny0wMDAwLTJiYjUtYWRkOS0zYzI4NmQzNjJkY2U6czI**
Will retry with a new stream
This retry loop continues for a significant number of iterations (e.g.,
240), effectively stalling the **landmarks_02** stream.
**Expected Behavior**:
Each BigQueryIO write stream should operate independently without its state
or schema context being influenced by errors or operations pertaining to other
BigQueryIO streams within the same job.
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [ ] Component: Python SDK
- [x] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [x] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Infrastructure
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]