shahar1 opened a new issue, #66963: URL: https://github.com/apache/airflow/issues/66963
### Apache Airflow Provider(s) google ### What happened? `BigQueryStreamingBufferEmptySensor` (added in #66652) decides the streaming buffer is empty by checking `table.streaming_buffer is None`. That check is unreliable because BigQuery's `streamingBuffer` table-metadata field is **eventually consistent**. For a window of several seconds *after* a streaming insert, the row is physically in the streaming buffer but `table.streaming_buffer` is still `None`. During that window the sensor reports the buffer **empty** — a false negative. `streaming_buffer is None` is therefore ambiguous — it means *either*: - "fully flushed / table never had streamed rows", or - "rows were just streamed in, metadata hasn't caught up yet". The sensor cannot tell these apart, so a DML task placed downstream of it (`UPDATE`/`DELETE`/`MERGE`) can still hit `UPDATE or DELETE statement over table ... would affect rows in the streaming buffer` — the exact error the sensor exists to prevent. ### Evidence Empirical timing check against real BigQuery — stream one row, then poll `get_table().streaming_buffer`: ``` t= 1.2s streaming_buffer = None -> sensor reports EMPTY (false) t= 11.6s streaming_buffer present (estimated_rows=1) -> sensor WAITs (correct) t= 22s..302s still present, estimated_rows=1 -> stays correct ``` There is a ~10–12s false-empty window. System tests run with the simulated executor (tasks run back-to-back, near-zero scheduling overhead), so the sensor's first poke fires ~1–2s after the upstream streaming-insert task finishes — squarely inside that window. ### What you think should happen instead The sensor should not report an empty buffer while rows are still buffered. Possible directions to evaluate: - Require the sensor to observe a non-empty → empty *transition* rather than trusting a single `None` reading. - Otherwise, clearly document the eventual-consistency limitation in the sensor docstring and treat it as best-effort, so users know to add their own guard between a streaming insert and the sensor. **Acceptance criteria:** either the sensor no longer false-passes immediately after a streaming insert, or the limitation is clearly documented — at which point the workaround below can be removed. ### Workaround currently in place The `example_bigquery_sensors` system test's `streaming_insert` task polls `get_table().streaming_buffer` until it becomes non-`None` before completing, so the metadata is caught up before `check_streaming_buffer_empty` runs its first poke. This makes the system test deterministic but does **not** fix the sensor for real users. It is tagged in the test with a link back to this issue. ### How to reproduce 1. Create a BigQuery table. 2. Stream a row into it (`BigQueryHook.insert_all`). 3. Immediately call `BigQueryStreamingBufferEmptySensor.poke()` — it returns `True` (empty) although the row is in the streaming buffer. ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) --- Drafted-by: Claude Code (Opus 4.7) (no human review before posting) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
