shahar1 opened a new issue, #66963:
URL: https://github.com/apache/airflow/issues/66963

   ### Apache Airflow Provider(s)
   
   google
   
   ### What happened?
   
   `BigQueryStreamingBufferEmptySensor` (added in #66652) decides the streaming 
buffer is empty by checking `table.streaming_buffer is None`. That check is 
unreliable because BigQuery's `streamingBuffer` table-metadata field is 
**eventually consistent**.
   
   For a window of several seconds *after* a streaming insert, the row is 
physically in the streaming buffer but `table.streaming_buffer` is still 
`None`. During that window the sensor reports the buffer **empty** — a false 
negative.
   
   `streaming_buffer is None` is therefore ambiguous — it means *either*:
   - "fully flushed / table never had streamed rows", or
   - "rows were just streamed in, metadata hasn't caught up yet".
   
   The sensor cannot tell these apart, so a DML task placed downstream of it 
(`UPDATE`/`DELETE`/`MERGE`) can still hit `UPDATE or DELETE statement over 
table ... would affect rows in the streaming buffer` — the exact error the 
sensor exists to prevent.
   
   ### Evidence
   
   Empirical timing check against real BigQuery — stream one row, then poll 
`get_table().streaming_buffer`:
   
   ```
   t=  1.2s   streaming_buffer = None                       -> sensor reports 
EMPTY (false)
   t= 11.6s   streaming_buffer present (estimated_rows=1)    -> sensor WAITs 
(correct)
   t= 22s..302s  still present, estimated_rows=1            -> stays correct
   ```
   
   There is a ~10–12s false-empty window. System tests run with the simulated 
executor (tasks run back-to-back, near-zero scheduling overhead), so the 
sensor's first poke fires ~1–2s after the upstream streaming-insert task 
finishes — squarely inside that window.
   
   ### What you think should happen instead
   
   The sensor should not report an empty buffer while rows are still buffered. 
Possible directions to evaluate:
   
   - Require the sensor to observe a non-empty → empty *transition* rather than 
trusting a single `None` reading.
   - Otherwise, clearly document the eventual-consistency limitation in the 
sensor docstring and treat it as best-effort, so users know to add their own 
guard between a streaming insert and the sensor.
   
   **Acceptance criteria:** either the sensor no longer false-passes 
immediately after a streaming insert, or the limitation is clearly documented — 
at which point the workaround below can be removed.
   
   ### Workaround currently in place
   
   The `example_bigquery_sensors` system test's `streaming_insert` task polls 
`get_table().streaming_buffer` until it becomes non-`None` before completing, 
so the metadata is caught up before `check_streaming_buffer_empty` runs its 
first poke. This makes the system test deterministic but does **not** fix the 
sensor for real users. It is tagged in the test with a link back to this issue.
   
   ### How to reproduce
   
   1. Create a BigQuery table.
   2. Stream a row into it (`BigQueryHook.insert_all`).
   3. Immediately call `BigQueryStreamingBufferEmptySensor.poke()` — it returns 
`True` (empty) although the row is in the streaming buffer.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   
   ---
   Drafted-by: Claude Code (Opus 4.7) (no human review before posting)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to