prabhnoor0212 opened a new issue, #38035:
URL: https://github.com/apache/beam/issues/38035

   ### What happened?
   
   There appears to be a bug in:
   
   `sdks/python/apache_beam/transforms/enrichment_handlers/bigquery.py`
   
   In batch mode, requests are collected into a map keyed by the enrichment key:
   
   ```python
   requests_map[self.create_row_key(req)] = req
   ```
   
   If more than one request in the same batch has the same enrichment key, the 
newer request overwrites the earlier one in `requests_map`.
   
   As a result, duplicate-key requests in the same batch are not all preserved 
and matched correctly.
   
   ### What did you expect to happen?
   
   When multiple requests in the same batch share the same enrichment key, all 
of them should be retained and handled correctly.
   
   ### What is the actual behavior?
   
   Only the latest request for a given key is stored in `requests_map`, so 
earlier requests with the same key are lost.
   
   ### Example
   
   Given batched requests like:
   
   - `Row(id=1, ...)`
   - `Row(id=1, ...)`
   
   both requests produce the same enrichment key. The second request overwrites 
the first in `requests_map`, which can cause one of the requests to be dropped 
or treated as unmatched incorrectly.
   
   ### Proposed fix
   
   Store a collection of requests per key instead of a single request, for 
example a list or queue, and then consume one request per matching response key 
during result processing.
   
   ### Additional context
   
   This seems to affect the batch path in `BigQueryEnrichmentHandler.__call__`, 
not the single-request path.
   
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [x] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to