prabhnoor0212 opened a new issue, #38035: URL: https://github.com/apache/beam/issues/38035
### What happened? There appears to be a bug in: `sdks/python/apache_beam/transforms/enrichment_handlers/bigquery.py` In batch mode, requests are collected into a map keyed by the enrichment key: ```python requests_map[self.create_row_key(req)] = req ``` If more than one request in the same batch has the same enrichment key, the newer request overwrites the earlier one in `requests_map`. As a result, duplicate-key requests in the same batch are not all preserved and matched correctly. ### What did you expect to happen? When multiple requests in the same batch share the same enrichment key, all of them should be retained and handled correctly. ### What is the actual behavior? Only the latest request for a given key is stored in `requests_map`, so earlier requests with the same key are lost. ### Example Given batched requests like: - `Row(id=1, ...)` - `Row(id=1, ...)` both requests produce the same enrichment key. The second request overwrites the first in `requests_map`, which can cause one of the requests to be dropped or treated as unmatched incorrectly. ### Proposed fix Store a collection of requests per key instead of a single request, for example a list or queue, and then consume one request per matching response key during result processing. ### Additional context This seems to affect the batch path in `BigQueryEnrichmentHandler.__call__`, not the single-request path. ### Issue Priority Priority: 2 (default / most bugs should be filed as P2) ### Issue Components - [x] Component: Python SDK - [ ] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam YAML - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Infrastructure - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
