Below0 opened a new pull request, #15740:
URL: https://github.com/apache/iceberg/pull/15740

   ## Problem
   
   `HashKeyGenerator.SelectorKey` was missing `writeParallelism` and 
`distributionMode` from its `equals()` and `hashCode()` methods. As a result, 
`computeIfAbsent` always hit the cache after the first record for a given 
table, silently reusing a stale `KeySelector` even when these values changed.
   
   This contradicts the class-level Javadoc which states:
   > "Caching ensures that a new key selector is also created when … the 
user-provided metadata changes (e.g. distribution mode, write parallelism)."
   
   ## Fix
   
   Add `writeParallelism` and `distributionMode` to `SelectorKey`'s fields, 
`equals()`, `hashCode()`, and `toString()`. The effective values passed to the 
cache key match those used in the `computeIfAbsent` lambda — `distributionMode` 
normalized via `firstNonNull(..., NONE)` and `writeParallelism` capped at 
`maxWriteParallelism`.
   
   ## Note
   
   `writeParallelism` and `distributionMode` should remain stable per table 
during a streaming job. Changing these values mid-stream — especially when 
equality fields are set — can cause routing changes that break equality delete 
co-location, as the subtask assignment is not monotonic across different 
`writeParallelism` values (i.e., the subtask set for parallelism N is not 
guaranteed to be a subset of the set for parallelism N+1).
   
   Making the subtask assignment monotonic (e.g., via a consistent ordering 
based on `maxWriteParallelism`) could address this limitation in a follow-up.
   
   ## Testing
   
   Added two regression tests to `TestHashKeyGenerator`:
   - `testCacheMissOnWriteParallelismChange`
   - `testCacheMissOnDistributionModeChange`
   
   Closes #15731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to