Ma77Ball opened a new pull request, #5411: URL: https://github.com/apache/texera/pull/5411
## What `ReservoirSamplingOpExec.onFinish` emitted `null` tuples when the input size was less than `k`. Fixed by emitting only the `n` populated reservoir slots via `reservoir.iterator.take(n)`. ## Why The reservoir is a fixed-size `Array[Tuple](k)`. With fewer than `k` inputs the trailing slots stay `null` and were returned by `onFinish`. Today they are silently dropped by the null-check in `DataProcessor.outputOneTuple` (`DataProcessor.scala:157`), so there is no crash, but it violates the operator contract and any other consumer of the output would observe the nulls. ## Test Regression test asserts no null padding when input size is less than `k`. Verified it fails on the old `reservoir.iterator` and passes with `take(n)`. ## Notes Stacked on #5384 (the ReservoirSamplingOpExec spec) until that merges, so the diff temporarily shows that PR's spec additions. After #5384 merges to main this rebases cleanly. Closes #5409 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
