Ma77Ball opened a new pull request, #5411:
URL: https://github.com/apache/texera/pull/5411

   ## What
   
   `ReservoirSamplingOpExec.onFinish` emitted `null` tuples when the input size 
was less than `k`. Fixed by emitting only the `n` populated reservoir slots via 
`reservoir.iterator.take(n)`.
   
   ## Why
   
   The reservoir is a fixed-size `Array[Tuple](k)`. With fewer than `k` inputs 
the trailing slots stay `null` and were returned by `onFinish`. Today they are 
silently dropped by the null-check in `DataProcessor.outputOneTuple` 
(`DataProcessor.scala:157`), so there is no crash, but it violates the operator 
contract and any other consumer of the output would observe the nulls.
   
   ## Test
   
   Regression test asserts no null padding when input size is less than `k`. 
Verified it fails on the old `reservoir.iterator` and passes with `take(n)`.
   
   ## Notes
   
   Stacked on #5384 (the ReservoirSamplingOpExec spec) until that merges, so 
the diff temporarily shows that PR's spec additions. After #5384 merges to main 
this rebases cleanly.
   
   Closes #5409
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to