Re: [PR] HIVE-29122: Vectorization - Support IGNORE NULLS for FIRST_VALUE and … [hive]

via GitHub Fri, 15 Aug 2025 00:39:19 -0700


soumyakanti3578 commented on PR #6027:
URL: https://github.com/apache/hive/pull/6027#issuecomment-3190855413


   @kasakrisz 
   Sorry I should have clarified it in more detail. The bug was fixed by this:
   ```
       if (longColVector.isRepeating) {
         if (longColVector.noNulls || !longColVector.isNull[0]) {
           lastValue = longColVector.vector[0];
           isGroupResultNull = false;
         } else {
   --        isGroupResultNull = true;
   ++        isGroupResultNull = doesRespectNulls() || lastValue == null;
         }
       }
   ```
   
   When `longColVector.isRepeating = true`, it means that the column vector 
contains the same value for all the rows in the batch and thus we just store 
the value in the `0th` index. 
   
   When `if (longColVector.noNulls || !longColVector.isNull[0])` condition is 
`true`, we can simply pick up the value from the `0th` index because all rows 
in the batch has the same repeating non null value, so `lastValue` would be the 
same value too, and hence the group result is not null as indicated by 
`isGroupResultNull = false`.
   
   But when the condition is `false`, i.e., when values in the batch are 
repeating for the column (`longColVector.isRepeating = true`) AND the value in 
the first element is null (`!(longColVector.noNulls || 
!longColVector.isNull[0])`) we were simply setting `isGroupResultNull = true;` 
as indeed then the value in the column vector is null. This was fine when we 
were not supporting `IGNORE NULLS`.
   
   Now, we must check if we respect nulls with `doesRespectNulls()`. If we do, 
then the group result is indeed null and we can set it to `true`. Else if we 
are ignoring nulls, then we must check if lastValue was set by the previous 
batch or not. If it was set, then we must set `isGroupResultNull = false;`
   
   In short, when we are in the `else` block, the value in the column vector is 
indeed `null`. But we must check the `lastValue` because it could have been set 
by the previous batch.
   
   In the int test file, we have these three rows for id = 2: `(2, 20), (2, 
null), (2, null)`. Since we have `id` and `int_col` in the ORDER BY of 
`LAST_VALUE(int_col IGNORE NULLS) OVER(PARTITION BY id ORDER BY id, int_col) AS 
last_int` this would create two batches:
   ```
   batch 1: [2, 20]
   batch 2: [[2, null], [2, null]]
   ```
   as we are dealing with `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, 
i.e., both `(2, null)`s will be together. Also note that for batch 2, 
`longColVector.isRepeating = true`. While computing batch 2, we will go to the 
else block and we would have set the result to null before.
   
   Hope this clarifies the bug!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29122: Vectorization - Support IGNORE NULLS for FIRST_VALUE and … [hive]

Reply via email to