soumyakanti3578 commented on PR #6009:
URL: https://github.com/apache/hive/pull/6009#issuecomment-3162054831

   @kasakrisz 
   I debugged a bit to understand the workflow. To simulate the scenario where 
rows with the same key doesn't fit into 1 single batch, we can run the tests 
with `SET hive.vectorized.testing.reducer.batch.size=2;` With this I have 
noticed that `VectorPTFGroupBatches.bufferedBatches` contains batches whose 
size is at most 2. 
   
   For the data in `vectorized_first_last_value_ignore_nulls_int.q`, for key 1 
and 2, I have seen batches like:
   ```
   Batch 1:   [[1, null], [1, null]]
   Batch 2:   [1, null]
   Batch 3:   [[2, null], [2, 20]]
   Batch 4:   [2, null]
   ```
   Let's look at `last_value ignore nulls` computation.
   First we will compute last value for Range with key = 1, which consists of 
Batch 1 and 2. A new evaluator will be created for 
`VectorPTFEvaluatorLongLastValue`, which we will keep reusing.
   
   First, we will call 
`VectorPTFEvaluatorLongLastValue.evaluateGroupBatch(batch 1)` will be called, 
and we will have to check both vector elements for non null values. Since they 
are null, we will have `isGroupResultNull = true` and `lastValue = null`.
   
   Then, we will call `VectorPTFEvaluatorLongLastValue.evaluateGroupBatch(batch 
2)` on the same evaluator object, but the  value of `isGroupResultNull` and 
`lastValue` doesn't change.
   
   Next we will call `VectorPTFEvaluatorLongLastValue.evaluateGroupBatch(batch 
3)` on the same evaluator, and this time, the values will be `isGroupResultNull 
= false` and `lastValue = 20`.
   
   And then, we will call 
`VectorPTFEvaluatorLongLastValue.evaluateGroupBatch(batch 4)` on the same 
evaluator, and the values will be `isGroupResultNull = true` and `lastValue = 
20`. Note that `lastValue` remains unchanged.
   
   Now if we didn't have other rows in the table, we would get out of the 
`while` loop in `VectorPTFGroupBatches.BlockIterator.run` and call 
`getEvaluatorResult(evaluator)` as seen below:
   ```
       Object run(VectorPTFEvaluatorBase evaluator) throws HiveException {
         runEvaluator(evaluator, range);
   
         while (hasNext()) {
           nextBlock();
           runEvaluator(evaluator, range);
         }
   
         return getEvaluatorResult(evaluator);
       }
   
       Object getEvaluatorResult(VectorPTFEvaluatorBase evaluator) {
         return evaluator.isGroupResultNull() ? null : 
evaluator.getGroupResult();
       }
   ```
   
   `VectorPTFEvaluatorLongLastValue.isGroupResultNull()` was updated in this PR 
as:
   ```
     public boolean isGroupResultNull() {
       return isGroupResultNull && doesRespectNulls();
     }
   ```
   This returns `false` even though `isGroupResultNull = true` at the end 
because `doesRespectNulls()` returns false. So finally 
`evaluator.getGroupResult()` is called, which returns `lastValue = 20`.
   
   If we were running `last value respect nulls`, it would have returned `null` 
because `isGroupResultNull()` would have returned `true`.
   
   Similarly `first value ignore nulls` can be explained too. One important 
difference is that `VectorPTFEvaluatorLongFirstValue` has the boolean variable 
`haveFirstValue`. Once we find a non null value, this variable is set to true 
and for all batches after that we directly write the value into the output 
vector.
   
   Hopefully this makes sense, but please let me know if I am missing 
something. If needed, I can add `SET 
hive.vectorized.testing.reducer.batch.size=2;` to all three test files too - 
they are all passing on my local machine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to