soumyakanti3578 commented on PR #6009:
URL: https://github.com/apache/hive/pull/6009#issuecomment-3162054831
@kasakrisz
I debugged a bit to understand the workflow. To simulate the scenario where
rows with the same key doesn't fit into 1 single batch, we can run the tests
with `SET hive.vectorized.testing.reducer.batch.size=2;` With this I have
noticed that `VectorPTFGroupBatches.bufferedBatches` contains batches whose
size is at most 2.
For the data in `vectorized_first_last_value_ignore_nulls_int.q`, for key 1
and 2, I have seen batches like:
```
Batch 1: [[1, null], [1, null]]
Batch 2: [1, null]
Batch 3: [[2, null], [2, 20]]
Batch 4: [2, null]
```
Let's look at `last_value ignore nulls` computation.
First we will compute last value for Range with key = 1, which consists of
Batch 1 and 2. A new evaluator will be created for
`VectorPTFEvaluatorLongLastValue`, which we will keep reusing.
First, we will call
`VectorPTFEvaluatorLongLastValue.evaluateGroupBatch(batch 1)` will be called,
and we will have to check both vector elements for non null values. Since they
are null, we will have `isGroupResultNull = true` and `lastValue = null`.
Then, we will call `VectorPTFEvaluatorLongLastValue.evaluateGroupBatch(batch
2)` on the same evaluator object, but the value of `isGroupResultNull` and
`lastValue` doesn't change.
Next we will call `VectorPTFEvaluatorLongLastValue.evaluateGroupBatch(batch
3)` on the same evaluator, and this time, the values will be `isGroupResultNull
= false` and `lastValue = 20`.
And then, we will call
`VectorPTFEvaluatorLongLastValue.evaluateGroupBatch(batch 4)` on the same
evaluator, and the values will be `isGroupResultNull = true` and `lastValue =
20`. Note that `lastValue` remains unchanged.
Now if we didn't have other rows in the table, we would get out of the
`while` loop in `VectorPTFGroupBatches.BlockIterator.run` and call
`getEvaluatorResult(evaluator)` as seen below:
```
Object run(VectorPTFEvaluatorBase evaluator) throws HiveException {
runEvaluator(evaluator, range);
while (hasNext()) {
nextBlock();
runEvaluator(evaluator, range);
}
return getEvaluatorResult(evaluator);
}
Object getEvaluatorResult(VectorPTFEvaluatorBase evaluator) {
return evaluator.isGroupResultNull() ? null :
evaluator.getGroupResult();
}
```
`VectorPTFEvaluatorLongLastValue.isGroupResultNull()` was updated in this PR
as:
```
public boolean isGroupResultNull() {
return isGroupResultNull && doesRespectNulls();
}
```
This returns `false` even though `isGroupResultNull = true` at the end
because `doesRespectNulls()` returns false. So finally
`evaluator.getGroupResult()` is called, which returns `lastValue = 20`.
If we were running `last value respect nulls`, it would have returned `null`
because `isGroupResultNull()` would have returned `true`.
Similarly `first value ignore nulls` can be explained too. One important
difference is that `VectorPTFEvaluatorLongFirstValue` has the boolean variable
`haveFirstValue`. Once we find a non null value, this variable is set to true
and for all batches after that we directly write the value into the output
vector.
Hopefully this makes sense, but please let me know if I am missing
something. If needed, I can add `SET
hive.vectorized.testing.reducer.batch.size=2;` to all three test files too -
they are all passing on my local machine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]