konstantinb commented on code in PR #6418:
URL: https://github.com/apache/hive/pull/6418#discussion_r3174741544
##########
ql/src/test/results/clientpositive/llap/parquet_types_non_dictionary_encoding_vectorization.q.out:
##########
Review Comment:
Refreshing `parquet_types_non_dictionary_encoding_vectorization.q.out` was
missed in the original impacted-set run. The 14-line diff is in a side query
(`SELECT hex(cbinary), count(*) FROM
parquet_types_n1 GROUP BY cbinary`), not the test's main subject —
`cbinary` is a `binary` column, and Hive's `getColStatistics` BINARY branch
never populates `countDistinct`, so it arrives at
`extractNDVGroupingColumns` with the canonical `(NDV=0, numNulls>0)`
unknown-NDV signature this PR targets.
The new estimate is also empirically more accurate. The actual data in the
test (visible from the `SELECT` rows in the .out) has 36 distinct non-NULL
binary values + 1 NULL bucket = **37 actual GROUP
BY groups**. Estimates:
| | Estimate | Error |
|---|---:|---:|
| Master (`+1` of `0`) | 1 | **37× under** |
| This PR (heuristic fallback at hash) | 150 | ~4× over |
| This PR (heuristic at mergepartial) | 75 | **~2× over** |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]