konstantinb commented on code in PR #6418:
URL: https://github.com/apache/hive/pull/6418#discussion_r3174741544


##########
ql/src/test/results/clientpositive/llap/parquet_types_non_dictionary_encoding_vectorization.q.out:
##########


Review Comment:
    Refreshing `parquet_types_non_dictionary_encoding_vectorization.q.out` was 
missed in the original impacted-set run. The 14-line diff is in a side query 
(`SELECT hex(cbinary), count(*) FROM
     parquet_types_n1 GROUP BY cbinary`), not the test's main subject — 
`cbinary` is a `binary` column, and Hive's `getColStatistics` BINARY branch 
never populates `countDistinct`, so it arrives at
     `extractNDVGroupingColumns` with the canonical `(NDV=0, numNulls>0)` 
unknown-NDV signature this PR targets.
   
     The new estimate is also empirically more accurate. The actual data in the 
test (visible from the `SELECT` rows in the .out) has 36 distinct non-NULL 
binary values + 1 NULL bucket = **37 actual GROUP
     BY groups**. Estimates:
   
     | | Estimate | Error |
     |---|---:|---:|
     | Master (`+1` of `0`) | 1 | **37× under** |
     | This PR (heuristic fallback at hash) | 150 | ~4× over |
     | This PR (heuristic at mergepartial) | 75 | **~2× over** |



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to