[
https://issues.apache.org/jira/browse/FLINK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timo Walther closed FLINK-39399.
--------------------------------
Fix Version/s: 2.3.0
Resolution: Fixed
Fixed in master: 795a36ee85403b183514a0aeccc022af0ca6068c
> Integer overflow in HyperLogLogPlusPlus.query() causes APPROX_COUNT_DISTINCT
> undercount at high cardinality
> -----------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39399
> URL: https://issues.apache.org/jira/browse/FLINK-39399
> Project: Flink
> Issue Type: Bug
> Components: Table SQL / Runtime
> Reporter: Jim Hughes
> Assignee: Jim Hughes
> Priority: Minor
> Labels: pull-request-available
> Fix For: 2.3.0
>
>
> APPROX_COUNT_DISTINCT produce incorrect (undercounted) results when the true
> cardinality exceeds ~100 million distinct values.
> The root cause is an integer overflow in HyperLogLogPlusPlus.query(). In
> HyperLogLogPlusPlus.java line 4950:
> {code:java}
> zInverse += 1.0 / (1 << mIdx);
> {code}
> mIdx is the per-register value representing the position of the leading 1-bit
> in a hash. At high cardinality, some registers accumulate values >= 32.
> Because 1 is an int, Java's << operator masks the shift count modulo 32, so 1
> << 35 silently evaluates to 1 << 3 = 8 instead of the correct 34359738368.
> This corrupts the zInverse accumulator and produces a dramatically wrong
> cardinality estimate.
> *Fix:*
> Change 1 << mIdx to 1L << mIdx so the shift uses long arithmetic (modulo 64,
> which is sufficient since register values max out around 51 for a 64-bit
> hash).
> {code:java}
> zInverse += 1.0 / (1L << mIdx);
> {code}
> *How to reproduce:*
> Construct an HLL buffer where every register holds a value >= 32 (e.g., 35)
> and call query(). With the bug, the estimate is ~95K. With the fix, the
> estimate correctly reflects the large register values (~4e14). Alternatively,
> run APPROX_COUNT_DISTINCT over ~120M+ distinct values and observe the
> estimate is orders of magnitude too low.
> (Diagnosed and generated with Claude code.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)