[ 
https://issues.apache.org/jira/browse/FLINK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timo Walther closed FLINK-39399.
--------------------------------
    Fix Version/s: 2.3.0
       Resolution: Fixed

Fixed in master: 795a36ee85403b183514a0aeccc022af0ca6068c

> Integer overflow in HyperLogLogPlusPlus.query() causes APPROX_COUNT_DISTINCT 
> undercount at high cardinality
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39399
>                 URL: https://issues.apache.org/jira/browse/FLINK-39399
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Runtime
>            Reporter: Jim Hughes
>            Assignee: Jim Hughes
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 2.3.0
>
>
> APPROX_COUNT_DISTINCT produce incorrect (undercounted) results when the true 
> cardinality exceeds ~100 million distinct values.
> The root cause is an integer overflow in HyperLogLogPlusPlus.query(). In 
> HyperLogLogPlusPlus.java line 4950:
> {code:java}
> zInverse += 1.0 / (1 << mIdx);
> {code}
> mIdx is the per-register value representing the position of the leading 1-bit 
> in a hash. At high cardinality, some registers accumulate values >= 32. 
> Because 1 is an int, Java's << operator masks the shift count modulo 32, so 1 
> << 35 silently evaluates to 1 << 3 = 8 instead of the correct 34359738368. 
> This corrupts the zInverse accumulator and produces a dramatically wrong 
> cardinality estimate.
> *Fix:*
> Change 1 << mIdx to 1L << mIdx so the shift uses long arithmetic (modulo 64, 
> which is sufficient since register values max out around 51 for a 64-bit 
> hash).
> {code:java}
> zInverse += 1.0 / (1L << mIdx);
> {code}
> *How to reproduce:*
> Construct an HLL buffer where every register holds a value >= 32 (e.g., 35) 
> and call query(). With the bug, the estimate is ~95K. With the fix, the 
> estimate correctly reflects the large register values (~4e14). Alternatively, 
> run APPROX_COUNT_DISTINCT over ~120M+ distinct values and observe the 
> estimate is orders of magnitude too low.
> (Diagnosed and generated with Claude code.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to