[
https://issues.apache.org/jira/browse/HIVE-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059197#comment-18059197
]
Thomas Rebele commented on HIVE-29365:
--------------------------------------
In case we want to "interpolate" the KLL sketch, an approach as described in
the paper "Fast and flexible methods for monotone polynomial fitting" could be
used. It seems to work well for the simple case of a Gaussian distribution.
There are quite a few parameters to choose and there might be problems with
over- or underfitting.
!all-monpoly.png|width=743,height=508!
Or as a vector graphic: [^all-monpoly.pdf]
> Range predicate within the same histogram bucket leads to an estimate
> rowcount=1
> --------------------------------------------------------------------------------
>
> Key: HIVE-29365
> URL: https://issues.apache.org/jira/browse/HIVE-29365
> Project: Hive
> Issue Type: Bug
> Affects Versions: 4.2.0
> Reporter: Thomas Rebele
> Priority: Major
> Labels: pull-request-available
> Attachments: all-monpoly.pdf, all-monpoly.png, expected.png,
> kll-k8.png, monpoly.r, tdigest-interpolated-k10.png
>
>
> The rowcount estimate is wrong for range predicates (comparison operators as
> well as BETWEEN) if both borders of the range are within the same bucket.
> Reason: KllFloatsSketch#getCDF calls FloatsSketchSortedView#getRank. Neither
> interpolate the rank for a quantile between the two borders of a bucket.
> Therefore both estimated ranks are the same if the borders of the range are
> within the same bucket. The Wikipedia page Percentile has a [section on
> interpolation|https://en.wikipedia.org/w/index.php?title=Percentile&oldid=1325431679#The_linear_interpolation_between_closest_ranks_method].
> ----
> Example on a metastore dump of a TPC-DS 30TB cluster with histograms (tried
> on a commit of 2025-12-04, ca105f8124072d19d88a83b2ced613d326c9a26b):
> {*}Locating the border{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk <
> 79000;
> explain cbo joincost select count(*) from customer where c_customer_sk <
> 160500;{code}
> Both give the same estimate for the rowcount of the filter {{{}rowcount =
> 29696.000000000004{}}}.
> {*}Check the estimates{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk
> between 79000 and 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0
> cpu, 0.0 io}, id = 17183 |
> | HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17181 |
> | HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT,
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
> io}, id = 17180 |
> | HiveTableScan(table=[[default, customer]], table:alias=[customer]):
> rowcount = 8.0E7, cumulative cost = {0}, id = 17134 | {code}
> Please note the estimation {{rowcount = 1.0}} of the HiveFilter. The same
> happens for range predicates with comparison operators:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk >=
> 79000 and c_customer_sk <= 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0
> cpu, 0.0 io}, id = 17088 |
> | HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17086 |
> | HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT,
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
> io}, id = 17085 |
> | HiveTableScan(table=[[default, customer]], table:alias=[customer]):
> rowcount = 8.0E7, cumulative cost = {0}, id = 17039 | {code}
> *Compare with the expected result*
> Executing the SELECT query using Trino DB gives the following result:
> {code:java}
> use tpcds.sf30000;
> trino:sf30000> select count(*) from customer where c_customer_sk between
> 79000 and 160500;
> _col0
> -------
> 81501 {code}
> So the estimates should be around {{rowcount = 81501.0}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)