[
https://issues.apache.org/jira/browse/HIVE-29312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059120#comment-18059120
]
Stamatis Zampetakis commented on HIVE-29312:
--------------------------------------------
[~wechar] Thanks for the additional details. I don't readily see how the
following transformation can improve index efficiency
{code:sql}
PART_NAME like 'a=1/%' and PART_NAME like '%/b=2/%' and PART_NAME like '%/c=3/%'
PART_NAME like 'a=1/b=2/c=3/%'
{code}
The description says that with the transformation the second expression will
match a longer index prefix so I would expect to see this somehow in the query
plan.
I played a bit with the [Postgres TPCDS 30TB metastore
dump|https://hub.docker.com/repository/docker/zabetak/postgres-tpcds-metastore]
but the plans I get for the following queries do not indicate a better index
usage.
{code:sql}
EXPLAIN ANALYZE SELECT * FROM "PARTITIONS" WHERE
"PART_NAME" LIKE 'dt=2020-05-07/%' AND
"PART_NAME" LIKE '%/ns=compute-1588357950-c2h8/%' AND
"PART_NAME" LIKE '%/app=usage-monitor' AND
"TBL_ID" = 56;
{code}
{noformat}
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on "PARTITIONS" (cost=105.05..459.75 rows=1 width=84)
(actual time=1.121..1.162 rows=1 loops=1)
Recheck Cond: ("TBL_ID" = 56)
Filter: ((("PART_NAME")::text ~~ 'dt=2020-05-07/%'::text) AND
(("PART_NAME")::text ~~ '%/ns=compute-1588357950-c2h8/%'::text) AND
(("PART_NAME")::text ~~ '%/app=usage-monitor'::text))
Rows Removed by Filter: 5434
Heap Blocks: exact=99
-> Bitmap Index Scan on "PARTITIONS_N49" (cost=0.00..105.05 rows=5435
width=0) (actual time=0.242..0.242 rows=5435 loops=1)
Index Cond: ("TBL_ID" = 56)
Planning Time: 0.190 ms
Execution Time: 1.189 ms
(9 rows)
{noformat}
{code:sql}
EXPLAIN ANALYZE SELECT * FROM "PARTITIONS" WHERE
"PART_NAME" LIKE 'dt=2020-05-07/ns=compute-1588357950-c2h8/app=usage-monitor%'
AND
"TBL_ID" = 56;
{code}
{noformat}
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on "PARTITIONS" (cost=105.05..432.58 rows=1 width=84)
(actual time=0.897..0.933 rows=1 loops=1)
Recheck Cond: ("TBL_ID" = 56)
Filter: (("PART_NAME")::text ~~
'dt=2020-05-07/ns=compute-1588357950-c2h8/app=usage-monitor%'::text)
Rows Removed by Filter: 5434
Heap Blocks: exact=99
-> Bitmap Index Scan on "PARTITIONS_N49" (cost=0.00..105.05 rows=5435
width=0) (actual time=0.187..0.187 rows=5435 loops=1)
Index Cond: ("TBL_ID" = 56)
Planning Time: 0.113 ms
Execution Time: 0.957 ms
(9 rows)
{noformat}
In both cases the index is accessed based on the TBL_ID; the filter on
PART_NAME is applied as post-processing on the matching tuples.
The only case where the index access is affected is if the filter on PART_NAME
is an exact match where the LIKE does not have a REGEX expression:
{code:sql}
EXPLAIN ANALYZE SELECT * FROM "PARTITIONS" WHERE
"PART_NAME" LIKE 'dt=2020-05-07/ns=compute-1588357950-c2h8/app=usage-monitor'
AND
"TBL_ID" = 56;
{code}
{noformat}
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Index Scan using "UNIQUEPARTITION" on "PARTITIONS" (cost=0.41..8.43 rows=1
width=84) (actual time=0.107..0.109 rows=1 loops=1)
Index Cond: ((("PART_NAME")::text =
'dt=2020-05-07/ns=compute-1588357950-c2h8/app=usage-monitor'::text) AND
("TBL_ID" = 56))
Filter: (("PART_NAME")::text ~~
'dt=2020-05-07/ns=compute-1588357950-c2h8/app=usage-monitor'::text)
Planning Time: 0.110 ms
Execution Time: 0.129 ms
(5 rows)
{noformat}
[~wechar] can you please clarify with which DBMS did you run the experiments
and also outline the impact on the query execution plan similarly to what I
have done above?
> Concatenate equality conditions in AND nodes
> --------------------------------------------
>
> Key: HIVE-29312
> URL: https://issues.apache.org/jira/browse/HIVE-29312
> Project: Hive
> Issue Type: Improvement
> Components: Standalone Metastore
> Affects Versions: 4.1.0
> Reporter: Wechar
> Assignee: Wechar
> Priority: Major
> Labels: pull-request-available
>
> Current partition filtert tree is a binary tree with single condition in the
> leaf node, actually the adjacent AND leaf nodes can be combined together to
> match longer index prefix. For example, a table has partition key: a, b, c,
> d, and a filter is {{a=1 and b=1 and c=1}}, when visit as a binary tree
> *AND(AND(a=1, b=1), c=1)*, the filter is:
> {code:sql}
> PART_NAME like 'a=1/%' and PART_NAME like '%/b=2/%' and PART_NAME like
> '%/c=3/%'
> {code}
> If we combine the adjacent AND conditions, the tree will be *MultiAnd(a=1,
> b=2, c=3)*, aand the filte can be:
> {code:bash}
> PART_NAME like 'a=1/b=2/c=3/%'
> {code}
> Obviously, the latter could match a longer index prefix and is more efficient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)