[
https://issues.apache.org/jira/browse/HIVE-29646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088728#comment-18088728
]
Denys Kuzmenko edited comment on HIVE-29646 at 6/13/26 12:45 PM:
-----------------------------------------------------------------
if we assume statistics on external tables are generally stale and therefore
cannot be trusted, then what is the point of having statistics-based
optimizations gated behind {{hive.disable.unsafe.external.table.operations}} at
all? The code path is effectively dead for the majority of setups.
I'm also not sure what is meant by "perform these optimizations for external
tables no matter the stats". These optimizations are inherently stats-driven;
they only trigger when statistics are available and considered accurate. If
statistics are maintained regularly, then we should be able to benefit from
them regardless of whether the table is managed or external.
Also, Hive is primarily an ETL and analytics engine where reads vastly
outnumber writes. Sacrificing read performance for all external tables because
statistics _might_ become stale due to out-of-band modifications seems like a
poor tradeoff, especially when many production environments already have
scheduled statistics maintenance as part of their data pipelines.
Finally, while managed tables exist, the most prominent example today is Hive
ACID. In practice, ACID table adoption remains relatively limited compared to
external tables, especially in modern lakehouse deployments where external
tables (including Iceberg and similar formats) are the dominant table type. As
a result, keeping these optimizations disabled for external tables effectively
excludes the majority of real-world Hive workloads from benefiting from
statistics-based query planning.
I understand that this proposal runs counter to the rationale behind
HIVE-19335, which disabled runtime filtering for external tables due to
concerns about stale statistics. However, it is worth noting that other engines
operating on the same datasets make a different tradeoff. For example, Impala
supports runtime filtering on external tables.
was (Author: dkuzmenko):
if we assume statistics on external tables are generally stale and therefore
cannot be trusted, then what is the point of having statistics-based
optimizations gated behind {{hive.disable.unsafe.external.table.operations}} at
all? The code path is effectively dead for the majority of setups.
I'm also not sure what is meant by "perform these optimizations for external
tables no matter the stats". These optimizations are inherently stats-driven;
they only trigger when statistics are available and considered accurate. If
statistics are maintained regularly, then we should be able to benefit from
them regardless of whether the table is managed or external.
Also, Hive is primarily an ETL and analytics engine where reads vastly
outnumber writes. Sacrificing read performance for all external tables because
statistics _might_ become stale due to out-of-band modifications seems like a
poor tradeoff, especially when many production environments already have
scheduled statistics maintenance as part of their data pipelines.
Finally, while managed tables exist, the most prominent example today is Hive
ACID. In practice, ACID table adoption remains relatively limited compared to
external tables, especially in modern lakehouse deployments where external
tables (including Iceberg and similar formats) are the dominant table type. As
a result, keeping these optimizations disabled for external tables effectively
excludes the majority of real-world Hive workloads from benefiting from
statistics-based query planning.
> Enable semijoin reduction and map-join conversion on external tables with
> accurate statistics
> ---------------------------------------------------------------------------------------------
>
> Key: HIVE-29646
> URL: https://issues.apache.org/jira/browse/HIVE-29646
> Project: Hive
> Issue Type: Improvement
> Reporter: Denys Kuzmenko
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)