[ 
https://issues.apache.org/jira/browse/IMPALA-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-11583:
---------------------------------------
    Description: 
COMPUTE STATS updates table-level stats via alter_table() HMS API. This 
replaces the whole HMS table, therefore if there are concurrent modifications 
by another engine, e.g. Hive, it's possible that these modifications are lost.

This is critical for Iceberg tables, as the 'metadata_location' table property 
must always point to the latest snapshot. Inadvertently rewriting it during 
COMPUTE STATS can result in a data loss.

Table-level stats like 'numRows' and 'totalSize' are already updated by Iceberg 
during table modifications, i.e. there is no need to update these values for 
COMPUTE STATS.

Column stats are not affected as they are updated via a different API call 
([updateTableColumnStatistics|https://github.com/apache/impala/blob/4e813b7085c995a7244ef886b00c22e9d93cc80c/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L1638()),
 and it doesn't touch the table properties. But updating statistics also 
require us to update table property "impala.lastComputeStatsTime".  We should 
update it via Iceberg APIs when HiveCatalog is used:
https://github.com/apache/impala/blob/4e813b7085c995a7244ef886b00c22e9d93cc80c/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java#L211

For other catalogs than HiveCatalog we still need to update the table property 
via HMS API. It should be safe as other catalogs don't depend on HMS table 
properties.

Reloading the HMS table before invoking 'alter_table()' can be considered in 
other cases (non-Iceberg tables as well), to decrease the possibility of losing 
concurrent table updates.

  was:
COMPUTE STATS updates table-level stats via alter_table() HMS API. This 
replaces the whole HMS table, therefore if there are concurrent modifications 
by another engine, e.g. Hive, it's possible that these modifications are lost.

This is critical for Iceberg tables, as the 'metadata_location' table property 
must always point to the latest snapshot. Inadvertently rewriting it during 
COMPUTE STATS can result in a data loss.

Table-level stats like 'numRows' and 'totalSize' are already updated by Iceberg 
during table modifications, i.e. there is no need to update these values for 
COMPUTE STATS.

Column stats are not affected as they are updated via a different API call 
(updateTableColumnStatistics()), and it doesn't touch the table properties. But 
updating statistics also require us to update table property 
"impala.lastComputeStatsTime".  We should update it via Iceberg APIs when 
HiveCatalog is used:
https://github.com/apache/impala/blob/4e813b7085c995a7244ef886b00c22e9d93cc80c/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java#L211

For other catalogs than HiveCatalog we still need to update the table property 
via HMS API. It should be safe as other catalogs don't depend on HMS table 
properties.

Reloading the HMS table before invoking 'alter_table()' can be considered in 
other cases (non-Iceberg tables as well), to decrease the possibility of losing 
concurrent table updates.


> Use Iceberg APIs to update table properties for Iceberg tables
> --------------------------------------------------------------
>
>                 Key: IMPALA-11583
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11583
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> COMPUTE STATS updates table-level stats via alter_table() HMS API. This 
> replaces the whole HMS table, therefore if there are concurrent modifications 
> by another engine, e.g. Hive, it's possible that these modifications are lost.
> This is critical for Iceberg tables, as the 'metadata_location' table 
> property must always point to the latest snapshot. Inadvertently rewriting it 
> during COMPUTE STATS can result in a data loss.
> Table-level stats like 'numRows' and 'totalSize' are already updated by 
> Iceberg during table modifications, i.e. there is no need to update these 
> values for COMPUTE STATS.
> Column stats are not affected as they are updated via a different API call 
> ([updateTableColumnStatistics|https://github.com/apache/impala/blob/4e813b7085c995a7244ef886b00c22e9d93cc80c/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L1638()),
>  and it doesn't touch the table properties. But updating statistics also 
> require us to update table property "impala.lastComputeStatsTime".  We should 
> update it via Iceberg APIs when HiveCatalog is used:
> https://github.com/apache/impala/blob/4e813b7085c995a7244ef886b00c22e9d93cc80c/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java#L211
> For other catalogs than HiveCatalog we still need to update the table 
> property via HMS API. It should be safe as other catalogs don't depend on HMS 
> table properties.
> Reloading the HMS table before invoking 'alter_table()' can be considered in 
> other cases (non-Iceberg tables as well), to decrease the possibility of 
> losing concurrent table updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to