Zoltán Borók-Nagy created IMPALA-11721:
------------------------------------------

             Summary: Impala query keep being retried over frequently updated 
iceberg table
                 Key: IMPALA-11721
                 URL: https://issues.apache.org/jira/browse/IMPALA-11721
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: Zoltán Borók-Nagy


# Iceberg table loading can fail in local catalog mode if the table gets 
updated frequently.

This is what happens during table loading in local catalog mode:

Every query starts with it's own empty local catalog. Table metadata is fetched 
in multiple requests via a MetaProvider which is always a CatalogdMetaprovider. 
CatalogdMetaprovider caches requests and the cache key also includes the 
table's catalog version.

The Iceberg table is loaded by the following requests:
 # CatalogdMetaProvider.loadTable()
 # CatalogdMetaProvider.loadIcebergTable()
 # CatalogdMetaProvider.loadIcebergApiTable() # This actually directly loads 
the Iceberg table via Iceberg API (no CatalogD involved)
 # CatalogdMetaProvider.loadTableColumnStatistics()
 # CatalogdMetaProvider.loadPartitionList()
 # CatalogdMetaProvider.loadPartitionsByRefs()

Steps 1-4 happens during table loading, steps 5-6 happens during planning. We 
cannot really reorder these invocations, but since CatalogdMetaprovider caches 
these, only the very first invocations need to reach out to CatalogD and check 
the table's catalot version. Subsequent invocations, i.e. subsequent queries 
that use the Iceberg table can use the cached metadata, and no need to check 
the catalog version of the cached metadata since the cache key also includes 
the catalog version.

I see two things that could resolve the issue:
 # speedup loadIcebergApiTable()
 ** either by speeding up Iceberg table loading itself
 ** or make the Iceberg API table serializable, so we can fetch it from CatalogD
 # Pre-warm the cache before issuing loadIcebergApiTable()
 ** so the CatalogdMetaProvider.load*() operations can be served from cache

1 needs contributions to the Iceberg library
2 can be done relatively easily. We just need to pre-invoke 
loadTableColumnStatistics() and 
FeCatalogUtils.loadAllPartitions() (which invokes loadPartitionList() and 
loadPartitionsByRefs()) before loadIcebergApiTable(). So when they are needed 
later they can be served from cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to