[
https://issues.apache.org/jira/browse/IMPALA-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18067684#comment-18067684
]
ASF subversion and git services commented on IMPALA-14583:
----------------------------------------------------------
Commit 20220fb9232b94d228383fe693a383d2c71a4733 in impala's branch
refs/heads/master from Mihaly Szjatinya
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=20220fb92 ]
IMPALA-14583: Support partial RPC dispatch for Iceberg tables
This patch extends IMPALA-11402 to support partial RPC dispatch for
Iceberg tables in local catalog mode. IMPALA-11402 added support for
HDFS partitioned tables where catalogd can truncate the response of
getPartialCatalogObject at partition boundaries when the file count
exceeds catalog_partial_fetch_max_files.
For Iceberg tables, the file list is not organized by partitions but
stored as a flat list of data and delete files. This patch implements
offset-based pagination to allow catalogd to truncate the response at
any point in the file list, not just at partition boundaries.
Implementation details:
- Added iceberg_file_offset field to TTableInfoSelector thrift struct
- IcebergContentFileStore.toThriftPartial() supports pagination with
offset and limit parameters
- IcebergContentFileStore uses a reverse lookup table
(icebergFileOffsetToContentFile_) for efficient offset-based access to
files
- IcebergTable.getPartialInfo() enforces the file limit configured by
catalog_partial_fetch_max_files (reusing the flag from IMPALA-11402)
- CatalogdMetaProvider.loadIcebergTableWithRetry() implements the retry
loop on the coordinator side, sending follow-up requests with
incremented offsets until all files are fetched
- Coordinator detects catalog version changes between requests and
throws InconsistentMetadataFetchException for query replanning
Key differences from IMPALA-11402:
- Offset-based pagination instead of partition-based (can split
anywhere)
- Single flat file list instead of per-partition file lists
- Works with both data files and delete files (Iceberg v2)
Tests:
- Added two custom-cluster tests in TestAllowIncompleteData:
* test_incomplete_iceberg_file_list: 150 data files with limit=100
* test_iceberg_with_delete_files: 60+ data+delete files with limit=50
- Both tests verify partial fetch across multiple requests and proper
log messages for truncation warnings and request counts
Change-Id: I7f2c058b7cc8efc15bac9fe0e91baadbb7b92cbb
Reviewed-on: http://gerrit.cloudera.org:8080/24041
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Limit the number of file descriptors per RPC to avoid JVM OOM in Catalog
> ------------------------------------------------------------------------
>
> Key: IMPALA-14583
> URL: https://issues.apache.org/jira/browse/IMPALA-14583
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog, Frontend
> Reporter: Noémi Pap-Takács
> Assignee: Mihaly Szjatinya
> Priority: Critical
> Labels: Catalog, OOM, impala-iceberg
>
> We often get OOM error when Impala tries to load a very large Iceberg table.
> This happens because the Catalog loads all the file descriptors and sends
> them to the Coordinator in one RPC, serializing all the file descriptors into
> one big byte array. However, JVM has a limit on the array length, so trying
> to send the entire table in one call can exceed this limit if there are too
> many files in the table.
> We could limit the number of files per RPC, so that the 2GB JVM array limit
> is not exceeded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]