Zoltán Borók-Nagy created IMPALA-12765: ------------------------------------------
Summary: Balance consecutive partitions better for Iceberg tables Key: IMPALA-12765 URL: https://issues.apache.org/jira/browse/IMPALA-12765 Project: IMPALA Issue Type: Bug Components: Frontend Reporter: Zoltán Borók-Nagy During scheduling Impala does the following: * Non-Iceberg tables ** The scheduler processes the scan ranges in partition key order ** The scheduler selects N replicas as candidates ** The scheduler chooses the executor from the candidates based on minimum number of assigned bytes ** So consecutive partitions are more likely to be assigned to different executors * Iceberg tables ** The scheduler processes the scan ranges in random order ** The scheduler selects N replicas as candidates ** The scheduler chooses the executor from the candidates based on minimum number of assigned bytes ** So consecutive partitions (by partition key order) are assigned randomly, i.e. there's a higher chances of clustering If the IcebergScanNode ordered its file descriptors based on their paths we would have a more balanced scheduling for consecutive partitions. Queries that operate on a range of partitions are quite common, so it makes sense to optimize that case. -- This message was sent by Atlassian Jira (v8.20.10#820010)