[jira] [Created] (IMPALA-12765) Balance consecutive partitions better for Iceberg tables

Jira Mon, 29 Jan 2024 07:36:07 -0800

Zoltán Borók-Nagy created IMPALA-12765:
------------------------------------------


             Summary: Balance consecutive partitions better for Iceberg tables
                 Key: IMPALA-12765
                 URL: https://issues.apache.org/jira/browse/IMPALA-12765
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: Zoltán Borók-Nagy


During scheduling Impala does the following:

* Non-Iceberg tables
** The scheduler processes the scan ranges in partition key order
** The scheduler selects N replicas as candidates
** The scheduler chooses the executor from the candidates based on minimum 
number of assigned bytes
** So consecutive partitions are more likely to be assigned to different 
executors
* Iceberg tables
** The scheduler processes the scan ranges in random order
** The scheduler selects N replicas as candidates
** The scheduler chooses the executor from the candidates based on minimum 
number of assigned bytes
** So consecutive partitions (by partition key order) are assigned randomly, 
i.e. there's a higher chances of clustering

If the IcebergScanNode ordered its file descriptors based on their paths we 
would have a more balanced scheduling for consecutive partitions. Queries that 
operate on a range of partitions are quite common, so it makes sense to 
optimize that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-12765) Balance consecutive partitions better for Iceberg tables

Reply via email to