[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..

IMPALA-12765: Balance consecutive partitions better for Iceberg tables

During remote read scheduling Impala does the following:

Non-Iceberg tables
 * The scheduler processes the scan ranges in partition key order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions are more likely to be assigned to
   different executors

Iceberg tables
 * The scheduler processes the scan ranges in random order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions (by partition key order) are assigned
   randomly, i.e. there's a higher chance of clustering

With this patch, IcebergScanNode orders its file descriptors based on
their paths, so we will have a more balanced scheduling for consecutive
partitions. It is especially important for queries that prune partitions
via runtime filters (e.g. due to a JOIN), because it doesn't matter that
we schedule the scan ranges evenly, the scan ranges that survive the
runtime filters can still be clustered on certain executors.

E.g. TPC-DS Q22 has the following JOIN and WHERE predicates:

 inv_date_sk=d_date_sk and
 d_month_seq between 1199 and 1199 + 11

The Inventory table is partitioned by column inv_date_sk, and we filter
the rows in the joined table by 'd_month_seq between 1199 and
1199 + 11'. This means that we will only need a range of partitions from
the Inventory table, but that range will only be revealed during
runtime. Scheduling neighbouring partitions to different executors means
that the surviving partitions are spread across executors more evenly.

Testing:
 * e2e test

Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Reviewed-on: http://gerrit.cloudera.org:8080/20973
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M tests/query_test/test_iceberg.py
2 files changed, 64 insertions(+), 2 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 7: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 31 Jan 2024 00:39:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10216/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 20:05:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 20:05:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/15113/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 18:40:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 6: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 18:23:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/20973/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20973/5//COMMIT_MSG@41
PS5, Line 41:
> nit: that
Done


http://gerrit.cloudera.org:8080/#/c/20973/5/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/20973/5/tests/query_test/test_iceberg.py@1025
PS5, Line 1025: splits = [l.strip() for l in profile.splitlines() if "Hdfs 
split stats" in l]
> nit: impala-flake8 catch 1 issue here:
Done



--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 18:15:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Zoltan Borok-Nagy (Code Review)
Hello Riza Suminto, Daniel Becker, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20973

to look at the new patch set (#6).

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..

IMPALA-12765: Balance consecutive partitions better for Iceberg tables

During remote read scheduling Impala does the following:

Non-Iceberg tables
 * The scheduler processes the scan ranges in partition key order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions are more likely to be assigned to
   different executors

Iceberg tables
 * The scheduler processes the scan ranges in random order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions (by partition key order) are assigned
   randomly, i.e. there's a higher chance of clustering

With this patch, IcebergScanNode orders its file descriptors based on
their paths, so we will have a more balanced scheduling for consecutive
partitions. It is especially important for queries that prune partitions
via runtime filters (e.g. due to a JOIN), because it doesn't matter that
we schedule the scan ranges evenly, the scan ranges that survive the
runtime filters can still be clustered on certain executors.

E.g. TPC-DS Q22 has the following JOIN and WHERE predicates:

 inv_date_sk=d_date_sk and
 d_month_seq between 1199 and 1199 + 11

The Inventory table is partitioned by column inv_date_sk, and we filter
the rows in the joined table by 'd_month_seq between 1199 and
1199 + 11'. This means that we will only need a range of partitions from
the Inventory table, but that range will only be revealed during
runtime. Scheduling neighbouring partitions to different executors means
that the surviving partitions are spread across executors more evenly.

Testing:
 * e2e test

Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
---
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M tests/query_test/test_iceberg.py
2 files changed, 64 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/20973/6
--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 5: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 18:00:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/15104/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:34:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/15103/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:33:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10215/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:27:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/15102/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:25:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 5: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:12:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/20973/3/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20973/3/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@55
PS3, Line 55: need
> Nit: "that need to be".
Went with "needed".



--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:09:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Zoltan Borok-Nagy (Code Review)
Hello Riza Suminto, Daniel Becker, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20973

to look at the new patch set (#5).

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..

IMPALA-12765: Balance consecutive partitions better for Iceberg tables

During remote read scheduling Impala does the following:

Non-Iceberg tables
 * The scheduler processes the scan ranges in partition key order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions are more likely to be assigned to
   different executors

Iceberg tables
 * The scheduler processes the scan ranges in random order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions (by partition key order) are assigned
   randomly, i.e. there's a higher chance of clustering

With this patch, IcebergScanNode orders its file descriptors based on
their paths, so we will have a more balanced scheduling for consecutive
partitions. It is especially important for queries that prune partitions
via runtime filters (e.g. due to a JOIN), because it doesn't matter that
we schedule the scan ranges evenly, the scan ranges that survive the
runtime filters can still be clustered on certain executors.

E.g. TPC-DS Q22 has the following JOIN and WHERE predicates:

 inv_date_sk=d_date_sk and
 d_month_seq between 1199 and 1199 + 11

The Inventory table is partitioned by column inv_date_sk, and we filter
the rows in the joined table by 'd_month_seq between 1199 and
1199 + 11'. This means the we will only need a range of partitions from
the Inventory table, but that range will only be revealed during
runtime. Scheduling neighbouring partitions to different executors means
that the surviving partitions are spread across executors more evenly.

Testing:
 * e2e test

Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
---
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M tests/query_test/test_iceberg.py
2 files changed, 63 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/20973/5
--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Zoltan Borok-Nagy (Code Review)
Hello Riza Suminto, Daniel Becker, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20973

to look at the new patch set (#4).

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..

IMPALA-12765: Balance consecutive partitions better for Iceberg tables

During remote read scheduling Impala does the following:

Non-Iceberg tables
 * The scheduler processes the scan ranges in partition key order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions are more likely to be assigned to
   different executors

Iceberg tables
 * The scheduler processes the scan ranges in random order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions (by partition key order) are assigned
   randomly, i.e. there's a higher chance of clustering

With this patch, IcebergScanNode orders its file descriptors based on
their paths, so we will have a more balanced scheduling for consecutive
partitions. It is especially important for queries that prune partitions
via runtime filters (e.g. due to a JOIN), because it doesn't matter that
we schedule the scan ranges evenly, the scan ranges that survive the
runtime filters can still be clustered on certain executors.

E.g. TPC-DS Q22 has the following JOIN and WHERE predicates:

 inv_date_sk=d_date_sk and
 d_month_seq between 1199 and 1199 + 11

The Inventory table is partitioned by column inv_date_sk, and we filter
the rows in the joined table by 'd_month_seq between 1199 and
1199 + 11'. This means the we will only need a range of partitions from
the Inventory table, but that range will only be revealed during
runtime. Scheduling neighbouring partitions to different executors means
that the surviving partitions are spread across executors more evenly.

Testing:
 * e2e test

Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
---
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M tests/query_test/test_iceberg.py
2 files changed, 63 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/20973/4
--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 3: Code-Review+1

(1 comment)

Thanks.

http://gerrit.cloudera.org:8080/#/c/20973/3/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20973/3/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@55
PS3, Line 55: need
Nit: "that need to be".



--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:05:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/20973/3/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/20973/3/tests/query_test/test_iceberg.py@1071
PS3, Line 1071: o
flake8: E501 line too long (92 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/20973/3/tests/query_test/test_iceberg.py@1083
PS3, Line 1083: \
flake8: W605 invalid escape sequence '\d'


http://gerrit.cloudera.org:8080/#/c/20973/3/tests/query_test/test_iceberg.py@1083
PS3, Line 1083: \
flake8: W605 invalid escape sequence '\('


http://gerrit.cloudera.org:8080/#/c/20973/3/tests/query_test/test_iceberg.py@1083
PS3, Line 1083: \
flake8: W605 invalid escape sequence '\d'


http://gerrit.cloudera.org:8080/#/c/20973/3/tests/query_test/test_iceberg.py@1083
PS3, Line 1083: \
flake8: W605 invalid escape sequence '\)'



--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:00:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 3:

(5 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/20973/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20973/2//COMMIT_MSG@25
PS2, Line 25: e
> Nit: chance.
Done


http://gerrit.cloudera.org:8080/#/c/20973/2//COMMIT_MSG@27
PS2, Line 27: With this patch, IcebergScanNode orders its file descriptors 
based on
> Could you elaborate why it i beneficial to assign neighbouring partitions t
Added some details and examples.


http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@55
PS1, Line 55: // List of files need to be scanned by t
> It is only sorted if the table is partitioned, isn't it?
Currently yes, because there's no need to sort ranges of unpartitioned tables. 
OTOH, that might wouldn't add too much overhead, and the code would become 
simpler.


http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@210
PS1, Line 210: verride
 :   protected Map It is only sorted if the table is partitioned, isn't it?
Yes, I added a condition to the sort.


http://gerrit.cloudera.org:8080/#/c/20973/2/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/20973/2/tests/query_test/test_iceberg.py@1086
PS2, Line 1086:   for files_rejected_str in files_rejected_array:
> Optional: I find 'continue' to be a bit more difficult to follow than a con
Done



--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 30 Jan 2024 13:00:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Zoltan Borok-Nagy (Code Review)
Hello Riza Suminto, Daniel Becker, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20973

to look at the new patch set (#3).

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..

IMPALA-12765: Balance consecutive partitions better for Iceberg tables

During remote read scheduling Impala does the following:

Non-Iceberg tables
 * The scheduler processes the scan ranges in partition key order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions are more likely to be assigned to
   different executors

Iceberg tables
 * The scheduler processes the scan ranges in random order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions (by partition key order) are assigned
   randomly, i.e. there's a higher chance of clustering

With this patch, IcebergScanNode orders its file descriptors based on
their paths, so we will have a more balanced scheduling for consecutive
partitions. It is especially important for queries that prune partitions
via runtime filters (e.g. due to a JOIN), because it doesn't matter that
we schedule the scan ranges evenly, the scan ranges that survive the
runtime filters can still be clustered on certain executors.

E.g. TPC-DS Q22 has the following JOIN and WHERE predicates:

 inv_date_sk=d_date_sk and
 d_month_seq between 1199 and 1199 + 11

The Inventory table is partitioned by column inv_date_sk, and we filter
the rows in the joined table by 'd_month_seq between 1199 and
1199 + 11'. This means the we will only need a range of partitions from
the Inventory table, but that range will only be revealed during
runtime. Scheduling neighbouring partitions to different executors means
that the surviving partitions are spread across executors more evenly.

Testing:
 * e2e test

Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
---
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M tests/query_test/test_iceberg.py
2 files changed, 62 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/20973/3
--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-30 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/20973/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20973/2//COMMIT_MSG@25
PS2, Line 25: s
Nit: chance.


http://gerrit.cloudera.org:8080/#/c/20973/2//COMMIT_MSG@27
PS2, Line 27: With this patch, IcebergScanNode orders its file descriptors 
based on
Could you elaborate why it i beneficial to assign neighbouring partitions to 
different executors?


http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@55
PS1, Line 55: private List fileDescs_;
> Put comment that this is always ordered.
It is only sorted if the table is partitioned, isn't it?


http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@210
PS1, Line 210: List orderedFds = Lists.newArrayList(fileDescs_);
 : Collections.sort(orderedFds);
> Now that fileDescs_ is always sorted, is this still needed?
It is only sorted if the table is partitioned, isn't it?


http://gerrit.cloudera.org:8080/#/c/20973/2/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/20973/2/tests/query_test/test_iceberg.py@1086
PS2, Line 1086: if files_rejected == 0: continue
Optional: I find 'continue' to be a bit more difficult to follow than a 
conditional, especially that there is only one line after it.



--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Tue, 30 Jan 2024 12:14:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/15092/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Mon, 29 Jan 2024 18:48:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-29 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@55
PS1, Line 55: private List fileDescs_;
Put comment that this is always ordered.


http://gerrit.cloudera.org:8080/#/c/20973/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@210
PS1, Line 210: List orderedFds = Lists.newArrayList(fileDescs_);
 : Collections.sort(orderedFds);
Now that fileDescs_ is always sorted, is this still needed?



--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Mon, 29 Jan 2024 18:30:17 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-29 Thread Zoltan Borok-Nagy (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20973

to look at the new patch set (#2).

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..

IMPALA-12765: Balance consecutive partitions better for Iceberg tables

During remote read scheduling Impala does the following:

Non-Iceberg tables
 * The scheduler processes the scan ranges in partition key order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
  minimum number of assigned bytes
 * So consecutive partitions are more likely to be assigned to
  different executors

Iceberg tables
 * The scheduler processes the scan ranges in random order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
  minimum number of assigned bytes
 * So consecutive partitions (by partition key order) are assigned
  randomly, i.e. there's a higher chances of clustering

With this patch, IcebergScanNode orders its file descriptors based on
their paths, so we will have a more balanced scheduling for consecutive
partitions. Queries that operate on a range of partitions are quite
common, so it makes sense to optimize this case.

Testing:
 * e2e test

Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
---
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M tests/query_test/test_iceberg.py
2 files changed, 50 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/20973/2
--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20973 )

Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/20973/1/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/20973/1/tests/query_test/test_iceberg.py@1071
PS1, Line 1071: o
flake8: E501 line too long (92 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/20973/1/tests/query_test/test_iceberg.py@1081
PS1, Line 1081: \
flake8: W605 invalid escape sequence '\d'


http://gerrit.cloudera.org:8080/#/c/20973/1/tests/query_test/test_iceberg.py@1081
PS1, Line 1081: \
flake8: W605 invalid escape sequence '\('


http://gerrit.cloudera.org:8080/#/c/20973/1/tests/query_test/test_iceberg.py@1081
PS1, Line 1081: \
flake8: W605 invalid escape sequence '\d'


http://gerrit.cloudera.org:8080/#/c/20973/1/tests/query_test/test_iceberg.py@1081
PS1, Line 1081: \
flake8: W605 invalid escape sequence '\)'



--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 29 Jan 2024 18:19:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12765: Balance consecutive partitions better for Iceberg tables

2024-01-29 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/20973


Change subject: IMPALA-12765: Balance consecutive partitions better for Iceberg 
tables
..

IMPALA-12765: Balance consecutive partitions better for Iceberg tables

During remote read scheduling Impala does the following:

Non-Iceberg tables
 * The scheduler processes the scan ranges in partition key order
 * The scheduler selects N replicas as candidates
 * The scheduler chooses the executor from the candidates based on
  minimum number of assigned bytes
 * So consecutive partitions are more likely to be assigned to
  different executors

Iceberg tables
 * The scheduler processes the scan ranges in random order
 * The scheduler selects N replicas as candidates
 * The scheduler chooses the executor from the candidates based on
  minimum number of assigned bytes
 * So consecutive partitions (by partition key order) are assigned
  randomly, i.e. there's a higher chances of clustering

With this patch, IcebergScanNode orders its file descriptors based on
their paths, so we will have a more balanced scheduling for consecutive
partitions. Queries that operate on a range of partitions are quite
common, so it makes sense to optimize this case.

Testing:
 * e2e test

Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
---
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M tests/query_test/test_iceberg.py
2 files changed, 50 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/20973/1
--
To view, visit http://gerrit.cloudera.org:8080/20973
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Gerrit-Change-Number: 20973
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy