[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 7: Thank you Wenzhe and Quanlong! -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 7 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Sun, 04 Feb 2024 22:38:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest Querying against large-scale databases is a good way for testing Impala. However, it is impractical to do in a single-node development machine. Frontend testing does not run the test query in the backend executor and can benefit from simulated large-scale test cases. This patch attempts to do it by instrumenting the CatalogD metadata loading code to scale tpcds_partitioned_parquet_snap by injecting column stats from a 3TB TPC-DS dataset in TpcdsCpuCostPlannerTest. The large-scale column stats are expressed in stats-3TB.json, taken by running "SHOW COLUMN STATS" and "DESCRIBE FORMATTED" queries on a 3TB dataset loaded using impala-tpcds-kit. It is parsed and then piggy-backed through RuntimeEnv. Code that populates stats metadata (caller of FeCatalogUtils.getRowCount(), FeCatalogUtils.getTotalSize(), and FeCatalogUtils.injectColumnStats()) are instrumented to populate stats from RuntimeEnv instead of Metastore. Scaled-up tables are invalidated before a test run to reload them with new high-scale stats. This patch also adds a scan range limit injection to force ScanNode over a single file table to act as if it scans a multi-files table. tpcds_partitioned_schema_template.sql is modified to match column names and types from impala-tpcds-kit. The test files under PlannerTest/tpcds_cpu_cost/ are replaced with queries that are specifically generated to run against the 3TB scale factor of the TPC-DS dataset (https://github.com/cloudera/impala-tpcds-kit/blob/separate_queries_per_scale_factor/queries/sf3000/). All query plans match with query plans obtained through real query runs in a large cluster except for a few mismatches due to the hard limit on the number of files at a table. Below are 3 queries out of 103 that still do not have a matching shape and the reasons. +-+--+ | Q | Reason | +-+--+ | 10a | different num files in customer_demographics | | 34 | different num files in customer | | 69 | different num files in customer | +-+--+ Testing: - Scale tables of tpcds_partitioned_parquet_snap in TpcdsCpuCostPlannerTest to simulate 3TB TPC-DS. The number of executors is raised from 3 to 10, and REPLICA_PREFERENCE=REMOTE to ignore data locality. - Pass core tests. Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Reviewed-on: http://gerrit.cloudera.org:8080/20922 Reviewed-by: Wenzhe Zhou Reviewed-by: Quanlong Huang Tested-by: Impala Public Jenkins --- M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java A fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java M fe/src/main/java/org/apache/impala/common/RuntimeEnv.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java M fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java A fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java M testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql A testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/stats-3TB.json M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q03.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q07.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q08.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q09.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q10a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q11.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test M testda
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/20922/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1758 PS6, Line 1758: analyzer.getQueryOptions().getReplica_preference().equals( : TReplicaPreference.REMOTE) ? : analyzer.numExecutorsForPlanning() : > This is test only (planner_testcase_mode==true). HDFS minicluster only have Ack. I missed the check at L1744. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Sun, 04 Feb 2024 22:13:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 6: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Sun, 04 Feb 2024 21:05:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/20922/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1758 PS6, Line 1758: analyzer.getQueryOptions().getReplica_preference().equals( : TReplicaPreference.REMOTE) ? : analyzer.numExecutorsForPlanning() : > Is this only required by the test or does it also fix some bugs? This is test only (planner_testcase_mode==true). HDFS minicluster only have 3 datanodes. Without this change, even if we declare more executors in PlannerTest, it will only plan ScanNodes in 3 executors (due to line 1750-1756). -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Sun, 04 Feb 2024 16:32:06 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 6: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10236/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Sun, 04 Feb 2024 16:35:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 6: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/20922/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1758 PS6, Line 1758: analyzer.getQueryOptions().getReplica_preference().equals( : TReplicaPreference.REMOTE) ? : analyzer.numExecutorsForPlanning() : Is this only required by the test or does it also fix some bugs? -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Sun, 04 Feb 2024 13:13:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 6: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Thu, 01 Feb 2024 18:34:07 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20922/5//COMMIT_MSG@32 PS5, Line 32: ar > nit: are Done -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Thu, 01 Feb 2024 18:33:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Hello Quanlong Huang, Abhishek Rawat, Wenzhe Zhou, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20922 to look at the new patch set (#6). Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest Querying against large-scale databases is a good way for testing Impala. However, it is impractical to do in a single-node development machine. Frontend testing does not run the test query in the backend executor and can benefit from simulated large-scale test cases. This patch attempts to do it by instrumenting the CatalogD metadata loading code to scale tpcds_partitioned_parquet_snap by injecting column stats from a 3TB TPC-DS dataset in TpcdsCpuCostPlannerTest. The large-scale column stats are expressed in stats-3TB.json, taken by running "SHOW COLUMN STATS" and "DESCRIBE FORMATTED" queries on a 3TB dataset loaded using impala-tpcds-kit. It is parsed and then piggy-backed through RuntimeEnv. Code that populates stats metadata (caller of FeCatalogUtils.getRowCount(), FeCatalogUtils.getTotalSize(), and FeCatalogUtils.injectColumnStats()) are instrumented to populate stats from RuntimeEnv instead of Metastore. Scaled-up tables are invalidated before a test run to reload them with new high-scale stats. This patch also adds a scan range limit injection to force ScanNode over a single file table to act as if it scans a multi-files table. tpcds_partitioned_schema_template.sql is modified to match column names and types from impala-tpcds-kit. The test files under PlannerTest/tpcds_cpu_cost/ are replaced with queries that are specifically generated to run against the 3TB scale factor of the TPC-DS dataset (https://github.com/cloudera/impala-tpcds-kit/blob/separate_queries_per_scale_factor/queries/sf3000/). All query plans match with query plans obtained through real query runs in a large cluster except for a few mismatches due to the hard limit on the number of files at a table. Below are 3 queries out of 103 that still do not have a matching shape and the reasons. +-+--+ | Q | Reason | +-+--+ | 10a | different num files in customer_demographics | | 34 | different num files in customer | | 69 | different num files in customer | +-+--+ Testing: - Scale tables of tpcds_partitioned_parquet_snap in TpcdsCpuCostPlannerTest to simulate 3TB TPC-DS. The number of executors is raised from 3 to 10, and REPLICA_PREFERENCE=REMOTE to ignore data locality. - Pass core tests. Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 --- M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java A fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java M fe/src/main/java/org/apache/impala/common/RuntimeEnv.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java M fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java A fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java M testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql A testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/stats-3TB.json M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q03.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q07.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q08.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q09.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q10a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q11.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test M testdata/workloads/functional-planner/querie
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 5: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20922/5//COMMIT_MSG@32 PS5, Line 32: is nit: are -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 5 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Thu, 01 Feb 2024 18:23:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/4/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java File fe/src/main/java/org/apache/impala/common/RuntimeEnv.java: http://gerrit.cloudera.org:8080/#/c/20922/4/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java@39 PS4, Line 39: : // Map of > that is used to simula > Looking around org/apache/impala/common, this package is pretty liberal in Done -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 5 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Thu, 01 Feb 2024 18:05:29 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/15135/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 5 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Thu, 01 Feb 2024 18:05:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Hello Quanlong Huang, Abhishek Rawat, Wenzhe Zhou, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20922 to look at the new patch set (#5). Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest Querying against large-scale databases is a good way for testing Impala. However, it is impractical to do in a single-node development machine. Frontend testing does not run the test query in the backend executor and can benefit from simulated large-scale test cases. This patch attempts to do it by instrumenting the CatalogD metadata loading code to scale tpcds_partitioned_parquet_snap by injecting column stats from a 3TB TPC-DS dataset in TpcdsCpuCostPlannerTest. The large-scale column stats are expressed in stats-3TB.json, taken by running "SHOW COLUMN STATS" and "DESCRIBE FORMATTED" queries on a 3TB dataset loaded using impala-tpcds-kit. It is parsed and then piggy-backed through RuntimeEnv. Code that populates stats metadata (caller of FeCatalogUtils.getRowCount(), FeCatalogUtils.getTotalSize(), and FeCatalogUtils.injectColumnStats()) are instrumented to populate stats from RuntimeEnv instead of Metastore. Scaled-up tables are invalidated before a test run to reload them with new high-scale stats. This patch also adds a scan range limit injection to force ScanNode over a single file table to act as if it scans a multi-files table. tpcds_partitioned_schema_template.sql is modified to match column names and types from impala-tpcds-kit. The test files under PlannerTest/tpcds_cpu_cost/ is replaced with queries that are specifically generated to run against the 3TB scale factor of the TPC-DS dataset (https://github.com/cloudera/impala-tpcds-kit/blob/separate_queries_per_scale_factor/queries/sf3000/). All query plans match with query plans obtained through real query runs in a large cluster except for a few mismatches due to the hard limit on the number of files at a table. Below are 3 queries out of 103 that still do not have a matching shape and the reasons. +-+--+ | Q | Reason | +-+--+ | 10a | different num files in customer_demographics | | 34 | different num files in customer | | 69 | different num files in customer | +-+--+ Testing: - Scale tables of tpcds_partitioned_parquet_snap in TpcdsCpuCostPlannerTest to simulate 3TB TPC-DS. The number of executors is raised from 3 to 10, and REPLICA_PREFERENCE=REMOTE to ignore data locality. - Pass core tests. Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 --- M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java A fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java M fe/src/main/java/org/apache/impala/common/RuntimeEnv.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java M fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java A fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java M testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql A testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/stats-3TB.json M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q03.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q07.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q08.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q09.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q10a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q11.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test M testdata/workloads/functional-planner/queries
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/4/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java File fe/src/main/java/org/apache/impala/common/RuntimeEnv.java: http://gerrit.cloudera.org:8080/#/c/20922/4/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java@39 PS4, Line 39: The value element is stored as Object to avoid referrencing : // SideloadTableStats class in org.apache.impala.common package. Looking around org/apache/impala/common, this package is pretty liberal in doing imports. There are imports from impala.analysis, impala.catalog, impala.thrift, and impala.util in common package. Maybe it is OK to import SideloadTableStats directly here. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Thu, 01 Feb 2024 17:28:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/15131/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Wed, 31 Jan 2024 23:07:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG@24 PS2, Line 24: ar > nit: are Done http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG@28 PS2, Line 28: mult > nit: multi Done http://gerrit.cloudera.org:8080/#/c/20922/2/fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java File fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java: http://gerrit.cloudera.org:8080/#/c/20922/2/fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java@26 PS2, Line 26: > nit: Could you add description for the new class? Done http://gerrit.cloudera.org:8080/#/c/20922/2/fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java@40 PS2, Line 40: public SideloadTableStats(String tableName, long numRows, long totalSize) { > add Preconditions check for the input parameters Done http://gerrit.cloudera.org:8080/#/c/20922/3/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java File fe/src/main/java/org/apache/impala/common/RuntimeEnv.java: http://gerrit.cloudera.org:8080/#/c/20922/3/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java@21 PS3, Line 21: > nit: remove empty line The empty line in between is auto-formatted by clang-format. It looks like it carried by Chromium java style. https://chromium.googlesource.com/chromium/src/+/HEAD/styleguide/java/java.md#Import-Order I choose to remove StringUtils import instead. http://gerrit.cloudera.org:8080/#/c/20922/3/fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java File fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java: http://gerrit.cloudera.org:8080/#/c/20922/3/fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java@90 PS3, Line 90: colType.substring(0, colType.indexOf("(") > if colType contains "(", does it contains ")"? Yes. An example is "DECIMAL(7,2)". In that case, only "DECIMAL" is taken. Any invalid type input will be catch by default handler at line 153. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Wed, 31 Jan 2024 22:44:06 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Hello Quanlong Huang, Abhishek Rawat, Wenzhe Zhou, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20922 to look at the new patch set (#4). Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest Querying against large-scale databases is a good way for testing Impala. However, it is impractical to do in a single-node development machine. Frontend testing does not run the test query in the backend executor and can benefit from simulated large-scale test cases. This patch attempts to do it by instrumenting the CatalogD metadata loading code to scale tpcds_partitioned_parquet_snap by injecting column stats from a 3TB TPC-DS dataset in TpcdsCpuCostPlannerTest. The large-scale column stats are expressed in stats-3TB.json, taken by running "SHOW COLUMN STATS" and "DESCRIBE FORMATTED" queries on a 3TB dataset loaded using impala-tpcds-kit. It is parsed and then piggy-backed through RuntimeEnv. Code that populates stats metadata (caller of FeCatalogUtils.getRowCount(), FeCatalogUtils.getTotalSize(), and FeCatalogUtils.injectColumnStats()) are instrumented to populate stats from RuntimeEnv instead of Metastore. Scaled-up tables are invalidated before a test run to reload them with new high-scale stats. This patch also adds a scan range limit injection to force ScanNode over a single file table to act as if it scans a multi-files table. tpcds_partitioned_schema_template.sql is modified to match column names and types from impala-tpcds-kit. The test files under PlannerTest/tpcds_cpu_cost/ is replaced with queries that are specifically generated to run against the 3TB scale factor of the TPC-DS dataset (https://github.com/cloudera/impala-tpcds-kit/blob/separate_queries_per_scale_factor/queries/sf3000/). All query plans match with query plans obtained through real query runs in a large cluster except for a few mismatches due to the hard limit on the number of files at a table. Below are 3 queries out of 103 that still do not have a matching shape and the reasons. +-+--+ | Q | Reason | +-+--+ | 10a | different num files in customer_demographics | | 34 | different num files in customer | | 69 | different num files in customer | +-+--+ Testing: - Scale tables of tpcds_partitioned_parquet_snap in TpcdsCpuCostPlannerTest to simulate 3TB TPC-DS. The number of executors is raised from 3 to 10, and REPLICA_PREFERENCE=REMOTE to ignore data locality. - Pass core tests. Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 --- M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java A fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java M fe/src/main/java/org/apache/impala/common/RuntimeEnv.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java M fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java A fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java M testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql A testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/stats-3TB.json M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q03.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q07.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q08.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q09.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q10a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q11.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test M testdata/workloads/functional-planner/queries
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG@24 PS2, Line 24: is nit: are http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG@28 PS2, Line 28: muti nit: multi http://gerrit.cloudera.org:8080/#/c/20922/2/fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java File fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java: http://gerrit.cloudera.org:8080/#/c/20922/2/fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java@26 PS2, Line 26: nit: Could you add description for the new class? http://gerrit.cloudera.org:8080/#/c/20922/2/fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java@40 PS2, Line 40: public void addColumnStats(String colName, ColumnStatisticsData colStats) { add Preconditions check for the input parameters http://gerrit.cloudera.org:8080/#/c/20922/3/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java File fe/src/main/java/org/apache/impala/common/RuntimeEnv.java: http://gerrit.cloudera.org:8080/#/c/20922/3/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java@21 PS3, Line 21: nit: remove empty line http://gerrit.cloudera.org:8080/#/c/20922/3/fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java File fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java: http://gerrit.cloudera.org:8080/#/c/20922/3/fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java@90 PS3, Line 90: colType.substring(0, colType.indexOf("(") if colType contains "(", does it contains ")"? -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Wed, 31 Jan 2024 21:17:53 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/15130/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Wed, 31 Jan 2024 20:10:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG@33 PS2, Line 33: ale factor of the TPC-DS : dataset (https://github.com/cloudera/impala-tpcds-kit/blob/separate_querie > David mention that the test SQL can be different depending on the scale fac Done -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Wed, 31 Jan 2024 19:49:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Hello Quanlong Huang, Abhishek Rawat, Wenzhe Zhou, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20922 to look at the new patch set (#3). Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest Querying against large-scale databases is a good way for testing Impala. However, it is impractical to do in a single-node development machine. Frontend testing does not run the test query in the backend executor and can benefit from simulated large-scale test cases. This patch attempts to do it by instrumenting the CatalogD metadata loading code to scale tpcds_partitioned_parquet_snap by injecting column stats from a 3TB TPC-DS dataset TpcdsCpuCostPlannerTest. The large-scale column stats are expressed in stats-3TB.json, taken by running "SHOW COLUMN STATS" and "DESCRIBE FORMATTED" queries on a 3TB dataset loaded using impala-tpcds-kit. It is parsed and then piggy-backed through RuntimeEnv. Code that populates stats metadata (caller of FeCatalogUtils.getRowCount(), FeCatalogUtils.getTotalSize(), and FeCatalogUtils.injectColumnStats()) is instrumented to populate stats from RuntimeEnv instead of Metastore. Scaled-up tables are invalidated before a test run to reload them with new high-scale stats. This patch also adds a scan range limit injection to force ScanNode over a single file table to act as if it scans a muti-files table. tpcds_partitioned_schema_template.sql is modified to match column names and types from impala-tpcds-kit. The test files under PlannerTest/tpcds_cpu_cost/ is replaced with queries that are specifically generated to run against the 3TB scale factor of the TPC-DS dataset (https://github.com/cloudera/impala-tpcds-kit/blob/separate_queries_per_scale_factor/queries/sf3000/). All query plans match with query plans obtained through real query runs in a large cluster except for a few mismatches due to the hard limit on the number of files at a table. Below are 3 queries out of 103 that still do not have a matching shape and the reasons. +-+--+ | Q | Reason | +-+--+ | 10a | different num files in customer_demographics | | 34 | different num files in customer | | 69 | different num files in customer | +-+--+ Testing: - Scale tables of tpcds_partitioned_parquet_snap in TpcdsCpuCostPlannerTest to simulate 3TB TPC-DS. The number of executors is raised from 3 to 10, and REPLICA_PREFERENCE=REMOTE to ignore data locality. - Pass core tests. Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 --- M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java A fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java M fe/src/main/java/org/apache/impala/common/RuntimeEnv.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java M fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java A fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java M testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql A testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/stats-3TB.json M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q03.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q07.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q08.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q09.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q10a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q11.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test M testdata/workloads/functional-planner/queries/Plan
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20922/2//COMMIT_MSG@33 PS2, Line 33: 3TB TPC-DS : dataset (https://github.com/cloudera/impala-tpcds-kit/tree/master/queries) David mention that the test SQL can be different depending on the scale factor it is intended to run. This set of test SQL is better suit for 3TB scale: https://github.com/cloudera/impala-tpcds-kit/blob/separate_queries_per_scale_factor/queries/sf3000/ -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Wed, 31 Jan 2024 00:15:10 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/15087/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Mon, 29 Jan 2024 16:58:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/20922/1/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java File fe/src/main/java/org/apache/impala/common/RuntimeEnv.java: http://gerrit.cloudera.org:8080/#/c/20922/1/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java@100 PS1, Line 100: > nit: "tables with their" Done http://gerrit.cloudera.org:8080/#/c/20922/1/fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java File fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java: http://gerrit.cloudera.org:8080/#/c/20922/1/fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java@53 PS1, Line 53: / Granular scan limit that will injected into individual ScanNode of tables. : private static Map< > Looks like some dim tables are also scaled but not linearly. I'll check TPC ps2 fully inject all stats based on actual dataset instead of just scaling them. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Mon, 29 Jan 2024 16:36:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20922 to look at the new patch set (#2). Change subject: IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest .. IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest Querying against large-scale databases is good way for testing Impala. However, it is impractical to do in a single-node development machine. Frontend testing does not run the test query in the backend executor and can benefit from simulated large-scale test cases. This patch attempts to do it by instrumenting the CatalogD metadata loading code to scale tpcds_partitioned_parquet_snap by injecting column stats from a 3TB TPC-DS dataset TpcdsCpuCostPlannerTest. The large-scale column stats are expressed in stats-3TB.json, taken by running "SHOW COLUMN STATS" and "DESCRIBE FORMATTED" queries on a 3TB dataset loaded using impala-tpcds-kit. It is parsed and then piggy-backed through RuntimeEnv. Code that populates stats metadata (caller of FeCatalogUtils.getRowCount(), FeCatalogUtils.getTotalSize(), and FeCatalogUtils.injectColumnStats()) is instrumented to populate stats from RuntimeEnv instead of Metastore. Scaled-up tables are invalidated before a test run to reload them with new high-scale stats. This patch also add a scan range limit injection to force ScanNode over a single file table to act as if it scans a muti-files table. tpcds_partitioned_schema_template.sql is modified to match column names and types from impala-tpcds-kit). After this patch, the test files under PlannerTest/tpcds_cpu_cost/ have matching query plan shapes with the actual impala-tpcds-kit queries run against the 3TB TPC-DS dataset (https://github.com/cloudera/impala-tpcds-kit/tree/master/queries), except for a few mismatches due to different SQL and hard limit on number of files. Below are 16 queries out of 103 that still does not have matching shape and the reasons. +-+--+ | Q | Reason | +-+--+ | 6 | extra limit 1| | 10a | different num files in customer_demographics | | 23b | different frequent_ss_items CTE | | 22 | extra warehouse table| | 27 | different predicate for store table | | 34 | extra limit 10 | | 36 | different predicate for store table | | 53 | missing avg_quarterly_sales | | 66 | different SQL| | 68 | different predicate for data_dim table | | 69 | different num files in customer | | 73 | different order by, extra limit 1000 | | 74 | different num files in customer | | 84 | missing customer_demographics table | | 96 | missing limit 100| | 98 | extra limit 1000 | +-+--+ Testing: - Scale tables of tpcds_partitioned_parquet_snap in TpcdsCpuCostPlannerTest to simulate 3TB TPC-DS. The number of executors is raised from 3 to 10, and REPLICA_PREFERENCE=REMOTE to ignore data locality. - Pass core tests. Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 --- M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java A fe/src/main/java/org/apache/impala/catalog/SideloadTableStats.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java M fe/src/main/java/org/apache/impala/common/RuntimeEnv.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java M fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java A fe/src/test/java/org/apache/impala/testutil/StatsJsonParser.java M testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql A testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/stats-3TB.json M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q03.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test M testdata/workloads/functional-planner/queries
[Impala-ASF-CR] IMPALA-12726: Simulate large scale query in TpcdsCpuCostPlannerTest
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large scale query in TpcdsCpuCostPlannerTest .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/20922/1/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java File fe/src/main/java/org/apache/impala/common/RuntimeEnv.java: http://gerrit.cloudera.org:8080/#/c/20922/1/fe/src/main/java/org/apache/impala/common/RuntimeEnv.java@100 PS1, Line 100: the table with its nit: "tables with their" http://gerrit.cloudera.org:8080/#/c/20922/1/fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java File fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java: http://gerrit.cloudera.org:8080/#/c/20922/1/fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java@53 PS1, Line 53: // Insert 1000x metadata scale to RuntimeEnv for each fact tables. : int scale = 1000; Looks like some dim tables are also scaled but not linearly. I'll check TPC-DS spec. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 1 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Thu, 18 Jan 2024 19:37:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-12726: Simulate large scale query in TpcdsCpuCostPlannerTest
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/20922 ) Change subject: IMPALA-12726: Simulate large scale query in TpcdsCpuCostPlannerTest .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/14992/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/20922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 Gerrit-Change-Number: 20922 Gerrit-PatchSet: 1 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Thu, 18 Jan 2024 19:09:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-12726: Simulate large scale query in TpcdsCpuCostPlannerTest
Riza Suminto has uploaded this change for review. ( http://gerrit.cloudera.org:8080/20922 Change subject: IMPALA-12726: Simulate large scale query in TpcdsCpuCostPlannerTest .. IMPALA-12726: Simulate large scale query in TpcdsCpuCostPlannerTest Querying against large scale database is a good way to test Impala. However, it is impractical to do in single node development machine. Frontend testing does not actually run the test query in backend executor and can benefit from simulated large scale test cases. This patch attempt to do it by instrumenting the CatalogD metadata loading code to multiply partitions numRows, tables numRows, numNull, numTrues, and numFalses to 1000x in TpcdsCpuCostPlannerTest. The scaling factor is supplied through RuntimeEnv. Code that populates stats metadata (caller of FeCatalogUtils.getRowCount() and FeCatalogUtils.injectColumnStats()) is instrumented to check against this scaling factor on whether to multiply the stats for a particular table or not. Tables that is scaled up must also be invalidated so that they will be reloaded with new scaled stats. Total byte sizes are not scaled up in this patch because it does not impact query plan unless stats extrapolation is being used. Testing: - Scale the fact tables of tpcds_partitioned_parquet_snap in TpcdsCpuCostPlannerTest to 1000x to simulate 1TB TPC-DS. Number of executor is raised from 3 to 10, and REPLICA_PREFERENCE is set to REMOTE to ignore data locality. - Compare with the afternative methods where instrumentation is done during stats collection (COMPUTE STATS) and confirm that the resulting query plans are the same with this patch. - Pass FE tests. Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7 --- M fe/src/main/java/org/apache/impala/catalog/Column.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/TableLoader.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java M fe/src/main/java/org/apache/impala/common/RuntimeEnv.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java M fe/src/test/java/org/apache/impala/planner/TpcdsCpuCostPlannerTest.java M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q03.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q07.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q08.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q09.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q10a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q11.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q13.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q14a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q14b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q15.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q16.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q17.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q18.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q19.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q20.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q21.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q22.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q24