Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/20756 )
Change subject: IMPALA-12601: Add a fully partitioned TPC-DS database ...................................................................... IMPALA-12601: Add a fully partitioned TPC-DS database The current tpcds dataset only has store_sales table fully partitioned and leaves the other facts table unpartitioned. This is intended for faster data loading during tests. However, this is not an accurate reflection of the larger scale TPC-DS dataset where all facts tables are partitioned. Impala planner may change the details of the query plan if a partition column exists. This patch adds a new dataset tpcds_partitioned, loading a fully partitioned TPC-DS db in parquet format named tpcds_partitioned_parquet_snap. This dataset can not be loaded independently and requires the base 'tpcds' db from the tpcds dataset to be preloaded first. An example of how to load this dataset can be seen at function load-tpcds-data in bin/create-load-data.sh. This patch also changes PlannerTest#testProcessingCost from targeting tpcds_parquet to tpcds_partitioned_parquet_snap. Other planner tests are that currently target tpcds_parquet will be gradually changed to test against tpcds_partitioned_parquet_snap in follow-up patches. This addition adds a couple of seconds in the "Computing table stats" step, but loading itself is negligible since it is parallelized with TPC-H and functional-query. The total loading time for the three datasets remains similar after this patch. This patch also adds several improvements in the following files: bin/load-data.py: - Log elapsed time on serial steps. testdata/bin/create-load-data.sh: - Rename MSG to LOAD_MSG to avoid collision with the same variable name in ./testdata/bin/run-step.sh testdata/bin/generate-schema-statements.py: - Remove redundant FILE_FORMAT_MAP. - Add build_partitioned_load to simplify expressing partitioned insert query in SQL template. testdata/datasets/tpcds/tpcds_schema_template.sql: - Reorder schema template to load all dimension tables before fact tables. Testing: - Pass core tests. Change-Id: I3a2e66c405639554f325ae78c66628d464f6c453 Reviewed-on: http://gerrit.cloudera.org:8080/20756 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> --- M bin/load-data.py M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/bin/compute-table-stats.sh M testdata/bin/create-load-data.sh M testdata/bin/generate-schema-statements.py M testdata/datasets/tpcds/tpcds_schema_template.sql A testdata/datasets/tpcds_partitioned/README A testdata/datasets/tpcds_partitioned/schema_constraints.csv A testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-processing-cost.test A testdata/workloads/tpcds_partitioned/tpcds_partitioned_core.csv A testdata/workloads/tpcds_partitioned/tpcds_partitioned_dimensions.csv A testdata/workloads/tpcds_partitioned/tpcds_partitioned_exhaustive.csv A testdata/workloads/tpcds_partitioned/tpcds_partitioned_pairwise.csv 14 files changed, 2,857 insertions(+), 1,987 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/20756 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I3a2e66c405639554f325ae78c66628d464f6c453 Gerrit-Change-Number: 20756 Gerrit-PatchSet: 8 Gerrit-Owner: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Laszlo Gaal <laszlo.g...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>