Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )
Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts ...................................................................... Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@13 PS2, Line 13: When this config is set to false, : dynamic partitioning inserts will be run as a map-only job that : potentially opens hundreds of files per partition, resulting in lots of : small files. Creating all these small files potentially impacts the : health of the Namenode, and can cause data-load to fail altogether. > Couple things here: Updated the commit message. Yeah, it looks like its just one file per partition, not multiple. After I removed the hive.optimize.sort.dynamic.partition setting in generate-schema-statements.py, the perf runtime of data load hasn't really changed at all. http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@20 PS2, Line 20: * Ran core tests for Impala-EC > The frontend tests can be sensitive to dataload changes, and we don't run f Done http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py@162 PS2, Line 162: SET_OPTIMIZE_SORT_DYNAMIC_PARTITION = "SET hive.optimize.sort.dynamic.partition=true;\n"\ : "SET hive.optimize.sort.dynamic.partition.threshold=1;" > This applies to all the Hive inserts. To my knowledge, only the insert into I removed this, and looks like all the tests pass. Removing it does improve the performance as well. Technically the optimization should apply for all dynamic partition inserts, but I guess it makes the biggest difference when generating tpcds.store_sales, probably because tpcds.store_sales gen requires going from unpartitioned --> partitioned table, whereas all the other queries go from partitioned --> partitioned tables. -- To view, visit http://gerrit.cloudera.org:8080/15998 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef Gerrit-Change-Number: 15998 Gerrit-PatchSet: 2 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Sat, 30 May 2020 23:56:39 +0000 Gerrit-HasComments: Yes
