Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/15998 )
Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true for dynamic inserts ...................................................................... Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@13 PS2, Line 13: When this config is set to false, : dynamic partitioning inserts will be run as a map-only job that : potentially opens hundreds of files per partition, resulting in lots of : small files. Creating all these small files potentially impacts the : health of the Namenode, and can cause data-load to fail altogether. Couple things here: 1. Let's mention the original diskspace accounting issue 2. I thought the problem was that multiple partitions are being written simultaneously with one file per partition. Were we creating more than one file per partition? Also, can we include some output comparing the runtimes for this versus before this change? Just this part of the dataload output: 14:08:53 Loading workload 'tpch' using exploration strategy 'core' OK (Took: 7 min 12 sec) 14:13:21 Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 11 min 40 sec) 14:27:07 Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 25 min 26 sec) http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@20 PS2, Line 20: * Ran core tests for Impala-EC The frontend tests can be sensitive to dataload changes, and we don't run frontend tests on EC, so we'll need a normal core job. http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py@162 PS2, Line 162: SET_OPTIMIZE_SORT_DYNAMIC_PARTITION = "SET hive.optimize.sort.dynamic.partition=true;\n"\ : "SET hive.optimize.sort.dynamic.partition.threshold=1;" This applies to all the Hive inserts. To my knowledge, only the insert into the text version of tpcds.store_sales needs this setting. Does the setting cost us anything or change anything for other tables? -- To view, visit http://gerrit.cloudera.org:8080/15998 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef Gerrit-Change-Number: 15998 Gerrit-PatchSet: 2 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Comment-Date: Thu, 28 May 2020 22:37:07 +0000 Gerrit-HasComments: Yes