Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true 
for dynamic inserts
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@13
PS2, Line 13: When this config is set to false,
            : dynamic partitioning inserts will be run as a map-only job that
            : potentially opens hundreds of files per partition, resulting in 
lots of
            : small files. Creating all these small files potentially impacts 
the
            : health of the Namenode, and can cause data-load to fail 
altogether.
Couple things here:
1. Let's mention the original diskspace accounting issue
2. I thought the problem was that multiple partitions are being written 
simultaneously with one file per partition. Were we creating more than one file 
per partition?

Also, can we include some output comparing the runtimes for this versus before 
this change? Just this part of the dataload output:
14:08:53   Loading workload 'tpch' using exploration strategy 'core' OK (Took: 
7 min 12 sec)
14:13:21   Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 
11 min 40 sec)
14:27:07   Loading workload 'functional-query' using exploration strategy 
'exhaustive' OK (Took: 25 min 26 sec)


http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@20
PS2, Line 20: * Ran core tests for Impala-EC
The frontend tests can be sensitive to dataload changes, and we don't run 
frontend tests on EC, so we'll need a normal core job.


http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py@162
PS2, Line 162: SET_OPTIMIZE_SORT_DYNAMIC_PARTITION = "SET 
hive.optimize.sort.dynamic.partition=true;\n"\
             :     "SET hive.optimize.sort.dynamic.partition.threshold=1;"
This applies to all the Hive inserts. To my knowledge, only the insert into the 
text version of tpcds.store_sales needs this setting. Does the setting cost us 
anything or change anything for other tables?



--
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 May 2020 22:37:07 +0000
Gerrit-HasComments: Yes

Reply via email to