Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15998 )

Change subject: IMPALA-9777: Set hive.optimize.sort.dynamic.partition to true 
for dynamic inserts
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@13
PS2, Line 13: When this config is set to false,
            : dynamic partitioning inserts will be run as a map-only job that
            : potentially opens hundreds of files per partition, resulting in 
lots of
            : small files. Creating all these small files potentially impacts 
the
            : health of the Namenode, and can cause data-load to fail 
altogether.
> Couple things here:
Updated the commit message. Yeah, it looks like its just one file per 
partition, not multiple.

After I removed the hive.optimize.sort.dynamic.partition setting in 
generate-schema-statements.py, the perf runtime of data load hasn't really 
changed at all.


http://gerrit.cloudera.org:8080/#/c/15998/2//COMMIT_MSG@20
PS2, Line 20: * Ran core tests for Impala-EC
> The frontend tests can be sensitive to dataload changes, and we don't run f
Done


http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/15998/2/testdata/bin/generate-schema-statements.py@162
PS2, Line 162: SET_OPTIMIZE_SORT_DYNAMIC_PARTITION = "SET 
hive.optimize.sort.dynamic.partition=true;\n"\
             :     "SET hive.optimize.sort.dynamic.partition.threshold=1;"
> This applies to all the Hive inserts. To my knowledge, only the insert into
I removed this, and looks like all the tests pass. Removing it does improve the 
performance as well. Technically the optimization should apply for all dynamic 
partition inserts, but I guess it makes the biggest difference when generating 
tpcds.store_sales, probably because tpcds.store_sales gen requires going from 
unpartitioned --> partitioned table, whereas all the other queries go from 
partitioned --> partitioned tables.



--
To view, visit http://gerrit.cloudera.org:8080/15998
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic2b7c0ec40a02da2640fae20cf640517fd1f4fef
Gerrit-Change-Number: 15998
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Comment-Date: Sat, 30 May 2020 23:56:39 +0000
Gerrit-HasComments: Yes

Reply via email to