Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/14060 to look at the new patch set (#2). Change subject: [WIP] IMPALA-8821: Use RECOVER PARTITIONS in dataload to get partition metadata ...................................................................... [WIP] IMPALA-8821: Use RECOVER PARTITIONS in dataload to get partition metadata When using a data snapshot without a metadata snapshot (e.g. when loading to a remote cluster), the data is already in place, and dataload needs to perform all the appropriate DDLs to create the metadata. Currently, for dynamically partitioned tables, dataload gives up and reloads those tables from scratch in this circumstance. However, there is no need to do this, as ALTER TABLE.. RECOVER PARTITIONS can get the partition metadata by looking at the filesystem. This changes dataload to use RECOVER PARTITIONS for dynamically partitioned tables rather than forcing a reload of the table. Dataload from scratch is not impacted, because there is no existing data and everything needs to be inserted anyway. Dataload with both a data snapshot and a metadata snapshot also is not impacted, because testdata/bin/create-load-data.sh skips most of the bin/load-data.py calls for that codepath. So, this is limited to dataload with a data snapshot and without a metadata snapshot. The biggest impact of this is the TPC-DS store_sales does not have to be reloaded from scratch in this case. Impala dataload overrides the default location for table directories to its own weird nonstandard location. These locations reside outside the database *.db directories. The current table existence check is tuned to handle tables that reside in directories with this naming system. It does not handle tables that use the default location (i.e. the location if LOCATION is not specified). This detects tables using the standard directory naming and uses a different table existence check for those tables. This eliminates the need to reload these tables. Callers of bin/load-data.py always have the option of forcing a reload via the --force_reload flag. Testing: - Ran normal dataload (no snapshots) - Ran dataload with just a data snapshot (no metadata snapshot) - Ran dataload with a data snapshot and a metadata snapshot Change-Id: I2622cd3655cf4521d5ac945759fd35c9abe670ef --- M testdata/bin/generate-schema-statements.py 1 file changed, 47 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/14060/2 -- To view, visit http://gerrit.cloudera.org:8080/14060 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2622cd3655cf4521d5ac945759fd35c9abe670ef Gerrit-Change-Number: 14060 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>