Vihang Karajgaonkar has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp 
directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch adds a new FilterIterator which wraps around existing
listFiles, listStatus and RecursingIterator to skip the hidden
directories from the listing result.

Also, the existing code to recover partitions implements its own
recursion logic which includes path validation. This already skips such
hidden directories since they do not conform to the partition spec. The
patch does a minor modification to this method by directly calling the
listStatusIterator instead of going through FileSystemUtil#listStatus
whiche uses the filtering remote iterator now.

Testing:
1. Added a new tests as well as modified existing ones which were
related to cover interesting cases.
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Reviewed-on: http://gerrit.cloudera.org:8080/13665
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Vihang Karajgaonkar <vih...@cloudera.com>
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
A fe/src/test/java/org/apache/impala/common/FileSystemUtilTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M tests/metadata/test_recursive_listing.py
6 files changed, 236 insertions(+), 9 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Vihang Karajgaonkar: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 14
Gerrit-Owner: Vihang Karajgaonkar <vih...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vih...@cloudera.com>

Reply via email to