Vihang Karajgaonkar has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13665 )
Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories ...................................................................... IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories The FileMetadataLoader is used to load the file information in when the table is loaded. By default, it lists all the files in the table/partition directory. Currently, it only skips the filenames which are invalid (hidden files and ones starting with "_" etc). However, it does not skip the directories which are temporary or hidden. In case of Hive when data is inserted into a table, it creates a temporary staging directory which is a hidden directory under the table location. When the insert in hive is completed, such staging directories are removed. But if there is a refresh called during that time, FileMetadataLoader will add the files in the staging directory as well. Not only this could cause temporary invalid results but it causes table to go in a bad state when these temporary directories are removed. The only work-around in such a case to issue a refresh on the table again. This patch adds logic in the filemetadataloader to ignore such temporary staging directories. Unfortunately, hadoop does not provide a API which can recursively list files in a directory and skip certain directories. This patch adds a new FilterIterator which wraps around existing listFiles, listStatus and RecursingIterator to skip the hidden directories from the listing result. Also, the existing code to recover partitions implements its own recursion logic which includes path validation. This already skips such hidden directories since they do not conform to the partition spec. The patch does a minor modification to this method by directly calling the listStatusIterator instead of going through FileSystemUtil#listStatus whiche uses the filtering remote iterator now. Testing: 1. Added a new tests as well as modified existing ones which were related to cover interesting cases. 2. Ran concurrent inserts from Hive while issuing refresh in a loop on Impala side. Earlier this would cause the table to go into a bad state. Now, it works fine for the staging directories. It still runs into a FileNotFoundException from the impalad when there are insert overwrite statements in Hive Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa Reviewed-on: http://gerrit.cloudera.org:8080/13665 Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Reviewed-by: Vihang Karajgaonkar <vih...@cloudera.com> --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java A fe/src/test/java/org/apache/impala/common/FileSystemUtilTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M tests/metadata/test_recursive_listing.py 6 files changed, 236 insertions(+), 9 deletions(-) Approvals: Impala Public Jenkins: Verified Vihang Karajgaonkar: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/13665 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa Gerrit-Change-Number: 13665 Gerrit-PatchSet: 14 Gerrit-Owner: Vihang Karajgaonkar <vih...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Vihang Karajgaonkar <vih...@cloudera.com>