Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )
Change subject: IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories ...................................................................... Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG@7 PS3, Line 7: IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories nit: wrap http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java: http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@549 PS3, Line 549: if (recursive) { : // The Hadoop FileSystem API doesn't provide a recursive listStatus call that : // doesn't also fetch block locations, and fetching block locations is expensive. : // Here, our caller specifically doesn't need block locations, so we don't want to : // call the expensive 'listFiles' call on HDFS. Instead, we need to "manually" : // recursively call FileSystem.listStatusIterator(). : // : // Note that this "manual" recursion is not actually any slower than the recursion : // provided by the HDFS 'listFiles(recursive=true)' API, since the HDFS wire : // protocol doesn't provide any such recursive support anyway. In other words, : // the API that looks like a single recursive call is just as bad as what we're : // doing here. : // : // However, S3 actually implements 'listFiles(recursive=true)' with a faster path : // which natively recurses. In that case, it's quite preferable to use 'listFiles' : // even though it returns LocatedFileStatus objects with "fake" blocks which we : // will ignore. : if (isS3AFileSystem(fs)) { : return listFiles(fs, p, recursive); : } : : return new RemoteIteratorWithFilter(fs, p, true); : } : : return new RemoteIteratorWithFilter(fs, p, false) I think this is equivalent to... if (S3) return listFiles(fs, p, recursive); return new RemoteIteratorWithFilter(fs, p, recursive); http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@581 PS3, Line 581: useful nit: used ? http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@585 PS3, Line 585: PREDICATE nit: FILTER? (predicate is a java construct) http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@642 PS3, Line 642: private final Predicate<FileStatus> fileStatusPredicate_; This is always used, no? Hard code it? http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@643 PS3, Line 643: private final boolean useListStatus_; Document this? http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java: http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100 PS2, Line 100: dstFs.deleteOnExit(tmpTestPath); > don't think so. This will clean up the directory when the JVM exits. The ot Are you sure? https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L2505 Its done in close(), so you probably need to wrap it in try-with-resources or explicitly call in a finally {} http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@112 PS2, Line 112: 24 > I am vary of using load() to test load() method's behavior. Seems like a an not sure I follow. I think what were asserting here is that the number of loaded files in the sourcePath would be the same with/without the filters. So why not load the sourcePath, get the the numFiles, add the junk folders, load again and compare the load numbers? I think thats better than hardcoding. -- To view, visit http://gerrit.cloudera.org:8080/13665 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa Gerrit-Change-Number: 13665 Gerrit-PatchSet: 3 Gerrit-Owner: Vihang Karajgaonkar <vih...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Vihang Karajgaonkar <vih...@cloudera.com> Gerrit-Comment-Date: Thu, 20 Jun 2019 18:44:19 +0000 Gerrit-HasComments: Yes