Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )
Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil ...................................................................... Patch Set 5: (3 comments) http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG@15 PS4, Line 15: infi > nit, I think using "infinite loop" is a more readable. Done http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG@18 PS4, Line 18: throw > nit, throw Done http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java: http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@847 PS4, Line 847: erIterator(Path startPath, R > Does this mean that if the file which is present inside a directory and whi AFAIK, it depends on the fs implementation. First of all, the exception is more possible to be thrown by hasNext() than next() in this pattern. Both the hasNext() and next() calls could throw IOException. AFAIK, most of the implementations cache some listing results when hasNext() is called. In their next() call, hasNext() will also be called again in case it's not called. So if hasNext() doesn't throw exception, the following next() call won't throw exception either, i.e. next() will return an item from the cache results. It could be a stale item that already removed after the last fetch. Secondly, it's more likely to get FileNotFoundException in these two cases: 1. When this is a recursive RemoteIterator and when the absent item is a subdirectory (instead of a file), the iterator gets exception when entering(listing) the subdirectory. 2. When the current directory being listing is removed, the iterator gets exception when it wants to fetch more results. If the absent directory is a partition folder, it's possible to hit case #2. E.g. when a high level partition is removed. I think it's ok to cause table loading errors. Otherwise, we may have loaded some stale files. If the absent directory is a staging folder (created by Hive/Impala INSERTs), it's more likely to hit case #1, because staging folder usually won't have more than 1000 files. The iterator gets all of them in one fetch. However, it still depends on the fs implementation. Take listing on HDFS as an example. The fetch size defaults to 1000 (configured by dfs.ls.limit). For case #1, our expectation is ignoring the absent subdirectory and continuely listing. So this patch catches the FileNotFoundException in RecursivingIterator#hasNext() when calling handleFileStat(), instead of catching it here. This patch also prefers using RecursivingIterator if the target FS doesn't have a real resursive listing like S3. It avoids hitting case #1. -- To view, visit http://gerrit.cloudera.org:8080/17171 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea Gerrit-Change-Number: 17171 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Vihang Karajgaonkar <vih...@cloudera.com> Gerrit-Comment-Date: Tue, 16 Mar 2021 07:38:46 +0000 Gerrit-HasComments: Yes