Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG@15
PS4, Line 15: infi
> nit, I think using "infinite loop" is a more readable.
Done


http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG@18
PS4, Line 18: throw
> nit, throw
Done


http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@847
PS4, Line 847: erIterator(Path startPath, R
> Does this mean that if the file which is present inside a directory and whi
AFAIK, it depends on the fs implementation.

First of all, the exception is more possible to be thrown by hasNext() than 
next() in this pattern. Both the hasNext() and next() calls could throw 
IOException. AFAIK, most of the implementations cache some listing results when 
hasNext() is called. In their next() call, hasNext() will also be called again 
in case it's not called. So if hasNext() doesn't throw exception, the following 
next() call won't throw exception either, i.e. next() will return an item from 
the cache results. It could be a stale item that already removed after the last 
fetch.

Secondly, it's more likely to get FileNotFoundException in these two cases:
1. When this is a recursive RemoteIterator and when the absent item is a 
subdirectory (instead of a file), the iterator gets exception when 
entering(listing) the subdirectory.
2. When the current directory being listing is removed, the iterator gets 
exception when it wants to fetch more results.

If the absent directory is a partition folder, it's possible to hit case #2. 
E.g. when a high level partition is removed. I think it's ok to cause table 
loading errors. Otherwise, we may have loaded some stale files.

If the absent directory is a staging folder (created by Hive/Impala INSERTs), 
it's more likely to hit case #1, because staging folder usually won't have more 
than 1000 files. The iterator gets all of them in one fetch. However, it still 
depends on the fs implementation. Take listing on HDFS as an example. The fetch 
size defaults to 1000 (configured by dfs.ls.limit).

For case #1, our expectation is ignoring the absent subdirectory and continuely 
listing. So this patch catches the FileNotFoundException in 
RecursivingIterator#hasNext() when calling handleFileStat(), instead of 
catching it here. This patch also prefers using RecursivingIterator if the 
target FS doesn't have a real resursive listing like S3. It avoids hitting case 
#1.



--
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vih...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Mar 2021 07:38:46 +0000
Gerrit-HasComments: Yes

Reply via email to