foxtail463 opened a new issue, #64369: URL: https://github.com/apache/doris/issues/64369
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version branch-4.0 ### What's Wrong? In branch-4.0, `RemoteFSPhantomManager` tracks `DFSFileSystem` with a `PhantomReference` and closes the internal `org.apache.hadoop.fs.FileSystem` when the `DFSFileSystem` object is garbage collected. This can close the underlying Hadoop `FileSystem` too early. Some DFS operations may still be using the Hadoop `FileSystem`, `RemoteIterator`, `FSDataInputStream`, or `FSDataOutputStream` after the owning `DFSFileSystem` object is no longer strongly reachable. In that window, GC can enqueue the `DFSFileSystem` phantom reference, and the cleanup thread may close the Hadoop `FileSystem` while operations such as `listStatus/listFiles`, read, or write are still running. This may cause intermittent failures in external file system access, especially when: - `DFSFileSystem` is evicted from `FileSystemCache` by size limit - `DFSFileSystem` expires after access timeout - `DFSFileSystem` is created directly and not retained by the cache - a long-running list/read/write operation only keeps the underlying Hadoop objects alive ### What You Expected? The internal Hadoop `FileSystem` should not be closed while any active DFS operation is still using it. Cleanup should happen only after both conditions are true: 1. The owning `DFSFileSystem` is closed or garbage collected. 2. There are no active operations/streams/iterators holding the underlying Hadoop `FileSystem`. ### How to Reproduce? The issue is timing-sensitive and depends on GC. General reproduction scenario: 1. Create or obtain a `DFSFileSystem`. 2. Start a long-running HDFS operation, for example recursive `listFiles/listStatus`, or create an input/output stream. 3. Drop the strong reference to the `DFSFileSystem` while the returned Hadoop object is still being used. 4. Trigger GC. 5. Wait for `RemoteFSPhantomManager` cleanup. 6. Continue using the iterator/stream. The cleanup thread may close the internal Hadoop `FileSystem`, causing the still-running operation to fail with IO errors related to a closed filesystem/client. This is less likely during normal cache hits because `FileSystemCache` uses a strong-value Caffeine cache, but the race is still possible after cache eviction, expiration, or direct construction outside the cache. ### Anything Else? The root cause is that phantom cleanup is tied to the lifetime of `DFSFileSystem`, but the actual resource users are the underlying Hadoop `FileSystem` and streams/iterators derived from it. A safer design is to introduce a resource/lease layer: - `DFSFileSystemResource` owns the Hadoop `FileSystem`. - Each operation acquires a lease before using the resource. - `close()` or phantom cleanup only marks the resource as closing. - The Hadoop `FileSystem` is physically closed only after all active leases are released. This avoids closing the Hadoop `FileSystem` while `listFiles`, reads, or writes are still in progress. ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
