zclllyybb commented on issue #64369:
URL: https://github.com/apache/doris/issues/64369#issuecomment-4668393563

   Breakwater-GitHub-Analysis-Slot: slot_c7a3882c3932
   
   I checked this against current `upstream/branch-4.0` (`dec383f670a`). The 
report is code-plausible and should be treated as a real branch-4.0 lifecycle 
bug, not only a theoretical cleanup concern.
   
   Evidence from the branch-4.0 code:
   
   - `DFSFileSystem.nativeFileSystem()` creates a Hadoop `FileSystem`, stores 
it in `dfsFileSystem`, creates `HDFSFileOperations`, and registers `this` with 
`RemoteFSPhantomManager`.
   - `RemoteFSPhantomManager` maps `PhantomReference<DFSFileSystem>` directly 
to the raw Hadoop `FileSystem` and the cleanup thread calls `fs.close()` as 
soon as the wrapper reference is enqueued. There is no active-operation, 
active-stream, or active-iterator accounting before close.
   - `FileSystemCache` uses a normal Caffeine loading cache with 
`expireAfterAccess` and `maximumSize`, and no removal listener that coordinates 
resource lifetime. Once the cache no longer strongly retains a `DFSFileSystem`, 
the phantom cleanup path is the only deferred cleanup mechanism for that 
wrapper.
   - Several public paths obtain derived Hadoop objects from the same 
`dfsFileSystem`: `listFiles()` uses a `RemoteIterator<LocatedFileStatus>`, 
`downloadWithFileSize()` / `upload()` use `FSDataInputStream` / 
`FSDataOutputStream`, and `openFile()` / `createFile()` return streams to 
callers. The returned-stream paths are the clearest risk: for example 
`HdfsInputFile.newStream()` returns `HdfsInputStream` containing only the 
`FSDataInputStream` and path, and `HdfsOutputFile.create()/createOrOverwrite()` 
return the raw output stream; these returned objects do not visibly retain the 
owning `DFSFileSystem`.
   
   So the failure mode described here is valid: the object used as the phantom 
referent is not the same object that represents active Hadoop I/O usage. If the 
`DFSFileSystem` wrapper becomes unreachable after cache eviction/expiration or 
direct construction while a derived stream/iterator is still active, phantom 
cleanup can close the underlying Hadoop `FileSystem` out from under that 
operation.
   
   Recommended fix direction:
   
   - Introduce a small resource owner around the Hadoop `FileSystem` with an 
active lease/reference count and a closing flag.
   - Make every operation that touches the Hadoop `FileSystem` acquire a lease 
before obtaining or using Hadoop objects.
   - Returned streams/iterators must hold a lease and release it on 
close/exhaustion/error. This is important for `openFile()`, `createFile()`, 
`HdfsInputFile.newStream()`, `HdfsOutputFile.create()/createOrOverwrite()`, and 
any `RemoteIterator` path.
   - `DFSFileSystem.close()` and phantom cleanup should mark the resource 
closing and prevent new leases, but physically call `FileSystem.close()` only 
after the active count reaches zero.
   - Add a deterministic FE unit test with a fake or mocked Hadoop 
`FileSystem`/stream/iterator that verifies phantom/close cleanup does not close 
the underlying `FileSystem` while a lease-backed stream or iterator is still 
open. Avoid relying only on real GC timing in the test.
   
   Missing information that would help validate an existing production symptom:
   
   - The exact exception and stack trace from the failed read/write/list 
operation.
   - Whether the failure happened through Hive/HMS listing, Iceberg `FileIO`, 
backup/restore, storage vault checking, or another direct 
`FileSystemFactory.get()` caller.
   - The values of `max_remote_file_system_cache_num` and 
`external_cache_expire_time_seconds_after_access`, plus whether the workload 
creates many distinct HDFS identities/properties.
   - A small reproducer that keeps only a returned stream/iterator alive after 
dropping the `DFSFileSystem`/`FileIO` object, then forces cache eviction or GC.
   
   I would scope the PR to branch-4.0's current filesystem implementation. 
Current master has a newer `fe-filesystem` layout and I did not find this exact 
`RemoteFSPhantomManager` path there.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to