Ilan Goldfeld created VFS-862:
---------------------------------
Summary: CacheStrategy.ON_RESOLVE triggers refresh on internal VFS
navigation, causing redundant FTP/SFTP operations
Key: VFS-862
URL: https://issues.apache.org/jira/browse/VFS-862
Project: Commons VFS
Issue Type: Bug
Reporter: Ilan Goldfeld
Fix For: 2.11.0
h2. Description
h3. Steps to reproduce
{code:java}
DefaultFileSystemManager manager = new DefaultFileSystemManager();
manager.addProvider("ftp", new FtpFileProvider());
manager.setCacheStrategy(CacheStrategy.ON_RESOLVE);
manager.init();
{code}
*Example 1 -- pre-existing, before PR #758:*
Checking existence of two sibling files should require 1 LIST on the parent,
but the second file's internal {{getParent()}} re-resolves the parent via
{{resolveFile()}}, triggering an ON_RESOLVE refresh that forces a redundant
LIST.
{code:java}
FileObject file1 = manager.resolveFile("ftp://user:pass@host/dir/file1.txt",
options);
FileObject file2 = manager.resolveFile("ftp://user:pass@host/dir/file2.txt",
options);
file1.exists(); // LIST on parent to determine type
file2.exists(); // redundant LIST — getParent() refreshes parent via ON_RESOLVE
// Expected: 1 LIST
// Actual: 2 LIST commands
{code}
*Example 2 -- after PR #758, amplified to O(N):*
Scanning a directory with N files now produces ~N LIST commands instead of 1.
{code:java}
FileObject dir = manager.resolveFile("ftp://user:pass@host/directory", options);
dir.refresh();
FileObject[] children = dir.findFiles(new FileDepthSelector(1, 1));
// Expected: 1 LIST
// Actual: ~N LIST commands — each child's getParent() refreshes the parent,
// and the unconditional childMap = null in refresh() clears the
// cached listing every time
{code}
h3. The issue
{{CacheStrategy.ON_RESOLVE}} is designed to refresh a file when the user
resolves it via the public API. However, internal VFS navigation methods also
call {{fileSystem.resolveFile()}}, which triggers the same ON_RESOLVE refresh.
This means that internal plumbing -- not user actions -- invalidates cached
state on files the user never asked to refresh.
Affected internal call sites:
* {{AbstractFileObject.getParent()}} -- resolves the parent directory
* {{AbstractFileObject.getChildren()}} -- resolves each child via
{{resolveFiles()}}
* {{AbstractFileSystem.getRoot()}} -- resolves the root file object (called by
{{getParent()}} for the root check)
* {{FtpFileObject.getLinkDestination()}} -- resolves symlink targets
* {{SftpFileObject.doListChildrenResolved()}} -- resolves children during SFTP
listing
For example, resolving two sibling files and checking their existence should
require a single LIST on the parent directory. Instead, the second file's
internal {{getParent()}} call re-resolves the parent via {{resolveFile()}},
triggering an ON_RESOLVE refresh that clears the parent's cached directory
listing, forcing a redundant LIST.
h3. History
This was always a wasteful pattern, but it became a severe performance
regression after PR #758 ("Fix refresh() not clearing cached state when file
was never attached").
PR #758 addressed a real correctness issue: provider-specific cached fields
like {{FtpFileObject.childMap}} could be populated without going through
{{attach()}} (e.g. via {{getChildFile()}} -> {{doGetChildren()}}). When
{{refresh()}} was called on such a file, {{detach()}} skipped {{doDetach()}}
because {{attached == false}}, silently preserving stale data. The fix added
{{childMap = null}} to {{FtpFileObject.refresh()}} to ensure cached state is
always cleared on refresh.
However, this turned the unnecessary internal refreshes into an O(N)
amplification: when scanning a directory with N files via {{findFiles()}}, each
child's {{getType()}} call triggers {{getParent()}}, which re-resolves the
parent via {{resolveFile()}}. With ON_RESOLVE, this refreshes the parent, and
the new unconditional {{childMap = null}} clears the parent's cached directory
listing for every child -- forcing a new FTP LIST command each time. A
directory with 50 files produces ~50 LIST commands instead of 1.
Before PR #758, the parent's {{childMap}} survived these unnecessary refreshes
because {{refresh()}} only cleared {{childMap}} via {{doDetach()}}, which was
guarded by {{if (attached)}}. After the first child's refresh set the parent to
{{attached=false}}, subsequent refreshes left {{childMap}} intact. PR #758
removed this accidental protection.
Even without PR #758, the pattern was wasteful. For example, resolving two
sibling files and checking their existence would produce 2 LIST commands
instead of 1, because the second file's internal {{getParent()}} call refreshes
the parent (clearing its state via {{doDetach()}} when the parent is attached).
h3. Fix
The fix has two parts:
*1. Internal navigation skips ON_RESOLVE.* A new {{resolveFileInternal()}}
method resolves files from the cache without applying the ON_RESOLVE refresh
policy. All internal navigation call sites use this instead of the public
{{resolveFile()}}:
* {{AbstractFileObject.getParent()}} -- parent resolution
* {{AbstractFileObject.resolveFile(FileName)}} -- child resolution in
{{resolveFiles()}}/{{getChild()}}
* {{AbstractFileSystem.getRoot()}} -- root file object lookup
* {{FtpFileObject.getLinkDestination()}} -- symlink target resolution
* {{SftpFileObject.doListChildrenResolved()}} -- SFTP child resolution
A protected bridge method in {{AbstractFileObject}} makes
{{resolveFileInternal()}} available to provider subclasses.
*2. Fresh directory listings propagate metadata to cached children in-place.*
When a directory listing is performed after a refresh, FTP and SFTP providers
update cached child objects directly with the fresh metadata from the listing,
preserving object identity:
* *FTP*: {{FtpFileObject.getChildren()}} detects a fresh LIST (because
{{childMap}} was null after refresh). After the LIST populates {{childMap}}, it
iterates the returned children and sets each child's {{ftpFile}} to the
matching entry from the fresh {{childMap}}, then clears the cached {{type}} via
{{injectType(null)}} so it re-evaluates from the fresh metadata.
* *SFTP*: {{SftpFileObject.doListChildrenResolved()}} already sets fresh
{{attrs}} on each child via {{setStat()}}. Now it also clears the cached
{{type}} via {{injectType(null)}} so it re-evaluates from the updated {{attrs}}.
This approach was chosen over cache eviction (removing and recreating child
objects) because eviction breaks object identity -- existing VFS tests assert
that resolving the same path returns the same FileObject instance. In-place
propagation updates the same objects with fresh data.
Other providers like SMB and local files don't need this -- their
{{doGetType()}} resolves directly from the server/filesystem on every call, so
after {{detach()}} clears the cached {{type}}, children naturally get fresh
data.
Together, these changes establish a clear contract: *cached state is used until
the user explicitly calls {{refresh()}} or resolves a file via the public API.*
Internal navigation never triggers server operations. When a directory is
refreshed, both the directory listing and its children are fresh.
h3. Tests
* {{FtpGetChildrenListCommandTest}}: verifies that {{findFiles()}} on an FTP
directory with 50 files issues exactly 1 LIST command with
{{CacheStrategy.ON_RESOLVE}}.
* {{SftpGetChildrenListCommandTest}}: verifies that refreshing an SFTP
directory and calling {{findFiles()}} returns fresh children reflecting
filesystem changes, proving in-place metadata propagation works for SFTP.
h3. Compatibility note
This changes the observable behavior of {{CacheStrategy.ON_RESOLVE}} for files
reached through internal navigation. The following call sites no longer trigger
ON_RESOLVE refresh:
* {{AbstractFileObject.getParent()}} -- resolving the parent directory
* {{AbstractFileObject.getChildren()}} / {{getChild()}} -- resolving children
* {{AbstractFileSystem.getRoot()}} -- resolving the root file object
* {{FtpFileObject.getLinkDestination()}} -- resolving symlink targets
* {{SftpFileObject.doListChildrenResolved()}} -- resolving children during SFTP
listing
Previously, these methods called {{fileSystem.resolveFile()}}, which triggered
an ON_RESOLVE refresh on each returned file. After this fix, internally
navigated files are served from the cache and only refreshed when the user
explicitly calls {{refresh()}} or re-resolves them via the public API.
When a directory is refreshed and its children are listed, the FTP and SFTP
providers propagate fresh metadata from the listing to cached child objects
in-place (updating {{ftpFile}}/{{attrs}} and clearing the cached {{type}}).
Object identity is preserved -- the same FileObject instances are returned with
updated data.
This behavior is consistent with {{CacheStrategy.MANUAL}} and aligns with the
intended ON_RESOLVE contract: refresh on explicit user resolution, not on
internal navigation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)