Ilan Goldfeld created VFS-862:
---------------------------------

             Summary: CacheStrategy.ON_RESOLVE triggers refresh on internal VFS 
navigation, causing redundant FTP/SFTP operations
                 Key: VFS-862
                 URL: https://issues.apache.org/jira/browse/VFS-862
             Project: Commons VFS
          Issue Type: Bug
            Reporter: Ilan Goldfeld
             Fix For: 2.11.0


h2. Description

h3. Steps to reproduce

{code:java}
DefaultFileSystemManager manager = new DefaultFileSystemManager();
manager.addProvider("ftp", new FtpFileProvider());
manager.setCacheStrategy(CacheStrategy.ON_RESOLVE);
manager.init();
{code}

*Example 1 -- pre-existing, before PR #758:*
Checking existence of two sibling files should require 1 LIST on the parent, 
but the second file's internal {{getParent()}} re-resolves the parent via 
{{resolveFile()}}, triggering an ON_RESOLVE refresh that forces a redundant 
LIST.

{code:java}
FileObject file1 = manager.resolveFile("ftp://user:pass@host/dir/file1.txt";, 
options);
FileObject file2 = manager.resolveFile("ftp://user:pass@host/dir/file2.txt";, 
options);
file1.exists();  // LIST on parent to determine type
file2.exists();  // redundant LIST — getParent() refreshes parent via ON_RESOLVE
// Expected: 1 LIST
// Actual:   2 LIST commands
{code}

*Example 2 -- after PR #758, amplified to O(N):*
Scanning a directory with N files now produces ~N LIST commands instead of 1.

{code:java}
FileObject dir = manager.resolveFile("ftp://user:pass@host/directory";, options);
dir.refresh();
FileObject[] children = dir.findFiles(new FileDepthSelector(1, 1));
// Expected: 1 LIST
// Actual:   ~N LIST commands — each child's getParent() refreshes the parent,
//           and the unconditional childMap = null in refresh() clears the
//           cached listing every time
{code}

h3. The issue

{{CacheStrategy.ON_RESOLVE}} is designed to refresh a file when the user 
resolves it via the public API. However, internal VFS navigation methods also 
call {{fileSystem.resolveFile()}}, which triggers the same ON_RESOLVE refresh. 
This means that internal plumbing -- not user actions -- invalidates cached 
state on files the user never asked to refresh.

Affected internal call sites:
* {{AbstractFileObject.getParent()}} -- resolves the parent directory
* {{AbstractFileObject.getChildren()}} -- resolves each child via 
{{resolveFiles()}}
* {{AbstractFileSystem.getRoot()}} -- resolves the root file object (called by 
{{getParent()}} for the root check)
* {{FtpFileObject.getLinkDestination()}} -- resolves symlink targets
* {{SftpFileObject.doListChildrenResolved()}} -- resolves children during SFTP 
listing

For example, resolving two sibling files and checking their existence should 
require a single LIST on the parent directory. Instead, the second file's 
internal {{getParent()}} call re-resolves the parent via {{resolveFile()}}, 
triggering an ON_RESOLVE refresh that clears the parent's cached directory 
listing, forcing a redundant LIST.

h3. History

This was always a wasteful pattern, but it became a severe performance 
regression after PR #758 ("Fix refresh() not clearing cached state when file 
was never attached").

PR #758 addressed a real correctness issue: provider-specific cached fields 
like {{FtpFileObject.childMap}} could be populated without going through 
{{attach()}} (e.g. via {{getChildFile()}} -> {{doGetChildren()}}). When 
{{refresh()}} was called on such a file, {{detach()}} skipped {{doDetach()}} 
because {{attached == false}}, silently preserving stale data. The fix added 
{{childMap = null}} to {{FtpFileObject.refresh()}} to ensure cached state is 
always cleared on refresh.

However, this turned the unnecessary internal refreshes into an O(N) 
amplification: when scanning a directory with N files via {{findFiles()}}, each 
child's {{getType()}} call triggers {{getParent()}}, which re-resolves the 
parent via {{resolveFile()}}. With ON_RESOLVE, this refreshes the parent, and 
the new unconditional {{childMap = null}} clears the parent's cached directory 
listing for every child -- forcing a new FTP LIST command each time. A 
directory with 50 files produces ~50 LIST commands instead of 1.

Before PR #758, the parent's {{childMap}} survived these unnecessary refreshes 
because {{refresh()}} only cleared {{childMap}} via {{doDetach()}}, which was 
guarded by {{if (attached)}}. After the first child's refresh set the parent to 
{{attached=false}}, subsequent refreshes left {{childMap}} intact. PR #758 
removed this accidental protection.

Even without PR #758, the pattern was wasteful. For example, resolving two 
sibling files and checking their existence would produce 2 LIST commands 
instead of 1, because the second file's internal {{getParent()}} call refreshes 
the parent (clearing its state via {{doDetach()}} when the parent is attached).

h3. Fix

The fix has two parts:

*1. Internal navigation skips ON_RESOLVE.* A new {{resolveFileInternal()}} 
method resolves files from the cache without applying the ON_RESOLVE refresh 
policy. All internal navigation call sites use this instead of the public 
{{resolveFile()}}:

* {{AbstractFileObject.getParent()}} -- parent resolution
* {{AbstractFileObject.resolveFile(FileName)}} -- child resolution in 
{{resolveFiles()}}/{{getChild()}}
* {{AbstractFileSystem.getRoot()}} -- root file object lookup
* {{FtpFileObject.getLinkDestination()}} -- symlink target resolution
* {{SftpFileObject.doListChildrenResolved()}} -- SFTP child resolution

A protected bridge method in {{AbstractFileObject}} makes 
{{resolveFileInternal()}} available to provider subclasses.

*2. Fresh directory listings propagate metadata to cached children in-place.* 
When a directory listing is performed after a refresh, FTP and SFTP providers 
update cached child objects directly with the fresh metadata from the listing, 
preserving object identity:

* *FTP*: {{FtpFileObject.getChildren()}} detects a fresh LIST (because 
{{childMap}} was null after refresh). After the LIST populates {{childMap}}, it 
iterates the returned children and sets each child's {{ftpFile}} to the 
matching entry from the fresh {{childMap}}, then clears the cached {{type}} via 
{{injectType(null)}} so it re-evaluates from the fresh metadata.
* *SFTP*: {{SftpFileObject.doListChildrenResolved()}} already sets fresh 
{{attrs}} on each child via {{setStat()}}. Now it also clears the cached 
{{type}} via {{injectType(null)}} so it re-evaluates from the updated {{attrs}}.

This approach was chosen over cache eviction (removing and recreating child 
objects) because eviction breaks object identity -- existing VFS tests assert 
that resolving the same path returns the same FileObject instance. In-place 
propagation updates the same objects with fresh data.

Other providers like SMB and local files don't need this -- their 
{{doGetType()}} resolves directly from the server/filesystem on every call, so 
after {{detach()}} clears the cached {{type}}, children naturally get fresh 
data.

Together, these changes establish a clear contract: *cached state is used until 
the user explicitly calls {{refresh()}} or resolves a file via the public API.* 
Internal navigation never triggers server operations. When a directory is 
refreshed, both the directory listing and its children are fresh.

h3. Tests

* {{FtpGetChildrenListCommandTest}}: verifies that {{findFiles()}} on an FTP 
directory with 50 files issues exactly 1 LIST command with 
{{CacheStrategy.ON_RESOLVE}}.
* {{SftpGetChildrenListCommandTest}}: verifies that refreshing an SFTP 
directory and calling {{findFiles()}} returns fresh children reflecting 
filesystem changes, proving in-place metadata propagation works for SFTP.

h3. Compatibility note

This changes the observable behavior of {{CacheStrategy.ON_RESOLVE}} for files 
reached through internal navigation. The following call sites no longer trigger 
ON_RESOLVE refresh:

* {{AbstractFileObject.getParent()}} -- resolving the parent directory
* {{AbstractFileObject.getChildren()}} / {{getChild()}} -- resolving children
* {{AbstractFileSystem.getRoot()}} -- resolving the root file object
* {{FtpFileObject.getLinkDestination()}} -- resolving symlink targets
* {{SftpFileObject.doListChildrenResolved()}} -- resolving children during SFTP 
listing

Previously, these methods called {{fileSystem.resolveFile()}}, which triggered 
an ON_RESOLVE refresh on each returned file. After this fix, internally 
navigated files are served from the cache and only refreshed when the user 
explicitly calls {{refresh()}} or re-resolves them via the public API.

When a directory is refreshed and its children are listed, the FTP and SFTP 
providers propagate fresh metadata from the listing to cached child objects 
in-place (updating {{ftpFile}}/{{attrs}} and clearing the cached {{type}}). 
Object identity is preserved -- the same FileObject instances are returned with 
updated data.

This behavior is consistent with {{CacheStrategy.MANUAL}} and aligns with the 
intended ON_RESOLVE contract: refresh on explicit user resolution, not on 
internal navigation.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to