Re: A two-part vision for Subversion and large binary objects.

Daniel Shahaf Tue, 25 Jan 2022 00:59:06 -0800

Julian Foad wrote on Fri, Jan 21, 2022 at 11:15:04 +0000:
> Scanning with 'stat'
> 
> I'm concerned about the implementation scanning the whole subtree,
> calling 'stat' on every file to determine whether the file is "changed"
> (locally modified). This is done in svn_wc__textbase_sync() with its 
> textbase_walk_cb().
> 
> It does this scan on every sync, which is twice on every syncing
> operation such as diff.
> 
> Don't we already have an optimised scan for local modifications
> implemented in the "status" code?


This? —

   [subversion/libsvn_wc/status.c]
   457            /* If the on-disk dirent exactly matches the expected state
   458               skip all operations in svn_wc__internal_text_modified_p()
   459               to avoid an extra filestat for every file, which can be
   460               expensive on network drives as a filestat usually can't
   461               be cached there */
   462            if (!info->has_checksum)
   463              text_modified_p = TRUE; /* Local addition -> Modified */
   464            else if (ignore_text_mods
   465                    ||(dirent
   466                       && info->recorded_size != SVN_INVALID_FILESIZE
   467                       && info->recorded_time != 0
   468                       && info->recorded_size == dirent->filesize
   469                       && info->recorded_time == dirent->mtime))
   470              text_modified_p = FALSE;

This still does a stat() on every file; how else would it obtain
dirent->mtime?  It doesn't do open()/read()/memcmp().

> Could we re-use this?

textbase_walk_cb() calls check_file_modified() which has a very similar
size-and-mtime check right at the top.  So, we already repeat the logic;
we just implement it twice.

Reuse would be nice, of course.  If nothing else, we could at least add
comments to the two locations cross-referencing them.

> Premature Hydrating
> 
> The present implementation "hydrates" (fetches missing pristines) every
> file within the whole subtree the operation targets. This is done by
> every major client operation calling svn_client__textbase_sync() before
> and afterwards.
> 

Does it?  Looking at textbase_walk_cb(), it only sets REFERENCED to TRUE
for modified files.  If I understand correctly textbase_walk_cb() and
the docstring of svn_wc__db_textbase_walk(), something along the lines of
.
    svn revert -R ./
    echo foo > subversion/tests/README
    svn diff
.
would fetch the pristine only for that one file, wouldn't it?

Sorry, I haven't got time to test this right now.

> That is pessimistic: the operation may not actually touch all these
> files if limited in any way such as by
> 
>   - depth filtering
>   - other filtering (changelist, properties-only, ...)
>   - terminating early (e.g. output piped to 'head')
> 
> That introduces all the fetching overhead for the given subtree as a
> latency before the operation shows its results, which for something
> small at the root of the tree such as "svn diff --depth=empty
> --properties-only ./" may make a significant usability impact.
> 
> Presumably we could add the depth and some other kinds of filtering to
> the tree walk. But that will always leave terminating early, and
> possibly other cases, sub-optimal.
> 
> I would prefer a solution that defers the hydrating until closer to the
> moment of demand.

Agree that from a UX perspective, it would be nice to avoid a long delay
at the start of an operation.

However, in cases such as «svn diff --diff-cmd=<GUI tool>», fetching the
pristines (too) close to the time they are needed could result in having
to reopen RA sessions.  In this case, perhaps it would make sense to
download the pristines in the background in a separate thread (at least
in case APR_HAS_THREADS)?

> Evgeny, have you looked into these possibilities at all? What are your
> thoughts about these?

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Reply via email to