On 10/31/2018 3:11 PM, Duy Nguyen wrote:
not really a review, just  a couple quick notes..


Perfect! As an RFC, I'm more looking for high level thoughts/notes than a style/syntax code review.

On Tue, Oct 30, 2018 at 9:40 PM Ben Peart <peart...@gmail.com> wrote:

From: Ben Peart <benpe...@microsoft.com>

On index load, clear/set the skip worktree bits based on the virtual
file system data. Use virtual file system data to update skip-worktree
bit in unpack-trees. Use virtual file system data to exclude files and
folders not explicitly requested.

Signed-off-by: Ben Peart <benpe...@microsoft.com>
---

We have taken several steps to make git perform well on very large repos.
Some of those steps include: improving underlying algorithms, utilizing
multi-threading where possible, and simplifying the behavior of some commands.
These changes typically benefit all git repos to varying degrees.  While
these optimizations all help, they are insufficient to provide adequate
performance on the very large repos we often work with.

To make git perform well on the very largest repos, we had to make more
significant changes.  The biggest performance win by far is the work we have
done to make git operations O(modified) instead of O(size of repo).  This
takes advantage of the fact that the number of files a developer has modified
is a tiny fraction of the overall repo size.

We accomplished this by utilizing the existing internal logic for the skip
worktree bit and excludes to tell git to ignore all files and folders other
than those that have been modified.  This logic is driven by an external
process that monitors writes to the repo and communicates the list of files
and folders with changes to git via the virtual file system hook in this patch.

The external process maintains a list of files and folders that have been
modified.  When git runs, it requests the list of files and folders that
have been modified via the virtual file system hook.  Git then sets/clears
the skip-worktree bit on the cache entries and builds a hashmap of the
modified files/folders that is used by the excludes logic to avoid scanning
the entire repo looking for changes and untracked files.

With this system, we have been able to make local git command performance on
extremely large repos (millions of files, 1/2 million folders) entirely
manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second
commit, etc).

Our desire is to eliminate all custom patches in our fork of git.  To that
end, I'm submitting this as an RFC to see how much interest there is and how
much willingness to take this type of change into git.

Most of these paragraphs (perhaps except the last one) should be part
of the commit message. You describe briefly what the patch does but
it's even more important to say why you want to do it.

+core.virtualFilesystem::
+       If set, the value of this variable is used as a command which
+       will identify all files and directories that are present in
+       the working directory.  Git will only track and update files
+       listed in the virtual file system.  Using the virtual file system
+       will supersede the sparse-checkout settings which will be ignored.
+       See the "virtual file system" section of linkgit:githooks[6].

It sounds like "virtual file system" is just one of the use cases for
this feature, which is more about a dynamic source of sparse-checkout
bits. Perhaps name the config key with something along sparse checkout
instead of naming it after a use case.

It's more than a dynamic sparse-checkout because the same list is also used to exclude any file/folder not listed. That means any file not listed won't ever be updated by git (like in 'checkout' for example) so 'stale' files could be left in the working directory. It also means git won't find new/untracked files unless they are specifically added to the list.


This is a hook. I notice we start to avoid adding real hooks and just
add config keys instead. Eventually we should have config-based hooks,
but if we're going to add more like this, I think these should be in a
separate section, hook.virtualFileSystem or something.


That is a great idea. I don't personally like specifying the hook as the 'flag' for whether a feature should be used. I'd rather have it be a bool (enable the feature? true/false) and 1) either have the hook name hard coded (like most existing hooks) or 2) as you suggest add a consistent way to have config-based hooks. Config based hooks could also help provide a consistent way to configure them using GIT_TEST_* environment variables for testing.

I don't think the superseding makes sense. There's no reason this
could not be used in combination with $GIT_DIR/info/sparse-checkout.
If you don't want both, disable the other.

One last note. Since this is related to filesystem. Shouldn't it be
part of fsmonitor (the protocol, not the implementation)? Then
watchman user could use it to.


To get this to work properly takes a lot more logic than exists in fsmonitor/watchman. The challenge is that 1) fsmonitor/watchman is focused on "what has changed since <time>" and 2) doesn't currently impact the excludes logic.

If you attempted to use this with watchman there is a chicken and egg problem. The initial git checkout wouldn't write out _any_ files to the working directory as none have been modified. There would be no way to get them populated where they could even get modified to get added to the list. Not very useful. :-)

This works with VFS for Git because it provides a virtual projection and will dynamically write out the contents of the file in the working directory as they are read. It makes it appear that they are there and will fetch the actual contents on demand transparently. If the user ends up modifying the file, it will get added to the virtual projection list so that git will start to pay attention to and update that file.

If the files are only read (and not written) by the user, the version on disk must be maintained by the VFS for Git daemon because git is completely unaware of them. That means the daemon must detect when the git commit changes and remove the contents of all the files that were read but not written and start projecting the files from new commit.

In short, this is only one small piece of what is necessary to get a fully virtualized git repo. It's an important piece but only one of the many pieces. My reason for submitting this RFC is to start the discussion about how interested the community is in enabling repo virtualization in the mainline version of git.

Reply via email to