On 29/06/2021 20:48, Kyle Lippincott wrote:
Can you elaborate a bit on what this append-only tree looks like (and why that's
preferred)
It’s a tree in that there are nodes for files and directories. We can quickly find
root nodes, and from a given node we can quickly find its direct child nodes, all
without parsing the entire file.
It’s "append-mostly" because changes are made by adding new nodes at the end of the
file and reusing nodes for unchanged sub-trees. Nodes that have been replaced become
unreachable but still take up space. Occasionally based on some heuristic, the whole
file is rewritten without unreachable nodes. This makes most writes cheaper than
re-serializing and writing the entire file.
and why stem compression would cause performance issues?
Each node contains its full path from the repository root. This allows status code to
pass around a slice (pointer + length) to the middle of the mmap’ed file. If a node
only had its basename we’d have to allocate a string to reconstitute a path by
concatenating the names of ancestor directories. The cost of many memory allocations
can add up.
When loading this new dirstate, would it require loading the entire thing from
the
beginning and replacing entries with the newer ones?
No, that’s the point of making it a tree of fixed-size nodes that contain data at
fixed-size offsets, with pseudo-pointers for variable-size data (paths and child nodes).
You say the Python implementation will offer no purposeful performance improvements,
but how likely is it that it will be slower than the current format?
The current implementations (Python and C) of dirstate-v1 work by parsing the entire
dirstate into large Python dicts. The Python implementation of dirstate-v2 would do
the same, only parsing a different format.
What level of performance degradation would be considered acceptable?
Good question. We don’t have a hard criteria.
However this fallback implementation of dirstate-v2 will only be used when for
accessing an existing local repository that uses that format. When creating a new
clone, dirstate-v2 is only used if a fast implementation is available.
What happens if the docket and data file get out of sync somehow (maybe hg crashes in
the middle of writing, or Google has a network write race)?
A docket that refers to a new data file is only swap-renamed after the data file was
finished writing.
I don’t know what ordering guarantees between writes exist or not on Google’s network
filesystem.
- A count of dead (unreachable) bytes
- A count of alive (reachable) bytes
What are these two?
Only one of them is needed, the other can be deduced by subtracting from the size of
the file. Unreachable means obsolete parts of the file that have been replaced by
other nodes, see "append-mostly" above. The heuristic for rewriting the whole file to
get rid of unreachable data is based on this counter.
Is there a good way of determining what the timestamp resolution of a
filesystem is?
As far as I know there is not.
What we can do is create a temporary file and take its mtime as the current time with
the same (unknown) truncation as other file’s mtimes. If we observe a "current mtime"
strictly later than a given file’s mtime, we know that further changes to that file
are extremely likely[1] to cause a different mtime since the clock has already ticked
since the last change.
([1] The system clock is not monotonous, so it could jump back and still have the
same clock-reported date happen again. If we get unlucky another change to the file
could happen exactly then, modulo truncation.)
See comments starting at
https://www.mercurial-scm.org/repo/hg-committed/file/5fa083a5ff04/rust/hg-core/src/dirstate_tree/status.rs#l401
(I don't know how various OSes treat
these timestamps when the underlying filesystem doesn't support higher precision; is
it 100% guaranteed that they just extend it with zeroes?)
Regardless, there’s also the case where the filesystem can store enough bits but the
kernel only updates an internal clock at some arbitrary ticks:
https://stackoverflow.com/a/14393315/1162888
- All of the info needed to get the previous state of a Removed
file in case we `hg add` it back
Can you explain the use case for this (and/or what would be in it)? I would think
that `hg rm foo && echo hi > foo && hg add foo` should be equivalent to `echo hi >
foo`, but I might be missing something?
I still don’t fully understand this, but it also exists in dirstate-v1. I think it’s
relevant when in the middle of merging.
https://www.mercurial-scm.org/wiki/DirState#Summary
My biggest concern is extensibility. As an example, as you were writing this up, you
thought of something else to add, so we probably don't want to restrict ourselves too
much :) The file format is already going to not be anything resembling fixed record
size, having a section for generic key/value data that extensions can use might be
quite useful (and maybe future core code, though I'm assuming the format can be such
that this would be able to be made to work without the size/parsing complexity of
key/value).
The proposed dirstate-v2 is based on fixed-size records. This is what enables
accessing parts of it without parsing.
A problem with file format extensibility is dealing with clients that don’t know
about a given extension. Since we can never rely on "new" fields to be present or
being kept up to date, they’re of rather limited use.
My opinion is that we can anticipate some things now-ish, and for further changes one
day we can make a dirstate-v3 format.
--
Simon Sapin
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel