On Mon, Jun 28, 2021 at 2:50 AM Raphaël Gomès <raphael.go...@octobus.net> wrote:
>
> Hello all,
>
> As you probably know my colleagues at Octobus and I have been working on
> a new version of the dirstate, and we're coming pretty close to
> something usable in production, so we need to freeze the format soon.
> This email is not meant to discuss the exact byte-per-byte layout
> details of the format, but rather its contents: what do you think should
> be included (or at least have space reserved for) in the new version?
>
> We have already discussed this at previous sprints and various other
> discussion channels, but I thought it'd be better to give a "last call"
> chance for people to get their voices heard.
>
> I remember Google people saying they'd like to separate information that
> is frequently written to a separate file to help with their filesystem
> shenanigans. What exactly would be the plan and can we do it easily? I
> may be pessimistic, but this looks like it would require a lot of work
> which (so far) no one wants to sponsor, though I'm happy to be proven
> wrong either way.
>
> To Matt Harbison: you said something about storing exec bit and symlink
> info explicitly to help platforms like Windows that don't have them,
> could you please elaborate?
>
> As a general recap (and to help understand some decisions), the new
> format will be an append-only tree with no stem compression for
> performance reasons. The Python implementation will be functional but
> very basic and will offer no purposeful performance improvements (unless
> someone wants to have fun!), as we currently only have the bandwidth for
> optimizing the Rust implementation.
>
> An overview of the current target (some implementation-detail level
> contents omitted):
>
>      - A docket file that contains global metadata about the dirstate:
>          - NodeID of the parents (32 bytes reserved, 20 used for now)
>          - A total count of files (including Removed ones)
>          - A count of dead (unreachable) bytes
>          - A count of alive (reachable) bytes
>          - A hash of ignore patterns (see
> https://phab.mercurial-scm.org/D10836)
>      - In the data file, for each directory/file (it can be both at the
> same time):
>          - The full path in bytes of the file (or directory)
>          - The full path of the copy source (optional)
>          - How many tracked recursive descendants it has
>          - How many recursive copies it has
>          - Exec bit
>          - mtime (probably up to nanosecond precision, both files and
> directories)
>          - Clean file size when applicable
>          - Its state: if it's removed, added, clean, etc.
>          - Whether it's from p1 or p2
>          - Whether it's ambiguous (it appears clean but the mtime is the
> same as the last status, probably will only happen with the Python
> implementation)
>          - All of the info needed to get the previous state of a Removed
> file in case we `hg add` it back
>          - (My idea as I type this: ) store the "raw bytes" version of
> the OS path if it differs from the normalized hg version (on Windows and
> MacOS for example) to cache the filefoldmap.
>
> I *think* that's it? I might be wrong, if so, please tell me!

My recollection of previous discussions can be summarized as "the
dirstate file does multiple things: we should split it up."

Given the breadth of things tracked in this list, I'm a bit concerned
about potential for write amplification where changing something small
results in writing out a large number of bytes. But a lot of this
hinges on the layout of this file. If we start adding complexity to
the file layout to minimize I/O, I worry that we'd be reinventing a
bespoke data store and we'd be better served by splitting the content
or leveraging something designed for the purpose (like SQLite or
LevelDB or somesuch).

The only other thing I'd consider adding to this list is something
that could help unify with external filesystem tracking tools. Maybe
an append only list of "externally monitored" filesystem changes
[found from watchman] that could be used to speed up aspects of `hg
status`. I haven't thought too much about this and my comment may be
off base. But my recollection is that the way fsmonitor integrates
today is somewhat hacky. I suspect there's a way to integrate that
functionality more tightly into the "dirstate umbrella" so things are
less hacky.
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Reply via email to