Re: Call for comments on new dirstate format contents

2021-10-18 Thread Augie Fackler


> On Oct 18, 2021, at 4:16 AM, Pierre-Yves David 
>  wrote:
> 
> 
> On 10/15/21 2:22 PM, Pierre-Yves David wrote:
>> 
>> On 10/13/21 10:57 AM, Simon Sapin wrote:
>>> Please let us know of any question or comment!
>> 
>> 
>> I remember discussion about storing WC exec-bit and symlink status to help 
>> system without support for thoses (Windows we are looking at you). That is 
>> necessary to solve things like "issue5883".
>> 
>> Storage wise this should be fairly simpler, so we should be able to reserve 
>> some useful value for that in the new format. Regarding the implementation 
>> of a behavior fixing the associated issues, it seems complicated to get 
>> something done as the freeze is a couple of days away.
>> 
>> I just remembered this and I am not actively working on it today so I don't 
>> have a very concret idea about it yet. Matt Harbison might have more 
>> concretes idea about this.
> 
> Okay, so the core of the issue is : "Windows" has no way of storing the exec 
> bits in the file-system and since we read it from the file system at commit 
> time, we have no way to set it (or unset it) in general, and during merges in 
> particular. The solution people have been leaning at for years is "store the 
> exec bits in the dirstate entry for the file and has some API (+ UI) to 
> set/unset it.
> 
> So it seems like we need to reserve at least 2 bits for this. (WC_EXEC_BITS, 
> WC_SYMLINK), then comes the question of how do we manage these two bits:
> 
> 1) We could enforce them all the time, requiring `hg status` runs to 
> synchronize the value between the fs and the dirstate when they differ, as 1 
> mean set and 0 mean unset. That approach requires more work but would allow 
> repository to move to a different fs without loosing information.
> 
> 2) We could have a third bits to indicate if the feature is in use. 
> Repository on file system that needs it could start using the feature 
> conditionally. This as the advantage that we just needs to reserve the flags, 
> and can implement the feature later, the code behaving as it is today if that 
> third flag is unset. However this means that moving repository between file 
> system might loose information.
> In this scenario we would still need code to at least preserve the existing 
> value in the code.
> 
> I guess I'll start poking a (2) and see how much work it actually save use 
> compared to (1)

This feels like it comes back to past desires to separate the parts of dirstate 
that track user intent and the parts of dirstate that record caches of 
filesystem state, but it also seems awfully late in the process of v2 to 
address that here.

I read through the design: it seems like a good improvement over the old 
dirstate format, with my only gripe that we (still) haven’t resolved the 
cache-vs-userdata duality of dirstate. Seems fine enough to me.

AF

> 
> --
> Pierre-Yves David
> ___
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: Call for comments on new dirstate format contents

2021-10-18 Thread Pierre-Yves David


On 10/15/21 2:22 PM, Pierre-Yves David wrote:


On 10/13/21 10:57 AM, Simon Sapin wrote:

Please let us know of any question or comment!



I remember discussion about storing WC exec-bit and symlink status to 
help system without support for thoses (Windows we are looking at 
you). That is necessary to solve things like "issue5883".


Storage wise this should be fairly simpler, so we should be able to 
reserve some useful value for that in the new format. Regarding the 
implementation of a behavior fixing the associated issues, it seems 
complicated to get something done as the freeze is a couple of days away.


I just remembered this and I am not actively working on it today so I 
don't have a very concret idea about it yet. Matt Harbison might have 
more concretes idea about this.


Okay, so the core of the issue is : "Windows" has no way of storing the 
exec bits in the file-system and since we read it from the file system 
at commit time, we have no way to set it (or unset it) in general, and 
during merges in particular. The solution people have been leaning at 
for years is "store the exec bits in the dirstate entry for the file and 
has some API (+ UI) to set/unset it.


So it seems like we need to reserve at least 2 bits for this. 
(WC_EXEC_BITS, WC_SYMLINK), then comes the question of how do we manage 
these two bits:


1) We could enforce them all the time, requiring `hg status` runs to 
synchronize the value between the fs and the dirstate when they differ, 
as 1 mean set and 0 mean unset. That approach requires more work but 
would allow repository to move to a different fs without loosing 
information.


2) We could have a third bits to indicate if the feature is in use. 
Repository on file system that needs it could start using the feature 
conditionally. This as the advantage that we just needs to reserve the 
flags, and can implement the feature later, the code behaving as it is 
today if that third flag is unset. However this means that moving 
repository between file system might loose information.
In this scenario we would still need code to at least preserve the 
existing value in the code.


I guess I'll start poking a (2) and see how much work it actually save 
use compared to (1)


--
Pierre-Yves David
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: Call for comments on new dirstate format contents

2021-10-15 Thread Pierre-Yves David


On 10/13/21 10:57 AM, Simon Sapin wrote:

On 28/06/2021 11:49, Raphaël Gomès wrote:

Hello all,

As you probably know my colleagues at Octobus and I have been working on
a new version of the dirstate, and we're coming pretty close to
something usable in production, so we need to freeze the format soon.


Hello again,

Together with the Rust implementation of the new status algorithm, 
this dirstate-v2 file format enables great performance improvements of 
`hg status` on large repositories.


We Octobus are hoping to stabilize it very soon after a few remaining 
changes, so that the format will not be experimental anymore in the 
upcoming Mercurial 6.0 release. It will not yet be enabled by default, 
but future Mercurial versions will need to be compatible both ways 
with 6.0 when accessing a given local repository that uses dirstate-v2.



A short user guide (how to enable, upgrade, or downgrade) as well as 
detailed documentation of the file format can be found at:


https://www.mercurial-scm.org/repo/hg-committed/file/tip/mercurial/helptext/internals/dirstate-v2.txt 



… or in a source repository by running `make local` then `./hg help 
internals.dirstate-v2`



The remaining format changes we have planned are:

* Add sub-second precision to stored file/symlink mtime, and share its 
location with that of directory mtime. (This part of the format is a 
bit of a mess right now since we’re in the middle of this change.)


* Maybe add a flag bit to allow marking files as "known modified at 
this mtime". `hg status` sometimes needs to read the contents of files 
in case of possible size-preserving changes. If there is indeed a 
change, currently this read is repeated every time status runs again. 
The new bit would record that result.


* Maybe add some node-specific or dirstate-wide flags or a "mode 
switch" to make the format and its storage of directory mtimes less 
tied to details of the current readdir-skipping optimization. (For 
example, a future version of Mercurial might want to add dirstate 
nodes for unknown or/and ignored files to skip readdir in more cases.)



Non-format changes that we want to have in 6.0:

* Merge D11520 and the rest of that stack to have a Python 
implementation of the format, so that repositories that use it are 
usable when Rust extensions are not enabled. This is slower, in the 
order of 0.1 to 0.3 seconds added to `hg status` commands taking 0.4 
to 2.5 seconds with dirstate-v1 without Rust on various repositories.


* Add configuration to either abort, warn, or silently continue when 
this slow code path is or would be used. And decide its default. I’m 
personally inclined at least not to abort by default since the slow 
path is not *horribly* slow.



Please let us know of any question or comment!



I remember discussion about storing WC exec-bit and symlink status to 
help system without support for thoses (Windows we are looking at you). 
That is necessary to solve things like "issue5883".


Storage wise this should be fairly simpler, so we should be able to 
reserve some useful value for that in the new format. Regarding the 
implementation of a behavior fixing the associated issues, it seems 
complicated to get something done as the freeze is a couple of days away.


I just remembered this and I am not actively working on it today so I 
don't have a very concret idea about it yet. Matt Harbison might have 
more concretes idea about this.


Cheers,

--
Pierre-Yves David

___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: Call for comments on new dirstate format contents

2021-10-13 Thread Simon Sapin

On 28/06/2021 11:49, Raphaël Gomès wrote:

Hello all,

As you probably know my colleagues at Octobus and I have been working on
a new version of the dirstate, and we're coming pretty close to
something usable in production, so we need to freeze the format soon.


Hello again,

Together with the Rust implementation of the new status algorithm, this dirstate-v2 
file format enables great performance improvements of `hg status` on large repositories.


We Octobus are hoping to stabilize it very soon after a few remaining changes, so 
that the format will not be experimental anymore in the upcoming Mercurial 6.0 
release. It will not yet be enabled by default, but future Mercurial versions will 
need to be compatible both ways with 6.0 when accessing a given local repository that 
uses dirstate-v2.



A short user guide (how to enable, upgrade, or downgrade) as well as detailed 
documentation of the file format can be found at:


https://www.mercurial-scm.org/repo/hg-committed/file/tip/mercurial/helptext/internals/dirstate-v2.txt

… or in a source repository by running `make local` then `./hg help 
internals.dirstate-v2`



The remaining format changes we have planned are:

* Add sub-second precision to stored file/symlink mtime, and share its location with 
that of directory mtime. (This part of the format is a bit of a mess right now since 
we’re in the middle of this change.)


* Maybe add a flag bit to allow marking files as "known modified at this mtime". `hg 
status` sometimes needs to read the contents of files in case of possible 
size-preserving changes. If there is indeed a change, currently this read is repeated 
every time status runs again. The new bit would record that result.


* Maybe add some node-specific or dirstate-wide flags or a "mode switch" to make the 
format and its storage of directory mtimes less tied to details of the current 
readdir-skipping optimization. (For example, a future version of Mercurial might want 
to add dirstate nodes for unknown or/and ignored files to skip readdir in more cases.)



Non-format changes that we want to have in 6.0:

* Merge D11520 and the rest of that stack to have a Python implementation of the 
format, so that repositories that use it are usable when Rust extensions are not 
enabled. This is slower, in the order of 0.1 to 0.3 seconds added to `hg status` 
commands taking 0.4 to 2.5 seconds with dirstate-v1 without Rust on various repositories.


* Add configuration to either abort, warn, or silently continue when this slow code 
path is or would be used. And decide its default. I’m personally inclined at least 
not to abort by default since the slow path is not *horribly* slow.



Please let us know of any question or comment!

--
Simon Sapin
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: Call for comments on new dirstate format contents

2021-06-30 Thread Gregory Szorc
On Mon, Jun 28, 2021 at 2:50 AM Raphaël Gomès  wrote:
>
> Hello all,
>
> As you probably know my colleagues at Octobus and I have been working on
> a new version of the dirstate, and we're coming pretty close to
> something usable in production, so we need to freeze the format soon.
> This email is not meant to discuss the exact byte-per-byte layout
> details of the format, but rather its contents: what do you think should
> be included (or at least have space reserved for) in the new version?
>
> We have already discussed this at previous sprints and various other
> discussion channels, but I thought it'd be better to give a "last call"
> chance for people to get their voices heard.
>
> I remember Google people saying they'd like to separate information that
> is frequently written to a separate file to help with their filesystem
> shenanigans. What exactly would be the plan and can we do it easily? I
> may be pessimistic, but this looks like it would require a lot of work
> which (so far) no one wants to sponsor, though I'm happy to be proven
> wrong either way.
>
> To Matt Harbison: you said something about storing exec bit and symlink
> info explicitly to help platforms like Windows that don't have them,
> could you please elaborate?
>
> As a general recap (and to help understand some decisions), the new
> format will be an append-only tree with no stem compression for
> performance reasons. The Python implementation will be functional but
> very basic and will offer no purposeful performance improvements (unless
> someone wants to have fun!), as we currently only have the bandwidth for
> optimizing the Rust implementation.
>
> An overview of the current target (some implementation-detail level
> contents omitted):
>
>  - A docket file that contains global metadata about the dirstate:
>  - NodeID of the parents (32 bytes reserved, 20 used for now)
>  - A total count of files (including Removed ones)
>  - A count of dead (unreachable) bytes
>  - A count of alive (reachable) bytes
>  - A hash of ignore patterns (see
> https://phab.mercurial-scm.org/D10836)
>  - In the data file, for each directory/file (it can be both at the
> same time):
>  - The full path in bytes of the file (or directory)
>  - The full path of the copy source (optional)
>  - How many tracked recursive descendants it has
>  - How many recursive copies it has
>  - Exec bit
>  - mtime (probably up to nanosecond precision, both files and
> directories)
>  - Clean file size when applicable
>  - Its state: if it's removed, added, clean, etc.
>  - Whether it's from p1 or p2
>  - Whether it's ambiguous (it appears clean but the mtime is the
> same as the last status, probably will only happen with the Python
> implementation)
>  - All of the info needed to get the previous state of a Removed
> file in case we `hg add` it back
>  - (My idea as I type this: ) store the "raw bytes" version of
> the OS path if it differs from the normalized hg version (on Windows and
> MacOS for example) to cache the filefoldmap.
>
> I *think* that's it? I might be wrong, if so, please tell me!

My recollection of previous discussions can be summarized as "the
dirstate file does multiple things: we should split it up."

Given the breadth of things tracked in this list, I'm a bit concerned
about potential for write amplification where changing something small
results in writing out a large number of bytes. But a lot of this
hinges on the layout of this file. If we start adding complexity to
the file layout to minimize I/O, I worry that we'd be reinventing a
bespoke data store and we'd be better served by splitting the content
or leveraging something designed for the purpose (like SQLite or
LevelDB or somesuch).

The only other thing I'd consider adding to this list is something
that could help unify with external filesystem tracking tools. Maybe
an append only list of "externally monitored" filesystem changes
[found from watchman] that could be used to speed up aspects of `hg
status`. I haven't thought too much about this and my comment may be
off base. But my recollection is that the way fsmonitor integrates
today is somewhat hacky. I suspect there's a way to integrate that
functionality more tightly into the "dirstate umbrella" so things are
less hacky.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: Call for comments on new dirstate format contents

2021-06-29 Thread Simon Sapin

On 29/06/2021 20:48, Kyle Lippincott wrote:
Can you elaborate a bit on what this append-only tree looks like (and why that's 
preferred)


It’s a tree in that there are nodes for files and directories. We can quickly find 
root nodes, and from a given node we can quickly find its direct child nodes, all 
without parsing the entire file.


It’s "append-mostly" because changes are made by adding new nodes at the end of the 
file and reusing nodes for unchanged sub-trees. Nodes that have been replaced become 
unreachable but still take up space. Occasionally based on some heuristic, the whole 
file is rewritten without unreachable nodes. This makes most writes cheaper than 
re-serializing and writing the entire file.




and why stem compression would cause performance issues?


Each node contains its full path from the repository root. This allows status code to 
pass around a slice (pointer + length) to the middle of the mmap’ed file. If a node 
only had its basename we’d have to allocate a string to reconstitute a path by 
concatenating the names of ancestor directories. The cost of many memory allocations 
can add up.




When loading this new dirstate, would it require loading the entire thing from 
the
 beginning and replacing entries with the newer ones?


No, that’s the point of making it a tree of fixed-size nodes that contain data at 
fixed-size offsets, with pseudo-pointers for variable-size data (paths and child nodes).



You say the Python implementation will offer no purposeful performance improvements, 
but how likely is it that it will be slower than the current format?


The current implementations (Python and C) of dirstate-v1 work by parsing the entire 
dirstate into large Python dicts. The Python implementation of dirstate-v2 would do 
the same, only parsing a different format.




What level of performance degradation would be considered acceptable?


Good question. We don’t have a hard criteria.

However this fallback implementation of dirstate-v2 will only be used when for 
accessing an existing local repository that uses that format. When creating a new 
clone, dirstate-v2 is only used if a fast implementation is available.



What happens if the docket and data file get out of sync somehow (maybe hg crashes in 
the middle of writing, or Google has a network write race)?


A docket that refers to a new data file is only swap-renamed after the data file was 
finished writing.


I don’t know what ordering guarantees between writes exist or not on Google’s network 
filesystem.




  - A count of dead (unreachable) bytes
  - A count of alive (reachable) bytes

What are these two?


Only one of them is needed, the other can be deduced by subtracting from the size of 
the file. Unreachable means obsolete parts of the file that have been replaced by 
other nodes, see "append-mostly" above. The heuristic for rewriting the whole file to 
get rid of unreachable data is based on this counter.




Is there a good way of determining what the timestamp resolution of a 
filesystem is?


As far as I know there is not.

What we can do is create a temporary file and take its mtime as the current time with 
the same (unknown) truncation as other file’s mtimes. If we observe a "current mtime" 
strictly later than a given file’s mtime, we know that further changes to that file 
are extremely likely[1] to cause a different mtime since the clock has already ticked 
since the last change.


([1] The system clock is not monotonous, so it could jump back and still have the 
same clock-reported date happen again. If we get unlucky another change to the file 
could happen exactly then, modulo truncation.)


See comments starting at
https://www.mercurial-scm.org/repo/hg-committed/file/5fa083a5ff04/rust/hg-core/src/dirstate_tree/status.rs#l401



(I don't know how various OSes treat 
these timestamps when the underlying filesystem doesn't support higher precision; is 
it 100% guaranteed that they just extend it with zeroes?)


Regardless, there’s also the case where the filesystem can store enough bits but the 
kernel only updates an internal clock at some arbitrary ticks:


https://stackoverflow.com/a/14393315/1162888



  - All of the info needed to get the previous state of a Removed
file in case we `hg add` it back


Can you explain the use case for this (and/or what would be in it)? I would think 
that `hg rm foo && echo hi > foo && hg add foo` should be equivalent to `echo hi > 
foo`, but I might be missing something?


I still don’t fully understand this, but it also exists in dirstate-v1. I think it’s 
relevant when in the middle of merging.


https://www.mercurial-scm.org/wiki/DirState#Summary


My biggest concern is extensibility. As an example, as you were writing this up, you 
thought of something else to add, so we probably don't want to restrict ourselves too 
much :) The file format is already going to not be anything resembling fixed record 
size, having a s

Re: Call for comments on new dirstate format contents

2021-06-29 Thread Kyle Lippincott
On Mon, Jun 28, 2021 at 2:50 AM Raphaël Gomès 
wrote:

> Hello all,
>
> As you probably know my colleagues at Octobus and I have been working on
> a new version of the dirstate, and we're coming pretty close to
> something usable in production, so we need to freeze the format soon.
> This email is not meant to discuss the exact byte-per-byte layout
> details of the format, but rather its contents: what do you think should
> be included (or at least have space reserved for) in the new version?
>
> We have already discussed this at previous sprints and various other
> discussion channels, but I thought it'd be better to give a "last call"
> chance for people to get their voices heard.
>
> I remember Google people saying they'd like to separate information that
> is frequently written to a separate file to help with their filesystem
> shenanigans. What exactly would be the plan and can we do it easily? I
> may be pessimistic, but this looks like it would require a lot of work
> which (so far) no one wants to sponsor, though I'm happy to be proven
> wrong either way.
>

The original thinking had been that we'd have two or three files:
1. p1/p2
2. anything the user did (`hg mv/cp/add/rm`)
3. anything hg can generate in `hg debugrebuilddirstate`

The thinking was that #3 could either be generated by the filesystem
itself, or if there was a network write race (when using filesystems like
our internal CitC filesystem, or maybe with things like NFS, if it can
determine write races, I'm honestly not sure...) it could either just let
one side win arbitrarily.

After learning more, neither of those really work for us. Our virtual
filesystem is "dumb" - it honestly knows very little about the files it's
being asked to store, so it would be a huge change to have it track enough
information to feasibly produce something that could replace the data in #3
above. Additionally, in the network write race scenario, letting one side
win arbitrarily just opens you up to dirstate corruption, which is not a
place anyone wants to be in. :) In the network write race case, we could
teach the virtual filesystem server to delete/poison the file (triggering a
dirstate rebuild on the next command), but it's probably not worthwhile at
this point.

I was then thinking that we could just store #3 "where it belongs", in
.hg/wcache, and just not replicate it. That still opens you up for dirstate
corruption issues (modify the working directory on machine A, and then use
it on machine B - we still need some way of telling machine B it's out of
date; that could be a timestamp in the non-cache part of the dirstate, I
guess?).


> To Matt Harbison: you said something about storing exec bit and symlink
> info explicitly to help platforms like Windows that don't have them,
> could you please elaborate?
>
> As a general recap (and to help understand some decisions), the new
> format will be an append-only tree with no stem compression for
> performance reasons.


Can you elaborate a bit on what this append-only tree looks like (and why
that's preferred) and why stem compression would cause performance issues?

When loading this new dirstate, would it require loading the entire thing
from the beginning and replacing entries with the newer ones? IMHO, we
should be optimizing as much as possible for the read performance, even if
it costs some small amount of write performance. Writes seem less frequent
to me (and more tolerant of slightly higher latency) than things like `hg
status` (being executed by an IDE, or by someone in `watch`, or in a shell
prompt, or something...)


> The Python implementation will be functional but
> very basic and will offer no purposeful performance improvements (unless
> someone wants to have fun!), as we currently only have the bandwidth for
> optimizing the Rust implementation.
>

You say the Python implementation will offer no purposeful performance
improvements, but how likely is it that it will be slower than the current
format? What level of performance degradation would be considered
acceptable?


>
> An overview of the current target (some implementation-detail level
> contents omitted):
>
>  - A docket file that contains global metadata about the dirstate:
>

What happens if the docket and data file get out of sync somehow (maybe hg
crashes in the middle of writing, or Google has a network write race)?


>  - NodeID of the parents (32 bytes reserved, 20 used for now)
>  - A total count of files (including Removed ones)

 - A count of dead (unreachable) bytes
>  - A count of alive (reachable) bytes
>

What are these two?


>  - A hash of ignore patterns (see
> https://phab.mercurial-scm.org/D10836)

 - In the data file, for each directory/file (it can be both at the
> same time):
>  - The full path in bytes of the file (or directory)
>  - The full path of the copy source (optional)
>  - How many tracked recursive descendants it has
>  - 

Re: Call for comments on new dirstate format contents

2021-06-28 Thread Matt Harbison
On Mon, Jun 28, 2021 at 5:49 AM Raphaël Gomès  wrote:
>
> To Matt Harbison: you said something about storing exec bit and symlink
> info explicitly to help platforms like Windows that don't have them,
> could you please elaborate?

The idea is essentially this, without having to steal
undefined/undocumented bits:

https://www.mercurial-scm.org/wiki/DirState#Proposed_extensions

I'm not sure what the point of having fallback vs real(?) bits was,
but I guess it could be useful to disambiguate on a system that *does*
support +x or symlink.  Support for this would fix issue2020,
issue5883, and maybe a few other corner cases.

> As a general recap (and to help understand some decisions), the new
> format will be an append-only tree with no stem compression for
> performance reasons. The Python implementation will be functional but
> very basic and will offer no purposeful performance improvements (unless
> someone wants to have fun!), as we currently only have the bandwidth for
> optimizing the Rust implementation.

Is a toggle-able bit like this a hassle for an append-only data
structure?  I suppose it's not much different than adding and removing
a file several times, but I haven't paid a lot of attention to this
discussion up to this point.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Call for comments on new dirstate format contents

2021-06-28 Thread Raphaël Gomès

Hello all,

As you probably know my colleagues at Octobus and I have been working on 
a new version of the dirstate, and we're coming pretty close to 
something usable in production, so we need to freeze the format soon. 
This email is not meant to discuss the exact byte-per-byte layout 
details of the format, but rather its contents: what do you think should 
be included (or at least have space reserved for) in the new version?


We have already discussed this at previous sprints and various other 
discussion channels, but I thought it'd be better to give a "last call" 
chance for people to get their voices heard.


I remember Google people saying they'd like to separate information that 
is frequently written to a separate file to help with their filesystem 
shenanigans. What exactly would be the plan and can we do it easily? I 
may be pessimistic, but this looks like it would require a lot of work 
which (so far) no one wants to sponsor, though I'm happy to be proven 
wrong either way.


To Matt Harbison: you said something about storing exec bit and symlink 
info explicitly to help platforms like Windows that don't have them, 
could you please elaborate?


As a general recap (and to help understand some decisions), the new 
format will be an append-only tree with no stem compression for 
performance reasons. The Python implementation will be functional but 
very basic and will offer no purposeful performance improvements (unless 
someone wants to have fun!), as we currently only have the bandwidth for 
optimizing the Rust implementation.


An overview of the current target (some implementation-detail level 
contents omitted):


    - A docket file that contains global metadata about the dirstate:
    - NodeID of the parents (32 bytes reserved, 20 used for now)
    - A total count of files (including Removed ones)
    - A count of dead (unreachable) bytes
    - A count of alive (reachable) bytes
    - A hash of ignore patterns (see 
https://phab.mercurial-scm.org/D10836)
    - In the data file, for each directory/file (it can be both at the 
same time):

    - The full path in bytes of the file (or directory)
    - The full path of the copy source (optional)
    - How many tracked recursive descendants it has
    - How many recursive copies it has
    - Exec bit
    - mtime (probably up to nanosecond precision, both files and 
directories)

    - Clean file size when applicable
    - Its state: if it's removed, added, clean, etc.
    - Whether it's from p1 or p2
    - Whether it's ambiguous (it appears clean but the mtime is the 
same as the last status, probably will only happen with the Python 
implementation)
    - All of the info needed to get the previous state of a Removed 
file in case we `hg add` it back
    - (My idea as I type this: ) store the "raw bytes" version of 
the OS path if it differs from the normalized hg version (on Windows and 
MacOS for example) to cache the filefoldmap.


I *think* that's it? I might be wrong, if so, please tell me!

Raphaël

___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel