Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files
Michael Haggertywrites: > The change to using `read()` rather than `mmap()` for small > `packed-refs` feels like it should be an improvement, but it occurred to > me that the performance numbers quoted in ea68b0ce9f8 (hash-object: > don't use mmap() for small files, 2010-02-21) are not directly > applicable to the `packed-refs` file. As far as I understand, the file > mmapped in `index_fd()` is always read in full, whereas the main point > of mmapping the packed-refs file is to avoid having to read the whole > file at all in some situations. That being said, a 32 KiB file would > only be 8 pages (assuming a page size of 4 KiB), and by the time you've > read the header and binary-searched to find the desired record, you've > probably paged in most of the file anyway. Reading the whole file at > once, in order, is almost certainly cheaper. Yup. So unless your "small" is meaningfully large, we are likely to be better off with read(2), but I suspect that this might not be even measuable since we are only talking about "small" files.
Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files
On 01/22/2018 08:31 PM, Junio C Hamano wrote: > Michael Haggertywrites: > >> `snapshot->buf` can still be NULL if the `packed-refs` file didn't exist >> (see the earlier code path in `load_contents()`). So either that code >> path *also* has to get the `xmalloc()` treatment, or my third patch is >> still necessary. (My second patch wouldn't be necessary because the >> ENOENT case makes `load_contents()` return 0, triggering the early exit >> from `create_snapshot()`.) >> >> I don't have a strong preference either way. > > Which would be a two-liner, like the attached, which does not look > too bad by itself. > > The direction, if we take this approach, means that we are declaring > that .buf being NULL is an invalid state for a snapshot to be in, > instead of saying "an empty snapshot looks exactly like one that was > freshly initialized", which seems to be the intention of the original > design. > > After Kim's fix and with 3/3 in your follow-up series, various > helpers are still unsafe against .buf being NULL, like > sort_snapshot(), verify_buffer_safe(), clear_snapshot_buffer() (only > when mmapped bit is set), find_reference_location(). > > packed_ref_iterator_begin() checks if snapshot->buf is NULL and > returns early. At the first glance, this appears a useful short cut > to optimize the empty case away, but the check also is acting as a > guard to prevent a snapshot with NULL .buf from being fed to an > unsafe find_reference_location(). An implicit guard like this feels > a bit more brittle than my liking. If we ensure .buf is never NULL, > that check can become a pure short-cut optimization and stop being > a correctness thing. > > So... > > > refs/packed-backend.c | 9 - > 1 file changed, 4 insertions(+), 5 deletions(-) > > diff --git a/refs/packed-backend.c b/refs/packed-backend.c > index b6e2bc3c1d..1eeb5c7f80 100644 > --- a/refs/packed-backend.c > +++ b/refs/packed-backend.c > @@ -473,12 +473,11 @@ static int load_contents(struct snapshot *snapshot) > if (fd < 0) { > if (errno == ENOENT) { > /* > - * This is OK; it just means that no > - * "packed-refs" file has been written yet, > - * which is equivalent to it being empty, > - * which is its state when initialized with > - * zeros. > + * Treat missing "packed-refs" as equivalent to > + * it being empty. >*/ > + snapshot->eof = snapshot->buf = xmalloc(0); > + snapshot->mmapped = 0; > return 0; > } else { > die_errno("couldn't read %s", snapshot->refs->path); > That would work, though if you go this way, please also change the docstring for `snapshot::buf`, which still says that `buf` and `eof` can be `NULL`. The other alternative, making `snapshot` safe for NULLs, becomes easier if `snapshot` stores a pointer to the start of the reference section of the `packed-refs` contents (i.e., after the header line), rather than repeatedly computing that address from `snapshot->buf + snapshot->header_len`. With this change, code that is technically undefined when the fields are NULL can more easily be replaced with code that is safe for NULL. For example, pos = snapshot->buf + snapshot->header_len becomes pos = snapshot->start , and len = snapshot->eof - pos; if (!len) [...] becomes if (pos == snapshot->eof) [...] len = snapshot->eof - pos; . In this way, most of the special-casing for NULL goes away (and some code becomes simpler, as well). In a moment I'll send a patch series illustrating this approach. I think patches 01, 02, and 04 are improvements regardless of whether we decide to make NULL safe. The change to using `read()` rather than `mmap()` for small `packed-refs` feels like it should be an improvement, but it occurred to me that the performance numbers quoted in ea68b0ce9f8 (hash-object: don't use mmap() for small files, 2010-02-21) are not directly applicable to the `packed-refs` file. As far as I understand, the file mmapped in `index_fd()` is always read in full, whereas the main point of mmapping the packed-refs file is to avoid having to read the whole file at all in some situations. That being said, a 32 KiB file would only be 8 pages (assuming a page size of 4 KiB), and by the time you've read the header and binary-searched to find the desired record, you've probably paged in most of the file anyway. Reading the whole file at once, in order, is almost certainly cheaper. Michael
Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files
Michael Haggertywrites: > `snapshot->buf` can still be NULL if the `packed-refs` file didn't exist > (see the earlier code path in `load_contents()`). So either that code > path *also* has to get the `xmalloc()` treatment, or my third patch is > still necessary. (My second patch wouldn't be necessary because the > ENOENT case makes `load_contents()` return 0, triggering the early exit > from `create_snapshot()`.) > > I don't have a strong preference either way. Which would be a two-liner, like the attached, which does not look too bad by itself. The direction, if we take this approach, means that we are declaring that .buf being NULL is an invalid state for a snapshot to be in, instead of saying "an empty snapshot looks exactly like one that was freshly initialized", which seems to be the intention of the original design. After Kim's fix and with 3/3 in your follow-up series, various helpers are still unsafe against .buf being NULL, like sort_snapshot(), verify_buffer_safe(), clear_snapshot_buffer() (only when mmapped bit is set), find_reference_location(). packed_ref_iterator_begin() checks if snapshot->buf is NULL and returns early. At the first glance, this appears a useful short cut to optimize the empty case away, but the check also is acting as a guard to prevent a snapshot with NULL .buf from being fed to an unsafe find_reference_location(). An implicit guard like this feels a bit more brittle than my liking. If we ensure .buf is never NULL, that check can become a pure short-cut optimization and stop being a correctness thing. So... refs/packed-backend.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/refs/packed-backend.c b/refs/packed-backend.c index b6e2bc3c1d..1eeb5c7f80 100644 --- a/refs/packed-backend.c +++ b/refs/packed-backend.c @@ -473,12 +473,11 @@ static int load_contents(struct snapshot *snapshot) if (fd < 0) { if (errno == ENOENT) { /* -* This is OK; it just means that no -* "packed-refs" file has been written yet, -* which is equivalent to it being empty, -* which is its state when initialized with -* zeros. +* Treat missing "packed-refs" as equivalent to +* it being empty. */ + snapshot->eof = snapshot->buf = xmalloc(0); + snapshot->mmapped = 0; return 0; } else { die_errno("couldn't read %s", snapshot->refs->path);
Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files
On 01/17/2018 11:09 PM, Jeff King wrote: > On Tue, Jan 16, 2018 at 08:38:15PM +0100, Kim Gybels wrote: > >> Take a hint from commit ea68b0ce9f8 (hash-object: don't use mmap() for >> small files, 2010-02-21) and use read() instead of mmap() for small >> packed-refs files. >> >> This also fixes the problem[1] where xmmap() returns NULL for zero >> length[2], for which munmap() later fails. >> >> Alternatively, we could simply check for NULL before munmap(), or >> introduce xmunmap() that could be used together with xmmap(). However, >> always setting snapshot->buf to a valid pointer, by relying on >> xmalloc(0)'s fallback to 1-byte allocation, makes using snapshots >> easier. >> >> [1] https://github.com/git-for-windows/git/issues/1410 >> [2] Logic introduced in commit 9130ac1e196 (Better error messages for >> corrupt databases, 2007-01-11) >> >> Signed-off-by: Kim Gybels>> --- >> >> Change since v2: removed separate case for zero length as suggested by Peff, >> ensuring that snapshot->buf is always a valid pointer. > > Thanks, this looks fine to me (I'd be curious to hear from Michael if > this eliminates the need for the other patches). `snapshot->buf` can still be NULL if the `packed-refs` file didn't exist (see the earlier code path in `load_contents()`). So either that code path *also* has to get the `xmalloc()` treatment, or my third patch is still necessary. (My second patch wouldn't be necessary because the ENOENT case makes `load_contents()` return 0, triggering the early exit from `create_snapshot()`.) I don't have a strong preference either way. Michael
Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files
On Tue, Jan 16, 2018 at 08:38:15PM +0100, Kim Gybels wrote: > Take a hint from commit ea68b0ce9f8 (hash-object: don't use mmap() for > small files, 2010-02-21) and use read() instead of mmap() for small > packed-refs files. > > This also fixes the problem[1] where xmmap() returns NULL for zero > length[2], for which munmap() later fails. > > Alternatively, we could simply check for NULL before munmap(), or > introduce xmunmap() that could be used together with xmmap(). However, > always setting snapshot->buf to a valid pointer, by relying on > xmalloc(0)'s fallback to 1-byte allocation, makes using snapshots > easier. > > [1] https://github.com/git-for-windows/git/issues/1410 > [2] Logic introduced in commit 9130ac1e196 (Better error messages for > corrupt databases, 2007-01-11) > > Signed-off-by: Kim Gybels> --- > > Change since v2: removed separate case for zero length as suggested by Peff, > ensuring that snapshot->buf is always a valid pointer. Thanks, this looks fine to me (I'd be curious to hear from Michael if this eliminates the need for the other patches). -Peff