Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files

2018-01-24 Thread Junio C Hamano
Michael Haggerty  writes:

> The change to using `read()` rather than `mmap()` for small
> `packed-refs` feels like it should be an improvement, but it occurred to
> me that the performance numbers quoted in ea68b0ce9f8 (hash-object:
> don't use mmap() for small files, 2010-02-21) are not directly
> applicable to the `packed-refs` file. As far as I understand, the file
> mmapped in `index_fd()` is always read in full, whereas the main point
> of mmapping the packed-refs file is to avoid having to read the whole
> file at all in some situations. That being said, a 32 KiB file would
> only be 8 pages (assuming a page size of 4 KiB), and by the time you've
> read the header and binary-searched to find the desired record, you've
> probably paged in most of the file anyway. Reading the whole file at
> once, in order, is almost certainly cheaper.

Yup.  So unless your "small" is meaningfully large, we are likely to
be better off with read(2), but I suspect that this might not be
even measuable since we are only talking about "small" files.


Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files

2018-01-24 Thread Michael Haggerty
On 01/22/2018 08:31 PM, Junio C Hamano wrote:
> Michael Haggerty  writes:
> 
>> `snapshot->buf` can still be NULL if the `packed-refs` file didn't exist
>> (see the earlier code path in `load_contents()`). So either that code
>> path *also* has to get the `xmalloc()` treatment, or my third patch is
>> still necessary. (My second patch wouldn't be necessary because the
>> ENOENT case makes `load_contents()` return 0, triggering the early exit
>> from `create_snapshot()`.)
>>
>> I don't have a strong preference either way.
> 
> Which would be a two-liner, like the attached, which does not look
> too bad by itself.
> 
> The direction, if we take this approach, means that we are declaring
> that .buf being NULL is an invalid state for a snapshot to be in,
> instead of saying "an empty snapshot looks exactly like one that was
> freshly initialized", which seems to be the intention of the original
> design.
> 
> After Kim's fix and with 3/3 in your follow-up series, various
> helpers are still unsafe against .buf being NULL, like
> sort_snapshot(), verify_buffer_safe(), clear_snapshot_buffer() (only
> when mmapped bit is set), find_reference_location().
> 
> packed_ref_iterator_begin() checks if snapshot->buf is NULL and
> returns early.  At the first glance, this appears a useful short cut
> to optimize the empty case away, but the check also is acting as a
> guard to prevent a snapshot with NULL .buf from being fed to an
> unsafe find_reference_location().  An implicit guard like this feels
> a bit more brittle than my liking.  If we ensure .buf is never NULL,
> that check can become a pure short-cut optimization and stop being
> a correctness thing.
> 
> So...
> 
> 
>  refs/packed-backend.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> index b6e2bc3c1d..1eeb5c7f80 100644
> --- a/refs/packed-backend.c
> +++ b/refs/packed-backend.c
> @@ -473,12 +473,11 @@ static int load_contents(struct snapshot *snapshot)
>   if (fd < 0) {
>   if (errno == ENOENT) {
>   /*
> -  * This is OK; it just means that no
> -  * "packed-refs" file has been written yet,
> -  * which is equivalent to it being empty,
> -  * which is its state when initialized with
> -  * zeros.
> +  * Treat missing "packed-refs" as equivalent to
> +  * it being empty.
>*/
> + snapshot->eof = snapshot->buf = xmalloc(0);
> + snapshot->mmapped = 0;
>   return 0;
>   } else {
>   die_errno("couldn't read %s", snapshot->refs->path);
> 

That would work, though if you go this way, please also change the
docstring for `snapshot::buf`, which still says that `buf` and `eof` can
be `NULL`.

The other alternative, making `snapshot` safe for NULLs, becomes easier
if `snapshot` stores a pointer to the start of the reference section of
the `packed-refs` contents (i.e., after the header line), rather than
repeatedly computing that address from `snapshot->buf +
snapshot->header_len`. With this change, code that is technically
undefined when the fields are NULL can more easily be replaced with code
that is safe for NULL. For example,

pos = snapshot->buf + snapshot->header_len

becomes

pos = snapshot->start

, and

len = snapshot->eof - pos;
if (!len) [...]

becomes

if (pos == snapshot->eof) [...]
len = snapshot->eof - pos;

. In this way, most of the special-casing for NULL goes away (and some
code becomes simpler, as well).

In a moment I'll send a patch series illustrating this approach. I think
patches 01, 02, and 04 are improvements regardless of whether we decide
to make NULL safe.

The change to using `read()` rather than `mmap()` for small
`packed-refs` feels like it should be an improvement, but it occurred to
me that the performance numbers quoted in ea68b0ce9f8 (hash-object:
don't use mmap() for small files, 2010-02-21) are not directly
applicable to the `packed-refs` file. As far as I understand, the file
mmapped in `index_fd()` is always read in full, whereas the main point
of mmapping the packed-refs file is to avoid having to read the whole
file at all in some situations. That being said, a 32 KiB file would
only be 8 pages (assuming a page size of 4 KiB), and by the time you've
read the header and binary-searched to find the desired record, you've
probably paged in most of the file anyway. Reading the whole file at
once, in order, is almost certainly cheaper.

Michael


Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files

2018-01-22 Thread Junio C Hamano
Michael Haggerty  writes:

> `snapshot->buf` can still be NULL if the `packed-refs` file didn't exist
> (see the earlier code path in `load_contents()`). So either that code
> path *also* has to get the `xmalloc()` treatment, or my third patch is
> still necessary. (My second patch wouldn't be necessary because the
> ENOENT case makes `load_contents()` return 0, triggering the early exit
> from `create_snapshot()`.)
>
> I don't have a strong preference either way.

Which would be a two-liner, like the attached, which does not look
too bad by itself.

The direction, if we take this approach, means that we are declaring
that .buf being NULL is an invalid state for a snapshot to be in,
instead of saying "an empty snapshot looks exactly like one that was
freshly initialized", which seems to be the intention of the original
design.

After Kim's fix and with 3/3 in your follow-up series, various
helpers are still unsafe against .buf being NULL, like
sort_snapshot(), verify_buffer_safe(), clear_snapshot_buffer() (only
when mmapped bit is set), find_reference_location().

packed_ref_iterator_begin() checks if snapshot->buf is NULL and
returns early.  At the first glance, this appears a useful short cut
to optimize the empty case away, but the check also is acting as a
guard to prevent a snapshot with NULL .buf from being fed to an
unsafe find_reference_location().  An implicit guard like this feels
a bit more brittle than my liking.  If we ensure .buf is never NULL,
that check can become a pure short-cut optimization and stop being
a correctness thing.

So...


 refs/packed-backend.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index b6e2bc3c1d..1eeb5c7f80 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -473,12 +473,11 @@ static int load_contents(struct snapshot *snapshot)
if (fd < 0) {
if (errno == ENOENT) {
/*
-* This is OK; it just means that no
-* "packed-refs" file has been written yet,
-* which is equivalent to it being empty,
-* which is its state when initialized with
-* zeros.
+* Treat missing "packed-refs" as equivalent to
+* it being empty.
 */
+   snapshot->eof = snapshot->buf = xmalloc(0);
+   snapshot->mmapped = 0;
return 0;
} else {
die_errno("couldn't read %s", snapshot->refs->path);


Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files

2018-01-20 Thread Michael Haggerty
On 01/17/2018 11:09 PM, Jeff King wrote:
> On Tue, Jan 16, 2018 at 08:38:15PM +0100, Kim Gybels wrote:
> 
>> Take a hint from commit ea68b0ce9f8 (hash-object: don't use mmap() for
>> small files, 2010-02-21) and use read() instead of mmap() for small
>> packed-refs files.
>>
>> This also fixes the problem[1] where xmmap() returns NULL for zero
>> length[2], for which munmap() later fails.
>>
>> Alternatively, we could simply check for NULL before munmap(), or
>> introduce xmunmap() that could be used together with xmmap(). However,
>> always setting snapshot->buf to a valid pointer, by relying on
>> xmalloc(0)'s fallback to 1-byte allocation, makes using snapshots
>> easier.
>>
>> [1] https://github.com/git-for-windows/git/issues/1410
>> [2] Logic introduced in commit 9130ac1e196 (Better error messages for
>> corrupt databases, 2007-01-11)
>>
>> Signed-off-by: Kim Gybels 
>> ---
>>
>> Change since v2: removed separate case for zero length as suggested by Peff,
>> ensuring that snapshot->buf is always a valid pointer.
> 
> Thanks, this looks fine to me (I'd be curious to hear from Michael if
> this eliminates the need for the other patches).

`snapshot->buf` can still be NULL if the `packed-refs` file didn't exist
(see the earlier code path in `load_contents()`). So either that code
path *also* has to get the `xmalloc()` treatment, or my third patch is
still necessary. (My second patch wouldn't be necessary because the
ENOENT case makes `load_contents()` return 0, triggering the early exit
from `create_snapshot()`.)

I don't have a strong preference either way.

Michael


Re: [PATCH v3] packed_ref_cache: don't use mmap() for small files

2018-01-17 Thread Jeff King
On Tue, Jan 16, 2018 at 08:38:15PM +0100, Kim Gybels wrote:

> Take a hint from commit ea68b0ce9f8 (hash-object: don't use mmap() for
> small files, 2010-02-21) and use read() instead of mmap() for small
> packed-refs files.
> 
> This also fixes the problem[1] where xmmap() returns NULL for zero
> length[2], for which munmap() later fails.
> 
> Alternatively, we could simply check for NULL before munmap(), or
> introduce xmunmap() that could be used together with xmmap(). However,
> always setting snapshot->buf to a valid pointer, by relying on
> xmalloc(0)'s fallback to 1-byte allocation, makes using snapshots
> easier.
> 
> [1] https://github.com/git-for-windows/git/issues/1410
> [2] Logic introduced in commit 9130ac1e196 (Better error messages for
> corrupt databases, 2007-01-11)
> 
> Signed-off-by: Kim Gybels 
> ---
> 
> Change since v2: removed separate case for zero length as suggested by Peff,
> ensuring that snapshot->buf is always a valid pointer.

Thanks, this looks fine to me (I'd be curious to hear from Michael if
this eliminates the need for the other patches).

-Peff