Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-15 Thread Junio C Hamano
Duy Nguyen  writes:

> Another aspect that's not mentioned is, we keep this daemon's logic as
> thin as possible. The "brain" stays in git. So the daemon can read and
> validate stuff, but that's about all it's allowed to do. It's not
> supposed to add/create new contents. It's not even allowed to accept
> direct updates from git.

That explanation does make the intent clear. It is a kind of design
decision that needs to be made early and that is hard to change
later (I am not saying I see the need of changing, though), so it is
worth stating explicitly to guide future readers and updaters of the
code.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-15 Thread Duy Nguyen
On Thu, Mar 10, 2016 at 6:09 AM, Junio C Hamano  wrote:
> David Turner  writes:
>
>> From: Nguyễn Thái Ngọc Duy 
>>
>> Instead of reading the index from disk and worrying about disk
>> corruption, the index is cached in memory (memory bit-flips happen
>> too, but hopefully less often). The result is faster read. Read time
>> is reduced by 70%.
>>
>> The biggest gain is not having to verify the trailing SHA-1, which
>> takes lots of time especially on large index files. But this also
>> opens doors for further optimiztions:
>>
>>  - we could create an in-memory format that's essentially the memory
>>dump of the index to eliminate most of parsing/allocation
>>overhead. The mmap'd memory can be used straight away. Experiment
>>[1] shows we could reduce read time by 88%.
>>
>>  - we could cache non-index info such as name hash
>>
>> The shared memory's name folows the template "git--"
>> where  is the trailing SHA-1 of the index file.  is
>> "index" for cached index files (and may be "name-hash" for name-hash
>> cache). If such shared memory exists, it contains the same index
>> content as on disk. The content is already validated by the daemon and
>> git won't validate it again (except comparing the trailing SHA-1s).
>
> This indeed is an interesting approach; what is not explained but
> must be is when the on-disk index is updated to reflect the reality
> (if I am reading the explanation and the code right, while the
> daemon is running, its in-core cache becomes the source of truth by
> forcing everybody's read-index-from() to go to the daemon).  The
> explanation could be "this is only for read side, and updating the
> index happens via the traditional 'write a new file and rename it to
> the final place' codepath, at which time the daemon must be told to
> re-read it."

Another aspect that's not mentioned is, we keep this daemon's logic as
thin as possible. The "brain" stays in git. So the daemon can read and
validate stuff, but that's about all it's allowed to do. It's not
supposed to add/create new contents. It's not even allowed to accept
direct updates from git.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-15 Thread Duy Nguyen
On Thu, Mar 10, 2016 at 1:36 AM, David Turner  wrote:
> Git can poke the daemon to tell it to refresh the index cache, or to
> keep it alive some more minutes via UNIX signals.

The reason I went with UNIX signals was because it made it possible to
make a simple GetMessage loop, the only thing I can remember from my
Windows time, on Windows later. It sounded clever, but because this is
more like UDP (vs TCP) it's harder for communication. For example, we
can't get a confirmation after a request... UNIX sockets would be more
natural.

Since this patch was written, watchman has gained Windows support. I
just looked at the code, it uses named pipe on Windows. So maybe we
can just go with that too (if only because it has been proven working
in practice) and we can go back to UNIX sockets on the *nix side. Too
bad we can't just copy some functions from watchman because of license
incompatibility. But we can leave Windows support to gfw team now, I
think.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-10 Thread Duy Nguyen
On Fri, Mar 11, 2016 at 3:22 AM, David Turner  wrote:
> Reads (ignoring SHA verification) will be slightly slower (due to the
> btree overhead).  If, in general, we only had to read part of the
> index, that would be faster. But a fair amount of git code is written
> to assume that it is reasonable to iterate over the whole index.  So I
> don't see a win from just switching to LMDB without other changes.

I didn't think of the btree overhead. Will need to run some test.. I
think lmdb just gives us pointers to mmap'd file, so an entire index
reading can be fast. We just record where the on-disk data is. We need
to make index-on-lmdb use native byte endian though to avoid
conversion and actually reading all index entries because of that.

> (I guess we could parallelize index deserialization more easily with
> lmdb, but I don't know
>
> The original intent of the index was to be a "staging area" to build up
> a tree before committing it.  This implies an alternate view of the
> index as a diff against some (possibly-empty) tree.  An index diff or
> status operation, then, just requires iterating over the changes.
>
> I haven't really dug into merges to understand whether it would be
> reasonable for them to use a representation like this, and what that
> would do to speed there.

I think we keep the same internal representation of the index, a list
of paths with extra info like stat and some flags. Minimal impacts to
the code base that way. There are some cases (I think) where we just
need to read a portion of the index (filtered by pathspec). If we
don't update the index afterwards, lmdb partial reads definitely help.
I need to check if those cases are common (and worth optimizing for)
though.

> In general, git takes the commit side of the commit-diff duality. This
> is generally a good thing, because commits are simpler to reason about.
> But I wonder if we could do better for this specific case.
>
> But I don't think switching to LMDB would replace index-helper.

The point of index-helper is reduce reading cost (let's leave watchman
out for now). But it looks like we're not certain if using lmdb can
reduce reading cost at lower complexity. I'll give it some thought
over the weekend. Maybe in the end we better just go with
index-helper...
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-10 Thread David Turner
On Thu, 2016-03-10 at 18:17 +0700, Duy Nguyen wrote:
> On Thu, Mar 10, 2016 at 6:21 AM, Junio C Hamano 
> wrote:
> > Junio C Hamano  writes:
> > 
> > > David Turner  writes:
> > > 
> > > > From: Nguyễn Thái Ngọc Duy 
> > > > 
> > > > Instead of reading the index from disk and worrying about disk
> > > > corruption, the index is cached in memory (memory bit-flips
> > > > happen
> > > > too, but hopefully less often). The result is faster read. Read
> > > > time
> > > > is reduced by 70%.
> > > > 
> > > > The biggest gain is not having to verify the trailing SHA-1,
> > > > which
> > > > takes lots of time especially on large index files.
> > 
> > Come to think of it, wouldn't it be far less intrusive change to
> > just put the index on a ramdisk and skip the trailing SHA-1
> > verification, to obtain a similar result with the same trade off
> > (i.e. blindly trusting memory instead of being paranoid)?
> 
> I'm straying off-topic again because this reminded me about lmdb :)
> For the record when I looked at lmdb I thought of using it as an
> index
> format too. It has everything we wanted in index-v5. Good enough that
> we would not need index-helper and split-index to work around current
> index format. Though I finally decided it didn't fit well.
> 
> On the good side, lmdb is b+ trees, we can update directly on disk
> (without rewriting whole file), read directly from mmap'd file and
> not
> worry about corrupting it. There's no SHA-1 verification either (and
> we can do crc32 per entry if we want too). It's good.
> 
> But, it does not fit well the our locking model (but we can change
> this). And we sometimes want to create temporary throw-away indexes
> (e.g. partial commits) which I don't think is easy to do. And the
> reading directly from mmap does not give us much because we have to
> do
> byte endian conversion  anyway.
> 
> Hmm.. on second thought, even though lmdb can't be the default format
> (for a bunch of other limitations), it can still be an option for
> super big worktrees, just like index-helper being an optional
> optimization. Hm.. hm.. if we can support lmdb as index format in
> addition to the current format, bringing back some work from Thomas..
> We may still need a daemon or something to deal with watchman, but
> the
> underlying mechanism is exactly the same... David, Junio, what do you
> think?

LMDB makes writes faster, since we only have to write the stuff that's
changed.  That's good.

Reads (ignoring SHA verification) will be slightly slower (due to the
btree overhead).  If, in general, we only had to read part of the
index, that would be faster. But a fair amount of git code is written
to assume that it is reasonable to iterate over the whole index.  So I
don't see a win from just switching to LMDB without other changes.

(I guess we could parallelize index deserialization more easily with
lmdb, but I don't know 

The original intent of the index was to be a "staging area" to build up
a tree before committing it.  This implies an alternate view of the
index as a diff against some (possibly-empty) tree.  An index diff or
status operation, then, just requires iterating over the changes. 

I haven't really dug into merges to understand whether it would be
reasonable for them to use a representation like this, and what that
would do to speed there.

In general, git takes the commit side of the commit-diff duality. This
is generally a good thing, because commits are simpler to reason about.
But I wonder if we could do better for this specific case.

But I don't think switching to LMDB would replace index-helper.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-10 Thread Duy Nguyen
On Thu, Mar 10, 2016 at 6:21 AM, Junio C Hamano  wrote:
> Junio C Hamano  writes:
>
>> David Turner  writes:
>>
>>> From: Nguyễn Thái Ngọc Duy 
>>>
>>> Instead of reading the index from disk and worrying about disk
>>> corruption, the index is cached in memory (memory bit-flips happen
>>> too, but hopefully less often). The result is faster read. Read time
>>> is reduced by 70%.
>>>
>>> The biggest gain is not having to verify the trailing SHA-1, which
>>> takes lots of time especially on large index files.
>
> Come to think of it, wouldn't it be far less intrusive change to
> just put the index on a ramdisk and skip the trailing SHA-1
> verification, to obtain a similar result with the same trade off
> (i.e. blindly trusting memory instead of being paranoid)?

I'm straying off-topic again because this reminded me about lmdb :)
For the record when I looked at lmdb I thought of using it as an index
format too. It has everything we wanted in index-v5. Good enough that
we would not need index-helper and split-index to work around current
index format. Though I finally decided it didn't fit well.

On the good side, lmdb is b+ trees, we can update directly on disk
(without rewriting whole file), read directly from mmap'd file and not
worry about corrupting it. There's no SHA-1 verification either (and
we can do crc32 per entry if we want too). It's good.

But, it does not fit well the our locking model (but we can change
this). And we sometimes want to create temporary throw-away indexes
(e.g. partial commits) which I don't think is easy to do. And the
reading directly from mmap does not give us much because we have to do
byte endian conversion  anyway.

Hmm.. on second thought, even though lmdb can't be the default format
(for a bunch of other limitations), it can still be an option for
super big worktrees, just like index-helper being an optional
optimization. Hm.. hm.. if we can support lmdb as index format in
addition to the current format, bringing back some work from Thomas..
We may still need a daemon or something to deal with watchman, but the
underlying mechanism is exactly the same... David, Junio, what do you
think?
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-09 Thread David Turner
On Wed, 2016-03-09 at 15:09 -0800, Junio C Hamano wrote:
> David Turner  writes:
> 
> > From: Nguyễn Thái Ngọc Duy 
> > 
> > Instead of reading the index from disk and worrying about disk
> > corruption, the index is cached in memory (memory bit-flips happen
> > too, but hopefully less often). The result is faster read. Read
> > time
> > is reduced by 70%.
> > 
> > The biggest gain is not having to verify the trailing SHA-1, which
> > takes lots of time especially on large index files. But this also
> > opens doors for further optimiztions:
> > 
> >  - we could create an in-memory format that's essentially the
> > memory
> >dump of the index to eliminate most of parsing/allocation
> >overhead. The mmap'd memory can be used straight away.
> > Experiment
> >[1] shows we could reduce read time by 88%.
> > 
> >  - we could cache non-index info such as name hash
> > 
> > The shared memory's name folows the template "git--"
> > where  is the trailing SHA-1 of the index file.  is
> > "index" for cached index files (and may be "name-hash" for name
> > -hash
> > cache). If such shared memory exists, it contains the same index
> > content as on disk. The content is already validated by the daemon
> > and
> > git won't validate it again (except comparing the trailing SHA-1s).
> 
> This indeed is an interesting approach; what is not explained but
> must be is when the on-disk index is updated to reflect the reality
> (if I am reading the explanation and the code right, while the
> daemon is running, its in-core cache becomes the source of truth by
> forcing everybody's read-index-from() to go to the daemon).  The
> explanation could be "this is only for read side, and updating the
> index happens via the traditional 'write a new file and rename it to
> the final place' codepath, at which time the daemon must be told to
> re-read it."

This seems like the explanation (from the current commit message):

"Git can poke the daemon to tell it to refresh the index cache, or to
keep it alive some more minutes via UNIX signals. It can't give any
real index data directly to the daemon. Real data goes to disk first,
then the daemon reads and verifies it from there. Poking only happens
for $GIT_DIR/index, not temporary index files."

I guess this could be rewritten as:

"Index validity is ensured by the following method: When a read is
requested from the index-helper, it checks the SHA1 of its cached index
copy against the on-disk version.  If they differ, index-helper rereads
the index.  In addition, any git process may explicitly suggest a
reread via a UNIX signal, but this is only an optimization and it is
not necessary for correctness.

In addition, Git can signal the daemon with a heartbeat signal, to keep
the daemon alive longer."

How does that sound?


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-09 Thread David Turner
On Wed, 2016-03-09 at 15:21 -0800, Junio C Hamano wrote:
> Junio C Hamano  writes:
> 
> > David Turner  writes:
> > 
> > > From: Nguyễn Thái Ngọc Duy 
> > > 
> > > Instead of reading the index from disk and worrying about disk
> > > corruption, the index is cached in memory (memory bit-flips
> > > happen
> > > too, but hopefully less often). The result is faster read. Read
> > > time
> > > is reduced by 70%.
> > > 
> > > The biggest gain is not having to verify the trailing SHA-1,
> > > which
> > > takes lots of time especially on large index files.
> 
> Come to think of it, wouldn't it be far less intrusive change to
> just put the index on a ramdisk and skip the trailing SHA-1
> verification, to obtain a similar result with the same trade off
> (i.e. blindly trusting memory instead of being paranoid)?
> 

1. If the index were stored on a ramdisk, we would *also* have to write
it to durable storage to avoid losing the user's work when they power
off.  That's more complicated.  

2. Duy notes that it is easier to add further optimizations to this --
for instance, we could share a pre-parsed version of the index.  It
would be harder to add these optimizations to the disk format, because
(a) they would take up more space, and (b) they would probably be
harder to write in a single pass, which is presently how index writing
works.

3. If we wanted to just skip SHA-1 verification, we would not need a
ramdisk; it's almost certain that the index would be in the disk cache
most of the time, so regular storage should be very nearly as fast as a
ramdisk. I think mlock might help ensure that the data remains in the
cache, although I'm not sure what the permissions story is on that.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-09 Thread Junio C Hamano
Junio C Hamano  writes:

> David Turner  writes:
>
>> From: Nguyễn Thái Ngọc Duy 
>>
>> Instead of reading the index from disk and worrying about disk
>> corruption, the index is cached in memory (memory bit-flips happen
>> too, but hopefully less often). The result is faster read. Read time
>> is reduced by 70%.
>>
>> The biggest gain is not having to verify the trailing SHA-1, which
>> takes lots of time especially on large index files.

Come to think of it, wouldn't it be far less intrusive change to
just put the index on a ramdisk and skip the trailing SHA-1
verification, to obtain a similar result with the same trade off
(i.e. blindly trusting memory instead of being paranoid)?

 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff

2016-03-09 Thread Junio C Hamano
David Turner  writes:

> From: Nguyễn Thái Ngọc Duy 
>
> Instead of reading the index from disk and worrying about disk
> corruption, the index is cached in memory (memory bit-flips happen
> too, but hopefully less often). The result is faster read. Read time
> is reduced by 70%.
>
> The biggest gain is not having to verify the trailing SHA-1, which
> takes lots of time especially on large index files. But this also
> opens doors for further optimiztions:
>
>  - we could create an in-memory format that's essentially the memory
>dump of the index to eliminate most of parsing/allocation
>overhead. The mmap'd memory can be used straight away. Experiment
>[1] shows we could reduce read time by 88%.
>
>  - we could cache non-index info such as name hash
>
> The shared memory's name folows the template "git--"
> where  is the trailing SHA-1 of the index file.  is
> "index" for cached index files (and may be "name-hash" for name-hash
> cache). If such shared memory exists, it contains the same index
> content as on disk. The content is already validated by the daemon and
> git won't validate it again (except comparing the trailing SHA-1s).

This indeed is an interesting approach; what is not explained but
must be is when the on-disk index is updated to reflect the reality
(if I am reading the explanation and the code right, while the
daemon is running, its in-core cache becomes the source of truth by
forcing everybody's read-index-from() to go to the daemon).  The
explanation could be "this is only for read side, and updating the
index happens via the traditional 'write a new file and rename it to
the final place' codepath, at which time the daemon must be told to
re-read it."
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html