Re: Implementing reftable in Git

2018-05-11 Thread David Turner
On Fri, 2018-05-11 at 11:31 +0200, Michael Haggerty wrote:
> On Wed, May 9, 2018 at 4:33 PM, Christian Couder
>  wrote:
> > I might start working on implementing reftable in Git soon.
> > [...]
> 
> Nice. It'll be great to have a reftable implementation in git core
> (and ideally libgit2, as well). It seems to me that it could someday
> become the new default reference storage method. The file format is
> considerably more complicated than the current loose/packed scheme,
> which is definitely a disadvantage (for example, for other Git
> implementations). But implementing it *with good performance and
> without races* might be no more complicated than the current scheme.

I am somewhat concerned about perf, because as I recall, we have a
bunch of code which effectively load all refs, which will be more
expensive with reftable than packed-refs (though maybe cheaper than
loose refs).  But maybe we have eliminated this code or can work around
it.

> Testing will be important. There are already many tests specifically
> about testing loose/packed reference storage. These will always have
> to run against repositories that are forced to use that reference
> scheme. And there will need to be new tests specifically about the
> reftable scheme. Both classes of tests should be run every time. That
> much is pretty obvious.
> 
> But currently, there are a lot of tests that assume the loose/packed
> reference format on disk even though the tests are not really related
> to references at all. ISTM that these should be converted to work at
> a
> higher level, for example using `for-each-ref`, `rev-parse`, etc. to
> examine references rather than reading reference files directly. That
> way the tests should run correctly regardless of which scheme is in
> use.

I agree with that, and I think some of my patches from years ago
attempted to do that.  I probably should have broken those out into a
separate series so that they could have been applied separately.

> And since it's too expensive to run the whole test suite with both
> reference storage schemes, it seems to me that the reference storage
> scheme that is used while running the scheme-neutral tests should be
> easy to choose at runtime.

I ran the whole suite with both schemes during my testing, and I think
it was quite valuable in flushing out bugs.

> David Turner did some analogous work for wiring up and testing his
> proposed LMDB ref storage backend that might be useful [1]. I'm CCing
> him, since he might have thoughts on this topic.

Inline, above.


Re: Implementing reftable in Git

2018-05-11 Thread Michael Haggerty
On Wed, May 9, 2018 at 4:33 PM, Christian Couder
 wrote:
> I might start working on implementing reftable in Git soon.
> [...]

Nice. It'll be great to have a reftable implementation in git core
(and ideally libgit2, as well). It seems to me that it could someday
become the new default reference storage method. The file format is
considerably more complicated than the current loose/packed scheme,
which is definitely a disadvantage (for example, for other Git
implementations). But implementing it *with good performance and
without races* might be no more complicated than the current scheme.

Testing will be important. There are already many tests specifically
about testing loose/packed reference storage. These will always have
to run against repositories that are forced to use that reference
scheme. And there will need to be new tests specifically about the
reftable scheme. Both classes of tests should be run every time. That
much is pretty obvious.

But currently, there are a lot of tests that assume the loose/packed
reference format on disk even though the tests are not really related
to references at all. ISTM that these should be converted to work at a
higher level, for example using `for-each-ref`, `rev-parse`, etc. to
examine references rather than reading reference files directly. That
way the tests should run correctly regardless of which scheme is in
use.

And since it's too expensive to run the whole test suite with both
reference storage schemes, it seems to me that the reference storage
scheme that is used while running the scheme-neutral tests should be
easy to choose at runtime.

David Turner did some analogous work for wiring up and testing his
proposed LMDB ref storage backend that might be useful [1]. I'm CCing
him, since he might have thoughts on this topic.

Regarding the reftable spec itself:

I recently gave a little internal talk about it, and while preparing
the talk I noticed a couple of things that should maybe be tweaked:

* The spec proposes to change `$GIT_DIR/refs`, which is currently a
directory that holds the loose refs, into a file that holds the table
of contents of reftable files comprising the full set of references.
This was my suggestion. I was thinking that this would prevent old
refs code from being used accidentally on a reftable-enabled
repository, while still enabling old versions of Git recognize this as
a git directory [2]. I think that the latter is important to make
things like `git rev-parse --git-dir` work correctly, even if the
installed version of git can't actually *read* the repository.

  The problem is that `is_git_directory()` checks not only whether
`$GIT_DIR/refs` exists, but also whether it is executable (i.e., since
it is normally a directory, that it is searchable). It would be silly
to make the reftable table of contents executable, so this doesn't
seem like a good approach after all.

  So probably `$GIT_DIR/refs` should continue to be a directory. If
it's there, it would probably make sense to place the reftable files
and maybe the ToC inside of it. We would have to rely on older Git
versions refusing to work in the directory because its `config` file
has an unrecognized `core.repositoryFormatVersion`, but that should be
OK I think.

* The scheme for naming reftable files [3] is, I believe, just a
suggestion as far as the spec is concerned (except for the use of
`.ref`/`.log` file extensions). It might be more less unwieldy to use
`%d` rather than `%08d`, and more convenient to name compacted files
to `${min_update_index}-${max_update_index}_${n}.{ref,log}` to make it
clearer to see by inspection what each file contains. That would also
make it unnecessary, in most cases, to insert a `_${n}` to make the
filename unique.

Michael

[1] https://github.com/dturner-tw/git/tree/dturner/pluggable-backends
[2] 
https://github.com/git/git/blob/ccdcbd54c4475c2238b310f7113ab3075b5abc9c/setup.c#L309-L347
[3] 
https://github.com/eclipse/jgit/blob/master/Documentation/technical/reftable.md#layout

https://github.com/eclipse/jgit/blob/master/Documentation/technical/reftable.md#compaction
[4] 
https://github.com/eclipse/jgit/blob/master/Documentation/technical/reftable.md#footer


Re: Implementing reftable in Git

2018-05-09 Thread Ævar Arnfjörð Bjarmason

On Wed, May 09 2018, Stefan Beller wrote:

> Hi Christian,
>
> On Wed, May 9, 2018 at 7:33 AM, Christian Couder
>  wrote:
>> Hi,
>>
>> I might start working on implementing reftable in Git soon.
>
> Cool! Everyone is waiting for it as they dream about the
> performance and correctness benefits this brings.
>
> Benefits that I know of:
> * performance in repos with many refs
> * no capitalization issues on case insensitive FS
> * replay-ability of the last fetch ("show the last reflog
>   of any ref under refs/remote/origin") is easier to do
>   in a correct way. (This is one of my motivations to desire reftables)
> * We *might* be able to use reftables in negotiation later
>   ("client: Last I fetched, you said your latest transaction
>   number was '5' with the hash over all refs to be ;
>   server: ok, here are the refs and the pack, you're welcome").
>
> Why are you (or rather booking.com) interested in this?

We have a lot of refs, which is a longer-term scalability issue (which
I've implemented hacks around (ref archiving)), and we also run into the
capitalization issues you mentioned.


Re: Implementing reftable in Git

2018-05-09 Thread Carlos Martín Nieto
On Wed, 2018-05-09 at 10:54 -0700, Jonathan Nieder wrote:
> Carlos Martín Nieto wrote:
> > On Wed, 2018-05-09 at 09:48 -0700, Jonathan Nieder wrote:
> > > If you would like the patches at https://git.eclipse.org/r/q/topi
> > > c:reftable
> > > relicensed for Git's use so that you don't need to include that
> > > license header, let me know.  Separate from any legal concerns,
> > > if
> > > you're doing a straight port, a one-line comment crediting the
> > > JGit
> > > project would still be appreciated, of course.
> 
> [...]
> > Would you expect that this port would keep the Eclipse Distribution
> > License or would it get relicensed to GPLv2?
> 
> I think you're way overcomplicating things.
> 
> The patches are copyright Google.  We can handle issues as they come.

Fair enough. I just wanted to avoid coming back to this in a few months
and realising we can't use it at all.

Cheers,
   cmn



Re: Implementing reftable in Git

2018-05-09 Thread Stefan Beller
On Wed, May 9, 2018 at 10:48 AM, Jonathan Nieder  wrote:
> Stefan Beller wrote:
>
>> * We *might* be able to use reftables in negotiation later
>>   ("client: Last I fetched, you said your latest transaction
>>   number was '5' with the hash over all refs to be ;
>>   server: ok, here are the refs and the pack, you're welcome").
>
> Do you mean that reftable's reflog layout makes this easier?
>
> It's not clear to me why this wouldn't work with the current
> reflogs.

Because of D/F conflicts we may not know all remote refs
(and their ref logs), such that "the hash over all refs" on the remote
is error prone to compute. Without transaction numbers it is also
cumbersome for the server to remember the state.
We could try it based on the current refs, but I'd think
it is not easy to do, whereas reftables bring some subtle
advantages that allow for such easier negotiation.

>
> [...]
>> On Wed, May 9, 2018 at 7:33 AM, Christian Couder
>>  wrote:
>
>>> During the last Git Merge conference last March Stefan talked about
>>> reftable. In Alex Vandiver's notes [1] it is asked that people
>>> announce it on the list when they start working on it,
>>
>> Mostly because many parties want to see it implemnented
>> and were not sure when they could start implementing it.
>
> And to coordinate / help each other!

Yes. Usually open source contributions are so sparse, that
just doing it and then sending it to the mailing list does not
produce contention or conflict (double work), but this seemed
like a race condition waiting to happen. ;)

>> With that said, please implement it in a way that it can not just be used as
>> a refs backend, but can easily be re-used to write ref advertisements
>> onto the wire?
>
> Can you spell this out a little more for me?  At first glance it's not
> obvious to me how knowing about this potential use would affect the
> initial code.

Yeah me neither. I just want to make Christian aware of the potential
use cases, that come afterwards, so it can influence his design decisions
for the implementation.


Re: Implementing reftable in Git

2018-05-09 Thread Jonathan Nieder
Carlos Martín Nieto wrote:
> On Wed, 2018-05-09 at 09:48 -0700, Jonathan Nieder wrote:

>> If you would like the patches at https://git.eclipse.org/r/q/topic:reftable
>> relicensed for Git's use so that you don't need to include that
>> license header, let me know.  Separate from any legal concerns, if
>> you're doing a straight port, a one-line comment crediting the JGit
>> project would still be appreciated, of course.
[...]
> Would you expect that this port would keep the Eclipse Distribution
> License or would it get relicensed to GPLv2?

I think you're way overcomplicating things.

The patches are copyright Google.  We can handle issues as they come.

Jonathan


Re: Implementing reftable in Git

2018-05-09 Thread Carlos Martín Nieto
Hi all,

On Wed, 2018-05-09 at 09:48 -0700, Jonathan Nieder wrote:
> Hi,
> 
> Christian Couder wrote:
> 
> > I might start working on implementing reftable in Git soon.
> 
> Yay!
> 
> [...]
> > So I think the most straightforward and compatible way to do it would
> > be to port the JGit implementation.
> 
> I suspect following the spec[1] would be even more compatible, since it
> would force us to tighten the spec where it is unclear.
> 
> >It looks like the
> > JGit repo and the reftable code there are licensed under the Eclipse
> > Distribution License - v 1.0 [7] which is very similar to the 3-Clause
> > BSD License also called Modified BSD License
> 
> If you would like the patches at https://git.eclipse.org/r/q/topic:reftable
> relicensed for Git's use so that you don't need to include that
> license header, let me know.  Separate from any legal concerns, if
> you're doing a straight port, a one-line comment crediting the JGit
> project would still be appreciated, of course.
> 
> That said, I would not be surprised if going straight from the spec is
> easier than porting the code.

Would you expect that this port would keep the Eclipse Distribution
License or would it get relicensed to GPLv2?

We would also want to have reftable functionality in the libgit2
project, but it has a slightly different license from git (GPLv2 with
linking exception) which requires explicit consent from the authors for
us to port over the code from git with its GPLv2 license.

The libgit2 project does have permission from Shawn to relicense his
git code, but this would presumably not cover this kind of porting. I
don't believe we would have issues if the code remained this BSD-like
license.

Sorry for being difficult, but fewer distinct reimplementations is
probably a good thing overall.

cc the core libgit2 team

Cheers,
   cmn



Re: Implementing reftable in Git

2018-05-09 Thread Jonathan Nieder
Stefan Beller wrote:

> * We *might* be able to use reftables in negotiation later
>   ("client: Last I fetched, you said your latest transaction
>   number was '5' with the hash over all refs to be ;
>   server: ok, here are the refs and the pack, you're welcome").

Do you mean that reftable's reflog layout makes this easier?

It's not clear to me why this wouldn't work with the current
reflogs.

[...]
> On Wed, May 9, 2018 at 7:33 AM, Christian Couder
>  wrote:

>> During the last Git Merge conference last March Stefan talked about
>> reftable. In Alex Vandiver's notes [1] it is asked that people
>> announce it on the list when they start working on it,
>
> Mostly because many parties want to see it implemnented
> and were not sure when they could start implementing it.

And to coordinate / help each other!

[...]
> I volunteer for reviewing.

\o/

[...]
> With that said, please implement it in a way that it can not just be used as
> a refs backend, but can easily be re-used to write ref advertisements
> onto the wire?

Can you spell this out a little more for me?  At first glance it's not
obvious to me how knowing about this potential use would affect the
initial code.

Thanks,
Jonathan


Re: Implementing reftable in Git

2018-05-09 Thread Stefan Beller
Hi Christian,

On Wed, May 9, 2018 at 7:33 AM, Christian Couder
 wrote:
> Hi,
>
> I might start working on implementing reftable in Git soon.

Cool! Everyone is waiting for it as they dream about the
performance and correctness benefits this brings.

Benefits that I know of:
* performance in repos with many refs
* no capitalization issues on case insensitive FS
* replay-ability of the last fetch ("show the last reflog
  of any ref under refs/remote/origin") is easier to do
  in a correct way. (This is one of my motivations to desire reftables)
* We *might* be able to use reftables in negotiation later
  ("client: Last I fetched, you said your latest transaction
  number was '5' with the hash over all refs to be ;
  server: ok, here are the refs and the pack, you're welcome").

Why are you (or rather booking.com) interested in this?

> During the last Git Merge conference last March Stefan talked about
> reftable. In Alex Vandiver's notes [1] it is asked that people
> announce it on the list when they start working on it,

Mostly because many parties want to see it implemnented
and were not sure when they could start implementing it.

> and it appears
> that there is a reference implementation in JGit.

The reference implementation can be used in tests
to see if we can interact with them, using the JGIT pre-requisite.

> Looking it up, there is indeed some documentation [2], code [3], tests
> [4] and other related stuff [5] in the JGit repo. It looks like the
> JGit repo and the reftable code there are licensed under the Eclipse
> Distribution License - v 1.0 [7] which is very similar to the 3-Clause
> BSD License also called Modified BSD License which is GPL compatible
> according to gnu.org [9]. So from a quick look it appears that I
> should be able to port the JGit to Git if I just keep the copyright
> and license header comments in all the related files.
>
> So I think the most straightforward and compatible way to do it would
> be to port the JGit implementation.

I would think you can go by the spec and then test if it is compatible with
JGit; that way the spec will be ironed out in corner cases.

> Thanks in advance for any suggestion or comment about this.

I volunteer for reviewing.

(Advanced:) The spec allows for some tune-able parameters and JGits use
is heavily optimized for the server side. I think git-core may need to have
slightly different tweaks in different situations, e.g. block sizes and how
many restarts are put into the block.
On the FS we may want to have faster access at the cost of more disk space,
whereas in the future when using reftables on the wire as well for ref
advertisement we may want to opt for smallest tables. (largest blocks,
no restarts)

With that said, please implement it in a way that it can not just be used as
a refs backend, but can easily be re-used to write ref advertisements
onto the wire?

Thanks,
Stefan


Re: Implementing reftable in Git

2018-05-09 Thread Jonathan Nieder
Hi,

Christian Couder wrote:

> I might start working on implementing reftable in Git soon.

Yay!

[...]
> So I think the most straightforward and compatible way to do it would
> be to port the JGit implementation.

I suspect following the spec[1] would be even more compatible, since it
would force us to tighten the spec where it is unclear.

>It looks like the
> JGit repo and the reftable code there are licensed under the Eclipse
> Distribution License - v 1.0 [7] which is very similar to the 3-Clause
> BSD License also called Modified BSD License

If you would like the patches at https://git.eclipse.org/r/q/topic:reftable
relicensed for Git's use so that you don't need to include that
license header, let me know.  Separate from any legal concerns, if
you're doing a straight port, a one-line comment crediting the JGit
project would still be appreciated, of course.

That said, I would not be surprised if going straight from the spec is
easier than porting the code.

Thanks,
Jonathan

[1] 
https://eclipse.googlesource.com/jgit/jgit/+/master/Documentation/technical/reftable.md


Re: Implementing reftable in Git

2018-05-09 Thread Duy Nguyen
On Wed, May 9, 2018 at 4:33 PM, Christian Couder
 wrote:
> Hi,
>
> I might start working on implementing reftable in Git soon.

Adding Michael Haggerty who did lots of work on ref stuff. He probably
can give a few suggestions.

You probably should also look at the last attempt to add lmdb as a new
ref backend. I'm not sure why it's still not in, maybe it wasn't the
right time (e.g. infrastructure was not ready).

> During the last Git Merge conference last March Stefan talked about
> reftable. In Alex Vandiver's notes [1] it is asked that people
> announce it on the list when they start working on it, and it appears
> that there is a reference implementation in JGit.
>
> Looking it up, there is indeed some documentation [2], code [3], tests
> [4] and other related stuff [5] in the JGit repo. It looks like the
> JGit repo and the reftable code there are licensed under the Eclipse
> Distribution License - v 1.0 [7] which is very similar to the 3-Clause
> BSD License also called Modified BSD License which is GPL compatible
> according to gnu.org [9]. So from a quick look it appears that I
> should be able to port the JGit to Git if I just keep the copyright
> and license header comments in all the related files.
>
> So I think the most straightforward and compatible way to do it would
> be to port the JGit implementation.
>
> Thanks in advance for any suggestion or comment about this.
>
> Reftable was first described by Shawn and then discussed last July on
> the list [6].
>
> My work on this would be sponsored by Booking.com.
>
> Thanks,
> Christian.
>
> [1] 
> https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
>
> [2] 
> https://github.com/eclipse/jgit/blob/master/Documentation/technical/reftable.md
>
> [3] 
> https://github.com/eclipse/jgit/tree/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/reftable
>
> [4] 
> https://github.com/eclipse/jgit/tree/master/org.eclipse.jgit.test/tst/org/eclipse/jgit/internal/storage/reftable
>
> [5] 
> https://github.com/eclipse/jgit/tree/master/org.eclipse.jgit.pgm/src/org/eclipse/jgit/pgm/debug
>
> [6] 
> https://public-inbox.org/git/CAJo=hJtyof=HRy=2sLP0ng0uZ4=s-dpz5dr1af+vhvetkg2...@mail.gmail.com/
>
> [7] http://www.eclipse.org/org/documents/edl-v10.php
>
> [8] https://opensource.org/licenses/BSD-3-Clause
>
> [9] https://www.gnu.org/licenses/license-list.en.html#ModifiedBSD
-- 
Duy


Re: Implementing reftable in Git

2018-05-09 Thread Derrick Stolee

On 5/9/2018 10:33 AM, Christian Couder wrote:

Hi,

I might start working on implementing reftable in Git soon.

During the last Git Merge conference last March Stefan talked about
reftable. In Alex Vandiver's notes [1] it is asked that people
announce it on the list when they start working on it, and it appears
that there is a reference implementation in JGit.


Thanks for starting on this! In addition to the performance gains, this 
will help a lot of users with case-insensitive file systems from getting 
case-errors on refnames.



Looking it up, there is indeed some documentation [2], code [3], tests
[4] and other related stuff [5] in the JGit repo. It looks like the
JGit repo and the reftable code there are licensed under the Eclipse
Distribution License - v 1.0 [7] which is very similar to the 3-Clause
BSD License also called Modified BSD License which is GPL compatible
according to gnu.org [9]. So from a quick look it appears that I
should be able to port the JGit to Git if I just keep the copyright
and license header comments in all the related files.

So I think the most straightforward and compatible way to do it would
be to port the JGit implementation.

Thanks in advance for any suggestion or comment about this.

Reftable was first described by Shawn and then discussed last July on
the list [6].


The hope is that such a direct port should be possible, but someone else 
should comment on the porting process.


This is also something that could be created independently based on the 
documentation you mention. I was planning to attempt that during a 
hackathon in July, but I'm happy you are able to start earlier (and that 
you are announcing your intentions). I would be happy to review your 
patch series, so please keep me posted.


Thanks,
-Stolee


Implementing reftable in Git

2018-05-09 Thread Christian Couder
Hi,

I might start working on implementing reftable in Git soon.

During the last Git Merge conference last March Stefan talked about
reftable. In Alex Vandiver's notes [1] it is asked that people
announce it on the list when they start working on it, and it appears
that there is a reference implementation in JGit.

Looking it up, there is indeed some documentation [2], code [3], tests
[4] and other related stuff [5] in the JGit repo. It looks like the
JGit repo and the reftable code there are licensed under the Eclipse
Distribution License - v 1.0 [7] which is very similar to the 3-Clause
BSD License also called Modified BSD License which is GPL compatible
according to gnu.org [9]. So from a quick look it appears that I
should be able to port the JGit to Git if I just keep the copyright
and license header comments in all the related files.

So I think the most straightforward and compatible way to do it would
be to port the JGit implementation.

Thanks in advance for any suggestion or comment about this.

Reftable was first described by Shawn and then discussed last July on
the list [6].

My work on this would be sponsored by Booking.com.

Thanks,
Christian.

[1] 
https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/

[2] 
https://github.com/eclipse/jgit/blob/master/Documentation/technical/reftable.md

[3] 
https://github.com/eclipse/jgit/tree/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/reftable

[4] 
https://github.com/eclipse/jgit/tree/master/org.eclipse.jgit.test/tst/org/eclipse/jgit/internal/storage/reftable

[5] 
https://github.com/eclipse/jgit/tree/master/org.eclipse.jgit.pgm/src/org/eclipse/jgit/pgm/debug

[6] 
https://public-inbox.org/git/CAJo=hJtyof=HRy=2sLP0ng0uZ4=s-dpz5dr1af+vhvetkg2...@mail.gmail.com/

[7] http://www.eclipse.org/org/documents/edl-v10.php

[8] https://opensource.org/licenses/BSD-3-Clause

[9] https://www.gnu.org/licenses/license-list.en.html#ModifiedBSD