Re: RFC: Missing blob hook might be invoked infinitely recursively

2017-06-30 Thread Jeff Hostetler



On 6/29/2017 2:48 PM, Jonathan Tan wrote:

As some of you may know, I'm currently working on support for partial
clones/fetches in Git (where blobs above a user-specified size threshold
are not downloaded - only their names and sizes are downloaded). To do
this, the client repository needs to be able to download blobs at will
whenever it needs a missing one (for example, upon checkout).

So I have done this by adding support for a hook in Git [1], and
updating the object-reading code in Git to, by default, automatically
invoke this hook whenever necessary. (This means that existing
subsystems will all work by default, in theory at least.) My current
design is for the hook to have maximum flexibility - when invoked with a
list of SHA-1s, it must merely ensure that those objects are in the
local repo, whether packed or loose.

I am also working on a command (fetch-blob) to be bundled with Git to be
used as a default hook, and here is where the problem lies.

Suppose you have missing blob AB12 and CD34 that you now need, so
fetch-blob is invoked. It sends the literals AB12 and CD34 to a new
server endpoint and obtains a packfile, which it then pipes through "git
index-pack". The issue is that "git index-pack" wants to try to access
AB12 and CD34 in the local repo in order to do a SHA-1 collision check,
and therefore fetch-blob is invoked once again, creating infinite
recursion.

This is straightforwardly fixed by making "git index-pack" understand
about missing blobs, but this might be a symptom of this approach being
error-prone (custom hooks that invoke any Git command must be extra
careful).


I'm not sure if this what you meant, but could you pass to index-pack
the set of missing blobs that you passed to fetch-blob?  That is, let
index-pack do it's normal lookups, but just not on the list passed in.



So I have thought of a few solutions, each with its pros and cons:

1. Require the hook to instead output a packfile to stdout. This means
that that hook no longer needs to access the local repo, and thus has
less dependence on Git commands. But this reduces the flexibility in
that its output must be packed, not loose. (This is fine for the use
cases I'm thinking of, but probably not so for others.)

2. Add support for an environment variable to Git that suppresses access
to the missing blob manifest, in effect, suppressing invocation of the
hook. This allows anyone (the person configuring Git or the hook writer)
to suppress this access, although they might need in-depth knowledge to
know whether the hook is meant to be run with such access suppressed or
required.

3. Like the above, except for a command-line argument to Git.

What do you think? Any solutions that I am missing?

[1] Work in progress, but you can see an earlier version here: 
https://public-inbox.org/git/b917a463f0ad4ce0ab115203b3f24894961a2e75.1497558851.git.jonathanta...@google.com/



Re: RFC: Missing blob hook might be invoked infinitely recursively

2017-06-29 Thread Jonathan Nieder
Hi,

Jonathan Tan wrote:

> Suppose you have missing blob AB12 and CD34 that you now need, so
> fetch-blob is invoked. It sends the literals AB12 and CD34 to a new
> server endpoint and obtains a packfile, which it then pipes through "git
> index-pack". The issue is that "git index-pack" wants to try to access
> AB12 and CD34 in the local repo in order to do a SHA-1 collision check,
> and therefore fetch-blob is invoked once again, creating infinite
> recursion.
[...]
> 2. Add support for an environment variable to Git that suppresses access
> to the missing blob manifest, in effect, suppressing invocation of the
> hook. This allows anyone (the person configuring Git or the hook writer)
> to suppress this access, although they might need in-depth knowledge to
> know whether the hook is meant to be run with such access suppressed or
> required.

Small tweak: what if Git itself sets that environment variable when
invoking the hook?  A fetch-blob hook author can then explicitly unset
the environment variable to request recursion if they need it.

I should credit Shawn Pearce for saying this a little more clearly
offline a moment ago.

Thanks,
Jonathan


RFC: Missing blob hook might be invoked infinitely recursively

2017-06-29 Thread Jonathan Tan
As some of you may know, I'm currently working on support for partial
clones/fetches in Git (where blobs above a user-specified size threshold
are not downloaded - only their names and sizes are downloaded). To do
this, the client repository needs to be able to download blobs at will
whenever it needs a missing one (for example, upon checkout).

So I have done this by adding support for a hook in Git [1], and
updating the object-reading code in Git to, by default, automatically
invoke this hook whenever necessary. (This means that existing
subsystems will all work by default, in theory at least.) My current
design is for the hook to have maximum flexibility - when invoked with a
list of SHA-1s, it must merely ensure that those objects are in the
local repo, whether packed or loose.

I am also working on a command (fetch-blob) to be bundled with Git to be
used as a default hook, and here is where the problem lies.

Suppose you have missing blob AB12 and CD34 that you now need, so
fetch-blob is invoked. It sends the literals AB12 and CD34 to a new
server endpoint and obtains a packfile, which it then pipes through "git
index-pack". The issue is that "git index-pack" wants to try to access
AB12 and CD34 in the local repo in order to do a SHA-1 collision check,
and therefore fetch-blob is invoked once again, creating infinite
recursion.

This is straightforwardly fixed by making "git index-pack" understand
about missing blobs, but this might be a symptom of this approach being
error-prone (custom hooks that invoke any Git command must be extra
careful).

So I have thought of a few solutions, each with its pros and cons:

1. Require the hook to instead output a packfile to stdout. This means
that that hook no longer needs to access the local repo, and thus has
less dependence on Git commands. But this reduces the flexibility in
that its output must be packed, not loose. (This is fine for the use
cases I'm thinking of, but probably not so for others.)

2. Add support for an environment variable to Git that suppresses access
to the missing blob manifest, in effect, suppressing invocation of the
hook. This allows anyone (the person configuring Git or the hook writer)
to suppress this access, although they might need in-depth knowledge to
know whether the hook is meant to be run with such access suppressed or
required.

3. Like the above, except for a command-line argument to Git.

What do you think? Any solutions that I am missing?

[1] Work in progress, but you can see an earlier version here: 
https://public-inbox.org/git/b917a463f0ad4ce0ab115203b3f24894961a2e75.1497558851.git.jonathanta...@google.com/