Re: [PATCH] submodule: Fetch the direct sha1 first

2016-02-22 Thread Jens Lehmann

Am 20.02.2016 um 01:11 schrieb Junio C Hamano:

Stefan Beller  writes:


On Fri, Feb 19, 2016 at 2:29 PM, Junio C Hamano  wrote:

Stefan Beller  writes:


Doing a 'git fetch' only and not the fetch for the specific sha1 would be
incorrect?


I thought that was what you are attempting to address.


Yep. In an ideal world I would imagine it would look like

 if $sha1 doesn't exist:
 fetch $sha1
 if server did not support fetching direct sha1:
 fallback to fetch 


It should look more like this:

if $sha1's history and objects are incomplete:
fetch ;# normally just like we have done before
 if $sha1's history and objects are still incomplete:
fetch $sha1


That makes lots of sense, doesn't break existing workflows and
enables the use case Stefan described. And if people want to skip
the first fetch later we could still add a config option to do so.


as existing users already expect that commits and objects that are
reachable from tips of refs configured to be fetched in the
submodule via its configured refspecs are available after this part
of the code runs, regardless of this "Gerrit reviews may not have
arrived to branches yet" issue.  The first "normal" fetch ensures
that the expectation is met.


Not sure if that has come up so far, but I believe we should not
only do that for the submodule command but also for a regular
fetch when it is configured to fetch submodule commits too (which
it is by default unless configured otherwise). Otherwise we'll
lose the plane-safety fetch normally provides in case of these
unconnected submodule sha1s, which would then again break users
expectations.

And if we see demand for only fetching the sha1s without any
extra history in the future (e.g. to minimize the amount of data
to be fetched by a CI server), we could add a new value ("by-sha1"
or such) for both the --recurse-submodules option of fetch and
pull and the submodule..fetchRecurseSubmodules config
setting. Then both a git submodule update and fetch would attempt
to just fetch the sha1(s) needed without any fetching any extra
history.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] submodule: Fetch the direct sha1 first

2016-02-20 Thread Jacob Keller
On Fri, Feb 19, 2016 at 1:13 PM, Junio C Hamano  wrote:
>> Regarding performance, the first fetch should fail quite fast iff the fetch
>> fails and then continue with the normal fetch. In case the first fetch works
>> fine getting the exact sha1, the fetch should be faster than a default fetch
>> as potentially less data needs to be fetched.
>
> "The fetch should be faster" may not be making a good trade-off
> overall--people may have depended on the branches configured to be
> fetched to be fetched after this codepath is exercised, but now if
> the commit bound to the superproject tree happens to be complete,
> even though it is not anchored by any remote tracking ref (hence the
> next GC may clobber it), the fetch of other branches will not
> happen.
>
> My knee-jerk reaction is that the order of fallback is probably the
> other way around.  That is, try "git fetch" as before, check again
> if the commit bound to the superproject tree is now complete, and
> fallback to fetch that commit with an extra "git fetch".
>

FWIW, I think the order you suggest here is probably better. It would
be lower risk of breaking something since we'd only do something more
in this case if the current fetch fails.

I've definitely been bit by this before thinking that the sub module
would be able to be fetched just fine only to discover that it wasn't
able to locate the change.

Regards,
Jake
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] submodule: Fetch the direct sha1 first

2016-02-19 Thread Junio C Hamano
Stefan Beller  writes:

> On Fri, Feb 19, 2016 at 2:29 PM, Junio C Hamano  wrote:
>> Stefan Beller  writes:
>>
>>> Doing a 'git fetch' only and not the fetch for the specific sha1 would be
>>> incorrect?
>>
>> I thought that was what you are attempting to address.
>
> Yep. In an ideal world I would imagine it would look like
>
> if $sha1 doesn't exist:
> fetch $sha1
> if server did not support fetching direct sha1:
> fallback to fetch 

It should look more like this:

if $sha1's history and objects are incomplete:
fetch ;# normally just like we have done before
if $sha1's history and objects are still incomplete:
fetch $sha1

as existing users already expect that commits and objects that are
reachable from tips of refs configured to be fetched in the
submodule via its configured refspecs are available after this part
of the code runs, regardless of this "Gerrit reviews may not have
arrived to branches yet" issue.  The first "normal" fetch ensures
that the expectation is met.

> Would it make sense in case of broken histories to not fetch
> (specially if the user asked to not fetch) and rather repair by
> making it a shallow repository?

Commits whose ancestors, trees and/or blobs are incomplete can and
do exist in a perfectly healthy repository and there is no breakage
in the history as long as such commits are not reachable from any of
the refs.

You can for example make a small fetch from 'pu' today, that results
in unpack-objects to be run instead of index-pack, and then make
another fetch from 'pu', making these loose objects unreachable from
anywhere.  Maybe there were 5 commits worth of objects in the
original transfer, and the objects necessary for the bottom 2 were
pruned away while the tip one still in the repository [*1*].

"cat-file -e" may find that the tip commit is there, but "rev-list
--objects $oldtip --not --all" will find that the old tip of pu that
is left behind is incomplete and cannot be safely used (e.g. "git
log -p" would fail).  The "$sha1's history and objects are
incomplete" check aka "quickfetch()" test is a way to avoid getting
fooled by an object that passes "cat-file -e" test.

I am not sure if it is feasible, given such an island of commits and
associated objects, to craft a proper "shallow" boundary after the
fact.  It should be doable with extra code, but I do not think there
is a canned support at the plumbing level (you can obviously
construct it with "cat-file -e" and following the inter object links
yourself).

This "fetch" is in a cmd_update() codepath, whose purpose is "Update
each submodule path to correct revision, using clone and checkout as
needed", so I am not sure "to not fetch, specically if the user
asked to not fetch" makes much sense in the first place.


[Footnote]

*1* A canonical example used to be a commit walker that fetches from
the tip of a branch that is ahead of us by 5 commits, gets
killed after fetching and storing the tip commit object and some
of its trees and blobs, and before successfully fetching and
storing all the necessary objects, e.g. the parent commits and
its trees and blobs.  That would leave a disconnected island of
objects that are not anchored by any ref.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] submodule: Fetch the direct sha1 first

2016-02-19 Thread Stefan Beller
On Fri, Feb 19, 2016 at 2:29 PM, Junio C Hamano  wrote:
> Stefan Beller  writes:
>
>> Doing a 'git fetch' only and not the fetch for the specific sha1 would be
>> incorrect?
>
> I thought that was what you are attempting to address.

Yep. In an ideal world I would imagine it would look like

if $sha1 doesn't exist:
fetch $sha1
if server did not support fetching direct sha1:
fallback to fetch 

and I thought this would be small enough to be expressed
using the || and &&.

It would be slightly more complicated for first doing the fetch
and then refetching if the sha1 is still missing as the condition
is more complicated to express.

>
>> ('git fetch' with no args finishes successfully, so no fallback is
>> triggered. But we are not sure if we obtained the sha1, so we need to
>> check if we have the sha1 by doing a local check and then try to get the sha1
>> again if we don't have it locally.
>
> Yes, that is what I meant in the "In the opposite fallback order"
> suggestion.
   (clear_local_git_env; cd "$sm_path" &&
 + remote_name=$(get_default_remote)
   ( (rev=$(git rev-list -n 1 $sha1 
 --not --all 2>/dev/null) &&
 -  test -z "$rev") || git-fetch)) ||
 +  test -z "$rev") || git-fetch 
 $remote_name $rev
>>>
>>> Regardless of the "fallback order" issue, I do not think $rev is a
>>> correct thing to fetch here.  The superproject binds $sha1 to its
>>> tree, and you would be checking that out, so shouldn't you be
>>> fetching that commit?
>>
>> Both $sha1 and $rev are in the submodule (because
>> 'git submodule--helper list' puts out the sha1 as the
>> submodule sha1). $rev is either empty or equal to $sha1
>> in my understanding of "rev-list $sha1 --not --all".
>
> Not quite.  The rev-list command expects [*1*] one of three outcomes
> in the original construct:
>
>  * The repository does not know anything about $sha1; the command
>fails, rev is left empty, but thanks to &&, git-fetch runs.

ok. "git cat-file -e" would be able to replace this case.

>
>  * The repository has $sha1 but the history behind it is not
>complete.  While digging from $sha1 following the parent chain,
>it would hit a missing object and fails, rev may or may not be
>empty, but thanks to &&, git-fetch runs.

which I read as a broken shallow clone or a half-way gc'ed repository.
(git fetch repairs that? ok.)

An intact shallow clone which has enough history to contain sha1 should
not be deepened in my understanding of "submodule update".

Rereading the man page for  "git cat-file -e", the output of
"cat-file -e" is the same as in the first case, and we want to also fetch.

>
>  * The repository has $sha1 and its history is all connected.  The
>command succeeds.  If $sha1 is not connected to any of the refs,
>however, that commit may be shown and stored in $rev.  In this
>case, "$rev" happens to be the same as "$sha1".

So it would be possible to checkout $sha1 as a detached HEAD,
but it's not strongly protected against gc. (I assume gc will not
touch objects reachable from HEAD, but not referenced by any ref,
but HEAD can change in a heart beat, so it is not as strong of a protection
as having a branch include that $sha1).

>
> As this "fetch" is run in order to make sure that the history behind
> $sha1 is complete in the submodule repository, so that detaching the
> HEAD at that commit will give the user a useful repository and its
> working tree, the check the code is doing in the original is already
> flawed.  If $sha1 and its ancestry is complete in the repository,
> rev-list would succeed, and if $sha1 is ahead of any of the refs,
> the original code still runs "git fetch", which is not necessary for
> the purpose of detaching the head at $sha1.  On the other hand, by
> using "-n 1", it can cause rev-list stop before discovering a gap in
> history behind $sha1, allowing "git fetch" to be skipped when it
> should be run to fill the gap in the history.
>
> To be complete, the rev-list command line should also run with
> "--objects"; after all, a commit walker fetch may have downloaded
> commit chain completely but haven't fetched necessary trees and
> blobs when it was killed, and "rev-list $sha1 --not --all" would not
> catch such a breakage without "--objects".

So 'cat-file -e' doesn't sound like it would back-test the history at all,
so it doesn't sound like a sufficient replacement.

>
>> Oh! Looking at that I suspect the
>> "test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null)"
>> and "git cat-file -e" are serving the same purpose here and should just
>> indicate if the given sha1 is present or not.
>
> That is the simplest explanation why the original "rev-list"
> invocation is already wrong.  It should do an equivalent of
> 

Re: [PATCH] submodule: Fetch the direct sha1 first

2016-02-19 Thread Junio C Hamano
Stefan Beller  writes:

> Doing a 'git fetch' only and not the fetch for the specific sha1 would be
> incorrect?

I thought that was what you are attempting to address.

> ('git fetch' with no args finishes successfully, so no fallback is
> triggered. But we are not sure if we obtained the sha1, so we need to
> check if we have the sha1 by doing a local check and then try to get the sha1
> again if we don't have it locally.

Yes, that is what I meant in the "In the opposite fallback order"
suggestion.
>>>   (clear_local_git_env; cd "$sm_path" &&
>>> + remote_name=$(get_default_remote)
>>>   ( (rev=$(git rev-list -n 1 $sha1 
>>> --not --all 2>/dev/null) &&
>>> -  test -z "$rev") || git-fetch)) ||
>>> +  test -z "$rev") || git-fetch 
>>> $remote_name $rev
>>
>> Regardless of the "fallback order" issue, I do not think $rev is a
>> correct thing to fetch here.  The superproject binds $sha1 to its
>> tree, and you would be checking that out, so shouldn't you be
>> fetching that commit?
>
> Both $sha1 and $rev are in the submodule (because
> 'git submodule--helper list' puts out the sha1 as the
> submodule sha1). $rev is either empty or equal to $sha1
> in my understanding of "rev-list $sha1 --not --all".

Not quite.  The rev-list command expects [*1*] one of three outcomes
in the original construct:

 * The repository does not know anything about $sha1; the command
   fails, rev is left empty, but thanks to &&, git-fetch runs.

 * The repository has $sha1 but the history behind it is not
   complete.  While digging from $sha1 following the parent chain,
   it would hit a missing object and fails, rev may or may not be
   empty, but thanks to &&, git-fetch runs.

 * The repository has $sha1 and its history is all connected.  The
   command succeeds.  If $sha1 is not connected to any of the refs,
   however, that commit may be shown and stored in $rev.  In this
   case, "$rev" happens to be the same as "$sha1".

As this "fetch" is run in order to make sure that the history behind
$sha1 is complete in the submodule repository, so that detaching the
HEAD at that commit will give the user a useful repository and its
working tree, the check the code is doing in the original is already
flawed.  If $sha1 and its ancestry is complete in the repository,
rev-list would succeed, and if $sha1 is ahead of any of the refs,
the original code still runs "git fetch", which is not necessary for
the purpose of detaching the head at $sha1.  On the other hand, by
using "-n 1", it can cause rev-list stop before discovering a gap in
history behind $sha1, allowing "git fetch" to be skipped when it
should be run to fill the gap in the history.

To be complete, the rev-list command line should also run with
"--objects"; after all, a commit walker fetch may have downloaded
commit chain completely but haven't fetched necessary trees and
blobs when it was killed, and "rev-list $sha1 --not --all" would not
catch such a breakage without "--objects".

> Oh! Looking at that I suspect the
> "test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null)"
> and "git cat-file -e" are serving the same purpose here and should just
> indicate if the given sha1 is present or not.

That is the simplest explanation why the original "rev-list"
invocation is already wrong.  It should do an equivalent of
builtin/fetch.c::quickfetch() to ensure that $sha1 is something that
is complete, i.e. could be anchored with a ref if we wanted to,
before deciding to avoid running "git fetch".

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] submodule: Fetch the direct sha1 first

2016-02-19 Thread Stefan Beller
On Fri, Feb 19, 2016 at 1:13 PM, Junio C Hamano  wrote:
> Stefan Beller  writes:
>
>> When reviewing a change in Gerrit, which also updates a submodule,
>> a common review practice is to download and cherry-pick the patch locally
>> to test it. However when testing it locally, the 'git submodule update'
>> may fail fetching the correct submodule sha1 as the corresponding commit
>> in the submodule is not yet part of the project history, but also just a
>> proposed change.
>>
>> To ease this, try fetching by sha1 first and when that fails (in case of
>> servers which do not allow fetching by sha1), fall back to the default
>> behavior we already have.
>>
>> Signed-off-by: Stefan Beller 
>> ---
>>
>> I think it's best to apply this on origin/master, there is no collision
>> with sb/submodule-parallel-update.
>>
>> Also I do not see a good way to test this both in correctness as well
>> as performance degeneration. If the first git fetch fails, the second
>> fetch is executed, so it should behave as before this patch w.r.t. 
>> correctness.
>>
>> Regarding performance, the first fetch should fail quite fast iff the fetch
>> fails and then continue with the normal fetch. In case the first fetch works
>> fine getting the exact sha1, the fetch should be faster than a default fetch
>> as potentially less data needs to be fetched.
>
> "The fetch should be faster" may not be making a good trade-off
> overall--people may have depended on the branches configured to be
> fetched to be fetched after this codepath is exercised, but now if
> the commit bound to the superproject tree happens to be complete,
> even though it is not anchored by any remote tracking ref (hence the
> next GC may clobber it), the fetch of other branches will not
> happen.
>
> My knee-jerk reaction is that the order of fallback is probably the
> other way around.  That is, try "git fetch" as before, check again
> if the commit bound to the superproject tree is now complete, and
> fallback to fetch that commit with an extra "git fetch".

I thought about that and assumed we'd need to have an option for
fetch like "--try-to-get-sha1", which depending on the servers capabilities
would just add that sha1 to the "wants" during fetching negotiation if the
server supports it, otherwise just fetch normally.

Doing a 'git fetch' only and not the fetch for the specific sha1 would be
incorrect? ('git fetch' with no args finishes successfully, so no fallback is
triggered. But we are not sure if we obtained the sha1, so we need to
check if we have the sha1 by doing a local check and then try to get the sha1
again if we don't have it locally. So doing the reverse order would be
more code here for correctness.

>
> Jens, what do you think?
>
>>  git-submodule.sh | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/git-submodule.sh b/git-submodule.sh
>> index 9bc5c5f..ee0b985 100755
>> --- a/git-submodule.sh
>> +++ b/git-submodule.sh
>> @@ -746,8 +746,9 @@ Maybe you want to use 'update --init'?")"
>>   # Run fetch only if $sha1 isn't present or it
>>   # is not reachable from a ref.
>>   (clear_local_git_env; cd "$sm_path" &&
>> + remote_name=$(get_default_remote)
>>   ( (rev=$(git rev-list -n 1 $sha1 --not 
>> --all 2>/dev/null) &&
>> -  test -z "$rev") || git-fetch)) ||
>> +  test -z "$rev") || git-fetch 
>> $remote_name $rev
>
> Regardless of the "fallback order" issue, I do not think $rev is a
> correct thing to fetch here.  The superproject binds $sha1 to its
> tree, and you would be checking that out, so shouldn't you be
> fetching that commit?

Both $sha1 and $rev are in the submodule (because
'git submodule--helper list' puts out the sha1 as the
submodule sha1). $rev is either empty or equal to $sha1
in my understanding of "rev-list $sha1 --not --all". However for
readability maybe we want to write:

(clear_local_git_env; cd "$sm_path" &&
test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null) ||
git fetch $(get_default_remote) $sha1 ||
git fetch ||
die ...
)

So in case you want to the other order, I'd propose

(clear_local_git_env; cd "$sm_path" &&
test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null) ||
git fetch ||
(git cat-file -e $sha1 && git fetch $(get_default_remote) $sha1) ||
die ...
)

Oh! Looking at that I suspect the
"test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null)"
and "git cat-file -e" are serving the same purpose here and should just
indicate if the given sha1 is present or not.

So we could reduce it further to

(clear_local_git_env; cd "$sm_path" &&
git cat-file -e $sha1 || git fetch ||
(git cat-file -e $sha1 && 

Re: [PATCH] submodule: Fetch the direct sha1 first

2016-02-19 Thread Junio C Hamano
Stefan Beller  writes:

> When reviewing a change in Gerrit, which also updates a submodule,
> a common review practice is to download and cherry-pick the patch locally
> to test it. However when testing it locally, the 'git submodule update'
> may fail fetching the correct submodule sha1 as the corresponding commit
> in the submodule is not yet part of the project history, but also just a
> proposed change.
>
> To ease this, try fetching by sha1 first and when that fails (in case of
> servers which do not allow fetching by sha1), fall back to the default
> behavior we already have.
>
> Signed-off-by: Stefan Beller 
> ---
>
> I think it's best to apply this on origin/master, there is no collision
> with sb/submodule-parallel-update.
>
> Also I do not see a good way to test this both in correctness as well
> as performance degeneration. If the first git fetch fails, the second
> fetch is executed, so it should behave as before this patch w.r.t. 
> correctness.
>
> Regarding performance, the first fetch should fail quite fast iff the fetch
> fails and then continue with the normal fetch. In case the first fetch works
> fine getting the exact sha1, the fetch should be faster than a default fetch
> as potentially less data needs to be fetched.

"The fetch should be faster" may not be making a good trade-off
overall--people may have depended on the branches configured to be
fetched to be fetched after this codepath is exercised, but now if
the commit bound to the superproject tree happens to be complete,
even though it is not anchored by any remote tracking ref (hence the
next GC may clobber it), the fetch of other branches will not
happen.

My knee-jerk reaction is that the order of fallback is probably the
other way around.  That is, try "git fetch" as before, check again
if the commit bound to the superproject tree is now complete, and
fallback to fetch that commit with an extra "git fetch".

Jens, what do you think?

>  git-submodule.sh | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/git-submodule.sh b/git-submodule.sh
> index 9bc5c5f..ee0b985 100755
> --- a/git-submodule.sh
> +++ b/git-submodule.sh
> @@ -746,8 +746,9 @@ Maybe you want to use 'update --init'?")"
>   # Run fetch only if $sha1 isn't present or it
>   # is not reachable from a ref.
>   (clear_local_git_env; cd "$sm_path" &&
> + remote_name=$(get_default_remote)
>   ( (rev=$(git rev-list -n 1 $sha1 --not 
> --all 2>/dev/null) &&
> -  test -z "$rev") || git-fetch)) ||
> +  test -z "$rev") || git-fetch 
> $remote_name $rev

Regardless of the "fallback order" issue, I do not think $rev is a
correct thing to fetch here.  The superproject binds $sha1 to its
tree, and you would be checking that out, so shouldn't you be
fetching that commit?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] submodule: Fetch the direct sha1 first

2016-02-19 Thread Stefan Beller
When reviewing a change in Gerrit, which also updates a submodule,
a common review practice is to download and cherry-pick the patch locally
to test it. However when testing it locally, the 'git submodule update'
may fail fetching the correct submodule sha1 as the corresponding commit
in the submodule is not yet part of the project history, but also just a
proposed change.

To ease this, try fetching by sha1 first and when that fails (in case of
servers which do not allow fetching by sha1), fall back to the default
behavior we already have.

Signed-off-by: Stefan Beller 
---

I think it's best to apply this on origin/master, there is no collision
with sb/submodule-parallel-update.

Also I do not see a good way to test this both in correctness as well
as performance degeneration. If the first git fetch fails, the second
fetch is executed, so it should behave as before this patch w.r.t. correctness.

Regarding performance, the first fetch should fail quite fast iff the fetch
fails and then continue with the normal fetch. In case the first fetch works
fine getting the exact sha1, the fetch should be faster than a default fetch
as potentially less data needs to be fetched.

Thanks,
Stefan

 git-submodule.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/git-submodule.sh b/git-submodule.sh
index 9bc5c5f..ee0b985 100755
--- a/git-submodule.sh
+++ b/git-submodule.sh
@@ -746,8 +746,9 @@ Maybe you want to use 'update --init'?")"
# Run fetch only if $sha1 isn't present or it
# is not reachable from a ref.
(clear_local_git_env; cd "$sm_path" &&
+   remote_name=$(get_default_remote)
( (rev=$(git rev-list -n 1 $sha1 --not 
--all 2>/dev/null) &&
-test -z "$rev") || git-fetch)) ||
+test -z "$rev") || git-fetch 
$remote_name $rev || git-fetch)) ||
die "$(eval_gettext "Unable to fetch in 
submodule path '\$displaypath'")"
fi
 
-- 
2.7.0.rc0.34.ga06e0b3.dirty

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html