Re: [PATCH] submodule: Fetch the direct sha1 first
Am 20.02.2016 um 01:11 schrieb Junio C Hamano: Stefan Bellerwrites: On Fri, Feb 19, 2016 at 2:29 PM, Junio C Hamano wrote: Stefan Beller writes: Doing a 'git fetch' only and not the fetch for the specific sha1 would be incorrect? I thought that was what you are attempting to address. Yep. In an ideal world I would imagine it would look like if $sha1 doesn't exist: fetch $sha1 if server did not support fetching direct sha1: fallback to fetch It should look more like this: if $sha1's history and objects are incomplete: fetch ;# normally just like we have done before if $sha1's history and objects are still incomplete: fetch $sha1 That makes lots of sense, doesn't break existing workflows and enables the use case Stefan described. And if people want to skip the first fetch later we could still add a config option to do so. as existing users already expect that commits and objects that are reachable from tips of refs configured to be fetched in the submodule via its configured refspecs are available after this part of the code runs, regardless of this "Gerrit reviews may not have arrived to branches yet" issue. The first "normal" fetch ensures that the expectation is met. Not sure if that has come up so far, but I believe we should not only do that for the submodule command but also for a regular fetch when it is configured to fetch submodule commits too (which it is by default unless configured otherwise). Otherwise we'll lose the plane-safety fetch normally provides in case of these unconnected submodule sha1s, which would then again break users expectations. And if we see demand for only fetching the sha1s without any extra history in the future (e.g. to minimize the amount of data to be fetched by a CI server), we could add a new value ("by-sha1" or such) for both the --recurse-submodules option of fetch and pull and the submodule..fetchRecurseSubmodules config setting. Then both a git submodule update and fetch would attempt to just fetch the sha1(s) needed without any fetching any extra history. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] submodule: Fetch the direct sha1 first
On Fri, Feb 19, 2016 at 1:13 PM, Junio C Hamanowrote: >> Regarding performance, the first fetch should fail quite fast iff the fetch >> fails and then continue with the normal fetch. In case the first fetch works >> fine getting the exact sha1, the fetch should be faster than a default fetch >> as potentially less data needs to be fetched. > > "The fetch should be faster" may not be making a good trade-off > overall--people may have depended on the branches configured to be > fetched to be fetched after this codepath is exercised, but now if > the commit bound to the superproject tree happens to be complete, > even though it is not anchored by any remote tracking ref (hence the > next GC may clobber it), the fetch of other branches will not > happen. > > My knee-jerk reaction is that the order of fallback is probably the > other way around. That is, try "git fetch" as before, check again > if the commit bound to the superproject tree is now complete, and > fallback to fetch that commit with an extra "git fetch". > FWIW, I think the order you suggest here is probably better. It would be lower risk of breaking something since we'd only do something more in this case if the current fetch fails. I've definitely been bit by this before thinking that the sub module would be able to be fetched just fine only to discover that it wasn't able to locate the change. Regards, Jake -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] submodule: Fetch the direct sha1 first
Stefan Bellerwrites: > On Fri, Feb 19, 2016 at 2:29 PM, Junio C Hamano wrote: >> Stefan Beller writes: >> >>> Doing a 'git fetch' only and not the fetch for the specific sha1 would be >>> incorrect? >> >> I thought that was what you are attempting to address. > > Yep. In an ideal world I would imagine it would look like > > if $sha1 doesn't exist: > fetch $sha1 > if server did not support fetching direct sha1: > fallback to fetch It should look more like this: if $sha1's history and objects are incomplete: fetch ;# normally just like we have done before if $sha1's history and objects are still incomplete: fetch $sha1 as existing users already expect that commits and objects that are reachable from tips of refs configured to be fetched in the submodule via its configured refspecs are available after this part of the code runs, regardless of this "Gerrit reviews may not have arrived to branches yet" issue. The first "normal" fetch ensures that the expectation is met. > Would it make sense in case of broken histories to not fetch > (specially if the user asked to not fetch) and rather repair by > making it a shallow repository? Commits whose ancestors, trees and/or blobs are incomplete can and do exist in a perfectly healthy repository and there is no breakage in the history as long as such commits are not reachable from any of the refs. You can for example make a small fetch from 'pu' today, that results in unpack-objects to be run instead of index-pack, and then make another fetch from 'pu', making these loose objects unreachable from anywhere. Maybe there were 5 commits worth of objects in the original transfer, and the objects necessary for the bottom 2 were pruned away while the tip one still in the repository [*1*]. "cat-file -e" may find that the tip commit is there, but "rev-list --objects $oldtip --not --all" will find that the old tip of pu that is left behind is incomplete and cannot be safely used (e.g. "git log -p" would fail). The "$sha1's history and objects are incomplete" check aka "quickfetch()" test is a way to avoid getting fooled by an object that passes "cat-file -e" test. I am not sure if it is feasible, given such an island of commits and associated objects, to craft a proper "shallow" boundary after the fact. It should be doable with extra code, but I do not think there is a canned support at the plumbing level (you can obviously construct it with "cat-file -e" and following the inter object links yourself). This "fetch" is in a cmd_update() codepath, whose purpose is "Update each submodule path to correct revision, using clone and checkout as needed", so I am not sure "to not fetch, specically if the user asked to not fetch" makes much sense in the first place. [Footnote] *1* A canonical example used to be a commit walker that fetches from the tip of a branch that is ahead of us by 5 commits, gets killed after fetching and storing the tip commit object and some of its trees and blobs, and before successfully fetching and storing all the necessary objects, e.g. the parent commits and its trees and blobs. That would leave a disconnected island of objects that are not anchored by any ref. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] submodule: Fetch the direct sha1 first
On Fri, Feb 19, 2016 at 2:29 PM, Junio C Hamanowrote: > Stefan Beller writes: > >> Doing a 'git fetch' only and not the fetch for the specific sha1 would be >> incorrect? > > I thought that was what you are attempting to address. Yep. In an ideal world I would imagine it would look like if $sha1 doesn't exist: fetch $sha1 if server did not support fetching direct sha1: fallback to fetch and I thought this would be small enough to be expressed using the || and &&. It would be slightly more complicated for first doing the fetch and then refetching if the sha1 is still missing as the condition is more complicated to express. > >> ('git fetch' with no args finishes successfully, so no fallback is >> triggered. But we are not sure if we obtained the sha1, so we need to >> check if we have the sha1 by doing a local check and then try to get the sha1 >> again if we don't have it locally. > > Yes, that is what I meant in the "In the opposite fallback order" > suggestion. (clear_local_git_env; cd "$sm_path" && + remote_name=$(get_default_remote) ( (rev=$(git rev-list -n 1 $sha1 --not --all 2>/dev/null) && - test -z "$rev") || git-fetch)) || + test -z "$rev") || git-fetch $remote_name $rev >>> >>> Regardless of the "fallback order" issue, I do not think $rev is a >>> correct thing to fetch here. The superproject binds $sha1 to its >>> tree, and you would be checking that out, so shouldn't you be >>> fetching that commit? >> >> Both $sha1 and $rev are in the submodule (because >> 'git submodule--helper list' puts out the sha1 as the >> submodule sha1). $rev is either empty or equal to $sha1 >> in my understanding of "rev-list $sha1 --not --all". > > Not quite. The rev-list command expects [*1*] one of three outcomes > in the original construct: > > * The repository does not know anything about $sha1; the command >fails, rev is left empty, but thanks to &&, git-fetch runs. ok. "git cat-file -e" would be able to replace this case. > > * The repository has $sha1 but the history behind it is not >complete. While digging from $sha1 following the parent chain, >it would hit a missing object and fails, rev may or may not be >empty, but thanks to &&, git-fetch runs. which I read as a broken shallow clone or a half-way gc'ed repository. (git fetch repairs that? ok.) An intact shallow clone which has enough history to contain sha1 should not be deepened in my understanding of "submodule update". Rereading the man page for "git cat-file -e", the output of "cat-file -e" is the same as in the first case, and we want to also fetch. > > * The repository has $sha1 and its history is all connected. The >command succeeds. If $sha1 is not connected to any of the refs, >however, that commit may be shown and stored in $rev. In this >case, "$rev" happens to be the same as "$sha1". So it would be possible to checkout $sha1 as a detached HEAD, but it's not strongly protected against gc. (I assume gc will not touch objects reachable from HEAD, but not referenced by any ref, but HEAD can change in a heart beat, so it is not as strong of a protection as having a branch include that $sha1). > > As this "fetch" is run in order to make sure that the history behind > $sha1 is complete in the submodule repository, so that detaching the > HEAD at that commit will give the user a useful repository and its > working tree, the check the code is doing in the original is already > flawed. If $sha1 and its ancestry is complete in the repository, > rev-list would succeed, and if $sha1 is ahead of any of the refs, > the original code still runs "git fetch", which is not necessary for > the purpose of detaching the head at $sha1. On the other hand, by > using "-n 1", it can cause rev-list stop before discovering a gap in > history behind $sha1, allowing "git fetch" to be skipped when it > should be run to fill the gap in the history. > > To be complete, the rev-list command line should also run with > "--objects"; after all, a commit walker fetch may have downloaded > commit chain completely but haven't fetched necessary trees and > blobs when it was killed, and "rev-list $sha1 --not --all" would not > catch such a breakage without "--objects". So 'cat-file -e' doesn't sound like it would back-test the history at all, so it doesn't sound like a sufficient replacement. > >> Oh! Looking at that I suspect the >> "test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null)" >> and "git cat-file -e" are serving the same purpose here and should just >> indicate if the given sha1 is present or not. > > That is the simplest explanation why the original "rev-list" > invocation is already wrong. It should do an equivalent of >
Re: [PATCH] submodule: Fetch the direct sha1 first
Stefan Bellerwrites: > Doing a 'git fetch' only and not the fetch for the specific sha1 would be > incorrect? I thought that was what you are attempting to address. > ('git fetch' with no args finishes successfully, so no fallback is > triggered. But we are not sure if we obtained the sha1, so we need to > check if we have the sha1 by doing a local check and then try to get the sha1 > again if we don't have it locally. Yes, that is what I meant in the "In the opposite fallback order" suggestion. >>> (clear_local_git_env; cd "$sm_path" && >>> + remote_name=$(get_default_remote) >>> ( (rev=$(git rev-list -n 1 $sha1 >>> --not --all 2>/dev/null) && >>> - test -z "$rev") || git-fetch)) || >>> + test -z "$rev") || git-fetch >>> $remote_name $rev >> >> Regardless of the "fallback order" issue, I do not think $rev is a >> correct thing to fetch here. The superproject binds $sha1 to its >> tree, and you would be checking that out, so shouldn't you be >> fetching that commit? > > Both $sha1 and $rev are in the submodule (because > 'git submodule--helper list' puts out the sha1 as the > submodule sha1). $rev is either empty or equal to $sha1 > in my understanding of "rev-list $sha1 --not --all". Not quite. The rev-list command expects [*1*] one of three outcomes in the original construct: * The repository does not know anything about $sha1; the command fails, rev is left empty, but thanks to &&, git-fetch runs. * The repository has $sha1 but the history behind it is not complete. While digging from $sha1 following the parent chain, it would hit a missing object and fails, rev may or may not be empty, but thanks to &&, git-fetch runs. * The repository has $sha1 and its history is all connected. The command succeeds. If $sha1 is not connected to any of the refs, however, that commit may be shown and stored in $rev. In this case, "$rev" happens to be the same as "$sha1". As this "fetch" is run in order to make sure that the history behind $sha1 is complete in the submodule repository, so that detaching the HEAD at that commit will give the user a useful repository and its working tree, the check the code is doing in the original is already flawed. If $sha1 and its ancestry is complete in the repository, rev-list would succeed, and if $sha1 is ahead of any of the refs, the original code still runs "git fetch", which is not necessary for the purpose of detaching the head at $sha1. On the other hand, by using "-n 1", it can cause rev-list stop before discovering a gap in history behind $sha1, allowing "git fetch" to be skipped when it should be run to fill the gap in the history. To be complete, the rev-list command line should also run with "--objects"; after all, a commit walker fetch may have downloaded commit chain completely but haven't fetched necessary trees and blobs when it was killed, and "rev-list $sha1 --not --all" would not catch such a breakage without "--objects". > Oh! Looking at that I suspect the > "test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null)" > and "git cat-file -e" are serving the same purpose here and should just > indicate if the given sha1 is present or not. That is the simplest explanation why the original "rev-list" invocation is already wrong. It should do an equivalent of builtin/fetch.c::quickfetch() to ensure that $sha1 is something that is complete, i.e. could be anchored with a ref if we wanted to, before deciding to avoid running "git fetch". -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] submodule: Fetch the direct sha1 first
On Fri, Feb 19, 2016 at 1:13 PM, Junio C Hamanowrote: > Stefan Beller writes: > >> When reviewing a change in Gerrit, which also updates a submodule, >> a common review practice is to download and cherry-pick the patch locally >> to test it. However when testing it locally, the 'git submodule update' >> may fail fetching the correct submodule sha1 as the corresponding commit >> in the submodule is not yet part of the project history, but also just a >> proposed change. >> >> To ease this, try fetching by sha1 first and when that fails (in case of >> servers which do not allow fetching by sha1), fall back to the default >> behavior we already have. >> >> Signed-off-by: Stefan Beller >> --- >> >> I think it's best to apply this on origin/master, there is no collision >> with sb/submodule-parallel-update. >> >> Also I do not see a good way to test this both in correctness as well >> as performance degeneration. If the first git fetch fails, the second >> fetch is executed, so it should behave as before this patch w.r.t. >> correctness. >> >> Regarding performance, the first fetch should fail quite fast iff the fetch >> fails and then continue with the normal fetch. In case the first fetch works >> fine getting the exact sha1, the fetch should be faster than a default fetch >> as potentially less data needs to be fetched. > > "The fetch should be faster" may not be making a good trade-off > overall--people may have depended on the branches configured to be > fetched to be fetched after this codepath is exercised, but now if > the commit bound to the superproject tree happens to be complete, > even though it is not anchored by any remote tracking ref (hence the > next GC may clobber it), the fetch of other branches will not > happen. > > My knee-jerk reaction is that the order of fallback is probably the > other way around. That is, try "git fetch" as before, check again > if the commit bound to the superproject tree is now complete, and > fallback to fetch that commit with an extra "git fetch". I thought about that and assumed we'd need to have an option for fetch like "--try-to-get-sha1", which depending on the servers capabilities would just add that sha1 to the "wants" during fetching negotiation if the server supports it, otherwise just fetch normally. Doing a 'git fetch' only and not the fetch for the specific sha1 would be incorrect? ('git fetch' with no args finishes successfully, so no fallback is triggered. But we are not sure if we obtained the sha1, so we need to check if we have the sha1 by doing a local check and then try to get the sha1 again if we don't have it locally. So doing the reverse order would be more code here for correctness. > > Jens, what do you think? > >> git-submodule.sh | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/git-submodule.sh b/git-submodule.sh >> index 9bc5c5f..ee0b985 100755 >> --- a/git-submodule.sh >> +++ b/git-submodule.sh >> @@ -746,8 +746,9 @@ Maybe you want to use 'update --init'?")" >> # Run fetch only if $sha1 isn't present or it >> # is not reachable from a ref. >> (clear_local_git_env; cd "$sm_path" && >> + remote_name=$(get_default_remote) >> ( (rev=$(git rev-list -n 1 $sha1 --not >> --all 2>/dev/null) && >> - test -z "$rev") || git-fetch)) || >> + test -z "$rev") || git-fetch >> $remote_name $rev > > Regardless of the "fallback order" issue, I do not think $rev is a > correct thing to fetch here. The superproject binds $sha1 to its > tree, and you would be checking that out, so shouldn't you be > fetching that commit? Both $sha1 and $rev are in the submodule (because 'git submodule--helper list' puts out the sha1 as the submodule sha1). $rev is either empty or equal to $sha1 in my understanding of "rev-list $sha1 --not --all". However for readability maybe we want to write: (clear_local_git_env; cd "$sm_path" && test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null) || git fetch $(get_default_remote) $sha1 || git fetch || die ... ) So in case you want to the other order, I'd propose (clear_local_git_env; cd "$sm_path" && test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null) || git fetch || (git cat-file -e $sha1 && git fetch $(get_default_remote) $sha1) || die ... ) Oh! Looking at that I suspect the "test -z $(git rev-list -n 1 $sha1 --not --all 2>/dev/null)" and "git cat-file -e" are serving the same purpose here and should just indicate if the given sha1 is present or not. So we could reduce it further to (clear_local_git_env; cd "$sm_path" && git cat-file -e $sha1 || git fetch || (git cat-file -e $sha1 &&
Re: [PATCH] submodule: Fetch the direct sha1 first
Stefan Bellerwrites: > When reviewing a change in Gerrit, which also updates a submodule, > a common review practice is to download and cherry-pick the patch locally > to test it. However when testing it locally, the 'git submodule update' > may fail fetching the correct submodule sha1 as the corresponding commit > in the submodule is not yet part of the project history, but also just a > proposed change. > > To ease this, try fetching by sha1 first and when that fails (in case of > servers which do not allow fetching by sha1), fall back to the default > behavior we already have. > > Signed-off-by: Stefan Beller > --- > > I think it's best to apply this on origin/master, there is no collision > with sb/submodule-parallel-update. > > Also I do not see a good way to test this both in correctness as well > as performance degeneration. If the first git fetch fails, the second > fetch is executed, so it should behave as before this patch w.r.t. > correctness. > > Regarding performance, the first fetch should fail quite fast iff the fetch > fails and then continue with the normal fetch. In case the first fetch works > fine getting the exact sha1, the fetch should be faster than a default fetch > as potentially less data needs to be fetched. "The fetch should be faster" may not be making a good trade-off overall--people may have depended on the branches configured to be fetched to be fetched after this codepath is exercised, but now if the commit bound to the superproject tree happens to be complete, even though it is not anchored by any remote tracking ref (hence the next GC may clobber it), the fetch of other branches will not happen. My knee-jerk reaction is that the order of fallback is probably the other way around. That is, try "git fetch" as before, check again if the commit bound to the superproject tree is now complete, and fallback to fetch that commit with an extra "git fetch". Jens, what do you think? > git-submodule.sh | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/git-submodule.sh b/git-submodule.sh > index 9bc5c5f..ee0b985 100755 > --- a/git-submodule.sh > +++ b/git-submodule.sh > @@ -746,8 +746,9 @@ Maybe you want to use 'update --init'?")" > # Run fetch only if $sha1 isn't present or it > # is not reachable from a ref. > (clear_local_git_env; cd "$sm_path" && > + remote_name=$(get_default_remote) > ( (rev=$(git rev-list -n 1 $sha1 --not > --all 2>/dev/null) && > - test -z "$rev") || git-fetch)) || > + test -z "$rev") || git-fetch > $remote_name $rev Regardless of the "fallback order" issue, I do not think $rev is a correct thing to fetch here. The superproject binds $sha1 to its tree, and you would be checking that out, so shouldn't you be fetching that commit? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] submodule: Fetch the direct sha1 first
When reviewing a change in Gerrit, which also updates a submodule, a common review practice is to download and cherry-pick the patch locally to test it. However when testing it locally, the 'git submodule update' may fail fetching the correct submodule sha1 as the corresponding commit in the submodule is not yet part of the project history, but also just a proposed change. To ease this, try fetching by sha1 first and when that fails (in case of servers which do not allow fetching by sha1), fall back to the default behavior we already have. Signed-off-by: Stefan Beller--- I think it's best to apply this on origin/master, there is no collision with sb/submodule-parallel-update. Also I do not see a good way to test this both in correctness as well as performance degeneration. If the first git fetch fails, the second fetch is executed, so it should behave as before this patch w.r.t. correctness. Regarding performance, the first fetch should fail quite fast iff the fetch fails and then continue with the normal fetch. In case the first fetch works fine getting the exact sha1, the fetch should be faster than a default fetch as potentially less data needs to be fetched. Thanks, Stefan git-submodule.sh | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/git-submodule.sh b/git-submodule.sh index 9bc5c5f..ee0b985 100755 --- a/git-submodule.sh +++ b/git-submodule.sh @@ -746,8 +746,9 @@ Maybe you want to use 'update --init'?")" # Run fetch only if $sha1 isn't present or it # is not reachable from a ref. (clear_local_git_env; cd "$sm_path" && + remote_name=$(get_default_remote) ( (rev=$(git rev-list -n 1 $sha1 --not --all 2>/dev/null) && -test -z "$rev") || git-fetch)) || +test -z "$rev") || git-fetch $remote_name $rev || git-fetch)) || die "$(eval_gettext "Unable to fetch in submodule path '\$displaypath'")" fi -- 2.7.0.rc0.34.ga06e0b3.dirty -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html