Re: GDPR compliance best practices?
On Tuesday, June 12, 2018 09:12:19 PM Peter Backes wrote: > So? If a thousand lawyers claim 1+1=3, it becomes a > mathematical truth? No, but probably a legal "truth". :) -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: worktrees vs. alternates
On Wednesday, May 16, 2018 01:06:59 PM Jeff King wrote: > On Wed, May 16, 2018 at 01:40:56PM -0600, Martin Fick wrote: > > > In theory the fetch means that it's safe to actually > > > prune in the mother repo, but in practice there are > > > still races. They don't come up often, but if you > > > have enough repositories, they do eventually. :) > > > > Peff, > > > > I would be very curious to hear what you think of this > > approach to mitigating the effect of those races? > > > > https://git.eclipse.org/r/c/122288/2 > > The crux of the problem is that we have no way to > atomically mark an object as "I am using this -- do not > delete" with respect to the actual deletion. > > So if I'm reading your approach correctly, you put objects > into a purgatory rather than delete them, and let some > operations rescue them from purgatory if we had a race. Yes. This has the cost of extra disk space for a while, but once I realized that we are incurring that cost already because for our repos, we already put things into purgatory to avoid getting stale NFS File handle errors during unrecoverable paths (while streaming an object). So effectively this has no extra space cost then what is needed to run safely on NFS. > 1. When do you rescue from purgatory? Any time the > object is referenced? Do you then pull in all of its > reachable objects too? For my approach, I decided a) Yes b) No Because: a) Rescue on reference is cheap and allows any other policy to be built upon it, just ensure that policy references it at some point before it is prune from the purgatory. b) The other referenced objects will likely get pulled in on reference anyway or by virtue of being in the same pack. > 2. How do you decide when to drop an object from > purgatory? And specifically, how do you avoid racing with > somebody using the object as you're pruning purgatory? If you clean the purgatory during repacking after creating all the new packs and before deleting the old ones, you will have a significant grace window to handle most longer running operations. In this way, repacking will have re-referenced any missing objects from the purgatory before it gets pruned causing them to be recovered if necessary. Those missing objects, believed to be in the exact packs in the purgatory at that time, should only ever have been referenced by write operations that started before those packs were moved to the purgatory, which was before the previous repacking round ended. This leaves write operations a full repacking cycle to complete in to avoid loosing objects. > 3. How do you know that an operation has been run that > will actually rescue the object, as opposed to silently > having a corrupted state on disk? > > E.g., imagine this sequence: > >a. git-prune computes reachability and finds that > commit X is ready to be pruned > >b. another process sees that commit X exists and > builds a commit that references it as a parent > >c. git-prune drops the object into purgatory > > Now we have a corrupt state created by the process in > (b), since we have a reachable object in purgatory. But > what if nobody goes back and tries to read those commits > in the meantime? See answer to #2, repacking itself should rescue any objects that need to be rescued before pruning the purgatory. > I think this might be solvable by using the purgatory as a > kind of "lock", where prune does something like: > > 1. compute reachability > > 2. move candidate objects into purgatory; nobody can > look into purgatory except us I don't think this is needed. It should be OK to let others see the objects in the purgatory after 1 and before 3 as long as "seeing" them, causes them to be recovered! > 3. compute reachability _again_, making sure that no > purgatory objects are used (if so, rollback the deletion > and try again) Yes, you laid out the formula, but nothing says this recompute can't wait until the next repack (again see my answer to #2)! i.e. there is no rush to cause a recovery as long as it gets recovered before it gets pruned from the purgatory. > But even that's not quite there, because you need to have > some consistent atomic view of what's "used". Just > checking refs isn't enough, because some other process > may be planning to reference a purgatory object but not > yet have updated the ref. So you need some atomic way of > saying "I am interested in using this object". As long as all write paths also read the object first (I assume they do, or we would be in big trouble already), then this should not be an issue. The idea is to force all reads (and thus all writes also) to recover the object, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: worktrees vs. alternates
On Wednesday, May 16, 2018 12:37:45 PM Jeff King wrote: > On Wed, May 16, 2018 at 03:29:42PM -0400, Konstantin Ryabitsev wrote: > Yes, that's pretty close to what we do at GitHub. Before > doing any repacking in the mother repo, we actually do > the equivalent of: > > git fetch --prune ../$id.git +refs/*:refs/remotes/$id/* > git repack -Adl > > from each child to pick up any new objects to de-duplicate > (our "mother" repos are not real repos at all, but just > big shared-object stores). ... > In theory the fetch means that it's safe to actually prune > in the mother repo, but in practice there are still > races. They don't come up often, but if you have enough > repositories, they do eventually. :) Peff, I would be very curious to hear what you think of this approach to mitigating the effect of those races? https://git.eclipse.org/r/c/122288/2 -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: worktrees vs. alternates
On Wednesday, May 16, 2018 03:11:47 PM Konstantin Ryabitsev wrote: > On 05/16/18 15:03, Martin Fick wrote: > >> I'm undecided about that. On the one hand this does > >> create lots of small files and inevitably causes > >> (some) performance degradation. On the other hand, I > >> don't want to keep useless objects in the pack, > >> because that would also cause performance degradation > >> for people cloning the "mother repo." If my > >> assumptions on any of that are incorrect, I'm happy to > >> learn more. > > > > My suggestion is to use science, not logic or hearsay. > > :) > > i.e. test it! > > I think the answer will be "it depends." In many of our > cases the repos that need those loose objects are rarely > accessed -- usually because they are forks with older > data (hence why they need objects that are no longer used > by the mother repo). Therefore, performance impacts of > occasionally touching a handful of loose objects will be > fairly negligible. This is especially true on > non-spinning media where seek times are low anyway. > Having slimmer packs for the mother repo would be more > beneficial in this case. > > On the other hand, if the "child repo" is frequently used, > then the impact of needing a bunch of loose objects would > be greater. For the sake of simplicity, I think I'll > leave things as they are -- it's cheaper to fix this via > reducing seek times than by applying complicated logic > trying to optimize on a per-repo basis. I think a major performance issue with loose objects is not just the seek time, but also the fact that they are not delta compressed. This means that sending them over the wire will likely have a significant cost before sending it. Unlike the seek time, this cost is not mitigated across concurrent fetches by the FS (or jgit if you were to use it) caching, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: worktrees vs. alternates
On Wednesday, May 16, 2018 03:01:13 PM Konstantin Ryabitsev wrote: > On 05/16/18 14:26, Martin Fick wrote: > > If you are going to keep the unreferenced objects around > > forever, it might be better to keep them around in > > packed > > form? > > I'm undecided about that. On the one hand this does create > lots of small files and inevitably causes (some) > performance degradation. On the other hand, I don't want > to keep useless objects in the pack, because that would > also cause performance degradation for people cloning the > "mother repo." If my assumptions on any of that are > incorrect, I'm happy to learn more. My suggestion is to use science, not logic or hearsay. :) i.e. test it! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: worktrees vs. alternates
On Wednesday, May 16, 2018 02:12:24 PM Konstantin Ryabitsev wrote: > The loose objects I'm thinking of are those that are > generated when we do "git repack -Ad" -- this takes all > unreachable objects and loosens them (see man git-repack > for more info). Normally, these would be pruned after a > certain period, but we're deliberately keeping them > around forever just in case another repo relies on them > via alternates. I want those repos to "claim" these loose > objects via hardlinks, such that we can run git-prune on > the mother repo instead of dragging all the unreachable > objects on forever just in case. If you are going to keep the unreferenced objects around forever, it might be better to keep them around in packed form? We currently do that because we don't think there is a safe way to prune objects yet on a running server (which is why I am teaching jgit to be able to recover from a racy pruning error), -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: worktrees vs. alternates
On Wednesday, May 16, 2018 10:58:19 AM Konstantin Ryabitsev wrote: > > 1. Find every repo mentioning the parent repository in > their alternates 2. Repack them without the -l switch > (which copies all the borrowed objects into those repos) > 3. Once all child repos have been repacked this way, prune > the parent repo (it's safe now) This is probably only true if the repos are in read-only mode? I suspect this is still racy on a busy server with no downtime. > 4. Repack child repos again, this time with the -l flag, > to get your savings back. > I would heartily love a way to teach git-repack to > recognize when an object it's borrowing from the parent > repo is in danger of being pruned. The cheapest way of > doing this would probably be to hardlink loose objects > into its own objects directory and only consider "safe" > objects those that are part of the parent repository's > pack. This should make alternates a lot safer, just in > case git-prune happens to run by accident. I think that hard linking is generally a good approach to solving many of the "pruning" races left in git. I have uploaded a "hard linking" proposal to jgit that could potentially solve a similar situation that is not alternate specific, and only for packfiles, with the intent of eventually also doing something similar for loose objects. You can see this here: https://git.eclipse.org/r/c/122288/2 I think it would be good to fill in more of these pruning gaps! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Git push error due to hooks error
On Monday, May 14, 2018 05:32:35 PM Barodia, Anjali wrote: > I was trying to push local git to another git on gerrit, > but stuck with an hook error. This is a very large repo > and while running the command "git push origin --all" I > am getting this errors: > > remote: (W) 92e19d4: too many commit message lines longer > than 70 characters; manually wrap lines remote: (W) > de2245b: too many commit message lines longer than 70 > characters; manually wrap lines remote: (W) dc6e982: too > many commit message lines longer than 70 characters; > manually wrap lines remote: (W) d2e2efd: too many commit > message lines longer than 70 characters; manually wrap > lines remote: error: internal error while processing > changes To ssh_url_path:8282/SI_VF.git > ! [remote rejected] master -> master (Error running > hook /opt/gerrit/hooks/ref-update) error: failed to > push some refs to 'ssh_user@url_path:8282/SI_VF.git' This is standard Gerrit behavior. For Gerrit questions, please post question to: Repo and Gerrit Discussion-Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Wednesday, January 10, 2018 02:39:13 PM Derrick Stolee wrote: > On 1/10/2018 1:25 PM, Martin Fick wrote: > > On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee > > > > wrote: > >> This RFC includes a new way to index the objects in > >> multiple packs using one file, called the multi-pack > >> index (MIDX). > > > > ... > > > >> The main goals of this RFC are: > >> > >> * Determine interest in this feature. > >> > >> * Find other use cases for the MIDX feature. > > > > My interest in this feature would be to speed up fetches > > when there is more than one large pack-file with many of > > the same objects that are in other pack-files. What > > does your MIDX design do when it encounters multiple > > copies of the same object in different pack files? > > Does it index them all, or does it keep a single copy? > > The MIDX currently keeps only one reference to each > object. Duplicates are dropped during writing. (See the > care taken in commit 04/18 to avoid duplicates.) Since > midx_sha1_compare() does not use anything other than the > OID to order the objects, there is no decision being made > about which pack is "better". The MIDX writes the first > copy it finds and discards the others. This would likely speed things up then, even if the chosen objects are suboptimal. > It would not be difficult to include a check in > midx_sha1_compare() to favor one packfile over another > based on some measurement (size? mtime?). Since this > would be a heuristic at best, I left it out of the > current patch. Yeah, I didn't know what heuristic to use either, I tended to think that the bigger pack-file would be valuable because it is more likely to share deltas with other objects in that pack, so more easy to send them. However, that is likely only true during clones or other large fetches when we want most objects. During small "update" fetches, the newer packs might be better? I also thought that objects in alternates should be considered less valuable for my use case, however in the github fork use case, the alternates might be more valuable? So yes heuristics, and I don't know what is best. Perhaps some config options could be used to set heuristics like this. Whatever the heuristics are, since they would be a part of the MIDX packing process it would be easy to change. This assumes that keeping only one copy in the index is the right thing. The question would be, what if we need different heuristics for different operations? Would it make sense to have multiple MIDX files covering the same packs then, one for fetch, one for merge...? > > In our Gerrit instance (Gerrit uses jgit), we have > > multiple copies of the linux kernel repos linked > > together via the alternatives file mechanism. > > GVFS also uses alternates for sharing packfiles across > multiple copies of the repo. The MIDX is designed to > cover all packfiles in the same directory, but is not > designed to cover packfiles in multiple alternates; > currently, each alternate would need its own MIDX file. > Does that cause issues with your setup? No, since the other large packfiles are all in other repos (alternates). Is there a reason the MIDX would not want to cover the alternates? If you don't then you would seemingly loose any benefits of the MIDX when you have alternates in use. ... > > It would be nice if this use case could be improved with > > MIDX. To do so, it seems that it would either require > > that MIDX either only put "the best" version of an > > object (i.e. pre-select which one to use), or include > > the extra information to help make the selection > > process of which copy to use (perhaps based on the > > operation being performed) fast. > > I'm not sure if there is sufficient value in storing > multiple references to the same object stored in multiple > packfiles. There could be value in carefully deciding > which copy is "best" during the MIDX write, but during > read is not a good time to make such a decision. It also > increases the size of the file to store multiple copies. Yes, I am not sure either, it would be good to have input from experts here. > > This also leads me to ask, what other additional > > information (bitmaps?) for other operations, besides > > object location, might suddenly be valuable in an index > > that potentially points to multiple copies of objects? > > Would such information be appropriate in MIDX, or would > > it be better in another index? > > For applications to bitmaps, it is probably best that we > only include one copy of each object. Otherwise, we need >
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee wrote: > This RFC includes a new way to index the objects in > multiple packs using one file, called the multi-pack > index (MIDX). ... > The main goals of this RFC are: > > * Determine interest in this feature. > > * Find other use cases for the MIDX feature. My interest in this feature would be to speed up fetches when there is more than one large pack-file with many of the same objects that are in other pack-files. What does your MIDX design do when it encounters multiple copies of the same object in different pack files? Does it index them all, or does it keep a single copy? In our Gerrit instance (Gerrit uses jgit), we have multiple copies of the linux kernel repos linked together via the alternatives file mechanism. These repos have many different references (mostly Gerrit change references), but they share most of the common objects from the mainline. I have found that during a large fetch such as a clone, jgit spends a significant amount of extra time by having the extra large pack-files from the other repos visible to it, usually around an extra minute per instance of these (without them, the clone takes around 7mins). This adds up easily with a few repos extra repos, it can almost double the time. My investigations have shown that this is due to jgit searching each of these pack files to decide which version of each object to send. I don't fully understand its selection criteria, however if I shortcut it to just pick the first copy of an object that it finds, I regain my lost time. I don't know if git suffers from a similar problem? If git doesn't suffer from this then it likely just uses the first copy of an object it finds (which may not be the best object to send?) It would be nice if this use case could be improved with MIDX. To do so, it seems that it would either require that MIDX either only put "the best" version of an object (i.e. pre-select which one to use), or include the extra information to help make the selection process of which copy to use (perhaps based on the operation being performed) fast. This also leads me to ask, what other additional information (bitmaps?) for other operations, besides object location, might suddenly be valuable in an index that potentially points to multiple copies of objects? Would such information be appropriate in MIDX, or would it be better in another index? Thanks, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Bring together merge and rebase
> On Jan 4, 2018 11:19 AM, "Martin Fick" <mf...@codeaurora.org> wrote: > > On Tuesday, December 26, 2017 12:40:26 AM Jacob Keller > > > > wrote: > > > On Mon, Dec 25, 2017 at 10:02 PM, Carl Baldwin > > > > <c...@ecbaldwin.net> wrote: > > > >> On Mon, Dec 25, 2017 at 5:16 PM, Carl Baldwin > > > > <c...@ecbaldwin.net> wrote: > > > >> A bit of a tangent here, but a thought I didn't > > > >> wanna > > > >> lose: In the general case where a patch was rebased > > > >> and the original parent pointer was changed, it is > > > >> actually quite hard to show a diff of what changed > > > >> between versions. > > > > > > My biggest gripes are that the gerrit web interface > > > doesn't itself do something like this (and jgit does > > > not > > > appear to be able to generate combined diffs at all!) > > > > I believe it now does, a presentation was given at the > > Gerrit User summit in London describing this work. It > > would indeed be great if git could do this also! On Thursday, January 04, 2018 04:02:40 PM Jacob Keller wrote: > Any chance slides or a recording was posted anywhere? I'm > quite interested in this topic. Slides and video + transcript here: https://gerrit.googlesource.com/summit/2017/+/master/sessions/new-in-2.15.md Watch the part after the backend improvements, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Bring together merge and rebase
On Tuesday, December 26, 2017 01:31:55 PM Carl Baldwin wrote: ... > What I propose is that gerrit and github could end up more > robust, featureful, and interoperable if they had this > feature to build from. I agree (assuming we come up with a well defined feature) > With gerrit specifically, adopting this feature would make > the "change" concept richer than it is now because it > could supersede the change-id in the commit message and > allow a change to evolve in a distributed non-linear way > with protection against clobbering work. We (the Gerrit maintainers) would like changes to be able to evolve non-linearly so that we can eventually support distributed Gerrit reviews, and the amended-commit pointer is one way I have thought to resolve this. > I have no intention to disparage either tool. I love them > both. They've both made my career better in different > ways. I know there is no guarantee that github, gerrit, > or any other tool will do anything to adopt this. But, > I'm hoping they are reading this thread and that they > recognize how this feature can make them a little bit > better and jump in and help. I know it is a lot to hope > for but I think it could be great if it happened. We (the Gerrit maintainers) do recognize it, and I am glad that someone is pushing for solutions in this space. I am not sure what the right solution is, and how to modify workflows to deal better with this. I do think that starting by making your local repo track pointers to amended-commits, likely with various git hooks and notes (as also proposed by Johannes Schindelin), would be a good start. With that in place, then you can attack various specific workflows. If you want to then attack the Gerrit workflow, it would be good if you could prevent pushing new patchests that are amended versions of patchsets that are out of date. While it would be great if Gerrit could reject such pushes, I wonder if to start, git could detect and it prevent the push in this situation? Could a git push hook analyze the ref advertisements and figure this out (all the patchsets are in the advertisement)? Can a git hook look at the ref advertisement? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Bring together merge and rebase
On Monday, December 25, 2017 06:16:40 PM Carl Baldwin wrote: > On Sun, Dec 24, 2017 at 10:52:15PM -0500, Theodore Ts'o wrote: > Look at what happens in a rebase type workflow in any of > the following scenarios. All of these came up regularly > in my time with Gerrit. > > 1. Make a quick edit through the web UI then later > work on the change again in your local clone. It is easy > to forget to pull down the change made through the UI > before starting to work on it again. If that happens, the > change made through the UI will almost certainly be > clobbered. > > 2. You or someone else creates a second change that is > dependent on yours and works on it while yours is still > evolving. If the second change gets rebased with an older > copy of the base change and then posted back up for > review, newer work in the base change has just been > clobbered. > > 3. As a reviewer, you decide the best way to explain > how you'd like to see something done differently is to > make the quick change yourself and push it up. If the > author fails to fetch what you pushed before continuing > onto something else, it gets clobbered. > > 4. You want to collaborate on a single change with > someone else in any way and for whatever reason. As soon > as that change starts hitting multiple work spaces, there > are synchronization issues that currently take careful > manual intervention. These scenarios seem to come up most for me at Gerrit hack- a-thons where we collaborate a lot in short time spans on changes. We (the Gerrit maintainers) too have wanted and sometimes discussed ways to track the relation of "amended" commits (which is generally what Gerrit patchsets are). We also concluded that some sort of parent commit pointer was needed, although parent is somewhat the wrong term since that already means something in git. Rather, maybe some "predecessor" type of term would be better, maybe "antecedent", but "amended-commit" pointer might be best? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Bring together merge and rebase
On Sunday, December 24, 2017 12:01:38 AM Johannes Schindelin wrote: > Hi Carl, > > On Sat, 23 Dec 2017, Carl Baldwin wrote: > > I imagine that a "git commit --amend" would also insert > > a "replaces" reference to the original commit but I > > failed to mention that in my original post. > > And cherry-pick, too, of course. > > Both of these examples hint at a rather huge urge of some > users to turn this feature off because the referenced > commits may very well be throw-away commits in their > case, making the newly-recorded information completely > undesired. > > Example: I am working on a topic branch. In the middle, I > see a typo. I commit a fix, continue to work on the topic > branch. Later, I cherry-pick that commit to a separate > topic branch because I really don't think that those two > topics are related. Now I definitely do not want a > reference of the cherry-picked commit to the original > one: the latter will never be pushed to a public > repository, and gc'ed in a few weeks. > > Of course, that is only my wish, other users in similar > situations may want that information. Demonstrating that > you would be better served with an opt-in feature that > uses notes rather than a baked-in commit header. I think what you are highlighting is not when to track this, but rather when to share this tracking. In my local repo, I would definitely want to know that I cherry-picked this from elsewhere, it helps me understand what I have done later when I look back at old commits and branches that need to potentially be thrown away. But I agree you may not want to share these publicly. I am not sure what the right formula is, for when to share these pointers publicly, but it seems like it might be that whenever you push something, it should push along any references to amended commits that were publicly available already. I am not sure how to track that, but I suspect it is a subset of the union of commits you have fetched, and commits you have pushed (i.e. you got it from elsewhere, or you created it and already shared it with others)? Maybe it should also include any commits reachable by advertisements to places you are pushing to (in case it got shared some other way)? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Bring together merge and rebase
On Tuesday, December 26, 2017 12:40:26 AM Jacob Keller wrote: > On Mon, Dec 25, 2017 at 10:02 PM, Carl Baldwinwrote: > >> On Mon, Dec 25, 2017 at 5:16 PM, Carl Baldwin wrote: > >> A bit of a tangent here, but a thought I didn't wanna > >> lose: In the general case where a patch was rebased > >> and the original parent pointer was changed, it is > >> actually quite hard to show a diff of what changed > >> between versions. > > My biggest gripes are that the gerrit web interface > doesn't itself do something like this (and jgit does not > appear to be able to generate combined diffs at all!) I believe it now does, a presentation was given at the Gerrit User summit in London describing this work. It would indeed be great if git could do this also! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] fetch-pack: always allow fetching of literal SHA1s
On Wednesday, May 10, 2017 11:20:49 AM Jonathan Nieder wrote: > Hi, > > Ævar Arnfjörð Bjarmason wrote: > > Just a side question, what are the people who use this > > feature using it for? The only thing I can think of > > myself is some out of band ref advertisement because > > you've got squillions of refs as a hack around git's > > limitations in that area. > > That's one use case. > > Another is when you really care about the exact sha1 (for > example because you are an automated build system and > this is the specific sha1 you have already decided you > want to build). > > Are there other use-cases for this? All the commits[1] > > that touched this feature just explain what, not why. > > Similar to the build system case I described above is when > a human has a sha1 (from a mailing list, or source > browser, or whatever) and wants to fetch just that > revision, with --depth=1. You could use "git archive > --remote", but (1) github doesn't support that and (2) > that doesn't give you all the usual git-ish goodness. Perhaps another use case is submodules and repo(android tool) subprojects since they can be "pinned" to sha1s, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Simultaneous gc and repack
On Thursday, April 13, 2017 02:28:07 PM David Turner wrote: > On Thu, 2017-04-13 at 12:08 -0600, Martin Fick wrote: > > On Thursday, April 13, 2017 11:03:14 AM Jacob Keller wrote: > > > On Thu, Apr 13, 2017 at 10:31 AM, David Turner > > > > <nova...@novalis.org> wrote: > > > > Git gc locks the repository (using a gc.pid file) so > > > > that other gcs don't run concurrently. But git > > > > repack > > > > doesn't respect this lock, so it's possible to have > > > > a > > > > repack running at the same time as a gc. This makes > > > > the gc sad when its packs are deleted out from under > > > > it > > > > with: "fatal: ./objects/pack/pack-$sha.pack cannot > > > > be > > > > accessed". Then it dies, leaving a large temp file > > > > hanging around. > > > > > > > > Does the following seem reasonable? > > > > > > > > 1. Make git repack, by default, check for a gc.pid > > > > file > > > > (using the same logic as git gc itself does). > > > > 2. Provide a --force option to git repack to ignore > > > > said > > > > check. 3. Make git gc provide that --force option > > > > when > > > > it calls repack under its own lock. > > > > > > What about just making the code that calls repack > > > today > > > just call gc instead? I guess it's more work if you > > > don't > > > strictly need it but..? > > > > There are many scanerios where this does not achieve > > the > > same thing. On the obvious side, gc does more than > > repacking, but on the other side, repacking has many > > switches that are not available via gc. > > > > Would it make more sense to move the lock to repack > > instead of to gc? > > Other gc operations might step on each other too (e.g. > packing refs). That would be less bad (and less common), > but it still seems worth avoiding. Yes, but all of thsoe operations need to be self protected already, or they risk the same issue. -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Simultaneous gc and repack
On Thursday, April 13, 2017 11:03:14 AM Jacob Keller wrote: > On Thu, Apr 13, 2017 at 10:31 AM, David Turnerwrote: > > Git gc locks the repository (using a gc.pid file) so > > that other gcs don't run concurrently. But git repack > > doesn't respect this lock, so it's possible to have a > > repack running at the same time as a gc. This makes > > the gc sad when its packs are deleted out from under it > > with: "fatal: ./objects/pack/pack-$sha.pack cannot be > > accessed". Then it dies, leaving a large temp file > > hanging around. > > > > Does the following seem reasonable? > > > > 1. Make git repack, by default, check for a gc.pid file > > (using the same logic as git gc itself does). > > 2. Provide a --force option to git repack to ignore said > > check. 3. Make git gc provide that --force option when > > it calls repack under its own lock. > > What about just making the code that calls repack today > just call gc instead? I guess it's more work if you don't > strictly need it but..? There are many scanerios where this does not achieve the same thing. On the obvious side, gc does more than repacking, but on the other side, repacking has many switches that are not available via gc. Would it make more sense to move the lock to repack instead of to gc? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v2] repack: Add option to preserve and prune old pack files
On Sunday, March 12, 2017 11:03:44 AM Junio C Hamano wrote: > Jeff Kingwrites: > > I can think of one downside of a time-based solution, > > though: if you run multiple gc's during the time > > period, you may end up using a lot of disk space (one > > repo's worth per gc). But that's a fundamental tension > > in the problem space; the whole point is to waste disk > > to keep helping old processes. > > Yes. If you want to help a process that mmap's a packfile > and wants to keep using it for N seconds, no matter how > many times somebody else ran "git repack" while you are > doing your work within that timeframe, you somehow need > to make sure the NFS server knows the file is still in > use for that N seconds. > > > But you may want a knob that lets you slide between > > those two things. For instance, if you kept a sliding > > window of N sets of preserved packs, and ejected from > > one end of the window (regardless of time), while > > inserting into the other end. James' existing patch is > > that same strategy with a hardcoded window of "1". > > Again, yes. But then the user does not get any guarantee > of how long-living a process the user can have without > getting broken by the NFS server losing track of a > packfile that is still in use. My suggestion for the > "expiry" based approach is essentially that I do not see > a useful tradeoff afforded by having such a knob. > > The other variable you can manipulate is whether to gc > > in the first place. E.g., don't gc if there are N > > preserved sets (or sets consuming more than N bytes, or > > whatever). You could do that check outside of git > > entirely (or in an auto-gc hook, if you're using it). > Yes, "don't gc/repack more than once within N seconds" may > also be an alternative and may generally be more useful > by covering general source of wastage coming from doing > gc too frequently, not necessarily limited to preserved > pack accumulation. As someone who helps manage a Gerrit server for several thousand repos, all on the same NFS disks, a time based expiry seems unpractical, and not something that I am very interested in having. I favor the simpler (single for now) repacking cycle approach, and it is what we have been using for almost 6 months now successfully, without suffering any more stale file handle exceptions. While time is indeed the factor that is going to determine whether someone is going to see the stale file handles or not, on a server (which is what this feature is aimed at), this is secondary in my mind to predictability about space utilization. I have no specific minimum time that I can reason about, i.e. I cannot reasonably say "I want all operations that last less than 1 hour, 1 min, or 1 second... to succeed". I don't really want ANY failures, and I am willing to sacrifice some disk space to prevent as many as possible. So the question to me is "How much disk space am I willing to sacrifice?", not "How long do I want operations to be able to last?". The only way that time enters my equation is to compare it to how long repacking takes, i.e. I want the preserved files cleaned up on the next repack. So effectively I am choosing a repacking cycle based approach, so that I can reasonably predict the extra disk space that I need to reserve for my collection of repos. With a single cycle, I am effectively doubling the "static" size of repos. To achieve this predictability with a time based approach requires coordination between the expiry setting and the repacking time cycle. This coordination is extra effort for me, with no apparent gain. It is also an additional risk that I don't want to have. If I decide to bump up how often I run repacking, and I forget to reduce the expiry time, my disk utilization will grow and potentially cause serious issues for all my repositories (since they share the same volume). This problem is even more difficult if I decide to use a usage (instead of time) based algorithm to determine when I repack. Admittedly, a repacking cycle based approach happens to be very easy and practical when it is a "single" cycle. If I determine eventually empirically that a single cycle is not long enough for my server, I don't know what I will do? Perhaps I would then want a switch that preserves the repos for another cycle? Maybe it could work the way that log rotation works, add a number to the end of each file name for each preserved cycle? This option seems preferable to me than a time based approach because it makes it more obvious what the impact on disk utilization will be. However, so far in practice, this does not seem necessary. I don't really see a good use case for a time based expiry (other than "this is how it was done for other things in git"). Of course, that doesn't mean such a use case doesn't exist, but I don't support adding a feature unless I really understand why and how someone would want to use
Re: [PATCH] repack: Add options to preserve and prune old pack files
On Thursday, March 09, 2017 10:50:21 AM jmel...@codeaurora.org wrote: > On 2017-03-07 13:33, Junio C Hamano wrote: > > James Melvinwrites: > >> These options are designed to prevent stale file handle > >> exceptions during git operations which can happen on > >> users of NFS repos when repacking is done on them. The > >> strategy is to preserve old pack files around until > >> the next repack with the hopes that they will become > >> unreferenced by then and not cause any exceptions to > >> running processes when they are finally deleted > >> (pruned). > > > > I find it a very sensible strategy to work around NFS, > > but it does not explain why the directory the old ones > > are moved to need to be configurable. It feels to me > > that a boolean that causes the old ones renamed > > s/^pack-/^old-&/ in the same directory (instead of > > pruning them right away) would risk less chances of > > mistakes (e.g. making "preserved" subdirectory on a > > separate device mounted there in a hope to reduce disk > > usage of the primary repository, which may defeat the > > whole point of moving the still-active file around > > instead of removing them). > > Moving the preserved pack files to a separate directory > only helped make the pack directory cleaner, but I agree > that having the old* pack files in the same directory is > a better approach as it would ensure that it's still on > the same mounted device. I'll update the logic to reflect > that. > > As for the naming convention of the preserved pack files, > there is already some logic to remove "old-" files in > repack. Currently this is the naming convention I have > for them: > > pack-.old- > pack-7412ee739b8a20941aa1c2fd03abcc7336b330ba.old-pack > > One advantage of that is the extension is no longer an > expected one, differentiating it from current pack files. > > That said, if that is not a concern, I could prefix them > with "preserved" instead of "old" to differentiate them > from the other logic that cleans up "old-*". What are > your thoughts on that? > > preserved-. > preserved-7412ee739b8a20941aa1c2fd03abcc7336b330ba.pack Some other proposals so that the preserved files do not get returned by naive finds based on their extensions, preserved-.-preserved preserved-7412ee739b8a20941aa1c2fd03abcc7336b330ba.pack- preserved or: preserved-.preserved- preserved-7412ee739b8a20941aa1c2fd03abcc7336b330ba.preserved- pack or maybe even just: preserved-- preserved-pack-7412ee739b8a20941aa1c2fd03abcc7336b330ba -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: [RFC] Add support for downloading blobs on demand
On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote: > While large files can be a real problem, our biggest issue > today is having a lot (millions!) of source files when > any individual developer only needs a small percentage of > them. Git with 3+ million local files just doesn't > perform well. Honestly, this sounds like a problem better dealt with by using git subtree or git submodules, have you considered that? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Preserve/Prune Old Pack Files
On Monday, January 09, 2017 01:21:37 AM Jeff King wrote: > On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick wrote: > > I am replying to this email across lists because I > > wanted to highlight to the git community this jgit > > change to repacking that we have up for review > > > > https://git.eclipse.org/r/#/c/87969/ > > > > This change introduces a new convention for how to > > preserve old pack files in a staging area > > (.git/objects/packs/preserved) before deleting them. I > > wanted to ensure that the new proposed convention would > > be done in a way that would be satisfactory to the git > > community as a whole so that it would be more easy to > > provide the same behavior in git eventually. The > > preserved pack files (and accompanying index and bitmap > > files), are not only moved, but they are also renamed > > so that they no longer will match recursive finds > > looking for pack files. > It looks like objects/pack/pack-123.pack becomes > objects/pack/preserved/pack-123.old-pack, Yes, that's the idea. > and so forth. Which seems reasonable, and I'm happy that: > > find objects/pack -name '*.pack' > > would not find it. :) Cool. > I suspect the name-change will break a few tools that you > might want to use to look at a preserved pack (like > verify-pack). I know that's not your primary use case, > but it seems plausible that somebody may one day want to > use a preserved pack to try to recover from corruption. I > think "git index-pack --stdin > be a last-resort for re-admitting the objects to the > repository. or even a simple manual rename/move back to its orginal place? > I notice this doesn't do anything for loose objects. I > think they technically suffer the same issue, though the > race window is much shorter (we mmap them and zlib > inflate immediately, whereas packfiles may stay mapped > across many object requests). Hmm, yeah that's the next change, didn't you see it? :) No, actually I forgot about those. Our server tends to not have too many of those (loose objects), and I don't think we have seen any exceptions yet for them. But, of course, you are right, they should get fixed too. I will work on a followup change to do that. Where would you suggest we store those? Maybe under ".git/objects/preserved//"? Do they need to be renamed also somehow to avoid a find? ... > I've wondered if we could make object pruning more atomic > by speculatively moving items to be deleted into some > kind of "outgoing" object area. ... > I don't have a solution here. I don't think we want to > solve it by locking the repository for updates during a > repack. I have a vague sense that a solution could be > crafted around moving the old pack into a holding area > instead of deleting (during which time nobody else would > see the objects, and thus not reference them), while the > repacking process checks to see if the actual deletion > would break any references (and rolls back the deletion > if it would). > > That's _way_ more complicated than your problem, and as I > said, I do not have a finished solution. But it seems > like they touch on a similar concept (a post-delete > holding area for objects). So I thought I'd mention it in > case if spurs any brilliance. I agree, this is a problem I have wanted to solve also. I think having a "preserved" directory does open the door to such "recovery" solutions, although I think you would actually want to modify the many read code paths to fall back to looking at the preserved area and performing immediate "recovery" of the pack file if it ends up being needed. That's a lot of work, but having the packs (and eventually the loose objects) preserved into a location where no new references will be built to depend on them is likely the first step. Does the name "preserved" do well for that use case also, or would there be some better name, what would a transactional system call them? Thanks for the review Peff! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Preserve/Prune Old Pack Files
On Monday, January 09, 2017 05:55:45 AM Jeff King wrote: > On Mon, Jan 09, 2017 at 04:01:19PM +0900, Mike Hommey wrote: > > > That's _way_ more complicated than your problem, and > > > as I said, I do not have a finished solution. But it > > > seems like they touch on a similar concept (a > > > post-delete holding area for objects). So I thought > > > I'd mention it in case if spurs any brilliance. > > > > Something that is kind-of in the same family of problems > > is the "loosening" or objects on repacks, before they > > can be pruned. ... > Yes, this can be a problem. The repack is smart enough not > to write out objects which would just get pruned > immediately, but since the grace period is 2 weeks, that > can include a lot of objects (especially with history > rewriting as you note). It would be possible to write > those loose objects to a "cruft" pack, but there are some > management issues around the cruft pack. You do not want > to keep repacking them into a new cruft pack at each > repack, since then they would never expire. So you need > some way of marking the pack as cruft, letting it age > out, and then deleting it after the grace period expires. > > I don't think it would be _that_ hard, but AFAIK nobody > has ever made patches. FYI, jgit does this, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: Preserve/Prune Old Pack Files
I am replying to this email across lists because I wanted to highlight to the git community this jgit change to repacking that we have up for review https://git.eclipse.org/r/#/c/87969/ This change introduces a new convention for how to preserve old pack files in a staging area (.git/objects/packs/preserved) before deleting them. I wanted to ensure that the new proposed convention would be done in a way that would be satisfactory to the git community as a whole so that it would be more easy to provide the same behavior in git eventually. The preserved pack files (and accompanying index and bitmap files), are not only moved, but they are also renamed so that they no longer will match recursive finds looking for pack files. I look forward to any review (it need not happen on the change, replies to this email would be fine also), in particular with respect to the approach and naming conventions. Thanks, -Martin On Tuesday, January 03, 2017 02:46:12 PM jmel...@codeaurora.org wrote: > We’ve noticed cases where Stale File Handle Exceptions > occur during git operations, which can happen on users of > NFS repos when repacking is done on them. > > To address this issue, we’ve added two new options to the > JGit GC command: > > --preserve-oldpacks: moves old pack files into the > preserved subdirectory instead of deleting them after > repacking > > --prune-preserved: prunes old pack files from the > preserved subdirectory after repacking, but before > potentially moving the latest old pack files to this > subdirectory > > The strategy is to preserve old pack files around until > the next repack with the hopes that they will become > unreferenced by then and not cause any exceptions to > running processes when they are finally deleted (pruned). > > Change is uploaded for review here: > https://git.eclipse.org/r/#/c/87969/ > > Thanks, > James -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: storing cover letter of a patch series?
On Friday, August 05, 2016 08:39:58 AM you wrote: > * A new topic, when you merge it to the "lit" branch, you > describe the cover as the merge commit message. > > * When you updated an existing topic, you tell a tool > like "rebase -i -p" to recreate "lit" branch on top of > the mainline. This would give you an opportunity to > update the cover. This is a neat idea. How would this work if there is no merge commit (mainline hasn't moved)? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT admin access
Brigning this back on list so that someone else can help... On Thursday, June 23, 2016 05:01:18 PM John Ajah wrote: > I'm on a private git, installed on a work server. Now the > guy who set it up is not available and I want to give > access to someone working for me, but I don't know how to > do that. I don't know what type of setup a "private git" means? Is this a machine with ssh access, is it git-daemon, git-hub, git-olite, gerrit, ...? > This is the error the developer got when he tried cloning: > > FATAL ERROR: Network error: Connection timed out > fatal: Could not read from remote repository. > > Please make sure you have the correct access rights > and the repository exists. > > My partner wants to set up another Git server and transfer > our content to the new server from the one we're > currently using. I think this is very risky and I also > think there has to be a way to provide access without > doing this. We need to know what product you are running to help. What risks are you concerned with setting up another server? And what kind of server would you be setting up? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RefTree: Alternate ref backend
On Tuesday, December 22, 2015 06:17:28 PM you wrote: > On Tue, Dec 22, 2015 at 7:41 AM, Michael Haggertywrote: > > At a deeper level, the "refs/" part of reference names is > actually pretty useless in general. I suppose it > originated in the practice of storing loose references > under "refs/" to keep them separate from other metadata > in $GIT_DIR. But really, aside from slightly helping > disambiguate references from paths in the command line, > what is it good for? Would we really be worse off if > references' full names were > > HEAD > heads/master > tags/v1.0.0 > remotes/origin/master (or remotes/origin/heads/master) I think this is a bit off, because HEAD != refs/HEAD so not quite useless. But, I agree that the whole refs notation has always bugged me, it is quirky. It makes it hard to disambiguate when something is meant to be absolute or not. What if we added a leading slash for absolute references? Then I could do something like: /HEAD /refs/heads/master /refs/tags/v1.0.0 /refs/remotes/origin/master I don't like that plumbing has to do a dance to guess at expansions, how many tools get it wrong (do it in different orders, miss some expansions...). With an absolute notation, plumbing could be built to require absolute notations, giving more predictable interpretations when called from tools. This is a long term idea, but it might make sense to consider it now just for the sake of storing refs, it would eliminate the need for the ".." notation for "refs/..HEAD". Now if we could only figure out a way to tell plumbing that something is a SHA, not a ref? :) -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: storing cover letter of a patch series?
+repo-disc...@googlegroups.com (to hit Gerrit developers also) On Thursday, September 10, 2015 09:28:52 AM Jacob Kellerwrote: > does anyone know of any tricks for storing a cover letter > for a patch series inside of git somehow? I'd guess the > only obvious way currently is to store it at the top of > the series as an empty commit.. but this doesn't get > emailed as the cover letter... ... > I really think it should be possible to store something > somehow as a blob that could be looked up later. On Thursday, September 10, 2015 10:41:54 AM Junio C Hamano wrote: > > I think "should" is too strong here. Yes, you could > implement that way. It is debatable if it is better, or > a flat file kept in a directory (my-topic/ in the example > above) across rerolls is more flexible, lightweight and > with less mental burden to the users. -- As a Gerrit developer and user, I would like a way to see/review cover letters in Gerrit. We have had many internal proposals, most based on git notes, but we have also used the empty commit trick. It would be nice if there were some standard git way to do this so that Gerrit and other tools could benefit from this standard. I am not suggesting that git need to be modified to do this, but rather that at least some convention be established. -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] protocol upload-pack-v2
The current protocol has the following problems that limit us: - It is not easy to make it resumable, because we recompute every time. This is especially problematic for the initial fetch aka clone as we will be talking about a large transfer. Redirection to a bundle hosted on CDN might be something we could do transparently. - The protocol extension has a fairly low length limit. - Because the protocol exchange starts by the server side advertising all its refs, even when the fetcher is interested in a single ref, the initial overhead is nontrivial, especially when you are doing a small incremental update. The worst case is an auto-builder that polls every five minutes, even when there is no new commits to be fetched. A lot of focus about the problems with ref advertisement is about the obvious problem mentioned above (a bad problem indeed). I would like to add that there is another related problem that all potential solutions to the above problem do not neccessarily improve. When polling regularly there is also no current efficient way to check on the current state of all refs. It would be nice to also be able to get an incremental update on large refs spaces. Thanks, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Git Scaling: What factors most affect Git performance for a large repo?
On Friday, February 20, 2015 01:29:12 PM David Turner wrote: ... For a more general solution, perhaps a log of ref updates could be used. Every time a ref is updated on the server, that ref would be written into an append-only log. Every time a client pulls, their pull data includes an index into that log. Then on push, the client could say, I have refs as-of $index, and the server could read the log (or do something more-optimized) and send only refs updated since that index. Interesting idea, I like it. How would you make this reliable? It relies on updates being reliably recorded which would mean that you would have to ensure that any tool which touches the repo follows this convention. That is unfortunately a tough thing to enforce for most people. But perhaps, instead of logging updates, the server could log snapshots of all refs using an atomically increasing sequence number. Then missed updates do not matter, a sequence number is simplly an opaque handle to some full ref state that can be diffed against. The snapshots need not even be taken inline with the client connection, or with every update for this to work. It might mean that some extra updates are sent when they don't need to be, but at least they will be accurate. I know in the past similar ideas have been passed around, but they typically relied on the server keeping track of the state of each client. Instead, here we are talking about clients keeping track of state for a particular server. Clients already store info about remotes. A very neat idea indeed, thanks! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Git Scaling: What factors most affect Git performance for a large repo?
On Feb 19, 2015 5:42 PM, David Turner dtur...@twopensource.com wrote: On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote: * 'git push'? This one is not affected by how deep your repo's history is, or how wide your tree is, so should be quick.. Ah the number of refs may affect both git-push and git-pull. I think Stefan knows better than I in this area. I can tell you that this is a bit of a problem for us at Twitter. We have over 100k refs, which adds ~20MiB of downstream traffic to every push. I added a hack to improve this locally inside Twitter: The client sends a bloom filter of shas that it believes that the server knows about; the server sends only the sha of master and any refs that are not in the bloom filter. The client uses its local version of the servers' refs as if they had just been sent. This means that some packs will be suboptimal, due to false positives in the bloom filter leading some new refs to not be sent. Also, if there were a repack between the pull and the push, some refs might have been deleted on the server; we repack rarely enough and pull frequently enough that this is hopefully not an issue. We're still testing to see if this works. But due to the number of assumptions it makes, it's probably not that great an idea for general use. Good to hear that others are starting to experiment with solutions to this problem! I hope to hear more updates on this. I have a prototype of a simpler, and I believe more robust solution, but aimed at a smaller use case I think. On connecting, the client sends a sha of all its refs/shas as defined by a refspec, which it also sends to the server, which it believes the server might have the same refs/shas values for. The server can then calculate the value of its refs/shas which meet the same refspec, and then omit sending those refs if the verification sha matches, and instead send only a confirmation that they matched (along with any refs outside of the refspec). On a match, the client can inject the local values of the refs which met the refspec and be guaranteed that they match the server's values. This optimization is aimed at the worst case scenario (and is thus the potentially best case compression), when the client and server match for all refs (a refs/* refspec) This is something that happens often on Gerrit server startup, when it verifies that its mirrors are up-to-date. One reason I chose this as a starting optimization, is because I think it is one use case which will actually not benefit from fixing the git protocol to only send relevant refs since all the refs are in fact relevant here! So something like this will likely be needed in any future git protocol in order for it to be efficient for this use case. And I believe this use case is likely to stick around. With a minor tweak, this optimization should work when replicating actual expected updates also by excluding the expected updating refs from the verification so that the server always sends their values since they will likely not match and would wreck the optimization. However, for this use case it is not clear whether it is actually even worth caring about the non updating refs? In theory the knowledge of the non updating refs can potentially reduce the amount of data transmitted, but I suspect that as the ref count increases, this has diminishing returns and mostly ends up chewing up CPU and memory in a vain attempt to reduce network traffic. Please do keep us up-to-date of your results, -Martin Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative ProjectN�r��yb�X��ǧv�^�){.n�+ا���ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
Re: Multi-threaded 'git clone'
There currently is a thread on the Gerrit list about how much faster cloning can be when using Gerrit/jgit GCed packs with bitmaps versus C git GCed packs with bitmaps. Some differences outlined are that jgit seems to have more bitmaps, it creates one for every refs/heads, is C git doing that? Another difference seems to be that jgit creates two packs, splitting stuff not reachable from refs/heads into its own pack. This makes a clone have zero CPU server side in the pristine case. In the Gerrit use case, this second unreachable packfile can be sizeable, I wonder if there are other use cases where this might also be the case (and this slowing down clones for C git GCed repos)? If there is not a lot of parallelism left to squeak out, perhaps a focus with better returns is trying to do whatever is possible to make all clones (and potentially any fetch use case deemed important on a particular server) have zero CPU? Depending on what a server's primary mission is, I could envision certain admins willing to sacrifice significant amounts of disk space to speed up their fetches. Perhaps some more extreme thinking (such as what must have led to bitmaps) is worth brainstorming about to improve server use cases? What if an admin were willing to sacrifice a packfile for every use case he deemed important, could git be made to support that easily? For example, maybe the admin considers a clone or a fetch from master to be important, could zero percent CPU be achieved regularly for those two use cases? Cloning is possible if the repository were repacked in the jgit style after any push to a head. Is it worth exploring ways of making GC efficient enough to make this feasible? Can bitmaps be leveraged to make repacking faster? I believe that at least reachability checking could potentially be improved with bitmaps? Are there potentially any ways to make better deltification reuse during repacking (not bitmap related), by somehow reversing or translating deltas to new objects that were just received, without actually recalculating them, but yet still getting most objects deltified against the newest objects (achieving the same packs as git GC would achieve today, but faster)? What other pieces need to be improved to make repacking faster? As for the single branch fetch case, could this somehow be improved by allocating one or more packfiles to this use case? The simplest single branch fetch use case is likely someone doing a git init followed by a single branch fetch. I think the android repo tool can be used in this way, so this may actually be a common use case? With a packfile dedicated to this branch, git should be able to just stream it out without any CPU. But I think git would need to know this packfile exists to be able to use it. It would be nice if bitmaps could help here, but I believe bitmaps can so far only be used for one packfile. I understand that making bitmaps span multiple packfiles would be very complicated, but maybe it would not be so hard to support bitmaps on multiple packfiles if each of these were self contained? By self contained I mean that all objects referenced by objects in the packfile were contained in that packfile. What other still unimplemented caching techniques could be used to improve clone/fetch use cases? - Shallow clones (dedicate a special packfile to this, what about another bitmap format that only maps objects in a single tree to help this)? - Small fetches (simple branch FF updates), I suspect these are fast enough, but if not, maybe caching some thin packs (that could result in zero CPU requests for many clients) would be useful? Maybe spread these out exponentially over time so that many will be available for recent updates and fewer for older updates? I know git normally throws away thin packs after receiving them and resolving them, but if it kept them around (maybe in a special directory), it seems that they could be useful for updating other clients with zero CPU? A thin pack cache might be something really easy to manage based on file timestamps, an admin may simply need to set a max cache size. But how can git know what thin packs it has, and what they would be useful for, name them with their start and ending shas? Sorry for the long winded rant. I suspect that some variation of all my suggestions have already been suggested, but maybe they will rekindle some older, now useful thoughts, or inspire some new ones. And maybe some of these are better to pursue then more parallelism? -Martin Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative ProjectOn Feb 16, 2015 8:47 AM, Jeff King p...@peff.net wrote: On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote: Then the server streams the data to the client. It might do some light work transforming the data as it comes off the disk,
Re: Diagnosing stray/stale .keep files -- explore what is in a pack?
Perhaps the receiving process is dying hard and leaving stuff behind? Out-of-memory, out of disk space? -Martin On Tuesday, January 14, 2014 10:10:31 am Martin Langhoff wrote: On Tue, Jan 14, 2014 at 9:54 AM, Martin Langhoff martin.langh...@gmail.com wrote: Is there a handy way to list the blobs in a pack, so I can feed them to git-cat-file and see what's in there? I'm sure that'll help me narrow down on the issue. git show-index /var/lib/ppg/reports.git/objects/pack/pack-22748bcca7f50a 3a49aa4aed61444bf9c4ced685.idx cut -d\ -f2 | xargs -iHASH git --git-dir /var/lib/ppg/reports.git/ unpack-file HASH After a bit of looking at the output, clearly I have two clients, out of the many that connect here, that have the problem. I will be looking into those clients to see what's the problem. In my use case, clients push to their own head. Looking at refs/heads shows that there are stale .lock files there. Hmmm. This is on git 1.7.1 (RHEL and CentOS clients). cheers, m -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas to speed up repacking
Martin Fick mf...@codeaurora.org writes: * Setup 1: Do a full repack. All loose and packed objects are added ... * Scenario 1: Start with Setup 1. Nothing has changed on the repo contents (no new object/packs, refs all the same), but repacking config options have changed (for example compression level has changed). On Tuesday, December 03, 2013 10:50:07 am Junio C Hamano wrote: Duy Nguyen pclo...@gmail.com writes: Reading Martin's mail again I wonder how we just grab all objects and skip history traversal. Who will decide object order in the new pack if we don't traverse history and collect path information. I vaguely recall raising a related topic for quick repack, assuming everything in existing packfiles are reachable, that only removes loose cruft several weeks ago. Once you decide that your quick repack do not care about ejecting objects from existing packs, like how I suspect Martin's outline will lead us to, we can repack the reachable loose ones on the recent surface of the history and then concatenate the contents of existing packs, excluding duplicates and possibly adjusting the delta base offsets for some entries, without traversing the bulk of the history. From this, it sounds like scenario 1 (a single pack being repacked) might then be doable (just trying to establish a really simple baseline)? Except that it would potentially not result in the same ordering without traversing history? Or, would the current pack ordering be preserved and thus be correct? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ideas to speed up repacking
I wanted to explore the idea of exploiting knowledge about previous repacks to help speed up future repacks. I had various ideas that seemed like they might be good places to start, but things quickly got away from me. Mainly I wanted to focus on reducing and even sometimes eliminating reachability calculations since that seems to be be the one major unsolved slow piece during repacking. My first line of thinking goes like this: After a full repack, reachability of the current refs is known. Exploit that knowledge for future repacks. There are some very simple scenarios where if we could figure out how to identify them reliably, I think we could simply avoid reachability calculations entirely, and yet end up with the same repacked files as if we had done the reachability calculations. Let me outline some to see if they make sense as starting place for further discussion. - * Setup 1: Do a full repack. All loose and packed objects are added to a single pack file (assumes git config repack options do not create multiple packs). * Scenario 1: Start with Setup 1. Nothing has changed on the repo contents (no new object/packs, refs all the same), but repacking config options have changed (for example compression level has changed). * Scenario 2: Starts with Setup 1. Add one new pack file that was pushed to the repo by adding a new ref to the repo (existing refs did not change). * Scenario 3: Starts with Setup 1. Add one new pack file that was pushed to the repo by updating an existing ref with a fast forward. * Scenario 4: Starts with Setup 1. Add some loose objects to the repo via a local fast forward ref update (I am assuming this is possible without adding any new unreferenced objects?) In all 4 scenarios, I believe we should be able to skip history traversal and simply grab all objects and repack them into a new file? - Of the 4 scenarios above, it seems like #3 and #4 are very common operations (#2 is perhaps even more common for Gerrit)? If these scenarios can be reliably identified somehow, then perhaps they could be used to reduce repacking time for these scenarios, and later used as building blocks to reduce repacking time for other related but slightly more complicated scenarios (with reduced history walking instead of none)? For example to identify scenario 1, what if we kept a copy of all refs and their shas used during a full repack along with the newly repacked file? A simplistic approach would store them in the same format as the packed-refs file as pack-sha.refs. During repacking, if none of the refs have changed and there are no new objects... Then, if none of the refs have changed and there are new objects, we can just throw the new objects away? ... I am going to stop here because this email is long enough and I wanted to get some feedback on the ideas first before offering more solutions. Thanks, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFE: support change-id generation natively
On Monday, October 21, 2013 12:40:58 pm james.mo...@gitblit.com wrote: On Mon, Oct 21, 2013, at 02:29 PM, Thomas Koch wrote: As I understand, a UUID could also be used for the same purbose as the change- id. How is the change-id generated by the way? Would it be a good english name to call it enduring commit identifier? Here is the algorithm: https://git.eclipse.org/c/jgit/jgit.git/tree/org.eclipse. jgit/src/org/eclipse/jgit/util/ChangeIdUtil.java#n78 I think enduring commit id is a fair interpretation of it's purpose. I don't speak for the Gerrit developers so I can not say if they are interested in alternative id generation. I come to the list as a change-id user/consumer. As a Gerrit maintainer, I would suspect that we would welcome a way to track changes natively in git. Despite any compatibility issues with the current Gerrit implementation, I suspect we would be open to new forms if the git community has a better proposal than the current Change-Id. Especially if it does reduce the significant user pain point of installing a hook! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pack corruption post-mortem
On Wednesday, October 16, 2013 02:34:01 am Jeff King wrote: I was recently presented with a repository with a corrupted packfile, and was asked if the data was recoverable. This post-mortem describes the steps I took to investigate and fix the problem. I thought others might find the process interesting, and it might help somebody in the same situation. This is awesome Peff, thanks for the great writeup! I have nightmares about this sort of thing every now and then, and we even experience some corruption here and there that needs to be fixed (mainly missing objects when we toy with different git repack arguments). I cannot help but wonder, how we can improve git further to either help diagnose or even fix some of these problems? More inline below... The first thing I did was pull the broken data out of the packfile. I needed to know how big the object was, which I found out with: $ git show-index $idx | cut -d' ' -f1 | sort -n | grep -A1 51653873 51653873 51664736 Show-index gives us the list of objects and their offsets. We throw away everything but the offsets, and then sort them so that our interesting offset (which we got from the fsck output above) is followed immediately by the offset of the next object. Now we know that the object data is 10863 bytes long, and we can grab it with: dd if=$pack of=object bs=1 skip=51653873 count=10863 Is there a current plumbing command that should be enhanced to be able to do the 2 steps above directly for people debugging (maybe with some new switch)? If not, should we create one, git show --zlib, or git cat-file --zlib? Note that the object file isn't fit for feeding straight to zlib; it has the git packed object header, which is variable-length. We want to strip that off so we can start playing with the zlib data directly. You can either work your way through it manually (the format is described in Documentation/technical/pack-format.txt), or you can walk through it in a debugger. I did the latter, creating a valid pack like: # pack magic and version printf 'PACK\0\0\0\2' tmp.pack # pack has one object printf '\0\0\0\1' tmp.pack # now add our object data cat object tmp.pack # and then append the pack trailer /path/to/git.git/test-sha1 -b tmp.pack trailer cat trailer tmp.pack and then running git index-pack tmp.pack in the debugger (stop at unpack_raw_entry). Doing this, I found that there were 3 bytes of header (and the header itself had a sane type and size). So I stripped those off with: dd if=object of=zlib bs=1 skip=3 This too feels like something we should be able to do with a plumbing command eventually? git zlib-extract So I took a different approach. Working under the guess that the corruption was limited to a single byte, I wrote a program to munge each byte individually, and try inflating the result. Since the object was only 10K compressed, that worked out to about 2.5M attempts, which took a few minutes. Awesome! Would this make a good new plumbing command, git zlib-fix? I fixed the packfile itself with: chmod +w $pack printf '\xc7' | dd of=$pack bs=1 seek=51659518 conv=notrunc chmod -w $pack The '\xc7' comes from the replacement byte our munge program found. The offset 51659518 is derived by taking the original object offset (51653873), adding the replacement offset found by munge (5642), and then adding back in the 3 bytes of git header we stripped. Another plumbing command needed? git pack-put --zlib? I am not saying my command suggestions are good, but maybe they will inspire the right answer? -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A naive proposal for preventing loose object explosions
On Friday, September 06, 2013 11:19:02 am Junio C Hamano wrote: mf...@codeaurora.org writes: Object lookups should likely not get any slower than if repack were not run, and the extra new pack might actually help find some objects quicker. In general, having an extra pack, only to keep objects that you know are available in other packs, will make _all_ object accesses, not just the ones that are contained in that extra pack, slower. My assumption was that if the new pack, with all the consolidated reachable objects in it, happens to be searched first, it would actually speed things up. And if it is searched last, then the objects weren't in the other packs so how could it have made it slower? It seems this would only slow down the missing object path? But it sounds like all the index files are mmaped up front? Then yes, I can see how it would slow things down. However, it is one only extra (hopefully now well optimized) pack. My base assumption was that even if it does slow things down, it would likely be unmeasurable and a price worth paying to avoid an extreme penalty. Instead of mmapping all the .idx files for all the available packfiles, we could build a table that records, for each packed object, from which packfile at what offset the data is available to optimize the access, but obviously building that in-core table will take time, so it may not be a good trade-off to do so at runtime (a precomputed super-.idx that we can mmap at runtime might be a good way forward if that turns out to be the case). Does this sound like it would work? Sorry, but it is unclear what problem you are trying to solve. I think you guessed it below, I am trying to prevent loose object explosions by keeping unreachable objects around in packs (instead of loose) until expiry. With the current way that pack-objects works, this is the best I could come up with (I said naive). :( Today the git-repack calls git pack-objects like this: git pack-objects --keep-true-parents --honor-pack-keep -- non-empty --all --reflog $args /dev/null $PACKTMP This has no mechanism to place unreachable objects in a pack. If git pack-objects supported an option which streamed them to a separate file (as you suggest below), that would likely be the main piece needed to avoid the heavy-handed approach I was suggesting. The problem is how to define the interface for this? How do we get the filename of the new unreachable packfile? Today the name of the new packfile is sent to stdout, would we just tack on another name? That seems like it would break some assumptions? Maybe it would be OK if it only did that when an --unreachable flag was added? Then git-repack could be enhanced to understand that flag and the extra filenames it outputs? Is it that you do not like that repack -A ejects unreferenced objects and makes it loose, which you may have many? Yes, several times a week we have people pushing the kernel to wrong projects, this leads to 4M loose objects. :( Without a solution for this regular problem, we are very scared to move our repos off of SSDs. This leads to hour plus long fetches. The loosen_unused_packed_objects() function used by repack -A calls the force_object_loose() function (actually, it is the sole caller of the function). If you tweak the latter to stream to a single new graveyard packfile and mark it as kept until expiry, would it solve the issue the same way but with much smaller impact? Yes. There already is an infrastructure available to open a single output packfile and send multiple objects to it in bulk-checkin.c, and I am wondering if you can take advantage of the framework. The existing interface to it assumes that the object data is coming from a file descriptor (the interface was built to support bulk-checkin of many objects in an empty repository), and it needs refactoring to allow stream_to_pack() to take different kind of data sources in the form of stateful callback function, though. That feels beyond what I could currently dedicate the time to do. Like I said, my solution is heavy handed but it felt simple enough for me to try. I can spare the extra disk space and I am not convinced the performance hit would be bad. I would, of course, be delighted if someone else were to do what you suggest, but I get that it's my itch... -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/7] Multiple simultaneously locked ref updates
On Thursday, August 29, 2013 08:11:48 am Brad King wrote: fatal: Unable to create 'lock': File exists. If no other git process is currently running, this probably means a git process crashed in this repository earlier. Make sure no other git process is running and remove the file manually to continue. I don't believe git currently tries to do any form of stale lock recovery since it is racy and unreliable (both single server or on a multi-server shared repo), -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] repack: rewrite the shell script in C.
On Thursday, August 15, 2013 01:46:02 am Stefan Beller wrote: On 08/15/2013 01:25 AM, Martin Fick wrote: On Wednesday, August 14, 2013 04:51:14 pm Matthieu Moy wrote: Antoine Pelisse apeli...@gmail.com writes: On Wed, Aug 14, 2013 at 6:27 PM, Stefan Beller stefanbel...@googlemail.com wrote: builtin/repack.c | 410 + contrib/examples/git-repack.sh | 194 +++ git-repack.sh | 194 --- I'm still not sure I understand the trade-off here. Most of what git-repack does is compute some file paths, (re)move those files and call git-pack-objects, and potentially git-prune-packed and git-update-server-info. Maybe I'm wrong, but I have the feeling that the correct tool for that is Shell, rather than C (and I think the code looks less intuitive in C for that matter). There's a real problem with git-repack being shell (I already mentionned it in the previous thread about the rewrite): it creates dependencies on a few external binaries, and a restricted server may not have them. I have this issue on a fusionforge server where Git repos are accessed in a chroot with very few commands available: everything went OK until the first project grew enough to require a git gc --auto, and then it stopped accepting pushes for that project. I tracked down the origin of the problem and the sysadmins disabled auto-gc, but that's not a very satisfactory solution. C is rather painfull to write, but as a sysadmin, drop the binary on your server and it just works. That's really important. AFAIK, git-repack is the only remaining shell part on the server, and it's rather small. I'd really love to see it disapear. I didn't review the proposed C version, but how was it planning on removing the dependencies on these binaries? Was it planning to reimplement mv, cp, find? These small programms (at least mv and cp) are just convenient interfaces for system calls from within the shell. You can use these system calls to achieve a similar results compared to the commandline option. http://linux.die.net/man/2/rename http://linux.die.net/man/2/unlink Sure, but have you ever looked at the code to mv? It isn't pretty. ;( But in all that ugliness is decades worth of portability and corner cases. Also, mv is smart enough to copy when rename doesn't work (on some systems it doesn't). So C may sound more portable, but I am not sure it actually is. Now hopefully you won't need all of that, but I think that some of the design decision that went into git-repack did consider some of the more eccentric filesystems out there, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] repack: rewrite the shell script in C.
On Wednesday, August 14, 2013 10:49:58 am Antoine Pelisse wrote: On Wed, Aug 14, 2013 at 6:27 PM, Stefan Beller stefanbel...@googlemail.com wrote: builtin/repack.c | 410 + contrib/examples/git-repack.sh | 194 +++ git-repack.sh | 194 --- I'm still not sure I understand the trade-off here. Most of what git-repack does is compute some file paths, (re)move those files and call git-pack-objects, and potentially git-prune-packed and git-update-server-info. Maybe I'm wrong, but I have the feeling that the correct tool for that is Shell, rather than C (and I think the code looks less intuitive in C for that matter). I'm not sure anyone would run that command a thousand times a second, so I'm not sure it would make a real-life performance difference. I have been holding off a bit on expressing this opinion too because I don't want to squash someone's energy to improve things, and because I am not yet a git dev, but since it was brought up anyway... I can say that as a user, having git-repack as a shell script has been very valuable. For example: we have modified it for our internal use to no longer ever overwrite new packfiles with the same name as the current packfile. This modification was relatively easy to do and see in shell script. If this were C code I can't imagine having personally: 1) identified that there was an issue with the original git-repack (it temporarily makes objects unavailable) 2) made a simple custom fix to that policy. The script really is mostly a policy script, and with the discussions happening in other threads about how to improve git gc, I think it is helpful to potentially be able to quickly modify the policies in this script, it makes it easier to prototype things. Shell portability issues aside, this script is not a low level type of tool that I feel will benefit from being in C, I feel it will in fact be worse off in C, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] repack: rewrite the shell script in C.
On Wednesday, August 14, 2013 04:16:35 pm Stefan Beller wrote: On 08/14/2013 07:25 PM, Martin Fick wrote: I have been holding off a bit on expressing this opinion too because I don't want to squash someone's energy to improve things, and because I am not yet a git dev, but since it was brought up anyway... It's ok, if you knew a better topic to work on, I'd gladly switch over. (Given it would be a good beginners topic.) See below... I can say that as a user, having git-repack as a shell script has been very valuable. For example: we have modified it for our internal use to no longer ever overwrite new packfiles with the same name as the current packfile. This modification was relatively easy to do and see in shell script. If this were C code I can't imagine having personally: 1) identified that there was an issue with the original git-repack (it temporarily makes objects unavailable) 2) made a simple custom fix to that policy. Looking at the `git log -- git-repack.sh` the last commit is from April 2012 and the commit before is 2011, so I assumed it stable enough for porting over to C, as there is not much modification going on. I'd be glad to include these changes you're using into the rewrite or the shell script as of now. One suggestion would be to change the repack code to create pack filenames based on the sha1 of the contents of the pack file instead of on the sha1 of the objects in the packfile. Since the same objects can be stored in a packfile in many ways (different deltification/compression options), it is currently possible to have 2 different pack files with the same names. The contents are different, but the contained objects are the same. This causes the object availability bug that I describe above in git repack when a new packfile is generated with the same name as a current one. I am not 100% sure if the change in naming convention I propose wouldn't cause any problems? But if others agree it is a good idea, perhaps it is something a beginner could do? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] repack: rewrite the shell script in C.
On Wednesday, August 14, 2013 04:51:14 pm Matthieu Moy wrote: Antoine Pelisse apeli...@gmail.com writes: On Wed, Aug 14, 2013 at 6:27 PM, Stefan Beller stefanbel...@googlemail.com wrote: builtin/repack.c | 410 + contrib/examples/git-repack.sh | 194 +++ git-repack.sh | 194 --- I'm still not sure I understand the trade-off here. Most of what git-repack does is compute some file paths, (re)move those files and call git-pack-objects, and potentially git-prune-packed and git-update-server-info. Maybe I'm wrong, but I have the feeling that the correct tool for that is Shell, rather than C (and I think the code looks less intuitive in C for that matter). There's a real problem with git-repack being shell (I already mentionned it in the previous thread about the rewrite): it creates dependencies on a few external binaries, and a restricted server may not have them. I have this issue on a fusionforge server where Git repos are accessed in a chroot with very few commands available: everything went OK until the first project grew enough to require a git gc --auto, and then it stopped accepting pushes for that project. I tracked down the origin of the problem and the sysadmins disabled auto-gc, but that's not a very satisfactory solution. C is rather painfull to write, but as a sysadmin, drop the binary on your server and it just works. That's really important. AFAIK, git-repack is the only remaining shell part on the server, and it's rather small. I'd really love to see it disapear. I didn't review the proposed C version, but how was it planning on removing the dependencies on these binaries? Was it planning to reimplement mv, cp, find? Were there other binaries that were problematic that you were thinking of? From what I can tell it also uses test, mkdir, sed, chmod and naturally sh, that is 8 dependencies. If those can't be depended upon for existing, perhaps git should just consider bundling busy-box or some other limited shell utils, or yikes!, even its own reimplementation of these instead of implementing these independently inside other git programs? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] repack: rewrite the shell script in C.
On Wednesday, August 14, 2013 04:53:36 pm Junio C Hamano wrote: Martin Fick mf...@codeaurora.org writes: One suggestion would be to change the repack code to create pack filenames based on the sha1 of the contents of the pack file instead of on the sha1 of the objects in the packfile. ... I am not 100% sure if the change in naming convention I propose wouldn't cause any problems? But if others agree it is a good idea, perhaps it is something a beginner could do? I would not be surprised if that change breaks some other people's reimplementation. I know we do not validate the pack name with the hash of the contents in the current code, but at the same time I do remember that was one of the planned things to be done while I and Linus were working on the original pack design, which was the last task we did together before he retired from the maintainership of this project. Perhaps a config option? One that becomes standard for git 2.0? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] repack: rewrite the shell script in C.
On Wednesday, August 14, 2013 05:25:42 pm Martin Fick wrote: On Wednesday, August 14, 2013 04:51:14 pm Matthieu Moy wrote: Antoine Pelisse apeli...@gmail.com writes: On Wed, Aug 14, 2013 at 6:27 PM, Stefan Beller stefanbel...@googlemail.com wrote: builtin/repack.c | 410 + contrib/examples/git-repack.sh | 194 +++ git-repack.sh | 194 --- I'm still not sure I understand the trade-off here. Most of what git-repack does is compute some file paths, (re)move those files and call git-pack-objects, and potentially git-prune-packed and git-update-server-info. Maybe I'm wrong, but I have the feeling that the correct tool for that is Shell, rather than C (and I think the code looks less intuitive in C for that matter). There's a real problem with git-repack being shell (I already mentionned it in the previous thread about the rewrite): it creates dependencies on a few external binaries, and a restricted server may not have them. I have this issue on a fusionforge server where Git repos are accessed in a chroot with very few commands available: everything went OK until the first project grew enough to require a git gc --auto, and then it stopped accepting pushes for that project. I tracked down the origin of the problem and the sysadmins disabled auto-gc, but that's not a very satisfactory solution. C is rather painfull to write, but as a sysadmin, drop the binary on your server and it just works. That's really important. AFAIK, git-repack is the only remaining shell part on the server, and it's rather small. I'd really love to see it disapear. I didn't review the proposed C version, but how was it planning on removing the dependencies on these binaries? Was it planning to reimplement mv, cp, find? Were there other binaries that were problematic that you were thinking of? From what I can tell it also uses test, mkdir, sed, chmod and naturally sh, that is 8 dependencies. If those can't be depended upon for existing, perhaps git should just consider bundling busy-box or some other limited shell utils, or yikes!, even its own reimplementation of these instead of implementing these independently inside other git programs? Sorry I didn't comprehend your email fully when I first read it. I guess that wouldn't really solve your problem unless someone had a way of bundling an sh program and whatever it calls inside a single executable? :( I can see why you would want what you want, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Thursday, August 08, 2013 10:56:38 am Junio C Hamano wrote: I thought the discussion was about making the local gc cheaper, and the Imagine we have a cheap way was to address it by assuming that the daily pack young objects into a single pack can be sped up if we did not have to traverse history. More permanent packs (the older ones in set of packs staggered by age Martin proposes) in the repository should go through the normal history traversal route. Assuming I understand what you are suggesting, would these young object likely still get deduped in an efficient way without doing history traversal (it sounds like they would)? In other words, if I understand correctly, it would save time by not pruning unreferenced objects, but it would still be deduping things and delta compressing also, so you would still likely get a great benefit from creating these young object packs? In other words, is there still a good chance that my 317 new pack files which included a 33M pack file will still get consolidated down to something near 8M? If so, then yeah this might be nice, especially if the history traversal is what would speed this up. Because today, my solution mostly saves IO and not time. I think it still saves time, I believe I have seen up to a 50% savings, but that is nothing compared to massive, several orders of magnitude IO savings. But if what you suggest could also give massive time (orders of magnitude) savings along with the IO improvements I am seeing, then suddenly repacking regularly would become very cheap even on large repos. The only time consuming piece would be pruning then? Could bitmaps eventually help out there? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Tuesday, August 06, 2013 06:24:50 am Duy Nguyen wrote: On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra artag...@gmail.com wrote: + Garbage collect using a pseudo logarithmic packfile maintenance + approach. This approach attempts to minimize packfile churn + by keeping several generations of varying sized packfiles around + and only consolidating packfiles (or loose objects) which are + either new packfiles, or packfiles close to the same size as + another packfile. I wonder if a simpler approach may be nearly efficient as this one: keep the largest pack out, repack the rest at fetch/push time so there are at most 2 packs at a time. Or we we could do the repack at 'gc --auto' time, but with lower pack threshold (about 10 or so). When the second pack is as big as, say half the size of the first, merge them into one at gc --auto time. This can be easily implemented in git-repack.sh. It would definitely be better than the current gc approach. However, I suspect it is still at least one to two orders of magnitude off from where it should be. To give you a real world example, on our server today when gitexproll ran on our kernel/msm repo, it consolidated 317 pack files into one almost 8M packfile (it compresses/dedupes shockingly well, one of those new packs was 33M). Our largest packfile in that repo is 1.5G! So let's now imagine that the second closest packfile is only 100M, it would keep getting consolidated with 8M worth of data every day (assuming the same conditions and no extra compression). That would take (750M-100M)/8M ~ 81 days to finally build up large enough to no longer consolidate the new packs with the second largest pack file daily. During those 80+ days, it will be on average writing 325M too much per day (when it should be writing just 8M). So I can see the appeal of a simple solution, unfortunately I think one layer would still suck though. And if you are going to add even just one extra layer, I suspect that you might as well go the full distance since you probably already need to implement the logic to do so? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Monday, August 05, 2013 08:38:47 pm Ramkumar Ramachandra wrote: This is the rough explanation I wrote down after reading it: So, the problem is that my .git/objects/pack is polluted with little packs everytime I fetch (or push, if you're the server), and this is problematic from the perspective of a overtly (naively) aggressive gc that hammers out all fragmentation. So, on the first run, the little packfiles I have are all consolidated into big packfiles; you also write .keep files to say that don't gc these big packs we just generated. In subsequent runs, the little packfiles from the fetch are absorbed into a pack that is immune to gc. You're also using a size heuristic, to consolidate similarly sized packfiles. You also have a --ratio to tweak the ratio of sizes. From: Martin Fickmf...@codeaurora.org See: https://gerrit-review.googlesource.com/#/c/35215/ Thread: http://thread.gmane.org/gmane.comp.version-control.git/2 31555 (Martin's emails are missing from the archive) --- After analyzing today's data, I recognize that in some circumstances the size estimation after consolidation can be off by huge amounts. The script naively just adds the current sizes together. This gives a very rough estimate, of the new packfile size, but sometimes it can be off by over 2 orders of magnitude. :( While many new packfiles are tiny (several K only), it seems like the larger new packfiles have a terrible tendency to throw the estimate way off (I suspect they simply have many duplicate objects). But despite this poor estimate, the script still offers drastic improvements over plain git gc. So, it has me wondering if there isn't a more accurate way to estimate the new packfile without wasting a ton of time? If not, one approach which might be worth experimenting with is to just assume that new packfiles have size 0! Then just consolidate them with any other packfile which is ready for consolidation, or if none are ready, with the smallest packfile. I would not be surprised to see this work on average better than the current summation, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] gc and impatience
On Monday, August 05, 2013 11:34:24 am Ramkumar Ramachandra wrote: Martin Fick wrote: https://gerrit-review.googlesource.com/#/c/35215/ Very cool. Of what I understood: So, the problem is that my .git/objects/pack is polluted with little packs everytime I fetch (or push, if you're the server), and this is problematic from the perspective of a overtly (naively) aggressive gc that hammers out all fragmentation. So, on the first run, the little packfiles I have are all consolidated into big packfiles; you also write .keep files to say that don't gc these big packs we just generated. In subsequent runs, the little packfiles from the fetch are absorbed into a pack that is immune to gc. You're also using a size heuristic, to consolidate similarly sized packfiles. You also have a --ratio to tweak the ratio of sizes. Yes, pretty much. I suspect that a smarter implementation would do a less good job of packing to save time also. I think this can be done by further limiting much of the lookups to the packs being packed (or some limited set of the greater packfiles). I admit I don't really understand how much the packing does today, but I believe it still looks at the larger packs with keeps to potentially deltafy against them, or to determine which objects are duplicated and thus should not be put into the new smaller packfiles? I say this because the time savings of this script is not as significant as I would have expected it to be (but the IO is). I think that it is possible to design a git gc using this rolling approach that would actually greatly reduce the time spent packing also. However, I don't think that can easily be done in a script like mine which just wraps itself around git gc. I hope that someone more familiar with git gc than me might take this on some day. :) I've checked it in and started using it; so yeah: I'll chew on it for a few weeks. The script also does some nasty timestamp manipulations that I am not proud of. They had significant time impacts for us, and likely could have been achieved some other way. They shouldn't be relevant to the packing algo though. I hope it doesn't interfere with the evaluation of the approach. Thanks for taking an interest in it, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to still kill git fetch with too many refs
On Tuesday, July 02, 2013 03:24:14 am Michael Haggerty wrote: git rev-list HEAD | for nn in $(seq 0 100) ; do for c in $(seq 0 1) ; do read sha ; echo $sha refs/c/$nn/$c$nn ; done ; done .git/packed-refs I believe this generates a packed-refs file that is not sorted lexicographically by refname, whereas all Git-generated packed-refs files are sorted. Yes, you are indeed correct. I was attempting to be too clever with my sharding I guess. Thanks. There are some optimizations in refs.c for adding references in order that might therefore be circumvented by your unsorted file. Please try sorting the file by refname and see if that helps. (You can do so by deleting one of the packed references; then git will sort the remainder while rewriting the file.) A simple git pack-refs seems to clean it up. The original test did complete in ~77mins last night. A rerun with a sorted file takes ~61mins, -Martin PS: This test was performed with git version 1.8.2.1 on linux 2.6.32-37-generic #81-Ubuntu SMP -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] avoid quadratic behavior in fetch-pack
On Tuesday, July 02, 2013 12:11:49 am Jeff King wrote: Here are my patches to deal with Martin's pathological case, split out for easy reading. I took a few timings to show that the results of the 3rd patch are noticeable even with 50,000 unique refs (which is still a lot, but something that I could conceive of a busy repo accumulating over time). [1/3]: fetch-pack: avoid quadratic list insertion in mark_complete [2/3]: commit.c: make compare_commits_by_commit_date global [3/3]: fetch-pack: avoid quadratic behavior in rev_list_push And here's the diffstat to prove it is really not scary. :) commit.c | 2 +- commit.h | 2 ++ fetch-pack.c | 16 3 files changed, 11 insertions(+), 9 deletions(-) -Peff I applied these 3 patches and it indeed improves things dramatically. Thanks Peff, you are awesome!!! The synthetic test case (but sorted), now comes in at around 15s. The more important real world case (for us), fetching from my production server, which took around 12mins previously, now takes around 30s (I think the extra time is now spent on the Gerrit server, but I will investigate that a bit more)! That is very significant and should make many workflows much more efficient. +1 for merging this. :) Again, thanks, -Martin Note, I tested git-next 1.8.3.2.883.g27cfd27 to be sure that it is still problematic without this patch, it is (running for 10mins now without completing). -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to still kill git fetch with too many refs
I have often reported problems with git fetch when there are many refs in a repo, and I have been pleasantly surprised how many problems I reported were so quickly fixed. :) With time, others have created various synthetic test cases to ensure that git can handle many many refs. A simple synthetic test case with 1M refs all pointing to the same sha1 seems to be easily handled by git these days. However, in our experience with our internal git repo, we still have performance issues related to having too many refs, in our kernel/msm instance we have around 400K. When I tried the simple synthetic test case and could not reproduce bad results, so I tried something just a little more complex and was able to get atrocious results!!! Basically, I generate a packed-refs files with many refs which each point to a different sha1. To get a list of valid but unique sha1s for the repo, I simply used rev-list. The result, a copy of linus' repo with a million unique valid refs and a git fetch of a single updated ref taking a very long time (55mins and it did not complete yet). Note, with 100K refs it completes in about 2m40s. It is likely not linear since 2m40s * 10 would be ~26m (but the difference could also just be how the data in the sha1s are ordered). Here is my small reproducible test case for this issue: git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cp -rp linux linux.1Mrefs-revlist cd linux echo Hello hello ; git add hello ; git ci -a -m 'hello' cd .. cd linux.1Mrefs-revlist git rev-list HEAD | for nn in $(seq 0 100) ; do for c in $(seq 0 1) ; do read sha ; echo $sha refs/c/$nn/$c$nn ; done ; done .git/packed-refs time git fetch file:///$(dirname $PWD)/linux refs/heads/master Any insights as to why it is so slow, and how we could possibly speed it up? Thanks, -Martin PS: My tests were performed with git version 1.8.2.1 on linux 2.6.32-37-generic #81-Ubuntu SMP -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fixing the git-repack replacement gap?
I have been trying to think of ways to fix git-repack so that it no longer momentarily makes the objects in a repo inaccessible to all processes when it replaces packfiles with the same objects in them as an already existing pack file. To be more explicit, I am talking about the way it moves the existing pack file (and index) to old-sha1.pack before moving the new packfile in place. During this moment in time the objects in that packfile are simply not available to anyone using the repo. This can be particularly problematic for busy servers. There likely are at lest 2 ways that the fundamental design of packfiles, their indexes, and their names have led to this issue. If the packfile and index were stored in a single file, they could have been replaced atomically and thus it would potentially avoid the issue of them being temporarily inaccessible (although admittedly that might not work anyway on some filesystems). Alternatively, if the pack file were named after the sha1 of the packed contents of the file instead of the sha1 of the objects in the sha1, then the replacement would never need to happen since it makes no sense to replace a file with another file with the exact same contents (unless, of course the first one is corrupt, but then you aren't likely making the repo temporarily worse, you are fixing a broken repo). I suspect these 2 ideas have been discussed before, but since they are fundamental changes to the way pack files work (and thus would not be backwards compatible), they are not likely to get implemented soon. This got me wondering if there wasn't an easier backwards compatible solution to avoid making the objects inaccessible? It seems like the problem could be avoided if we could simply change the name of the pack file when a replacement would be needed? Of course, if we just changed the name, then the name would not match the sha1 of the contained objects and would likely be considered bad by git? So, what if we could simply add a dummy object to the file to cause it to deserve a name change? So the idea would be, have git-repack detect the conflict in filenames and have it repack the new file with an additional dummy (unused) object in it, and then deliver the new file which no longer conflicts. Would this be possible? If so, what sort of other problems would this cause? It would likely cause an unreferenced object and likely cause it to want to get pruned by the next git-repack? Is that OK, maybe you want it to get pruned because then the pack file will get repacked once again without the dummy object later and avoid the temporarily inaccessible period for objects in the file? Hmm, but then maybe that could even be done in a single git- repack run (at the expense of extra disk space)? 1) Detect the conflict, 2) Save the replacement file 3) Create a new packfile with the dummy object 4) Put the new file with the dummy object into service 5) Remove the old conflicting file (no gap) 6) Place the new conflicting file in service (no dummy) 7) Remove the new file with dummy object (no gap again) done? Would it work? If so, is there an easy way to create the dummy file? Can any object simply be added at the end of a pack file after the fact (and then added to the index too)? Also, what should the dummy object be? Is there some sort of null object that would be tiny and that would never already be in the pack? Thanks for any thoughts, -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git hangs on pthread_join
On Thursday, May 23, 2013 07:01:43 am you wrote: I'm running a rather special configuration, basically i have a gerrit server pushing ... I have found git receive-packs that has been running for days/weeks without terminating ... Anyone that has any clues about what could be going wrong? -- Have you narrowed down whether this is a git client problem, or a server problem (gerrit in your case). Is this a repeatable issue. Try the same operation against a clone of the repo using just git. Check on the server side for .noz files in you repo (a jgit thing), -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: inotify to minimize stat() calls
On Sunday, February 10, 2013 12:03:00 pm Robert Zeh wrote: On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano gits...@pobox.com wrote: Ramkumar Ramachandra artag...@gmail.com writes: This is much better than Junio's suggestion to study possible implementations on all platforms and designing a generic daemon/ communication channel. That's no weekend project. It appears that you misunderstood what I wrote. That was not here is a design; I want it in my system. Go implemment it. It was If somebody wants to discuss it but does not know where to begin, doing a small experiment like this and reporting how well it worked here may be one way to do so., nothing more. What if instead of communicating over a socket, the daemon dumped a file containing all of the lstat information after git wrote a file? By definition the daemon should know about file writes. But git doesn't, how will it know when the file is written? Will it use inotify, or poll (kind of defeats the point)? -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
Jeff King p...@peff.net wrote: On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote: Both makes sense to me. I also wonder if we would be helped by another repack mode that coalesces small packs into a single one with minimum overhead, and run that often from gc --auto, so that we do not end up having to have 50 packfiles. When we have 2 or more small and young packs, we could: - iterate over idx files for these packs to enumerate the objects to be packed, replacing read_object_list_from_stdin() step; - always choose to copy the data we have in these existing packs, instead of doing a full prepare_pack(); and - use the order the objects appear in the original packs, bypassing compute_write_order(). I'm not sure. If I understand you correctly, it would basically just be concatenating packs without trying to do delta compression between the objects which are ending up in the same pack. So it would save us from having to do (up to) 50 binary searches to find an object in a pack, but would not actually save us much space. I would be interested to see the timing on how quick it is compared to a real repack, as the I/O that happens during a repack is non-trivial (although if you are leaving aside the big main pack, then it is probably not bad). But how do these somewhat mediocre concatenated packs get turned into real packs? Pack-objects does not consider deltas between objects in the same pack. And when would you decide to make a real pack? How do you know you have 50 young and small packs, and not 50 mediocre coalesced packs? If we are reconsidering repacking strategies, I would like to propose an approach that might be a more general improvement to repacking which would help in more situations. You could roll together any packs which are close in size, say within 50% of each other. With this strategy you will end up with files which are spread out by size exponentially. I implementated this strategy on top of the current gc script using keep files, it works fairly well: https://gerrit-review.googlesource.com/#/c/35215/3/contrib/git-exproll.sh This saves some time, but mostly it saves I/O when repacking regularly. I suspect that if this strategy were used in core git that further optimizations could be made to also reduce the repack time, but I don't know enough about repacking to know? We run it nightly on our servers, both write and read only mirrors. We us are a ratio of 5 currently to drastically reduce large repack file rollovers, -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] refs: do not use cached refs in repack_without_ref
...[Sorry about the previous HTML reposts] Jeff King p...@peff.net wrote: On Mon, Dec 31, 2012 at 03:30:53AM -0700, Martin Fick wrote: The general approach is to setup a transaction and either commit or abort it. A transaction can be setup by renaming an appropriately setup directory to the ref.lock name. If the rename succeeds, the transaction is begun. Any actor can abort the transaction (up until it is committed) by simply deleting the ref.lock directory, so it is not at risk of going stale. Deleting a directory is not atomic, as you first have to remove the contents, putting it into a potentially inconsistent state. I'll assume you deal with that later... Right, these simple single file transactions have at best 1 important file/directory in them, once deleted the transaction is aborted (can no longer complete). However to support multi file transactions, a better approach is likely to rename the uuid directory to have a .delete extension before deleting stuff in it. One important piece of the transaction is the use of uuids. The uuids provide a mechanism to tie the atomic commit pieces to the transactions and thus to prevent long sleeping process from inadvertently performing actions which could be out of date when they wake finally up. Has this been a problem for you in practice? No, but as you say, we don't currently hold locks for very long. I anticipate it being a problem in a clustered environment when transactions start spanning repos from java processes, with insane amounts of RAM, which can sometimes have unpredictable indeterminately long java GC cycles at inopportune times.. It would seem short sighted if Gerrit at least did not assume this will be a problem. But, deletes today in git are not so short and Michael's fixes may make things worse? But, as you point out, that should perhaps be solved a different way. Avoiding this is one of the reasons that git does not take out long locks; instead, it takes the lock only at the moment it is ready to write, and aborts if it has been updated since the longer-term operation began. This has its own problems (you might do a lot of work only to have your operation aborted), but I am not sure that your proposal improves on that. It does not, it might increase this. Git typically holds ref locks for a few syscalls. If you are conservative about leaving potentially stale locks in place (e.g., give them a few minutes to complete before assuming they are now bogus), you will not run into that problem. In a distributed environment even a few minutes might not be enough, processes could be on a remote server with a temporarily split network, that could cause delays longer than your typical local expectations. But there is also the other piece of this problem, how do you detect stale locks? How long will it be stale until a user figures it out and reports it? How many other users will simply have failed pushes and wonder why without reporting them? In each case, the atomic commit piece is the renaming of a file. For the create and update pieces, a file is renamed from the ref.lock dir to the ref file resulting in an update to the sha for the ref. I think we've had problems with cross-directory renames on some filesystems, but I don't recall the details. I know that Coda does not like cross-directory links, but cross-directory renames are OK (and in fact we fall back to the latter when the former does not work). Ah, here we go: 5723fe7 (Avoid cross-directory renames and linking on object creation, 2008-06-14). Looks like NFS is the culprit. If the renames fail we can fall back to regular file locking, the hard part to detect and deal with would be if the renames don't fail but become copies/mkdirs. In the case of a delete, the actor may verify that ref currently contains the sha to prune if it needs to, and then renames the ref file to ref.lock/uuid/delete. On success, the ref was deleted. Whether successful or not, the actor may now simply delete the ref.lock directory, clearing the way for a new transaction. Any other actor may delete this directory at any time also, likely either on conflict (if they are attempting to initiate a transaction), or after a grace period just to cleanup the FS. Any actor may also safely cleanup the tmp directories, preferably also after a grace period. Hmm. So what happens to the delete file when the ref.lock directory is being deleted? Presumably deleting the ref.lock directory means doing it recursively (which is non-atomic). But then why are we keeping the delete file at all, if we're just about to remove it? We are not trying to keep it, but we need to ensure that our transaction has not yet been aborted: the rename does this. If we just deleted the file, we may sleep and another transaction may abort our transaction and complete before we wake up and actually delete the file. But by using
Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)
On Friday, January 04, 2013 10:52:43 am Pyeron, Jason J CTR (US) wrote: From: Martin Fick Sent: Thursday, January 03, 2013 6:53 PM Any thoughts on this idea? Is it flawed? I am trying to write it up in a more formal generalized manner and was hoping to get at least one it seems sane before I do. If you are assuming that atomic renames, etc. are available, then you should identify a test case and a degrade operation path when it is not available. Thanks, sound reasonable. Where you thinking a runtime test case that would be run before every transaction? I was anticipating a per repo config option called something like core.locks = recoverable that would be needed to turn them on? I was thinking that this was something that server sites could test in advance on their repos and then enable it for them. Maybe a git- lock tool with a --test-recoverable option? -Martin On Monday, December 31, 2012 03:30:53 am Martin Fick wrote: On Thursday, December 27, 2012 04:11:51 pm Martin Fick wrote: It concerns me that git uses any locking at all, even for refs since it has the potential to leave around stale locks. ... [a previous not so great attempt to fix this] ... I may have finally figured out a working loose ref update mechanism which I think can avoid stale locks. Unfortunately it requires atomic directory renames and universally unique identifiers (uuids). These may be no-go criteria? But I figure it is worth at least exploring this idea because of the potential benefits? The general approach is to setup a transaction and either commit or abort it. A transaction can be setup by renaming an appropriately setup directory to the ref.lock name. If the rename succeeds, the transaction is begun. Any actor can abort the transaction (up until it is committed) by simply deleting the ref.lock directory, so it is not at risk of going stale. However, once the actor who sets up the transaction commits it, deleting the ref.lock directory simply aids in cleaning it up for the next transaction (instead of aborting it). One important piece of the transaction is the use of uuids. The uuids provide a mechanism to tie the atomic commit pieces to the transactions and thus to prevent long sleeping process from inadvertently performing actions which could be out of date when they wake finally up. In each case, the atomic commit piece is the renaming of a file. For the create and update pieces, a file is renamed from the ref.lock dir to the ref file resulting in an update to the sha for the ref. However, in the delete case, the ref file is instead renamed to end up in the ref.lock directory resulting in a delete of the ref. This scheme does not affect the way refs are read today, To prepare for a transaction, an actor first generates a uuid (an exercise I will delay for now). Next, a tmp directory named after the uuid is generated in the parent directory for the ref to be updated, perhaps something like: .lock_uuid. In this directory is places either a file or a directory named after the uuid, something like: .lock_uuid/,uuid. In the case of a create or an update, the new sha is written to this file. In the case of a delete, it is a directory. Once the tmp directory is setup, the initiating actor attempts to start the transaction by renaming the tmp directory to ref.lock. If the rename fails, the update fails. If the rename succeeds, the actor can then attempt to commit the transaction (before another actor aborts it). In the case of a create, the actor verifies that ref does not currently exist, and then renames the now named ref.lock/uuid file to ref. On success, the ref was created. In the case of an update, the actor verifies that ref currently contains the old sha, and then also renames the now named ref.lock/uuid file to ref. On success, the ref was updated. In the case of a delete, the actor may verify that ref currently contains the sha to prune if it needs to, and then renames the ref file to ref.lock/uuid/delete. On success, the ref was deleted. Whether successful or not, the actor may now simply delete the ref.lock directory, clearing the way for a new transaction. Any other actor may delete this directory at any time also, likely either on conflict (if they are attempting to initiate a transaction), or after a grace period just to cleanup the FS. Any actor may also safely cleanup the tmp directories, preferably also after a grace period. One neat part about this scheme is that I believe it would be backwards compatible with the current locking mechanism since the transaction directory will simply appear to be a lock to older clients. And the old
Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)
Any thoughts on this idea? Is it flawed? I am trying to write it up in a more formal generalized manner and was hoping to get at least one it seems sane before I do. Thanks, -Martin On Monday, December 31, 2012 03:30:53 am Martin Fick wrote: On Thursday, December 27, 2012 04:11:51 pm Martin Fick wrote: It concerns me that git uses any locking at all, even for refs since it has the potential to leave around stale locks. ... [a previous not so great attempt to fix this] ... I may have finally figured out a working loose ref update mechanism which I think can avoid stale locks. Unfortunately it requires atomic directory renames and universally unique identifiers (uuids). These may be no-go criteria? But I figure it is worth at least exploring this idea because of the potential benefits? The general approach is to setup a transaction and either commit or abort it. A transaction can be setup by renaming an appropriately setup directory to the ref.lock name. If the rename succeeds, the transaction is begun. Any actor can abort the transaction (up until it is committed) by simply deleting the ref.lock directory, so it is not at risk of going stale. However, once the actor who sets up the transaction commits it, deleting the ref.lock directory simply aids in cleaning it up for the next transaction (instead of aborting it). One important piece of the transaction is the use of uuids. The uuids provide a mechanism to tie the atomic commit pieces to the transactions and thus to prevent long sleeping process from inadvertently performing actions which could be out of date when they wake finally up. In each case, the atomic commit piece is the renaming of a file. For the create and update pieces, a file is renamed from the ref.lock dir to the ref file resulting in an update to the sha for the ref. However, in the delete case, the ref file is instead renamed to end up in the ref.lock directory resulting in a delete of the ref. This scheme does not affect the way refs are read today, To prepare for a transaction, an actor first generates a uuid (an exercise I will delay for now). Next, a tmp directory named after the uuid is generated in the parent directory for the ref to be updated, perhaps something like: .lock_uuid. In this directory is places either a file or a directory named after the uuid, something like: .lock_uuid/,uuid. In the case of a create or an update, the new sha is written to this file. In the case of a delete, it is a directory. Once the tmp directory is setup, the initiating actor attempts to start the transaction by renaming the tmp directory to ref.lock. If the rename fails, the update fails. If the rename succeeds, the actor can then attempt to commit the transaction (before another actor aborts it). In the case of a create, the actor verifies that ref does not currently exist, and then renames the now named ref.lock/uuid file to ref. On success, the ref was created. In the case of an update, the actor verifies that ref currently contains the old sha, and then also renames the now named ref.lock/uuid file to ref. On success, the ref was updated. In the case of a delete, the actor may verify that ref currently contains the sha to prune if it needs to, and then renames the ref file to ref.lock/uuid/delete. On success, the ref was deleted. Whether successful or not, the actor may now simply delete the ref.lock directory, clearing the way for a new transaction. Any other actor may delete this directory at any time also, likely either on conflict (if they are attempting to initiate a transaction), or after a grace period just to cleanup the FS. Any actor may also safely cleanup the tmp directories, preferably also after a grace period. One neat part about this scheme is that I believe it would be backwards compatible with the current locking mechanism since the transaction directory will simply appear to be a lock to older clients. And the old lock file should continue to lock out these newer transactions. Due to this backwards compatibility, I believe that this could be incrementally employed today without affecting very much. It could be deployed in place of any updates which only hold ref.locks to update the loose ref. So for example I think it could replace step 4a below from Michael Haggerty's description of today's loose ref pruning during ref packing: * Pack references: ... 4. prune_refs(): for each ref in the ref_to_prune list, call prune_ref(): a. Lock the reference using lock_ref_sha1(), verifying that the recorded SHA1 is still valid. If it is, unlink the loose reference file then free the lock; otherwise leave the loose reference file untouched. I think it would also therefore be able to replace the loose ref locking in Michael's new ref-packing scheme as well as the locking in Michael's new ref
Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)
On Thursday, December 27, 2012 04:11:51 pm Martin Fick wrote: It concerns me that git uses any locking at all, even for refs since it has the potential to leave around stale locks. ... [a previous not so great attempt to fix this] ... I may have finally figured out a working loose ref update mechanism which I think can avoid stale locks. Unfortunately it requires atomic directory renames and universally unique identifiers (uuids). These may be no-go criteria? But I figure it is worth at least exploring this idea because of the potential benefits? The general approach is to setup a transaction and either commit or abort it. A transaction can be setup by renaming an appropriately setup directory to the ref.lock name. If the rename succeeds, the transaction is begun. Any actor can abort the transaction (up until it is committed) by simply deleting the ref.lock directory, so it is not at risk of going stale. However, once the actor who sets up the transaction commits it, deleting the ref.lock directory simply aids in cleaning it up for the next transaction (instead of aborting it). One important piece of the transaction is the use of uuids. The uuids provide a mechanism to tie the atomic commit pieces to the transactions and thus to prevent long sleeping process from inadvertently performing actions which could be out of date when they wake finally up. In each case, the atomic commit piece is the renaming of a file. For the create and update pieces, a file is renamed from the ref.lock dir to the ref file resulting in an update to the sha for the ref. However, in the delete case, the ref file is instead renamed to end up in the ref.lock directory resulting in a delete of the ref. This scheme does not affect the way refs are read today, To prepare for a transaction, an actor first generates a uuid (an exercise I will delay for now). Next, a tmp directory named after the uuid is generated in the parent directory for the ref to be updated, perhaps something like: .lock_uuid. In this directory is places either a file or a directory named after the uuid, something like: .lock_uuid/,uuid. In the case of a create or an update, the new sha is written to this file. In the case of a delete, it is a directory. Once the tmp directory is setup, the initiating actor attempts to start the transaction by renaming the tmp directory to ref.lock. If the rename fails, the update fails. If the rename succeeds, the actor can then attempt to commit the transaction (before another actor aborts it). In the case of a create, the actor verifies that ref does not currently exist, and then renames the now named ref.lock/uuid file to ref. On success, the ref was created. In the case of an update, the actor verifies that ref currently contains the old sha, and then also renames the now named ref.lock/uuid file to ref. On success, the ref was updated. In the case of a delete, the actor may verify that ref currently contains the sha to prune if it needs to, and then renames the ref file to ref.lock/uuid/delete. On success, the ref was deleted. Whether successful or not, the actor may now simply delete the ref.lock directory, clearing the way for a new transaction. Any other actor may delete this directory at any time also, likely either on conflict (if they are attempting to initiate a transaction), or after a grace period just to cleanup the FS. Any actor may also safely cleanup the tmp directories, preferably also after a grace period. One neat part about this scheme is that I believe it would be backwards compatible with the current locking mechanism since the transaction directory will simply appear to be a lock to older clients. And the old lock file should continue to lock out these newer transactions. Due to this backwards compatibility, I believe that this could be incrementally employed today without affecting very much. It could be deployed in place of any updates which only hold ref.locks to update the loose ref. So for example I think it could replace step 4a below from Michael Haggerty's description of today's loose ref pruning during ref packing: * Pack references: ... 4. prune_refs(): for each ref in the ref_to_prune list, call prune_ref(): a. Lock the reference using lock_ref_sha1(), verifying that the recorded SHA1 is still valid. If it is, unlink the loose reference file then free the lock; otherwise leave the loose reference file untouched. I think it would also therefore be able to replace the loose ref locking in Michael's new ref-packing scheme as well as the locking in Michael's new ref deletion scheme (again steps 4): * Delete reference foo: ... 4. Delete loose ref for foo: a. Acquire the lock $GIT_DIR/refs/heads/foo.lock b. Unlink $GIT_DIR/refs/heads/foo if it is unchanged. If it is changed, leave it untouched. If it is deleted, that is OK too. c
Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)
On Saturday, December 29, 2012 03:18:49 pm Martin Fick wrote: Jeff King p...@peff.net wrote: On Thu, Dec 27, 2012 at 04:11:51PM -0700, Martin Fick wrote: My idea is based on using filenames to store sha1s instead of file contents. To do this, the sha1 one of a ref would be stored in a file in a directory named after the loose ref. I believe this would then make it possible to have lockless atomic ref updates by renaming the file. To more fully illustrate the idea, imagine that any file (except for the null file) in the directory will represent the value of the ref with its name, then the following transitions can represent atomic state changes to a refs value and existence: Hmm. So basically you are relying on atomic rename() to move the value around within a directory, rather than using write to move it around within a file. Atomic rename is usually something we have on local filesystems (and I think we rely on it elsewhere). Though I would not be surprised if it is not atomic on all networked filesystems (though it is on NFS, at least). Yes. I assume this is OK because doesn't git already rely on atomic renames? For example to rename the new packed-refs file to unlock it? ... 3) To create a ref, it must be renamed from the null file (sha ...) to the new value just as if it were being updated from any other value, but there is one extra condition: before renaming the null file, a full directory scan must be done to ensure that the null file is the only file in the directory (this condition exists because creating the directory and null file cannot be atomic unless the filesystem supports atomic directory renames, an expectation git does not currently make). I am not sure how this compares to today's approach, but including the setup costs (described below), I suspect it is slower. Hmm. mkdir is atomic. So wouldn't it be sufficient to just mkdir and create the correct sha1 file? But then a process could mkdir and die leaving a stale empty dir with no reliable recovery mechanism. Unfortunately, I think I see another flaw though! :( I should have known that I cannot separate an important check from its state transitioning action. The following could happen: A does mkdir A creates null file A checks dir - no other files B checks dir - no other files A renames null file to abcd C creates second null file B renames second null file to defg One way to fix this is to rely on directory renames, but I believe this is something git does not want to require of every FS? If we did, we could Change #3 to be: 3) To create a ref, it must be renamed from the null file (sha ...) to the new value just as if it were being updated from any other value. (No more scan) Then, with reliable directory renames, a process could do what you suggested to a temporary directory, mkdir + create null file, then rename the temporary dir to refname. This would prevent duplicate null files. With a grace period, the temporary dirs could be cleaned up in case a process dies before the rename. This is your approach with reliable recovery. The whole null file can go away if we use directory renames. Make #3: 3) To create a ref, create a temporary directory containing a file named after the sha1 of the ref to be created and rename the directory to the name of the ref to create. If the rename fails, the create fails. If the rename succeeds, the create succeeds. With a grace period, the temporary dirs could be cleaned up in case a process dies before the rename, -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lockless Refs?
Jeff King p...@peff.net wrote: On Fri, Dec 28, 2012 at 09:15:52AM -0800, Junio C Hamano wrote: Martin Fick mf...@codeaurora.org writes: Hmm, actually I believe that with a small modification to the semantics described here it would be possible to make multi repo/branch commits work Shawn talked about adding multi repo/branch transaction semantics to jgit, this might be something that git wants to support also at some point? Shawn may have talked about it and you may have listened to it, but others wouldn't have any idea what kind of multi repo/branch transaction you are talking about. Is it about I want to push this ref to that repo and push this other ref to that other repo, in what situation will it be used/useful, what are the failure modes, what are failure tolerances by the expected use cases, ...? Care to explain? I cannot speak for Martin, but I am assuming the point is to atomically update 2 (or more) refs on the same repo. That is, if I have a branch refs/heads/foo and a ref pointing to meta-information (say, notes about commits in foo, in refs/notes/meta/foo), I would want to git push them, and only update them if _both_ will succeed, and otherwise fail and update nothing. My use case was cross repo/branch dependencies in Gerrit (which do not yet exist). Users want to be able to define several changes (destined for different project/branches) which can only be merged together. If one change cannot be merged, the others should fail too. The solutions we can think of generally need to hold ref locks while acquiring more ref locks, this drastically increases the opportunities for stale locks over the simple lock, check, update, unlock mode which git locks are currently used for. I was perhaps making too big of a leap to assume that there would be other non Gerrit uses cases for this? I assumed that other git projects which are spread across several git repos would need this? But maybe this simply wouldn't be practical with other git server solutions? -Martin Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)
Jeff King p...@peff.net wrote: On Fri, Dec 28, 2012 at 07:50:14AM -0700, Martin Fick wrote: Hmm, actually I believe that with a small modification to the semantics described here it would be possible to make multi repo/branch commits work. Simply allow the ref filename to be locked by a transaction by appending the transaction ID to the filename. So if transaction 123 wants to lock master which points currently to abcde, then it will move master/abcde to master/abcde_123. If transaction 123 is designed so that any process can commit/complete/abort it without requiring any locks which can go stale, then this ref lock will never go stale either (easy as long as it writes all its proposed updates somewhere upfront and has atomic semantics for starting, committing and aborting). On commit, the ref lock gets updated to its new value: master/newsha and on abort it gets unlocked: master/abcde. Hmm. I thought our goal was to avoid locks? Isn't this just locking by another name? It is a lock, but it is a lock with an owner: the transaction. If the transaction has reliable recovery semantics, then the lock will be recoverable also. This is possible if we have lock ownership (the transaction) which does not exist today for the ref locks. With good lock ownership we gain the ability to reliably delete locks for a specific owner without the risk of deleting the lock when held by another owner (putting the owner in the filename is good, while putting the owner in the filecontents is not). Lastly, for reliable recovery of stale locks we need the ability to determine when an owner has abandoned a lock. I believe that the transaction semantics laid out below give this. I guess your point is to have no locks in the normal case, and have locked transactions as an optional add-on? Basically. If we design the transaction into the git semantics we could ensure that it is recoverable and we should not need to expose these reflocks outside of the transaction APIs. To illustrate a simple transaction approach (borrowing some of Shawn's ideas), we could designate a directory to hold transaction files *1. To prepare a transaction: write a list of repo:ref:oldvalue:newvalue to a file named id.new (in a stable sorted order based on repo:ref to prevent deadlocks). This is not a state change and thus this file could be deleted by any process at anytime (preferably after a long grace period). If file renames are atomic on the filesystem holding the transaction files then 1, 2, 3 below will be atomic state changes. It does not matter who performs state transitions 2 or 3. It does not matter who implements the work following any of the 3 transitions, many processes could attempt the work in parallel (so could a human). 1) To start the transaction, rename the id.new file to id. If the rename fails, start over if desired/still possible. On success, ref locks for each entry should be acquired in listed order (to prevent deadlocks), using transaction id and oldvalue. It is never legal to unlock a ref in this state (because a block could cause the unlock to be delayed until the commit phase). However, it is legal for any process to transition to abort at any time from this state, perhaps because of a failure to acquire a lock (held by another transaction), and definitely if a ref has changed (is no longer oldvalue). 2) To abort the transaction, rename the id file to id.abort. This should only ever fail if commit was achieved first. Once in this state, any process may/should unlock any ref locks belonging to this transaction id. Once all refs are unlocked, id.abort may be deleted (it could be deleted earlier, but then cleanup will take longer). 3) To commit the transaction, rename the file to id.commit. This should only ever fail if abort was achieved first. This transition should never be done until every listed ref is locked by the current transaction id. Once in this phase, all refs may/should be moved to their new values and unlocked by any process. Once all refs are unlocked, id.commit may be deleted. Since any process attempting any of the work in these transactions could block at any time for an indefinite amount of time, these processes may wake after the transaction is aborted or comitted and the transaction files are cleaned up. I believe that in these cases the only actions which could succeed by these waking processes is the ref locking action. All such abandoned ref locks may/should be unlocked by any process. This last rule means that no transaction ids should ever be reused, -Martin *1 We may want to adapt the simple model illustrated above to use git mechanisms such as refs to hold transaction info instead of files in a directory, and git submodule files to hold the list of refs to update. Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum -- To unsubscribe from this list: send
Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)
Jeff King p...@peff.net wrote: On Thu, Dec 27, 2012 at 04:11:51PM -0700, Martin Fick wrote: My idea is based on using filenames to store sha1s instead of file contents. To do this, the sha1 one of a ref would be stored in a file in a directory named after the loose ref. I believe this would then make it possible to have lockless atomic ref updates by renaming the file. To more fully illustrate the idea, imagine that any file (except for the null file) in the directory will represent the value of the ref with its name, then the following transitions can represent atomic state changes to a refs value and existence: Hmm. So basically you are relying on atomic rename() to move the value around within a directory, rather than using write to move it around within a file. Atomic rename is usually something we have on local filesystems (and I think we rely on it elsewhere). Though I would not be surprised if it is not atomic on all networked filesystems (though it is on NFS, at least). Yes. I assume this is OK because doesn't git already rely on atomic renames? For example to rename the new packed-refs file to unlock it? ... 3) To create a ref, it must be renamed from the null file (sha ...) to the new value just as if it were being updated from any other value, but there is one extra condition: before renaming the null file, a full directory scan must be done to ensure that the null file is the only file in the directory (this condition exists because creating the directory and null file cannot be atomic unless the filesystem supports atomic directory renames, an expectation git does not currently make). I am not sure how this compares to today's approach, but including the setup costs (described below), I suspect it is slower. Hmm. mkdir is atomic. So wouldn't it be sufficient to just mkdir and create the correct sha1 file? But then a process could mkdir and die leaving a stale empty dir with no reliable recovery mechanism. Unfortunately, I think I see another flaw though! :( I should have known that I cannot separate an important check from its state transitioning action. The following could happen: A does mkdir A creates null file A checks dir - no other files B checks dir - no other files A renames null file to abcd C creates second null file B renames second null file to defg One way to fix this is to rely on directory renames, but I believe this is something git does not want to require of every FS? If we did, we could Change #3 to be: 3) To create a ref, it must be renamed from the null file (sha ...) to the new value just as if it were being updated from any other value. (No more scan) Then, with reliable directory renames, a process could do what you suggested to a temporary directory, mkdir + create null file, then rename the temporary dir to refname. This would prevent duplicate null files. With a grace period, the temporary dirs could be cleaned up in case a process dies before the rename. This is your approach with reliable recovery. I don't know how this new scheme could be made to work with the current scheme, it seems like perhaps new git releases could be made to understand both the old and the new, and a config option could be used to tell it which method to write new refs with. Since in this new scheme ref directory names would conflict with old ref filenames, this would likely prevent both schemes from erroneously being used simultaneously (so they shouldn't corrupt each other), except for the fact that refs can be nested in directories which confuses things a bit. I am not sure what a good solution to this is? I think you would need to bump core.repositoryformatversion, and just never let old versions of git access the repository directly. Not the end of the world, but it certainly increases deployment effort. If we were going to do that, it would probably make sense to think about solving the D/F conflict issues at the same time (i.e., start calling refs/heads/foo in the filesystem refs.d/heads.d/foo.ref so that it cannot conflict with refs.d/heads.d/foo.d/bar.ref). Wouldn't you want to use a non legal ref character instead of dot? And without locks, we free up more of the ref namespace too I think? (Refs could end in .lock) -Martin Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lockless Refs?
On Friday, December 28, 2012 09:58:36 am Junio C Hamano wrote: Martin Fick mf...@codeaurora.org writes: 3) To create a ref, it must be renamed from the null file (sha ...) to the new value just as if it were being updated from any other value, but there is one extra condition: before renaming the null file, a full directory scan must be done to ensure that the null file is the only file in the directory... While you are scanning this directory to make sure it is empty, The objective is not to scan for an empty dir, but to scan for the existence of only the null file. I am contemplating to create the same ref with a different value. You finished checking but haven't created the null. The scan needs to happen after creating the null, not before, so I don't believe the rest of the scenario below is possible then? I have also scanned, created the null and renamed it to my value. Now you try to create the null, succeed, and then rename. We won't know which of the two non-null values are valid, but worse yet, I think one of them should have failed in the first place. Sounds like we would need some form of locking around here. Is your goal no locks, or less locks? (answered below) I don't know how this new scheme could be made to work with the current scheme,... It is much more important to know if/why yours is better than the current scheme in the first place. The goal is: no locks which do not have a clearly defined reliable recovery procedure. Stale locks without a reliable recovery procedure will lead to stolen locks. At this point it is only a matter of luck whether this leads to data loss or not. So I do hope to convince people first that the current scheme is bad, not that my scheme is better! My scheme was proposed to get people thinking that we may not have to use locks to get reliable updates. Without an analysis on how the new scheme interacts with the packed refs and gives better behaviour, that is kinda difficult. Fair enough. I will attempt this if the basic idea seems at least sane? I do hope that eventually the packed-refs piece and its locking will be reconsidered also; as Michael pointed out it has issues already. So, I am hoping to get people thinking more about lockless approaches to all the pieces. I think I have some solutions to some of the other pieces also, but I don't want to overwhelm the discussion all at once (especially if my first piece is shown to be flawed, or if no one has any interest in eliminating the current locks?) I think transition plans can wait until that is done. If it is not even marginally better, we do not have to worry about transitioning at all. If it is only marginally better, the transition has to be designed to be no impact to the existing repositories. If it is vastly better, we might be able to afford a flag day. OK, makes sense, I jumped the gun a bit, -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)
On Wednesday, December 26, 2012 01:24:39 am Michael Haggerty wrote: ... lots of discussion about ref locking... It concerns me that git uses any locking at all, even for refs since it has the potential to leave around stale locks. For a single user repo this is not a big deal, the lock can always be cleaned up manually (and it is a rare occurrence). However, in a multi user server environment, possibly even from multiple hosts over a shared filesystem such as NFS, stale locks could lead to serious downtime and risky recovery (since it is currently hard to figure out if a lock really is stale). Even though stale locks are probably rare even today in the larger shared repo case, as git scales to even larger shared repositories, this will eventually become more of a problem *1. Naturally, this has me thinking that git should possibly consider moving towards a lockless design for refs in the long term. I realize this is hard and that git needs to support many different filesystems with different semantics. I had an idea I think may be close to a functional lockless design for loose refs (one piece at a time) that I thought I should propose, just to get the ball rolling, even if it is just going to be found to be flawed (I realize that history suggests that such schemes usually are). I hope that it does not make use of any semantics which are not currently expected from git of filesystems. I think it relies only on the ability to rename a file atomically, and the ability to scan the contents of a directory reliably to detect the ordered existence of files. My idea is based on using filenames to store sha1s instead of file contents. To do this, the sha1 one of a ref would be stored in a file in a directory named after the loose ref. I believe this would then make it possible to have lockless atomic ref updates by renaming the file. To more fully illustrate the idea, imagine that any file (except for the null file) in the directory will represent the value of the ref with its name, then the following transitions can represent atomic state changes to a refs value and existence: 1) To update the value from a known value to a new value atomically, simply rename the file to the new value. This operation should only succeed if the file exists and is still named old value before the rename. This should even be faster than today's approach, especially on remote filesystems since it would require only 1 round trip in the success case instead of 3! 2) To delete the ref, simply delete the filename representing the current value of the ref. This ensures that you are deleting the ref from a specific value. I am not sure if git needs to be able to delete refs without knowing their values? If so, this would require reading the value and looping until the delete succeeds, this may be a bit slow for a constantly updated ref, but likely a rare situation (and not likely worse than trying to acquire the ref-lock today). Overall, this again would likely be faster than today's approach. 3) To create a ref, it must be renamed from the null file (sha ...) to the new value just as if it were being updated from any other value, but there is one extra condition: before renaming the null file, a full directory scan must be done to ensure that the null file is the only file in the directory (this condition exists because creating the directory and null file cannot be atomic unless the filesystem supports atomic directory renames, an expectation git does not currently make). I am not sure how this compares to today's approach, but including the setup costs (described below), I suspect it is slower. While this outlines the state changes, some additional operations may be needed to setup some starting conditions and to clean things up. But these operations could/should be performed by any process/thread and would not cause any state changes to the ref existence or value. For example, when creating a ref, the ref directory would need to be created and the null file needs to be created. Whenever a null file is detected in the directory at the same time as another file, it should be deleted. Whenever the directory is empty, it may be deleted (perhaps after a grace period to reduce retries during ref creation unless the process just deleted the ref). I don't know how this new scheme could be made to work with the current scheme, it seems like perhaps new git releases could be made to understand both the old and the new, and a config option could be used to tell it which method to write new refs with. Since in this new scheme ref directory names would conflict with old ref filenames, this would likely prevent both schemes from erroneously being used simultaneously (so they shouldn't corrupt each other), except for the fact that refs can be nested in directories which confuses things a bit. I am not sure what a good solution to this is?
git-repack.sh not server/multiuse safe?
I have been reading the git-repack.sh script and I have found a piece that I am concerned with. It looks like after repacking there is a place when packfiles could be temporarily unaccessible making the objects within temporarily unaccessible. If my evaluation is true, it would seem like git repacking is not server safe? In particular, I am talking about this loop: # Ok we have prepared all new packfiles. # First see if there are packs of the same name and if so # if we can move them out of the way (this can happen if we # repacked immediately after packing fully. rollback= failed= for name in $names do for sfx in pack idx do file=pack-$name.$sfx test -f $PACKDIR/$file || continue rm -f $PACKDIR/old-$file mv $PACKDIR/$file $PACKDIR/old-$file || { failed=t break } rollback=$rollback $file done test -z $failed || break done It would seem that one way to avoid this (at least on systems supporting hardlinks), would be to instead link the original packfile to old-file first, then move the new packfile in place without ever deleting the original one (from its original name), only delete the old-file link. Does that make sense at all? Thanks, -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html