Re: [BUG?] gc and impatience
Duy Nguyen pclo...@gmail.com writes: I worry less about this. It's not the right model to have two machines modify the same shared repository (gc --auto is only triggered when we think there are new objects) even though I think we support it. I am a bit hesitant to dismiss with It's not the right model, as the original of accessing the repository from two terminals while one clearly is being accessed busily by gc falls into the same category. If it's two _scripts_ modifying the same repo, I don't care as this is more about user interaction. It can very well be two terminals, one on one machine each, both with the same human end-user interaction. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] gc and impatience
Junio C Hamano wrote: I am a bit hesitant to dismiss with It's not the right model, as the original of accessing the repository from two terminals while one clearly is being accessed busily by gc falls into the same category. As to why I think it makes sense: garbage collecting unreferenced objects has nothing to do with updating refs, or checking out a worktree. Think about my earlier make push.default = current resolve HEAD early; why would the user want to update the ref that is being pushed? She'd most likely want to continue working on another feature on some other branch, and that's perfectly fine. In long-running runtimes, garbage collection is absolutely essential to the performance. Often, stupidly written garbage collectors that stop-the-world (the execution of the program), compact the memory after collection, and then restart the program, can cause the user to throw that runtime out the window (Emacs has a really stupid one, by the way). Most modern runtimes have concurrent garbage collectors that are allocated very fine-grained slots by the scheduler: so, the program won't suddenly come to a grinding halt to do garbage collection. The reason it's so hard to do concurrent gc is because there can be races between data modification via variables (main program), and data being moved around in memory for compacting (gc). Having said all this, the problem is highly simplified in git, because the object store is a const-store. A particular key (sha-1) is guaranteed never to point to the wrong data. Frankly, even if there is concurrent access to the object store, the worst thing that can happen is that the gc didn't collect some dangling objects that were created during the gc run. Unless you have some irrational fear of introducing some unexpected behavior in some convoluted corner case, I really don't see what the problem is. I'm sure server-side implementations have to do it all the time: GitHub and Gerrit certainly doesn't say I'm gc'ing; please pull after 10 mins. Perhaps they're more conservative than the client side about gc (space is cheap), but that's just a sane default. It can very well be two terminals, one on one machine each, both with the same human end-user interaction. Someone does an SSH my machine to a submarine in Russia over a slow connection. I remove an ordinary file, while she's trying to write to it. When did anyone make any guarantees about no races? What does git gc specifically have to do with this? For the record, you can easily mess up your worktree by running two different worktree updates (checkout/ merge) on two different terminals: nothing forbidding it. I don't see how _not_ forbidding gc on two different terminals is better than forbidding it. This is quite an obscure feature for few super-impatient people, and we haven't even advertised it in any documentation. Unless you can present an alternative now (patch-form, please), I think you're being irrationally conservative about this. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] gc and impatience
Martin Fick wrote: https://gerrit-review.googlesource.com/#/c/35215/ Very cool. Of what I understood: So, the problem is that my .git/objects/pack is polluted with little packs everytime I fetch (or push, if you're the server), and this is problematic from the perspective of a overtly (naively) aggressive gc that hammers out all fragmentation. So, on the first run, the little packfiles I have are all consolidated into big packfiles; you also write .keep files to say that don't gc these big packs we just generated. In subsequent runs, the little packfiles from the fetch are absorbed into a pack that is immune to gc. You're also using a size heuristic, to consolidate similarly sized packfiles. You also have a --ratio to tweak the ratio of sizes. I've checked it in and started using it; so yeah: I'll chew on it for a few weeks. Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] gc and impatience
On Monday, August 05, 2013 11:34:24 am Ramkumar Ramachandra wrote: Martin Fick wrote: https://gerrit-review.googlesource.com/#/c/35215/ Very cool. Of what I understood: So, the problem is that my .git/objects/pack is polluted with little packs everytime I fetch (or push, if you're the server), and this is problematic from the perspective of a overtly (naively) aggressive gc that hammers out all fragmentation. So, on the first run, the little packfiles I have are all consolidated into big packfiles; you also write .keep files to say that don't gc these big packs we just generated. In subsequent runs, the little packfiles from the fetch are absorbed into a pack that is immune to gc. You're also using a size heuristic, to consolidate similarly sized packfiles. You also have a --ratio to tweak the ratio of sizes. Yes, pretty much. I suspect that a smarter implementation would do a less good job of packing to save time also. I think this can be done by further limiting much of the lookups to the packs being packed (or some limited set of the greater packfiles). I admit I don't really understand how much the packing does today, but I believe it still looks at the larger packs with keeps to potentially deltafy against them, or to determine which objects are duplicated and thus should not be put into the new smaller packfiles? I say this because the time savings of this script is not as significant as I would have expected it to be (but the IO is). I think that it is possible to design a git gc using this rolling approach that would actually greatly reduce the time spent packing also. However, I don't think that can easily be done in a script like mine which just wraps itself around git gc. I hope that someone more familiar with git gc than me might take this on some day. :) I've checked it in and started using it; so yeah: I'll chew on it for a few weeks. The script also does some nasty timestamp manipulations that I am not proud of. They had significant time impacts for us, and likely could have been achieved some other way. They shouldn't be relevant to the packing algo though. I hope it doesn't interfere with the evaluation of the approach. Thanks for taking an interest in it, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] gc and impatience
Martin Fick wrote: I hope that someone more familiar with git gc than me might take this on some day. :) More likely scenario: someone who is unfamiliar with it will read and patch it little by little :) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG?] gc and impatience
Hi, I was pulling in some changes in the morning to find: Auto packing the repository for optimum performance. You may also run git gc manually. See git help gc for more information. Being my usual impatient self, I opened another prompt and started merging changes. After the checkout, it started running another gc (why!?), which I attempted to kill using ^C. Counting objects: 449291 x$ It didn't just fail to stop, but it kept writing output making my prompt completely unusable. I finally just killed the pane. Now, it's struggling to update-index and update my cache (read: more waiting). Why is gc not designed for impatient people, and what needs to be done to change this? Ram -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] gc and impatience
On Sat, Aug 3, 2013 at 8:48 AM, Ramkumar Ramachandra artag...@gmail.com wrote: Auto packing the repository for optimum performance. You may also run git gc manually. See git help gc for more information. Being my usual impatient self, I opened another prompt and started merging changes. After the checkout, it started running another gc (why!?), Good point. I think that is because gc does not check if gc is already running. Adding such a check should not be too hard. I think gc could save its pid in $GIT_DIR/auto-gc.pid. The next auto-gc checks this, if the pid is valid, skip auto-gc. Why is gc not designed for impatient people, and what needs to be done to change this? Some improvements could be made in gc, for example warn users about upcoming gc so they can run it in background (of course the above bug should be fixed) http://thread.gmane.org/gmane.comp.version-control.git/197716/focus=197877 or speed up repack by implementing pack-objects --merge-pack: http://thread.gmane.org/gmane.comp.version-control.git/57672/focus=57943 Or you could just make a cron job to gc all repos every week and the problem goes away ;-) -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] gc and impatience
On Fri, Aug 2, 2013 at 8:53 PM, Duy Nguyen pclo...@gmail.com wrote: Good point. I think that is because gc does not check if gc is already running. Adding such a check should not be too hard. I think gc could save its pid in $GIT_DIR/auto-gc.pid. The next auto-gc checks this, if the pid is valid, skip auto-gc. Defining valid is a tricky business, though, as pid can and will wrap around, and the directory could be shared on multiple machines. A pid written by a process on one machine has no relation to any pid on another machine. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html