Re: [RFC] Design for http-pull on repo with packs
Junio C Hamano wrote: Dan Holmsand <[EMAIL PROTECTED]> writes: Repacking all of that to a single pack file gives, somewhat surprisingly, a pack size of 62M (+ 1.3M index). In other words, the cost of getting all those branches, and all of the new stuff from Linus, turns out to be *negative* (probably due to some strange deltification coincidence). We do _not_ want to optimize for initial slurps into empty repositories. Quite the opposite. We want to optimize for allowing quick updates of reasonably up-to-date developer repos. If initial slurps are _also_ efficient then that is an added bonus; that is something the baseline big pack (60M Linus pack) would give us already. So repacking everything into a single pack nightly is _not_ what we want to do, even though that would give the maximum compression ;-). I know you understand this, but just stating the second of the above paragraphs would give casual readers a wrong impression. I agree, to a point: I think the bonus is quite nice to have... As it is, it's actually faster on my machine to clone a fresh tree of Linus' than it is to "git clone" a local tree (without doing the hardlinking "cheating", that is). And it's kind of nice to have the option to start completely fresh. Anyway, my point is this: to make pulling efficient, we should ideally have (1) as few object files to pull as possible, especially when using http, and (2) have as few packs as possible, to gain some compression for those who pull more seldom. Point 1 is obviously the most important one. To make this happen, relatively frequent repacking and re-repacking (even if only on parts of the repository) would be necessary. Or at least nice to have... Which was why I wanted the "dumb fetch" thingies to at least do some "relatively smart un/repacking" to avoid duplication. And, ideally, that they would avoid downloading entire packs that we just want the beginning of. That would lessen the cost of repacking, which I happen to think is a good thing. Also, it's kind of strange when the ssh/local fetching *always* unpacks everything, and rsync/http *never* does this... You are correct. For somebody like Jeff, having the Linus baseline pack with one pack of all of his head (incremental that excludes what is already in the Linus baseline pack) would help pullers. That would work, of course. It, however, means that Linus becomes the "official repository maintainer" in a way that doesn't feel very distributed. Perhaps then Linus' packs should be marked "official" in some way? The big problem, however, comes when Jeff (or anyone else) decides to repack. Then, if you fetch both his repo and Linus', you might end up with several really big pack files, that mostly overlap. That could easily mean storing most objects many times, if you don't do some smart selective un/repacking when fetching. Indeed. Overlapping packs is a possibility, but my gut feeling is that it would not be too bad, if things are arranged so that packs are expanded-and-then-repacked _very_ rarely if ever. Instead, at least for your public repository, if you only repack incrementally I think you would be OK. To be exact, you're ok (in the meaning of avoiding duplicates) as long as you always rsync in the "official packs", and coordinate with others you're merging with, before you do any repacking of your own. Sure, this works. It just feels a bit "un-distributed" for my personal taste... /dan - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Design for http-pull on repo with packs
Dan Holmsand <[EMAIL PROTECTED]> writes: > I did a little experiment. I cloned Linus' current tree, and git > repacked everything (that's 63M + 3.3M worth of pack files). Then I > got something like 25 or so of Jeff's branches. That's 6.9M of object > files, and 1.4M packed. Total size: 70M for the entire > .git/objects/pack directory. > > Repacking all of that to a single pack file gives, somewhat > surprisingly, a pack size of 62M (+ 1.3M index). In other words, the > cost of getting all those branches, and all of the new stuff from > Linus, turns out to be *negative* (probably due to some strange > deltification coincidence). We do _not_ want to optimize for initial slurps into empty repositories. Quite the opposite. We want to optimize for allowing quick updates of reasonably up-to-date developer repos. If initial slurps are _also_ efficient then that is an added bonus; that is something the baseline big pack (60M Linus pack) would give us already. So repacking everything into a single pack nightly is _not_ what we want to do, even though that would give the maximum compression ;-). I know you understand this, but just stating the second of the above paragraphs would give casual readers a wrong impression. > I think that this shows that (at least in this case), having many > branches isn't particularly wasteful (1.4M in this case with one > incremental pack). > And that fewer packs beats many packs quite handily. You are correct. For somebody like Jeff, having the Linus baseline pack with one pack of all of his head (incremental that excludes what is already in the Linus baseline pack) would help pullers. > The big problem, however, comes when Jeff (or anyone else) decides to > repack. Then, if you fetch both his repo and Linus', you might end up > with several really big pack files, that mostly overlap. That could > easily mean storing most objects many times, if you don't do some > smart selective un/repacking when fetching. Indeed. Overlapping packs is a possibility, but my gut feeling is that it would not be too bad, if things are arranged so that packs are expanded-and-then-repacked _very_ rarely if ever. Instead, at least for your public repository, if you only repack incrementally I think you would be OK. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Design for http-pull on repo with packs
> The big problem, however, comes when Jeff (or anyone else) decides to > repack. Then, if you fetch both his repo and Linus', you might end up > with several really big pack files, that mostly overlap. That could > easily mean storing most objects many times, if you don't do some smart > selective un/repacking when fetching. So although it is possible to pack and re-pack at any time, perhaps we need some guidelines? Maybe Linus should just do a re-pack as each 2.6.x release is made (or perhaps just every 2.6.even release if that is too often). It has already been noted offlist that repositories hosted on kernel.org can just copy pack files from Linus (or even better hardlink them). -Tony - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Design for http-pull on repo with packs
Junio C Hamano wrote: One very minor problem I have with Holmsand approach [*1*] is that the original Barkalow puller allowed a really dumb http server by not requiring directory index at all. For somebody like me with a cheap ISP account [*2*], it was great that I did not have to update 256 index.html files for objects/??/ directories. Admittedly, it would be just one directory object/pack/, but still... I totally agree that you shouldn't have to do any special kind of prepping to serve a repository thru http. Which was why I thought it was a good thing to use the default directory listing of the web-server, assuming that this feature would be available on most servers... Apparently not yours, though :-( And Cogito already relies on directory listings (to find tags to download). But if git-repack-script generates a "pack index file" automagically, then of course everything is fine. On the other hand, picking an optimum set of packs from overlapping set of packs is indeed a very interesting (and hard combinatorial) problem to solve. I am hoping that in practice people would not force clients to do it with "interesting" set of packs. I would hope them to have just a full pack and incrementals, never having ovelaps, like Linus plans to do on his kernel repo. On the other hand, for somebody like Jeff Garzik with 50 heads, it might make some sense to have a handful different overlapping packs, optimized for different sets of people wanting to pull some but not all of his heads. Well, it is an interresting problem... But I don't think that the solution is to create more pack files. In fact, you'd want as few pack files as possible, for maximum overall efficiency. I did a little experiment. I cloned Linus' current tree, and git repacked everything (that's 63M + 3.3M worth of pack files). Then I got something like 25 or so of Jeff's branches. That's 6.9M of object files, and 1.4M packed. Total size: 70M for the entire .git/objects/pack directory. Repacking all of that to a single pack file gives, somewhat surprisingly, a pack size of 62M (+ 1.3M index). In other words, the cost of getting all those branches, and all of the new stuff from Linus, turns out to be *negative* (probably due to some strange deltification coincidence). I think that this shows that (at least in this case), having many branches isn't particularly wasteful (1.4M in this case with one incremental pack). And that fewer packs beats many packs quite handily. The big problem, however, comes when Jeff (or anyone else) decides to repack. Then, if you fetch both his repo and Linus', you might end up with several really big pack files, that mostly overlap. That could easily mean storing most objects many times, if you don't do some smart selective un/repacking when fetching. /dan - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Design for http-pull on repo with packs
One very minor problem I have with Holmsand approach [*1*] is that the original Barkalow puller allowed a really dumb http server by not requiring directory index at all. For somebody like me with a cheap ISP account [*2*], it was great that I did not have to update 256 index.html files for objects/??/ directories. Admittedly, it would be just one directory object/pack/, but still... On the other hand, picking an optimum set of packs from overlapping set of packs is indeed a very interesting (and hard combinatorial) problem to solve. I am hoping that in practice people would not force clients to do it with "interesting" set of packs. I would hope them to have just a full pack and incrementals, never having ovelaps, like Linus plans to do on his kernel repo. On the other hand, for somebody like Jeff Garzik with 50 heads, it might make some sense to have a handful different overlapping packs, optimized for different sets of people wanting to pull some but not all of his heads. Having said that, even if we want to support such a repository, we should remember that the server side optimization needs to be done only once per push to support many pulls by different downstream clients. Maybe preparing more than "list of pack file names" to help clients decide which packs to pull is desirable anyway. Say, "here are the list of packs. If you want to sync with this and that head, I would suggest starting by getting this pack." [Footnotes] *1* I was about to type Dan's, but both of you are ;-). *2* Not having a public, rsync-reachable repository gave me a lot of incentive to think about issues to support small/cheap projects well ;-). - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Design for http-pull on repo with packs
Daniel Barkalow wrote: On Sun, 10 Jul 2005, Dan Holmsand wrote: Daniel Barkalow wrote: If an individual file is not available, figure out what packs are available: Get the list of pack files the repository has (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135") For any packs we don't have, get the index files. This part might be slightly expensive, for large repositories. If one assumes that packs are named as by git-repack-script, however, one might cache indexes we've already seen (again, see below). Or, if you go for the mandatory "pack-index-file", require that it has a reliable order, so that you can get the last added index first. Nothing bad happens if you have index files for pack files you don't have, as it turns out; the library ignores them. So we can keep the index files around so we can quickly check if they have the objects we want. That way, we don't have to worry about skipping something now (because it's not needed) and then ignoring it when the branch gets merged in. So what I actually do is make a list of the pack files that aren't already downloaded that are available from the server, and download the index files for any where the index file isn't downloaded, either. Aah. In other words, you do the caching thing as well. It seems a little ugly, though, to store the index-only index files with the rest of the pack. It might be preferable to introduce something like $GIT_DIR/index-cache or something, so than it can be easily cleaned (and don't follow us around forever when cloning-by-hardlinking-the-entire-object-directory). You might end up with quite a large number of index files, after a while though, if you pull from several repositories that are regularly repacked. Keep a list of the struct packed_gits for the packs the server has (these are not used as places to look for objects) Each time we need an object, check the list for it. If it is in there, download the corresponding pack and report success. Here you will need some strategy to deal with packs that overlap with what we've already got. Basically, small and overlapping packs should be unpacked, big and non-overlapping ones saved as is (since git-unpack-objects is painfully slow and memory-hungry...). I don't think there's an issue to having overlapping packs, either with each other or with separate objects. If the user wants, stuff can be repacked outside of the pull operation (note, though, that the index files should be truncated rather than removed, so that the program doesn't fetch them again next time some object can't be found easily). Well, the only issue is obviously waste of space. If you fetch a lot of branches from independently packed repos, it might mean a lot of waste, though. About truncating index files: this seems a bit ugly. You get a file that doesn't contain what it says it contains, which may cause trouble if for example the git prune thing is used. You might be better off with a simple list of index files we know we have all the objects of (and make sure that git-prune-script deletes this file, since it possibly breaks the contract). One could also optimize the pack-download bit, by figuring out the last object in the pack that we need (easy enough to do from the index file), and just get the part of the pack file leading up to that object. That could be a huge win for independently packed repositories (I don't do that in my code below, though). That's only possible if you can figure out what you want to have before you get it. My code is walking the reachability graph on the client; it can only figure out what other objects it needs after it's mapped the pack file. No, but we can find out which objects we *don't* want (i.e. the ones we have). And that may be a lot, e.g. if a repository is fully repacked, or if we track branches on several similar but independently packed repositories. And as far as I understand git-pack-objects, it tries to put recent objects in the front. I don't have any numbers to back this up with, though. Some testing may be needed, but since the population of packed public repositories is 1, this is tricky... I might use that method for listing the available packs, although I'd sort of like to encourage a clean solution first. Encouraging cleanliness is obviously a good thing :-) /dan - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Design for http-pull on repo with packs
On Sun, 10 Jul 2005, Dan Holmsand wrote: > Daniel Barkalow wrote: > > I have a design for using http-pull on a packed repository, and it only > > requires one extra file in the repository: an append-only list of the pack > > files (because getting the directory listing is very painful and > > failure-prone). > > A few comments (as I've been tinkering with a way to solve the problem > myself). > > As long as the pack files are named sensibly (i.e. if they are created > by git-repack-script), it's not very error-prone to just get the > directory listing, and look for matches for pack-.idx. It seems to > work quite well (see below). It isn't beautiful in any way, but it works... I may grab your code for that; the version I just sent seems to be working except for that. > > If an individual file is not available, figure out what packs are > > available: > > > >Get the list of pack files the repository has > > (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135") > >For any packs we don't have, get the index files. > > This part might be slightly expensive, for large repositories. If one > assumes that packs are named as by git-repack-script, however, one might > cache indexes we've already seen (again, see below). Or, if you go for > the mandatory "pack-index-file", require that it has a reliable order, > so that you can get the last added index first. Nothing bad happens if you have index files for pack files you don't have, as it turns out; the library ignores them. So we can keep the index files around so we can quickly check if they have the objects we want. That way, we don't have to worry about skipping something now (because it's not needed) and then ignoring it when the branch gets merged in. So what I actually do is make a list of the pack files that aren't already downloaded that are available from the server, and download the index files for any where the index file isn't downloaded, either. > >Keep a list of the struct packed_gits for the packs the server has > > (these are not used as places to look for objects) > > > > Each time we need an object, check the list for it. If it is in there, > > download the corresponding pack and report success. > > Here you will need some strategy to deal with packs that overlap with > what we've already got. Basically, small and overlapping packs should be > unpacked, big and non-overlapping ones saved as is (since > git-unpack-objects is painfully slow and memory-hungry...). I don't think there's an issue to having overlapping packs, either with each other or with separate objects. If the user wants, stuff can be repacked outside of the pull operation (note, though, that the index files should be truncated rather than removed, so that the program doesn't fetch them again next time some object can't be found easily). > One could also optimize the pack-download bit, by figuring out the last > object in the pack that we need (easy enough to do from the index file), > and just get the part of the pack file leading up to that object. That > could be a huge win for independently packed repositories (I don't do > that in my code below, though). That's only possible if you can figure out what you want to have before you get it. My code is walking the reachability graph on the client; it can only figure out what other objects it needs after it's mapped the pack file. > Anyway, here's my attempt at the same thing. It introduces > "git-dumb-fetch", with usage like git-fetch-pack (except that it works > with http and rsync). And it adds some uglyness to git-cat-file, for > figuring out which objects we already have. I might use that method for listing the available packs, although I'd sort of like to encourage a clean solution first. -Daniel *This .sig left intentionally blank* - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Design for http-pull on repo with packs
Daniel Barkalow wrote: I have a design for using http-pull on a packed repository, and it only requires one extra file in the repository: an append-only list of the pack files (because getting the directory listing is very painful and failure-prone). A few comments (as I've been tinkering with a way to solve the problem myself). As long as the pack files are named sensibly (i.e. if they are created by git-repack-script), it's not very error-prone to just get the directory listing, and look for matches for pack-.idx. It seems to work quite well (see below). It isn't beautiful in any way, but it works... [snip] If an individual file is not available, figure out what packs are available: Get the list of pack files the repository has (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135") For any packs we don't have, get the index files. This part might be slightly expensive, for large repositories. If one assumes that packs are named as by git-repack-script, however, one might cache indexes we've already seen (again, see below). Or, if you go for the mandatory "pack-index-file", require that it has a reliable order, so that you can get the last added index first. Keep a list of the struct packed_gits for the packs the server has (these are not used as places to look for objects) Each time we need an object, check the list for it. If it is in there, download the corresponding pack and report success. Here you will need some strategy to deal with packs that overlap with what we've already got. Basically, small and overlapping packs should be unpacked, big and non-overlapping ones saved as is (since git-unpack-objects is painfully slow and memory-hungry...). One could also optimize the pack-download bit, by figuring out the last object in the pack that we need (easy enough to do from the index file), and just get the part of the pack file leading up to that object. That could be a huge win for independently packed repositories (I don't do that in my code below, though). Anyway, here's my attempt at the same thing. It introduces "git-dumb-fetch", with usage like git-fetch-pack (except that it works with http and rsync). And it adds some uglyness to git-cat-file, for figuring out which objects we already have. I'm sort of using the same basic strategy as you, except that I check the pack files first (I didn't want to mess with http-pull.c, and I wanted something that would work with rsync as well). The strategy is this: o Check if the repository has some pack files we haven't seen already o If there are new pack files, download indexes, and see if they contain anything new. If so, download pack file and store or unpack. In either case, note that we have seen the pack file in question (I've used $GIT_DIR/checked_packs). o Then o if http: do the git-http-pull stuff, and we're done o if rsync: get a list of all object files in the repository, and download the ones we're still missing. Feel free to take a look, and use anything that might be useful (if anything...) I'm not claiming that this method is better than your way; the only main differences are the caching of seen index files, and that I download packs first. My way is faster if the repository contains overlapping object files and packs. And doesn't require any new infrastructure. On the other hand, my method risks fetching too many objects, if a pack file solely contains stuff from a branch we don't want. And it requires the git-repack-script naming convention to be used on the remote side. /dan diff --git a/cat-file.c b/cat-file.c --- a/cat-file.c +++ b/cat-file.c @@ -11,6 +11,42 @@ int main(int argc, char **argv) char type[20]; void *buf; unsigned long size; + int obj_count = 0; + int missing_count = 0; + char line[1000]; + + if (argc == 2 && !strcmp("--count", argv[1])) { + while (fgets(line, sizeof(line), stdin)) { + if (get_sha1(line, sha1)) + die("invalid id %s", line); + if (has_sha1_file(sha1)) + ++obj_count; + else + ++missing_count; + } + printf("%i %i\n", obj_count, missing_count); + return 0; + } + + if (argc == 2 && !strcmp("--existing", argv[1])) { + while (fgets(line, sizeof(line), stdin)) { + if (get_sha1(line, sha1)) + die("invalid id %s", line); + if (has_sha1_file(sha1)) + printf ("%s", line); + } + return 0; + } + + if (argc == 2 && !strcmp("--missing", argv[1])) { + while (fgets(line, sizeof(line), stdin)) { + if (get_sha1(line, sh
[RFC] Design for http-pull on repo with packs
I have a design for using http-pull on a packed repository, and it only requires one extra file in the repository: an append-only list of the pack files (because getting the directory listing is very painful and failure-prone). The first thing to note is that fetch() is allowed to get more than just the requested object. This means that we can get the pack file with the requested object, and this will fulfill the contract of fetch(), and, hopefully, be extra-helpful (since we expect the repository owner to have packed stuff together usefully). So I do this: Try to get individual files. So long as this works, everything is as before. If an individual file is not available, figure out what packs are available: Get the list of pack files the repository has (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135") For any packs we don't have, get the index files. Keep a list of the struct packed_gits for the packs the server has (these are not used as places to look for objects) Each time we need an object, check the list for it. If it is in there, download the corresponding pack and report success. I've nearly got an implementation ready, except for not having a way of getting a list of available packs. It seems to work for getting e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135 when necessary, although I'm still debugging the last few things. -Daniel *This .sig left intentionally blank* - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html