Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
On Thu, 15 Jun 2017 16:28:24 -0400 Jeff Hostetler wrote: > I agree with Peff here. I've been working on my partial/narrow/sparse > clone/fetch ideas since my original RFC and have come to the conclusion > that the server can do the size limiting efficiently, but we should > leave the pathname filtering to the client. That is, let the client > get the commits and trees and then locally apply pattern matching, > whether that be a sparse-checkout definition or simple ".git*" > matching and then make a later request for the "blobs of interest". This means that the client would need to inflate all the trees it received, but that might not be that bad. > Whether we "fault-in" the missing blobs or have a "fetch-blobs" > command (like fetch-pack) to get them is open to debate. > > There are concerns about the size of the requested blob-id list in a > fetch-blobs approach, but I think there are ways to say I need all > of the blobs referenced by the directory /foo in commit (and > optionally, that aren't present in directory /foo in commit > or in the range ..). (handwave) Unless the server no longer has the relevant commit (e.g. because a branch was rewound), but even then, even if we only sent blob hashes, the list would be probably much smaller than the downloaded pack anyway, so things might still be OK.
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
On 6/2/2017 6:26 PM, Jeff King wrote: On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote: ... We have a name-hash cache extension in the bitmap file, but it doesn't carry enough information to deduce the .git-ness of a file. I don't think it would be too hard to add a "flags" extension, and give a single bit to "this is a .git file". I do also wonder if the two features would need to be separated for a GVFS-style solution. If you're not just avoiding large blobs but trying to get a narrow clone, you don't want the .git files from the uninteresting parts of the tree. You want to get no blobs at all, and then fault them in as they become relevant due to user action. -Peff I agree with Peff here. I've been working on my partial/narrow/sparse clone/fetch ideas since my original RFC and have come to the conclusion that the server can do the size limiting efficiently, but we should leave the pathname filtering to the client. That is, let the client get the commits and trees and then locally apply pattern matching, whether that be a sparse-checkout definition or simple ".git*" matching and then make a later request for the "blobs of interest". Whether we "fault-in" the missing blobs or have a "fetch-blobs" command (like fetch-pack) to get them is open to debate. There are concerns about the size of the requested blob-id list in a fetch-blobs approach, but I think there are ways to say I need all of the blobs referenced by the directory /foo in commit (and optionally, that aren't present in directory /foo in commit or in the range ..). (handwave) Jeff
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
On Fri, Jun 02, 2017 at 04:25:08PM -0700, Jonathan Nieder wrote: > > We have a name-hash cache extension in the bitmap file, but it doesn't > > carry enough information to deduce the .git-ness of a file. I don't > > think it would be too hard to add a "flags" extension, and give a single > > bit to "this is a .git file". > > A nicer approach IMHO is to include an extra bitmap, like the existing > object-type bitmaps (see the dump_bitmap calls in > bitmap_writer_finish). This would would represent the set of all > .git* blobs in the pack. Yeah, it could be stored as a bitmap, which would be slightly smaller (since it would be mostly 0's). I think either way it would need an extension flag in the header to signal its presence. Older versions of Git are OK with having flags they don't understand. I know JGit used to complain about seeing a bitmap with unknown flags, but I'm not sure if that is still the case. > > If you're not just avoiding large blobs but trying > > to get a narrow clone, you don't want the .git files from the > > uninteresting parts of the tree. > > This is something I've wondered about, too. Part of the story is that > we haven't started omitting trees, so there is already O(number of > trees) objects being sent and some additional small blobs for .git* > specials doesn't make it much worse. Sending all .git* blobs keeps > things simple since the server doesn't have to infer which .git* blobs > are relevant to this client. > > Longer term, we will likely want to allow clients to request omission > of some trees, too. Omitting the corresponding .git* files becomes > more straightforward at that point. > > Does that make sense? Yeah, I agree we'd want to avoid the trees, too, in that case. -Peff
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
Jeff King writes: > I do also wonder if the two features would need to be separated for a > GVFS-style solution. If you're not just avoiding large blobs but trying > to get a narrow clone, you don't want the .git files from the > uninteresting parts of the tree. You want to get no blobs at all, and > then fault them in as they become relevant due to user action. Thanks; I didn't think of this one while reviewing the overall design, but I agree that this is something important to think about.
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
Jeff King wrote: > On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote: >> +--blob-max-bytes=:: >> +This option can only be used with --stdout. If specified, a blob >> +larger than this will not be packed unless a to-be-packed tree >> +has that blob with a filename beginning with ".git". The size >> +can be suffixed with "k", "m", or "g", and may be "0". >> ++ >> +If specified, after printing the packfile, pack-objects will print the >> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob >> +in hash order, pack-objects will print its hash (20 bytes) and its size >> +(8 bytes). All numbers are in network byte order. >> ++ >> +If pack-objects cannot efficiently determine, for an arbitrary blob, >> +that no to-be-packed tree has that blob with a filename beginning with >> +".git" (for example, if bitmaps are used to calculate the objects to be >> +packed), it will pack all blobs regardless of size. > > Hmm. So this feature can't be used with bitmaps at all? That seems like > a big downside, as we still have to walk the whole graph to come up with > the list of blobs (even if we're sending them). That's 30-40 seconds of > CPU per clone on torvalds/linux. I imagine it's much worse on > repositories big enough to need a full GVFS-style "don't send any blobs" > approach. > > We have a name-hash cache extension in the bitmap file, but it doesn't > carry enough information to deduce the .git-ness of a file. I don't > think it would be too hard to add a "flags" extension, and give a single > bit to "this is a .git file". A nicer approach IMHO is to include an extra bitmap, like the existing object-type bitmaps (see the dump_bitmap calls in bitmap_writer_finish). This would would represent the set of all .git* blobs in the pack. [...] > If you're not just avoiding large blobs but trying > to get a narrow clone, you don't want the .git files from the > uninteresting parts of the tree. This is something I've wondered about, too. Part of the story is that we haven't started omitting trees, so there is already O(number of trees) objects being sent and some additional small blobs for .git* specials doesn't make it much worse. Sending all .git* blobs keeps things simple since the server doesn't have to infer which .git* blobs are relevant to this client. Longer term, we will likely want to allow clients to request omission of some trees, too. Omitting the corresponding .git* files becomes more straightforward at that point. Does that make sense? Jonathan
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote: > +--blob-max-bytes=:: > + This option can only be used with --stdout. If specified, a blob > + larger than this will not be packed unless a to-be-packed tree > + has that blob with a filename beginning with ".git". The size > + can be suffixed with "k", "m", or "g", and may be "0". > ++ > +If specified, after printing the packfile, pack-objects will print the > +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob > +in hash order, pack-objects will print its hash (20 bytes) and its size > +(8 bytes). All numbers are in network byte order. > ++ > +If pack-objects cannot efficiently determine, for an arbitrary blob, > +that no to-be-packed tree has that blob with a filename beginning with > +".git" (for example, if bitmaps are used to calculate the objects to be > +packed), it will pack all blobs regardless of size. Hmm. So this feature can't be used with bitmaps at all? That seems like a big downside, as we still have to walk the whole graph to come up with the list of blobs (even if we're sending them). That's 30-40 seconds of CPU per clone on torvalds/linux. I imagine it's much worse on repositories big enough to need a full GVFS-style "don't send any blobs" approach. We have a name-hash cache extension in the bitmap file, but it doesn't carry enough information to deduce the .git-ness of a file. I don't think it would be too hard to add a "flags" extension, and give a single bit to "this is a .git file". I do also wonder if the two features would need to be separated for a GVFS-style solution. If you're not just avoiding large blobs but trying to get a narrow clone, you don't want the .git files from the uninteresting parts of the tree. You want to get no blobs at all, and then fault them in as they become relevant due to user action. -Peff