Ben Peart <peart...@gmail.com> writes: > My concern with this proposal is the combination of 1) writing a new > pack file for every git command that ends up bringing down a missing > object and 2) gc not compressing those pack files into a single pack > file.
Your noticing these is a sign that you read the outline of the design correctly, I think. The basic idea is that the local fsck should tolerate missing objects when they are known to be obtainable from that external service, but should still be able to diagnose missing objects that we do not know if the external service has, especially the ones that have been newly created locally and not yet made available to them by pushing them back. So we need a way to tell if an object that we do not have (but we know about) can later be obtained from the external service. Maintaining an explicit list of such objects obviously is one way, but we can get the moral equivalent by using pack files. After receiving a pack file that has a commit from such an external service, if the commit refers to its parent commit that we do not have locally, the design proposes us to consider that the parent commit that is missing is available at the external service that gave the pack to us. Similarly for missing trees, blobs, and any objects that are supposed to be "reachable" from objects in such a packfile. We can extend the approach to cover loose objects if we wanted to; just define an alternate object store used internally for this purpose and drop loose objects obtained from such an external service in that object store. Because we do not want to leave too many loose objects and small packfiles lying around, we will need a new way of packing these. Just enumerate these objects known to have come from the external service (by being in packfiles marked as such or being loose objects in the dedicated alternate object store), and create a single larger packfile, which is marked as "holding the objects that are known to be in the external service". We do not have such a mode of gc, and that is a new development that needs to happen, but we know that is doable. > That thinking did lead me back to wondering again if we could live > with a repo specific flag. If any clone/fetch was "partial" the flag > is set and fsck ignore missing objects whether they came from a > "partial" remote or not. The only reason people run "git fsck" is to make sure that their local repository is sound and they can rely on the objects you have as the base of building new stuff on top of. That is why we are trying to find a way to make sure "fsck" can be used to detect broken or missing objects that cannot be obtained from the lazy-object store, without incurring undue overhead for normal codepath (i.e. outside fsck). It is OK to go back to wondering again, but I think that essentially tosses "git fsck" out of the window and declares that it is OK to hope that local objects will never go bad. We can make such an declaration anytime, but I do not want to see us doing so without first trying to solve the issue without punting.