Re: File versioning based on shallow Git repositories?
On 12. april 2018 23:07, Rafael Ascensao wrote: Would initiating a repo with a empty root commit, tag it with 'base' then use $ git rebase --onto base master@{30 days ago} master; be viable? No... my question was confused from the beginning. With such large files I _shouldn't_ have history (or grafts), otherwise Git spends a lot of CPU time creating diffs when I look at a commit, or worse, when I try git log. Which I discovered quickly when trying real data instead of test-data:-) Ævar's suggestion was exactly right in that respect. Thanks again! -- Hallvard
Re: File versioning based on shallow Git repositories?
On 12. april 2018 20:47, Ævar Arnfjörð Bjarmason wrote: 1. Create a backup.git repo 2. Each time you make a backup, checkout a new orphan branch, see "git checkout --orphan" 3. You copy the files over, commit them, "git log" at this point shows one commit no matter if you've done this before. 4. You create a tag for this backup, e.g. one named after the current time, delete the branch. 5. You then have a retention period for the tags, e.g. only keep the last 30 tags if you do daily backups for 30 days of backups. Then as soon as you delete the tags the old commit will be unreferenced, and you can make git-gc delete the data. Nice! Why the tags though, instead of branches named after the current time? One --orphan branch/tag per day with several commits would work for me. Also maybe it'll be worthwhile to generate .git/info/grafts in a local clone of the repo to get back easily visible history. No grafts in the original repo, grafts mess things up. -- Hallvard
File versioning based on shallow Git repositories?
Can I use a shallow Git repo for file versioning, and regularly purge history older than e.g. 2 weeks? Purged data MUST NOT be recoverable. Or is there a backup tool based on shallow Git cloning which does this? Push/pull to another shallow repo would be nice but is not required. The files are text files up to 1/4 Gb, usually with few changes. If using Git - I see "git fetch --depth" can shorten history now. How do I do that without 'fetch', in the origin repo? Also Documentation/technical/shallow.txt describes some caveats, I'm not sure how relevant they are. To purge old data - git config core.logallrefupdates false git gc --prune=now --aggressive Anything else? I'm guessing that without --aggressive, some expired info might be deduced from studying the packing of the remaining objects. Don't know if we'll be required to be that paranoid. -- Hallvard
GC of alternate object store (was: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?)
Oswald Buddenhagen wrote: (...)so the second approach is the bare aggregator repo which adds all other repos as remotes, and the other repos link back via alternates. problems: - to actually share objects, one always needs to push to the aggregator Run a cron job which frequently does that? - tags having a shared namespace doesn't actually work, because the repos have the same tags on different commits (they are independent repos, after all) Junio's proposal partially fixes that: It pushes refs/* instead of refs/heads/*, to refs/remotes/borrowing repo/. However... - one still cannot safely garbage-collect the aggregator, as the refs don't include the stashes and the index, so rebasing may invalidate these more transient objects. Also if you copy a repo (e.g. making a backup) instead of cloning it, and then start using both, they'll push into the same namespace - overwriting each other's refs. Non-fast-forward pushes can thus lose refs to objects needed by the other repo. receive.denyNonFastForwards only rejects pushes to refs/heads/ or something. (A feature, as I learned when I reported it as bug:-) IIRC Git has no config option to reject all non-fast-forward pushes. i would re-propose hallvard's volatile alternates (at least i think that's what he was talking about two weeks ago): they can be used to obtain objects, but every object which is in any way referenced from the current clone must be available locally (or from a regular alternate). that means that diffing, etc. would get objects only temporarily, while cherry-picking would actually copy (some of) the objects. this would make it possible to cross-link repositories, safely and without any 3rd parties. I'm afraid that idea by itself won't work:-( Either you borrow from a store or not. If Git uses an object from the volatile store, it can't always know if the caller needs the object to be copied. OTOH volatile stores which you do *not* borrow from would be useful: Let fetch/repack/gc/whatever copy missing objects from there. 2nd attempt for a way to gc of the alternate repo: Copy the with removed objects into each borrowing repo, then gc them. Like this: 1. gc, but pack all to-be-removed objects into a removable pack. 2. Hardlink/copy the removable pack - with a .keep file - into borrowing repos when feasible: I.e. repos you can find and have write access to. Update their .git/objects/info/packs. (Is there a Git command for this?) Repeat until nothing to do, in case someone created a new repo during this step. 3. Move the pack from the alternate repo to a backup object store which will keep it for a while. 4. Delete the .keep files from step (2). They were needed in case a user gc'ed away an object from the pack and then added an identical object - borrowed from the to-be-removed pack. 5. gc/repack the other repos at your leisure. 666. Repos you could not update in step (2), can get temporarily broken. Their owners must link the pack from the backup store by hand, or use that store as a volatile store and then gc/repack. Loose objects are a problem: If a repo has longer expiry time(s) than the alternate store, it will get loads of loose objects from all repos which push into the alternate store. Worse, gc can *unpack* those objects, consuming a lot of space. See threads git gc == git garbage-create from removed branch (3 May) and Keeping unreachable objects in a separate pack instead of loose? (10 Jun). Presumably the work-arounds are: - Use long expiry times in the alternate repo. I don't know which expiration config settings are relevant how. - Add some command which checks and warns if the repo has longer expiry time than the repo it borrows from. Also I hope Git will be changed to instead pack such loose objects somewhere, as discussed in the above threads. All in all, this isn't something you'd want to do every day. But it looks doable and can be scripted. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?
Junio C Hamano wrote: Some ideas: - Make clone --reference without -s not to borrow from the reference repository. (...) Generalize: Introduce volatile alternate object stores. Commands like (remote) fetch, repack, gc will copy desired objects they see there. That allows pruneable alternates if people want them: Make every borrowing repo also borrow from a companion volatile store. To prune some shared objects: Move them from the alternate to the volatile. Repack or gc all borrowing repos. Empty the volatile alternate. Similar to detach from one alternate repo while keeping others: gc with the to-be-dropped alternate as a volatile. Also it gives a simple way to try to repair a repo with missing objects, if you have some other repositories which might have the objects: Repack with the other repositories as volatile alternates. BTW, if a wanted object disappears from the volatile alternate while fetch is running, fetch should get it from the remote after all. - Make the distinction between a regular repository and an object store that is meant to be used for object sharing stronger. Perhaps a configuration item core.objectstore = readonly can be introduced, and we forbid clone -s from pointing at a repository without such a configuration. We also forbid object pruning operations such as gc and repack from being run in a repository marked as such. I hope Michael's append-only/donor is feasible instead. In which case safer gc/repack are needed, like you outline: It may be necessary to allow some special kind of repacking of such a readonly object store, in order to reduce the number of packfiles (and get rid of loose object files); it needs to be implemented carefully not to lose any object, regardless of local reachability. And it needs to be default behavior in such stores, so users won't need don't-shoot-myself-in-foot options. - It might not be a bad idea to have a dedicated new command to help users manage alternates (git alternates?); obviously this will be one of its subcommand git alternates detach if we go that route. git object-store subcommand -- manage alternates object stores? - Or just an entry in the documentation is sufficient? Better doc would be useful anyway, and this command gives a place to put it:-) I had no idea alternates were intended to be read-only, but that does explain some seeming defects I'd wondered about. - When you have two or more repositories that do not share objects, you may want to rearrange things so that they share their objects from a single common object store. There is no direct UI to do this, as far as I know. You can obviously create a new bare repository, push there from all of these repositories, and then borrow from there, e.g. git --bare init shared.git for r in a.git b.git c.git ... do ( cd $r git push ../shared.git refs/*:refs/remotes/$r/* echo ../../../shared.git/objects .git/objects/info/alternates ) done And then repack shared.git once. ...and finally gc the other repositories. The refs/remotes/$r/ namespace becomes misleading if the user renames or copies the corresponding Git repository, and then cleverly does something to the shared repo and the repo (if any) in directory $r. I suggest refs/remotes/$unique_number/ and note $unique_number somewhere in the borrowing repo. If someone insists on being clever, this may force them to read up on what they're doing first. Or store no refs, since the shared repo shouldn't lose objects anyway. If we're sure objects won't be lost: Create a proper remote with the shared repo. That way the user can push into it once in a while, and he can configure just which refs should be shared. Some ideas: - (obvious: give a canned command to do the above, perhaps then set the core.objectstore=readonly in the resuting shared.git) That's getting closer to 'bzr init-repository': One dir with the shared repo and all borrowing repositories. A simple model which Git can track and the user need not think further about. This way, git clone/init of a new repo in this dir can learn to notice and use the shared repo. We can also have a command (git object-store?) to maintain the repository collection, since Git knows where to find them all: Push from all repos into the shared repo, gc all repos, even prune unused objects from the shared repo - after imlementing sufficient paranoia. - When you have one object store and a repository that does not yet borrow from it, you may want to make the repository borrow from the object store. Obviously you can run echo like the sample script in the previous item above, but it is not obvious how to perform the logical next step of shrinking $GIT_DIR/objects of the repository that