Re: File versioning based on shallow Git repositories?

2018-04-12 Thread Hallvard Breien Furuseth

On 12. april 2018 23:07, Rafael Ascensao wrote:

Would initiating a repo with a empty root commit, tag it with 'base' then
use $ git rebase --onto base master@{30 days ago} master;
be viable?


No... my question was confused from the beginning.  With such large files
I _shouldn't_ have history (or grafts), otherwise Git spends a lot of CPU
time creating diffs when I look at a commit, or worse, when I try git log.
Which I discovered quickly when trying real data instead of test-data:-)

Ævar's suggestion was exactly right in that respect.  Thanks again!

--
Hallvard


Re: File versioning based on shallow Git repositories?

2018-04-12 Thread Hallvard Breien Furuseth

On 12. april 2018 20:47, Ævar Arnfjörð Bjarmason wrote:

1. Create a backup.git repo
2. Each time you make a backup, checkout a new orphan branch, see "git
checkout --orphan"
3. You copy the files over, commit them, "git log" at this point shows
one commit no matter if you've done this before.
4. You create a tag for this backup, e.g. one named after the current
time, delete the branch.
5. You then have a retention period for the tags, e.g. only keep the
last 30 tags if you do daily backups for 30 days of backups.

Then as soon as you delete the tags the old commit will be unreferenced,
and you can make git-gc delete the data.


Nice!
Why the tags though, instead of branches named after the current time?

One --orphan branch/tag per day with several commits would work for me.

Also maybe it'll be worthwhile to generate .git/info/grafts in a local
clone of the repo to get back easily visible history.  No grafts in
the original repo, grafts mess things up.

--
Hallvard


File versioning based on shallow Git repositories?

2018-04-12 Thread Hallvard Breien Furuseth
Can I use a shallow Git repo for file versioning, and regularly purge
history older than e.g. 2 weeks?  Purged data MUST NOT be recoverable.

Or is there a backup tool based on shallow Git cloning which does this?
Push/pull to another shallow repo would be nice but is not required.
The files are text files up to 1/4 Gb, usually with few changes. 


If using Git - I see "git fetch --depth" can shorten history now.
How do I do that without 'fetch', in the origin repo?
Also Documentation/technical/shallow.txt describes some caveats, I'm
not sure how relevant they are.

To purge old data -
  git config core.logallrefupdates false
  git gc --prune=now --aggressive
Anything else?

I'm guessing that without --aggressive, some expired info might be
deduced from studying the packing of the remaining objects.  Don't
know if we'll be required to be that paranoid.

-- 
Hallvard


GC of alternate object store (was: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?)

2012-08-28 Thread Hallvard Breien Furuseth
Oswald Buddenhagen wrote:
 (...)so the second approach is the bare aggregator repo which adds
 all other repos as remotes, and the other repos link back via
 alternates. problems:
 
 - to actually share objects, one always needs to push to the aggregator

Run a cron job which frequently does that?

 - tags having a shared namespace doesn't actually work, because the
 repos have the same tags on different commits (they are independent
 repos, after all)

Junio's proposal partially fixes that: It pushes refs/* instead of
refs/heads/*, to refs/remotes/borrowing repo/.  However...

 - one still cannot safely garbage-collect the aggregator, as the refs
 don't include the stashes and the index, so rebasing may invalidate
 these more transient objects.

Also if you copy a repo (e.g. making a backup) instead of cloning it,
and then start using both, they'll push into the same namespace -
overwriting each other's refs.  Non-fast-forward pushes can thus lose
refs to objects needed by the other repo.

receive.denyNonFastForwards only rejects pushes to refs/heads/ or
something.  (A feature, as I learned when I reported it as bug:-)
IIRC Git has no config option to reject all non-fast-forward pushes.

 i would re-propose hallvard's volatile alternates (at least i think that's
 what he was talking about two weeks ago): they can be used to obtain
 objects, but every object which is in any way referenced from the current
 clone must be available locally (or from a regular alternate). that means
 that diffing, etc.  would get objects only temporarily, while cherry-picking
 would actually copy (some of) the objects. this would make it possible to
 cross-link repositories, safely and without any 3rd parties.

I'm afraid that idea by itself won't work:-(  Either you borrow from a
store or not.  If Git uses an object from the volatile store, it can't
always know if the caller needs the object to be copied.

OTOH volatile stores which you do *not* borrow from would be useful:
Let fetch/repack/gc/whatever copy missing objects from there.


2nd attempt for a way to gc of the alternate repo:  Copy the with
removed objects into each borrowing repo, then gc them.   Like this:

1. gc, but pack all to-be-removed objects into a removable pack.

2. Hardlink/copy the removable pack - with a .keep file - into
   borrowing repos when feasible:  I.e. repos you can find and
   have write access to.  Update their .git/objects/info/packs.
   (Is there a Git command for this?)  Repeat until nothing to do,
   in case someone created a new repo during this step.

3. Move the pack from the alternate repo to a backup object store
   which will keep it for a while.

4. Delete the .keep files from step (2).  They were needed in case
   a user gc'ed away an object from the pack and then added an
   identical object - borrowed from the to-be-removed pack.

5. gc/repack the other repos at your leisure.

666. Repos you could not update in step (2), can get temporarily
   broken.  Their owners must link the pack from the backup store by
   hand, or use that store as a volatile store and then gc/repack.

Loose objects are a problem:  If a repo has longer expiry time(s)
than the alternate store, it will get loads of loose objects from all
repos which push into the alternate store.  Worse, gc can *unpack*
those objects, consuming a lot of space.  See threads git gc == git
garbage-create from removed branch (3 May) and Keeping unreachable
objects in a separate pack instead of loose? (10 Jun).

Presumably the work-arounds are:
- Use long expiry times in the alternate repo.  I don't know which
  expiration config settings are relevant how.
- Add some command which checks and warns if the repo has longer
  expiry time than the repo it borrows from.
Also I hope Git will be changed to instead pack such loose objects
somewhere, as discussed in the above threads.

All in all, this isn't something you'd want to do every day.  But it
looks doable and can be scripted.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?

2012-08-11 Thread Hallvard Breien Furuseth
Junio C Hamano wrote:
Some ideas:
 
- Make clone --reference without -s not to borrow from the
  reference repository.  (...)

Generalize: Introduce volatile alternate object stores.  Commands like
(remote) fetch, repack, gc will copy desired objects they see there.

That allows pruneable alternates if people want them: Make every
borrowing repo also borrow from a companion volatile store.  To prune
some shared objects:  Move them from the alternate to the volatile.
Repack or gc all borrowing repos.  Empty the volatile alternate.
Similar to detach from one alternate repo while keeping others:
gc with the to-be-dropped alternate as a volatile.

Also it gives a simple way to try to repair a repo with missing
objects, if you have some other repositories which might have the
objects: Repack with the other repositories as volatile alternates.

BTW, if a wanted object disappears from the volatile alternate while
fetch is running, fetch should get it from the remote after all.

- Make the distinction between a regular repository and an object
  store that is meant to be used for object sharing stronger.
 
  Perhaps a configuration item core.objectstore = readonly can
  be introduced, and we forbid clone -s from pointing at a
  repository without such a configuration.  We also forbid object
  pruning operations such as gc and repack from being run in
  a repository marked as such.

I hope Michael's append-only/donor is feasible instead.  In which
case safer gc/repack are needed, like you outline:

  It may be necessary to allow some special kind of repacking of
  such a readonly object store, in order to reduce the number
  of packfiles (and get rid of loose object files); it needs to
  be implemented carefully not to lose any object, regardless of
  local reachability.

And it needs to be default behavior in such stores, so users won't
need don't-shoot-myself-in-foot options.

- It might not be a bad idea to have a dedicated new command to
  help users manage alternates (git alternates?); obviously
  this will be one of its subcommand git alternates detach if
  we go that route.

git object-store subcommand  -- manage alternates  object stores?

- Or just an entry in the documentation is sufficient?

Better doc would be useful anyway, and this command gives a place to
put it:-)  I had no idea alternates were intended to be read-only,
but that does explain some seeming defects I'd wondered about.

  - When you have two or more repositories that do not share objects,
you may want to rearrange things so that they share their objects
from a single common object store.
 
There is no direct UI to do this, as far as I know.  You can
obviously create a new bare repository, push there from all
of these repositories, and then borrow from there, e.g.

   git --bare init shared.git 
   for r in a.git b.git c.git ...
 do
   (
   cd $r 
   git push ../shared.git refs/*:refs/remotes/$r/* 
   echo ../../../shared.git/objects .git/objects/info/alternates
   )
   done
 
And then repack shared.git once.

...and finally gc the other repositories.

The refs/remotes/$r/ namespace becomes misleading if the user renames
or copies the corresponding Git repository, and then cleverly does
something to the shared repo and the repo (if any) in directory $r.

I suggest refs/remotes/$unique_number/ and note $unique_number
somewhere in the borrowing repo.  If someone insists on being clever,
this may force them to read up on what they're doing first.

Or store no refs, since the shared repo shouldn't lose objects anyway.

If we're sure objects won't be lost: Create a proper remote with the
shared repo.  That way the user can push into it once in a while, and
he can configure just which refs should be shared.

 
Some ideas:
 
- (obvious: give a canned command to do the above, perhaps then
  set the core.objectstore=readonly in the resuting shared.git)

That's getting closer to 'bzr init-repository': One dir with the
shared repo and all borrowing repositories.  A simple model which Git
can track and the user need not think further about.

This way, git clone/init of a new repo in this dir can learn to notice
and use the shared repo.

We can also have a command (git object-store?) to maintain the
repository collection, since Git knows where to find them all:
Push from all repos into the shared repo, gc all repos, even prune
unused objects from the shared repo - after imlementing sufficient
paranoia.

  - When you have one object store and a repository that does not yet
borrow from it, you may want to make the repository borrow from
the object store.  Obviously you can run echo like the sample
script in the previous item above, but it is not obvious how to
perform the logical next step of shrinking $GIT_DIR/objects of
the repository that