Re: Git for backup storage

2023-10-07 Thread gene heskett

On 10/7/23 01:51, to...@tuxteam.de wrote:

On Fri, Oct 06, 2023 at 01:44:34PM -0700, Mike Castle wrote:

Something I played with recently was
https://packages.debian.org/stable/vcs/git-filter-repo


Yes, it does work. My typical use case is when someone has put a
password in the repo you don't even want to have in the history.

But you aren't going to use that in a Git backup with gigabytes
of data, believe me :-)


But you definitely want to run tests on real data before you decide
that deleting old data saves your anything, particularly with respect
to time.

If git is so efficient at storing this kind of data, then what do you
expect to gain by deleting old stuff, outside of a smaller log to go
through?


The backup idea is a good one for medium amounts of smallish files
(/etc comes to mind). Once big hunks like videos are involved, things
get sluggish.

Try doing "time sha1sum foo" where foo is an 1.2G video file to see
what I mean.

Cheers


I have a now old Sony Handicam, about halfway between never twice same 
color and modern hidef, its mpeg output is quite compressed, but a 22 
minute wedding is 30 G's of raw video. I had to use kino and edit pretty 
heavily to make it fit on a single layer dvd. An sha1sum would have been 
around 35 minutes on this machine but that was nearly 20 years ago on a 
500 meg K-III cpu.


Cheers, Gene Heskett.
--
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author, 1940)
If we desire respect for the law, we must first make the law respectable.
 - Louis D. Brandeis
Genes Web page 



Re: Git for backup storage

2023-10-06 Thread tomas
On Fri, Oct 06, 2023 at 01:44:34PM -0700, Mike Castle wrote:
> Something I played with recently was
> https://packages.debian.org/stable/vcs/git-filter-repo

Yes, it does work. My typical use case is when someone has put a
password in the repo you don't even want to have in the history.

But you aren't going to use that in a Git backup with gigabytes
of data, believe me :-)

> But you definitely want to run tests on real data before you decide
> that deleting old data saves your anything, particularly with respect
> to time.
> 
> If git is so efficient at storing this kind of data, then what do you
> expect to gain by deleting old stuff, outside of a smaller log to go
> through?

The backup idea is a good one for medium amounts of smallish files
(/etc comes to mind). Once big hunks like videos are involved, things
get sluggish.

Try doing "time sha1sum foo" where foo is an 1.2G video file to see
what I mean.

Cheers
-- 
t


signature.asc
Description: PGP signature


Re: Git for backup storage

2023-10-06 Thread Mike Castle
Something I played with recently was
https://packages.debian.org/stable/vcs/git-filter-repo

But you definitely want to run tests on real data before you decide
that deleting old data saves your anything, particularly with respect
to time.

If git is so efficient at storing this kind of data, then what do you
expect to gain by deleting old stuff, outside of a smaller log to go
through?

mrc



Re: Git for backup storage

2023-10-06 Thread Stefan Monnier
>> `git gc` does delete the old data (if it's not reachable any more).
> And it is very expensive.  My point exactly.

It's fairly expensive indeed, but it's usually an operation that is not
very time-sensitive: it can usually be delayed to a convenient time, and
you can run it infrequently and as a low-priority background task.

A good reason why you usually don't want to run it frequently, is that
due to the sharing ("deduplication"), there's usually not that much
garbage to collect.

[ IOW, often a thousand backups (of the same machine) don't take up
  much more space than a single backup.  ]

>> BTW, if you want to (ab)use a Git repository to do backups, you should
>> definitely look at `bup`.
> Thanks, it might be exactly what I am looking for.

Bup uses the same format as Git, but has its own implementation for most
operations because the performance of Git is tuned for a very different
use-case.  With Bup it's common to have a repository that is much larger
than 100GB, whereas Git very rarely manages repositories of such size.


Stefan



Re: Git for backup storage

2023-10-06 Thread Nicolas George
Stefan Monnier (12023-10-06):
> `git gc` does delete the old data (if it's not reachable any more).

And it is very expensive. My point exactly.

> BTW, if you want to (ab)use a Git repository to do backups, you should
> definitely look at `bup`.

Thanks, it might be exactly what I am looking for.

Regards,

-- 
  Nicolas George



Re: Git for backup storage

2023-10-06 Thread Stefan Monnier
> Have you tried? The very principle of Git makes it necessary, to remove
> or update old data, to rewrite the whole subsequent history.
> Furthermore, it is done by creating a new branch, the original data is
> not actually deleted.

`git gc` does delete the old data (if it's not reachable any more).

BTW, if you want to (ab)use a Git repository to do backups, you should
definitely look at `bup`.


Stefan



Re: Git for backup storage

2023-10-06 Thread Nicolas George
john doe (12023-10-06):
> Please elaborate on why Git is so bad at removing data from a single
> repository?

Have you tried? The very principle of Git makes it necessary, to remove
or update old data, to rewrite the whole subsequent history.
Furthermore, it is done by creating a new branch, the original data is
not actually deleted.

I asked a question and somehow it has become my responsibility to
explain thingsā€¦

Regards,

-- 
  Nicolas George



Re: Git for backup storage

2023-10-06 Thread john doe

On 10/6/23 13:26, Nicolas George wrote:

john doe (12023-10-06):

I do not understand why you would want multiple repos, to me this looks
like this would fit the bill for a Git branching workflow.


Please elaborate. How do you work around the fact that Git is terrible
at removing data with a single repository?




Please elaborate on why Git is so bad at removing data from a single
repository?

We clearly do not understand eachother!

--
John Doe



Re: Git for backup storage

2023-10-06 Thread Nicolas George
Max Nikulin (12023-10-06):
> I have no idea if it is possible to do it in place, but "git clone" and "git
> fetch" have the --depth option. So you can specify how many last commits you
> would like to have in the cloned repository. Using "git rebase

I know. They only allow to keep the last commits, not decimate them.

> --interactive" it is possible to squash e.g. daily commits into weekly or
> monthly ones. The drawback is that git rebase changes commit hashes.

git rebase is too inefficient for that kind of use.

Regards,

-- 
  Nicolas George



Re: Git for backup storage

2023-10-06 Thread Nicolas George
john doe (12023-10-06):
> I do not understand why you would want multiple repos, to me this looks
> like this would fit the bill for a Git branching workflow.

Please elaborate. How do you work around the fact that Git is terrible
at removing data with a single repository?

Regards,

-- 
  Nicolas George



Re: Git for backup storage

2023-10-06 Thread john doe

On 10/6/23 11:14, Nicolas George wrote:

Hi.

There is a project I have that requires some scripting, but I am
wondering if somebody already did something similar and there is a
package that I can just apt-get install.

The idea is to use Git to store backups of text files that change rather
rarely or not a lot, because Git is very efficient at compressing very
similar files in time sequences. That would be used for dumps of SQL
databases for example, or for records of hashes of all the files on a
system.

Unfortunately, Git is very bad at removing old data, that makes a
problem for rotating / decimating the oldest backups. To work around
this, I am considering using several Git repositories with a spillover
system:

- The files are committed into a monthly repository, each repository
   being created on the fly for the first commit on the month.

- Old monthly repositories can be deleted.

- But before they are deleted, one commit each five days can be
   extracted and committed into a yearly repository.

- And similarly, one commit per month can be committed into a decennial
   repository before old yearly repositories are removed.

Of course the month / year / five days parameters can be tweaked.

So, does anybody know of existing packages in Debian that could make my
work easier?

Thanks in advance.



I do not understand why you would want multiple repos, to me this looks
like this would fit the bill for a Git branching workflow.

--
John Doe



Re: Git for backup storage

2023-10-06 Thread Max Nikulin

On 06/10/2023 16:14, Nicolas George wrote:

Unfortunately, Git is very bad at removing old data


I have no idea if it is possible to do it in place, but "git clone" and 
"git fetch" have the --depth option. So you can specify how many last 
commits you would like to have in the cloned repository. Using "git 
rebase --interactive" it is possible to squash e.g. daily commits into 
weekly or monthly ones. The drawback is that git rebase changes commit 
hashes.