Bug#662080: ITP: hadori -- Hardlinks identical files

Goswin von Brederlow Sat, 03 Mar 2012 22:03:21 -0800

Timo WeingÃ¤rtner <t...@tiwe.de> writes:

> Package: wnpp
> Severity: wishlist
> X-Debbugs-CC: debian-de...@lists.debian.org
>
>    Package name: hadori
>         Version: 0.2
> Upstream Author: Timo WeingÃ¤rtner <t...@tiwe.de>
>             URL: https://github.com/tiwe-de/hadori
>         License: GPL3+
>     Description: Hardlinks identical files
>  This might look like yet another hardlinking tool, but it is the only one
>  which only memorizes one filename per inode. That results in less merory
>  consumption and faster execution compared to its alternatives. Therefore
>  (and because all the other names are already taken) it's called
>  "HArdlinking DOne RIght".
>  .
>  Advantages over other hardlinking tools:
>   * predictability: arguments are scanned in order, each first version is kept
>   * much lower CPU and memory consumption
>   * hashing option: speedup on many equal-sized, mostly identical files
>
> The initial comparison was with hardlink, which got OOM killed with a hundred 
> backups of my home directory. Last night I compared it to duff and rdfind 
> which would have happily linked files with different st_mtime and st_mode.
>
> I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
> number.
>
>
> Greetings
> Timo


I've been thinking about the problem of memory consumption too. But I've
come to a different solution. One that doesn't need memory at all.

Instead of remembering inodes, filenames and checksums create a global
cache (e.g. directory hierachy like .cache/<start of hash>/<hash>)
and hardlink every file to there. If you want/need to include uid, gid,
mtime, mode in there then make that part of the .cache path.

Garbage collection in the cache would be removing all files with a link
count of 1.

Going one step further link files with unique size [uid, gid, mtime,
...] to .cache/<size> and change that into .cache/<size>/<start of
hash>/<hash> when you find a second file with the same size that isn't
identical. That would save on the expensive hashing of clearly unique
files.

You could also use a hash that computes the first byte from the first
4k, second byte from 64k, thrid from 1mb and so on. That way you can
check if the beginning of 2 files match without having to checksum the
whole file or literally comprare the two.

MfG
        Goswin



--
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87lingkis2.fsf@frosties.localnet

Bug#662080: ITP: hadori -- Hardlinks identical files

Reply via email to