hw wrote: > > As you say, deduplication in backup systems is quite common, and works > > pretty well. There's also an on-disk non-filesystem utility, rdfind, > > which is packaged in Debian. It can discover identical files and make > > them hardlinks. > > Well, if I had all the disk space to hold 2 full copies of the data to be able > to deduplicate it only later, I wouldn't need to deduplicate anything.
Only two copies? That's not a good use case for any of the deduplicators. The point of rdfind is to use it in a cron job while some process is generating duplicate files. For example, a backup process that copies a filesystem every six hours will generate four identical copies of almost every file each day. (rsnapshot would do a better job, here.) > And how would pretending there are two backups while there's actually only one > because it got deduplicated be better than having only one backup to begin > with? > (Yeah I haven't thought of that before ...) It's not two backups, it's two very similar backups taken at different times, so the majority of the files are the same but some are different. If you want a second backup, it needs to go on a different machine, preferably in a different location. Maybe you should tell us what your actual use case is rather than asking about realtime deduplication? It could be that there's a completely different solution which would make you happy. -dsr-