On 11/7/22 19:49, hw wrote:
On Mon, 2022-11-07 at 11:32 +0100, didier gaumet wrote:
At (linux) filesystem level, I think in-line deduplication is only
provided by ZFS (and perhaps, out-of-tree, BTRFS)
That's what it seems like, except VDO. Unfortunately, ZFS is said to need 5--
6GB of RAM for each 1TB of data, and that would require upgrading my server.
On my ZFS storage and backup servers, ZFS seems to grab the majority of
available memory. I have been unable to figure out a way to measure
memory consumed by deduplication.
When I want to have 2 (or more) generations of backups, do I actually want
deduplication? It leaves me with only one actual copy of the data which seems
to defeat the idea of having multiple generations of backups at least to some
extent.
The question is then if it makes a difference. It also creates the question if
I need (want) multiple generations of backups, especially when I end up with
only one copy anyway. Hmm ...
I put rsync based backups on ZFS storage with compression and
de-duplication. du(1) reports 33 GiB for the current backups (e.g.
uncompressed and/or duplicated size). zfs-auto-snapshot takes snapshots
of the backup filesystems daily and monthly, and I take snapshots
manually every week. I have 78 snapshots going back ~6 months. du(1)
reports ~3.5 TiB for the snapshots. 'zfs list' reports 86.2 GiB of
actual disk usage for all 79 backups. So, ZFS de-duplication and
compression leverage my backup storage by 41:1.
ZFS compression and de-duplication also works well for jails/ VM's.
For general data, I use compression alone.
For compressed and/or encrypted archives, image, etc., I do not use
compression or de-duplication
The key is to only use de-duplication when there is a lot of duplication.
And, to a lesser extend, to only use compression on uncompressed data
(lz4 detects compressed data and does not try to compress it further).
My ZFS pools are built with HDD's. I recently added an SSD-based vdev
as a dedicated 'dedup' device, and write performance improved
significantly when receiving replication streams.
David