Hi Ludo,
Thoughts? :-)
Super cool! :)
Your research inspired me to do conduct some experiments towards
de-duplication.
For two similar packages (emacs-27.1 and emacs-26.3) I was able to
de-duplicate ~12% using EROFS and ERIS. Still far from the ~85%
similarity, but an attempt I'd like to share.
The two main ingredients:
- EROFS (Enhanced Read-Only File-System) is a read-only,
compressed
file-system comparable to SquashFS. It has some properties that
make
it more suitable than SquashFS (it aligns content to fixed block
size). EROFS is in mainline Linux Kernel since v5.4.
- ERIS (Encoding for Robust Immutable Storage) is an encoding of
content
into uniformly sized blocks that I've been working on. It
de-couples
encoding of content from storage and transport layer. Transport
layers
can be things like IPFS, GNUNet, Named Data Network or just a
plain
old HTTP service.
I make EROFS images of the packages and encode them with ERIS,
which
de-duplicates blocks as part of the encoding process.
With this I manage to de-duplicate between 12-17% (depending on
some
parameters).
This could allow:
- Directly mounting packages instead of unarchiving (a la distri)
- Peer-to-peer distribution of packages (that's what ERIS is for)
- De-duplicating common content in packages to a certain extent
(topic
of this thread)
A more in-depth write-up:
https://gitlab.com/openengiadina/eris/-/tree/main/examples/dedup-fs
Happy Hacking!
-pukkamustard