To summarize our off list discussion...

On 25/04/2026 17:12, Leonid Evdokimov wrote:
As I mentioned in the thread start,

My use-case is a simple one: I need a content-defined chunker to put
≈100'000 versions of a ≈500 MiB text file in a git repository to use
the excellent xdelta implementation in the git toolkit.

I have now run an end-to-end test for this use case. I am pleased with
the results, so I want to share my joy and raise two more questions.


CDC results
===========

Storage method  | Size, GiB | Compression
----------------+-----------+---------------------
git-sizer blobs |  12,767.8 | yes, 12 TiB of input
xz --best       |        ~  | x12
BorgBackup      |        ~  | x140
Git, as-is      |      82.0 | x156
Git, split 100M |      91.8 | x139 :-(
Git, split CDC  |      17.1 | x747
Git, CDC, CDmv  |       2.4 | x5354


git-sizer reports the total size of all unique blobs in the Git repo.
Given the repository structure, this is effectively equal to the total
input size, since it mostly consists of a single large file under
version control.

The `xz --best` compression ratio for the latest version of the file
is x12; it is included here as a rough baseline.

BorgBackup uses LZMA, BUZHash-based CDC, and no delta compression.
LZMA alone provides about x10 compression, while deduplication
improves this to x140.

The x156 reduction demonstrates the effectiveness of Git's xdelta
implementation.

split -b100M was introduced due to GitHub's blob size limitations, which
prevented us from publishing the "as-is" version of the repository.
The -b100M option slightly reduced memory pressure on the xdelta
algorithm during git-gc due to smaller chunks, but compression worsened.

Using split with the CDC patch produces smaller, GitHub-compliant chunks
and localizes chunk boundaries, simplifying xdelta's job. This results
in a x5.4 storage reduction compared to -b100M (x747 from baseline).

"CDmv" is shorthand for "content-defined (re)naming". It's another
git-specific low-hanging optimization fruit. CDmv names each chunk
as ${SHA1(CDC_Window)}.${serial} instead of the sequential names
produced by split(1). This helps git-pack enumerate candidates for delta
compression when the number of chunks changes. Git's heuristics
are based on the basename of the file and are described here:
https://git-scm.com/docs/pack-heuristics

CDC+CDmv provides a further x7.1 storage reduction compared to CDC.


Questions
=========

Now, I would like to ask the maintainers for their opinions
on two questions:

Question #1: Does it make sense to add the CDmv patch to split(1),
or is this becoming too specific to the Git use case?

My intuition is that it may be too specialized, but I may be mistaken.
All in all, I've originally seen CDC from a Git-specific angle as well.

Yes it's probably best to not include CDmv in split,
but it would be worth documenting in info as a possibility.

Question 2: BUZHash seeding via --random-source is somewhat concerning,
as the current implementation does not define an API contract
guaranteeing stability across versions. Does this deserve further
consideration or improvement?

The built-in BUZHash seed is explicitly defined to be stable across
versions, which is probably okay.

I don't know enough to comment at present.

I've rebased the patch stack on top of current master:
https://github.com/coreutils/coreutils/compare/master...darkk:coreutils:cdc

Copyright assignment has stalled,
but will hopefully resume after new emails sent.

thanks,
Padraig

Reply via email to