Yeah, a looong long time ago I was hoping Zstd compression + dictionaries would solve the problem. I had, though, I think, designed some overly-complex systems for doing it, and therefore never got around to setting it up.
I did some tests w/ squashfs and got some good results as well. This option appeals to me for its transparency: The Zstd + dictionary approach means special tools for looking at the data, but squashfs would work w/ a standard CLI toolkit. Those results are below. I'm collecting (heh) up a design spec for this <https://github.com/orgs/cpan-testers/discussions/24> in the CPAN Testers Discussions under a new Proposal category. And then once we isolate this problem, the rest of the problems seem almost trivial ;) # The count of all reports cpantesters@cpantesters4:~$ find reports-dir/_meta/timestamp -type f | xargs cat | wc -l 44987 # The total size on-disk (I'm assuming w/ extra tail blocks) cpantesters@cpantesters4:~$ du -sh reports-dir/ 614M reports-dir/ # LZ4 squashfs Exportable Squashfs 4.0 filesystem, lz4 compressed, data block size 131072 compressed data, compressed metadata, compressed fragments, compressed xattrs duplicates are removed Filesystem size 127003.50 Kbytes (124.03 Mbytes) 30.41% of uncompressed filesystem size (417572.73 Kbytes) Inode table size 889805 bytes (868.95 Kbytes) 39.78% of uncompressed inode table size (2237004 bytes) Directory table size 823072 bytes (803.78 Kbytes) 36.71% of uncompressed directory table size (2242366 bytes) # XZ squashfs (best compression) Exportable Squashfs 4.0 filesystem, xz compressed, data block size 131072 compressed data, compressed metadata, compressed fragments, compressed xattrs duplicates are removed Filesystem size 92831.63 Kbytes (90.66 Mbytes) 22.23% of uncompressed filesystem size (417572.73 Kbytes) Inode table size 479692 bytes (468.45 Kbytes) 21.44% of uncompressed inode table size (2237004 bytes) Directory table size 493816 bytes (482.24 Kbytes) 22.02% of uncompressed directory table size (2242366 bytes) # LZO squashfs Exportable Squashfs 4.0 filesystem, lzo compressed, data block size 131072 compressed data, compressed metadata, compressed fragments, compressed xattrs duplicates are removed Filesystem size 119522.29 Kbytes (116.72 Mbytes) 28.62% of uncompressed filesystem size (417572.73 Kbytes) Inode table size 827963 bytes (808.56 Kbytes) 37.01% of uncompressed inode table size (2237004 bytes) Directory table size 743654 bytes (726.22 Kbytes) 33.16% of uncompressed directory table size (2242366 bytes) # Gzip squashfs Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072 compressed data, compressed metadata, compressed fragments, compressed xattrs duplicates are removed Filesystem size 111798.37 Kbytes (109.18 Mbytes) 26.77% of uncompressed filesystem size (417572.73 Kbytes) Inode table size 627493 bytes (612.79 Kbytes) 28.05% of uncompressed inode table size (2237004 bytes) Directory table size 581621 bytes (567.99 Kbytes) 25.94% of uncompressed directory table size # Ztsd squashfs (needed to move to a Debian 12 box to get this) Exportable Squashfs 4.0 filesystem, zstd compressed Filesystem size 100603.81 Kbytes (98.25 Mbytes) 24.09% of uncompressed filesystem size (417572.73 Kbytes) Inode table size 537209 bytes (524.62 Kbytes) 24.01% of uncompressed inode table size (2237004 bytes) Directory table size 490852 bytes (479.35 Kbytes) 21.89% of uncompressed directory table size (2242366 bytes) Doug Bell d...@preaction.me > On May 5, 2025, at 6:23 PM, Scott Baker <sc...@perturb.org> wrote: > > CPAN Testers: > > As part of my research into Magpie we came up against a disk space hurdle. > Currently CPT is ingesting ~25,000 tests per day. After capturing a sampling > of about 40,000 tests I was able to determine that the average test is 9,129 > bytes of text. If we store uncompressed text that's 223MB per day (81GB per > year). Clearly that's not very sustainable so we need to look at compression. > > gzip -9 = 3198 bytes > zstd -12 = 3124 bytes > brotli -9 = 2699 bytes > Brotli is the clear winner for compressing smallish chunks of text. Not > surprising as that was one of the primary goals when it was designed. > Compressing with Brotli gets us down to 66MB per day (24GB per year) which is > more reasonable for sure. > > Doing some research I came across Zstandard dictionaries > <x-msg://27/Zstandard%20dictionaries>. Zstandard dictionaries fit our use > case perfectly: compressing many small but very similar (json, xml, etc.) > files. I dumped the last 50,000 text test results from CPT and created a > custom 128KB dictionary file. Using that CPT tuned dictionary I was able to > get the average size on disk of a test result down to 1087 bytes (27MB per > day or 10GB per year). > > As we move forward with reworking the DB side of CPT we should definitely > consider Zstandard dictionaries. They are well tested, relatively easy to > use, and well supported > <https://metacpan.org/pod/Compress::Stream::Zstd::CompressionDictionary> by > Perl and other tools. > > High speed database-grade cloud storage is not cheap. Whatever we can do to > decrease the amount of raw storage we need the better. Lower storage usage > means faster replication and quicker backups. Have you ever tried backing up > 1TB of data in the cloud? Spoiler alert: it's not easy. > > -- Scottchiefbaker > > P.S. For bonus points what if we re-worked what we store? Do we need to store > "Thank you for uploading your work to CPAN..." Do we need to store the > opening boiler plate paragraph? > > >> From: metabase:user:314402c4-2aae-11df-837a-5e0a49663a4f >> Subject: NA Random-Simple-0.24 5.10.1 FreeBSD >> Date: 2025-03-31T17:20:02Z >> >> This distribution has been tested as part of the CPAN Testers >> project, supporting the Perl programming language. See >> http://wiki.cpantesters.org/ for more information or email >> questions to cpan-testers-discuss@perl.org >> <mailto:cpan-testers-discuss@perl.org> > P.P.S. Raw numbers for reference: > > >> perlmagpie> SELECT avg(octet_length(txt_zstd)), count(guid), grade FROM >> test_results INNER JOIN test USING (GUID) GROUP BY grade ORDER BY 1 asc >> LIMIT 30; >> +-----------------------+-------+---------+ >> | avg | count | grade | >> |-----------------------+-------+---------| >> | 837.1807610993657505 | 1892 | NA | >> | 862.9752690411719781 | 72015 | PASS | >> | 1286.9555979297194225 | 3671 | UNKNOWN | >> | 1515.2728811352688452 | 15362 | FAIL | >> +-----------------------+-------+---------+ >> SELECT 4 >> Time: 0.223s >> > >