Shannon estimated the entropy of written English to be between 0.6 and 1.3 bits per character. The best compression of enwik9 is 0.1072, which is 0.86 bpc. I think there is room for improvement. Maybe 0.08 is possible. cmix is based on PAQ, which uses 24 bytes to store each set of context statistics. I have some ideas to improve this. The Hutter prize is constrained by memory, so cmix uses a larger PPM model because it stores statistics more efficiently using a context tree. The problem is that PPM only works for contiguous contexts. A semantic model requires whole word contexts with gaps.
Here is the experiment I did in 2000 on finding word boundaries in text with the spaces removed. https://cs.fit.edu/~mmahoney/dissertation/lex1.html The fact that this works at all is why byte pair encoding works. Also I looked at the article reordering preprocessor in cmix. I believe it is the same one used in starlit and fast-cmix to win the last 2 Hutter prizes. It does some light preprocessing on enwik9 to decode XML/HTML markup like < " & > to < " & >, replaces [[article link]] with [article link], [website link] with [[website link]], and encoding the XML headers like replacing timestamps with integers. This reduces enwik9 from 1000 MB to 934 MB. The articles are reordered from a list of 250K numbers. I didn't see any code for producing this list, but I looked at the sorted titles and it groups related genres like movies, historical figures, years, and places. The results below don't include the size of this file, about 1 MB compressed. Baseline enwik9 178,051,852 enwik9.zpaq -ms7ci1.1.1.1.2am 181,583,242 enwik9.bsc -b100 -T 184,909,536 enwik9.pmd -m256 -o16 -r1 230,135,777 enwik9.7z 322,592,218 enwik9.zip -9 1,000,000,000 enwik9 2,097,272,625 bytes My byte pair encoder. 166,471,095 x.zpaq 176,420,366 x.bsc 178,037,227 x.pmd 214,109,644 x.7z 298,750,012 x.zip 657,627,700 x 1,691,416,044 bytes cmix v20 article reordering. 162,899,592 ready4cmix.zpaq 169,660,387 ready4cmix.pmd 171,141,389 ready4cmix.bsc 208,166,180 ready4cmix.7z 296,085,159 ready4cmix.zip 934,188,796 ready4cmix 1,942,141,503 bytes cmix + byte pair encoding. 154,146,050 x.zpaq 164,829,889 x.pmd 168,405,583 x.bsc 197,333,656 x.7z 277,562,651 x.zip 651,495,517 x 1,613,773,346 bytes BWT works poorly on sorted data, which might explain why bsc improved less than the others where proximity is more important. On Thu, Nov 23, 2023 at 9:36 AM John Rose <johnr...@polyplexic.com> wrote: > > A compression ratio of 0.1072 seems like there is plenty of room still. What > is the max ratio estimate something like 0.08 to 0.04? Though 0.04 might be > impossibly tight... even at 0.05 the resource consumption has got to > exponentiate out of control.... unless there are overlooked discoveries yet > to be made. > > Artificial General Intelligence List / AGI / see discussions + participants + > delivery options Permalink -- -- Matt Mahoney, mattmahone...@gmail.com ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Me070b1be58aefeaa9a41fb3d Delivery options: https://agi.topicbox.com/groups/agi/subscription