I didn't implement word models with gaps in zpaq, although I did for bytes, which doesn't help much on text. Here are some simple models
54,804 2024pre-processed enwik5.txt (uncompressed size) zpaq a "" "2024pre-processed enwik5.txt" -ms4.0c -> 40133 (order 0 ICM) zpaq a "" "2024pre-processed enwik5.txt" -ms4.0c255 -> 41691 (order 0 CM, max count 255x4) zpaq a "" "2024pre-processed enwik5.txt" -ms4.0c8 -> 40435 (order 0 CM, max count 8x4) An ICM maps a context to a bit history as an 8 bit state and then to a probability table. It is updated to reduce the prediction error by 0.1%. A CM maps a context directly to a prediction and then updates the prediction to reduce the error by 1/count, where the maximum count can be 4 to 1020. Higher count limits work better for stationary data like text, but smaller limits adapt faster to changing statistics. An ICM usually works well for both cases. The options have the meanings: -m method -s streaming format (no dedupe or journaling). 4 selects a block size of 2^4 MB. If the input is bigger than the block size then the blocks are compressed independently in separate thereads. 0 means no preprocessing. You can select LZ77, BWT, or E8E9 filters, but these won't help for maximum compression. c0 selects an ICM. c8 or c255 selects a CM with a maximum count of 32 or 1020. Here are some higher order single contexts -ms4.0c0.0.255 -> :28191 (order 1) -ms4.0c0.0.255.255 -> 29463 (order2) -ms4.0c0.0.255.255.255 -> 34505 (order 3) c0.0.255 selects an ICM (0), no special contexts (0), and a bit mask on the past whole byte contexts (255 = 0xFF, all bits). There is an implied context of the previous bits of the current byte which is included in the hash. On larger files, higher order contexts become more important. Here are some results on enwik8 (3 to 5 seconds in all cases). zpaq a "" "enwik8" -ms4.0c0.0.255 -> 46373408 (order 1) -ms4.0c0.0.255.255 -> 36189176 (order 2) -ms4.0c0.0.255.255.255 -> 30289924 (order 3) -ms4.0c0.0.255.255.255.255 -> 29041348 (order 4) -ms4.0c0.0.255.255.255.255.255 -> 32642131 (order 5) These are fast because the input is divided into 16 MB blocks which are compressed in separate threads. In a single thread (option -t1), the order 4 compressor takes 12 seconds. But this allows a larger block size to compress better. zpaq a "" "enwik8" -ms7.0c0.0.255.255.255.255 -> 26741077 (15 sec). Most of the compression comes from a ICM-ISSE chain. -ms7.0ci1.1.1.1 -> 24540667 (29 sec) -ms7.0ci1.1.1.1.2 -> 21744079 (40 sec) -ms7.0ci1.1.1.1.2am -> 20544090 (73 sec) The first describes a mix of orderr 0 ICM and order 1, 2, 3, 4 ISSE chain. Each ISSE mixes the previous prediction with a constant 1 using the bit history to select the pair of weights. The order 4 prediction goes to the arithmetic coder. The second is the same with orders 1, 2, 3, 4, 6. The third adds a MATCH model and a mixer that takes all of the prior components as input. We can add a word model: -ms7.0ci1.1.1.1.2awm -> 20236291 (91 sec) -ms7.0ci1.1.1.1.2aw2m -> 20048225 (88 sec) w selects the current word as context. A word is a sequence of A-Z, case insensitive, ignoring other characters, mapped to an ICM. w2 is a ICM-ISSE chain using the current and previous words as context. This ranks about the top 25 on LTCB. https://mattmahoney.net/dc/text.html On Fri, Dec 12, 2025 at 7:04 AM <[email protected]> wrote: > > Woah. That is very helpful. > > Can you tell me what it score [without Indirect] and only using SSE and MATCH > and 4 orders (or up to 15 orders if it improves compression) ? > > And another question, it seems like the what you ran with ISSE and ICM and 4 > orders and MATCH is most of the algorithm and you said it scored 22917, but > what is the score with that + gaps? Is it ~20,000 bytes then I guess? (Like > when you ran in a few days ago?). > > And what else is there in your algorithm (if adding gaps to those settings > above doesn't bring it to ~20,000) that brings the rest of it down to > ~20,000? (you said it scored that a few days ago) > --------------------------------------------------------------------------------------------- > Btw mine makes MATCH and SSE but they are the same thing (and I think I do it > a different way and I use how far back they are in the last ~500 characters > (and of course how many times appear) to tally up the SSE score for each > character). I do about 4 MATCHES for 4 orders. This is my SSE. But I also do > it to order 0 (the character "is" the MATCH) and order 1 (of the 2 characters > for order 1, only one character seeks a MATCH, but I also do 2 character > MATCH for order 1 in another loop over order 1, but not for higher orders (it > didn't help at 100,000 bytes of data, at least, I mean)). > Artificial General Intelligence List / AGI / see discussions + participants + > delivery options Permalink -- -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-Mef255312f1584a2648fc8b87 Delivery options: https://agi.topicbox.com/groups/agi/subscription
