Byte and bit level prediction doesn't affect the compression ratio in
theory because any byte prediction can be factored into 8 bit predictions
by the chain rule: p(ab) = p(a)p(b|a). Where it makes a difference is
speed. If you use a range code you have to split the range into 2 parts to
encode a bit or 256 parts to encode a byte, in proportion to the
probability distribution in either case. To normalize the distribution you
have to add up the predictions and divide by the sum so they add up to 1.
For a byte, that's 256 steps. Then to find the right piece of the range you
have to add up all the probabilities that are lexicographically less than
the byte you are encoding, which is 128 steps on average. For bit encoding,
it's just 1 step, but repeated 8 times.

Byte prediction isn't necessarily slower because most of the computation is
modeling, not coding. For bit level prediction like in PAQ you are
evaluating the model 8 times, each with an additional bit of context, and
updating the model 8 times after each prediction.

Most LLMs predict tokens from a vocabulary in the tens of thousands, so the
difference is even greater. But it still might be faster. The prediction
code is highly parallel, and even some of the encoding steps can be
parallel, like using an addition tree to reduce the encoding to the same
number of steps as encoding bits.

The PAQ/ZPAQ components like CM, ICM, ISSE, SSE, and MIX all operate on bit
predictions. Every component in the list takes a context and possibly some
bit predictions from earlier components and outputs a new prediction. The
last component in the list is the final output to be encoded. For byte or
token prediction you would need a new set of components that output a
vector of probabities instead of a single number. This would mean 256 times
more computation and memory for byte prediction or tens of thousands for
tokens. But that's what neural language models do. LSTM and transformers
use an attention model to speed this up at the cost of some compression by
setting all but the top few elements to 0.

-- Matt Mahoney, [email protected]

On Tue, Dec 9, 2025, 1:47 PM <[email protected]> wrote:

> Do you know how much the score falls if you change an algorithm's type of
> prediction from byte to bit? For example if an algorithm compresses 100,000
> bytes to 23,000 using byte prediction, will simply changing the same
> algorithm to do bit prediction score 21,500 bytes r maybe just 22,500 bytes
> or maybe no improvement? How much?
>
> Also how much for SSE? (or ISSE).
> And how much for updating mixer weights based on error feedback?
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> +
> delivery options <https://agi.topicbox.com/groups/agi/subscription>
> Permalink
> <https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-M077039668592391e49c807bd>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-M3b659ceca81ead1ef1dbcdc9
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to