On Sun, Dec 7, 2025 at 3:55 AM <[email protected]> wrote:
>
> I just saw your reply now. I posted a lot of update replies in the link above 
> (read them there).
>
> I didn't understand, what does paq8px_v67 score with the pre-processed 
> enwik5?  I see "20,582 x.paq8px" but is that v67 or a different version? (I'm 
> using the pre-processed enwik5 for now because everyone else at the top seems 
> to use it.)

I was using v67. I just downloaded the latest version
(paq8px_v209fix1) as of Sept. 22, 2025 from
https://encode.su/threads/342-paq8px?p=86059&viewfull=1#post86059
Compression is significantly better than v67. I tested with option -8
(2377 MB memory).

100,000 enwik5
23,616 enwik5.paq8px209fix1 (11 sec)
54,781 pre-processed enwik5.txt
20,201 pre-processed enwik5.txt.paq8px209fix1 (10 sec)

Also, I found an interesting tutorial on transformers including a
small model you can run on a laptop.
https://poloclub.github.io/transformer-explainer/
and the 2017 paper from Google "Attention is all you need".
https://dl.acm.org/doi/10.5555/3295222.3295349

although it is still not clear to me how the Q, K, and V matrices are
trained (other than they use back propagation with some layer skipping
connections to avoid gradient collapse). The paper on nncp (top ranked
on LTCB) uses this model, but again it is not clear to me.
https://bellard.org/nncp/nncp.pdf

What I did notice is that the preprocessed output from cmix -s that
you are using preserves spaces between words. I think (but have not
tested) it would get better compression if the spaces were removed,
since they are implied by the token boundaries. I believe that is what
LLMs currently do.

-- 
-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-Mff6422291e514c4bc6d47b78
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to