On Sun, Dec 7, 2025 at 3:55 AM <[email protected]> wrote: > > I just saw your reply now. I posted a lot of update replies in the link above > (read them there). > > I didn't understand, what does paq8px_v67 score with the pre-processed > enwik5? I see "20,582 x.paq8px" but is that v67 or a different version? (I'm > using the pre-processed enwik5 for now because everyone else at the top seems > to use it.)
I was using v67. I just downloaded the latest version (paq8px_v209fix1) as of Sept. 22, 2025 from https://encode.su/threads/342-paq8px?p=86059&viewfull=1#post86059 Compression is significantly better than v67. I tested with option -8 (2377 MB memory). 100,000 enwik5 23,616 enwik5.paq8px209fix1 (11 sec) 54,781 pre-processed enwik5.txt 20,201 pre-processed enwik5.txt.paq8px209fix1 (10 sec) Also, I found an interesting tutorial on transformers including a small model you can run on a laptop. https://poloclub.github.io/transformer-explainer/ and the 2017 paper from Google "Attention is all you need". https://dl.acm.org/doi/10.5555/3295222.3295349 although it is still not clear to me how the Q, K, and V matrices are trained (other than they use back propagation with some layer skipping connections to avoid gradient collapse). The paper on nncp (top ranked on LTCB) uses this model, but again it is not clear to me. https://bellard.org/nncp/nncp.pdf What I did notice is that the preprocessed output from cmix -s that you are using preserves spaces between words. I think (but have not tested) it would get better compression if the spaces were removed, since they are implied by the token boundaries. I believe that is what LLMs currently do. -- -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-Mff6422291e514c4bc6d47b78 Delivery options: https://agi.topicbox.com/groups/agi/subscription
