Re: [agi] Re: Lexical model learning for LLMs

Matt Mahoney Thu, 23 Nov 2023 07:20:28 -0800

Shannon estimated the entropy of written English to be between 0.6 and
1.3 bits per character. The best compression of enwik9 is 0.1072,
which is 0.86 bpc. I think there is room for improvement. Maybe 0.08
is possible. cmix is based on PAQ, which uses 24 bytes to store each
set of context statistics. I have some ideas to improve this. The
Hutter prize is constrained by memory, so cmix uses a larger PPM model
because it stores statistics more efficiently using a context tree.
The problem is that PPM only works for contiguous contexts. A semantic
model requires whole word contexts with gaps.

Here is the experiment I did in 2000 on finding word boundaries in
text with the spaces removed.
https://cs.fit.edu/~mmahoney/dissertation/lex1.html
The fact that this works at all is why byte pair encoding works.

Also I looked at the article reordering preprocessor in cmix. I
believe it is the same one used in starlit and fast-cmix to win the
last 2 Hutter prizes. It does some light preprocessing on enwik9 to
decode XML/HTML markup like &lt; &quot; &amp; &gt; to < " & >,
replaces [[article link]] with [article link], [website link] with
[[website link]], and encoding the XML headers like replacing
timestamps with integers. This reduces enwik9 from 1000 MB to 934 MB.
The articles are reordered from a list of 250K numbers. I didn't see
any code for producing this list, but I looked at the sorted titles
and it groups related genres like movies, historical figures, years,
and places. The results below don't include the size of this file,
about 1 MB compressed.

Baseline enwik9
 178,051,852 enwik9.zpaq -ms7ci1.1.1.1.2am
 181,583,242 enwik9.bsc -b100 -T
 184,909,536 enwik9.pmd -m256 -o16 -r1
 230,135,777 enwik9.7z
 322,592,218 enwik9.zip -9
1,000,000,000 enwik9
2,097,272,625 bytes

My byte pair encoder.
 166,471,095 x.zpaq
 176,420,366 x.bsc
 178,037,227 x.pmd
 214,109,644 x.7z
 298,750,012 x.zip
 657,627,700 x
1,691,416,044 bytes

cmix v20 article reordering.
 162,899,592 ready4cmix.zpaq
 169,660,387 ready4cmix.pmd
 171,141,389 ready4cmix.bsc
 208,166,180 ready4cmix.7z
 296,085,159 ready4cmix.zip
 934,188,796 ready4cmix
1,942,141,503 bytes

cmix + byte pair encoding.
 154,146,050 x.zpaq
 164,829,889 x.pmd
 168,405,583 x.bsc
 197,333,656 x.7z
 277,562,651 x.zip
 651,495,517 x
1,613,773,346 bytes

BWT works poorly on sorted data, which might explain why bsc improved
less than the others where proximity is more important.

On Thu, Nov 23, 2023 at 9:36 AM John Rose <johnr...@polyplexic.com> wrote:
>
> A compression ratio of 0.1072 seems like there is plenty of room still. What 
> is the max ratio estimate something like 0.08 to 0.04?  Though 0.04 might be 
> impossibly tight... even at 0.05 the resource consumption has got to 
> exponentiate out of control.... unless there are overlooked discoveries yet 
> to be made.
>
> Artificial General Intelligence List / AGI / see discussions + participants + 
> delivery options Permalink

-- 
-- Matt Mahoney, mattmahone...@gmail.com

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Me070b1be58aefeaa9a41fb3d
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Re: Lexical model learning for LLMs

Reply via email to