Re: [agi] Re: Lexical model learning for LLMs

2023-11-27 Thread Matt Mahoney
I separated the effects of XML parsing and article reordering in cmix
for the Hutter prize. Recall that I first found a baseline for enwik9
compression using generic compression models without specializations
for text.

Baseline enwik9
 178,051,852 enwik9.zpaq -ms7ci1.1.1.1.2am (context mixing)
 181,583,242 enwik9.bsc -b100 -T (Burrows Wheeler transform)
 184,909,536 enwik9.pmd -m256 -o16 -r1 (PPM)
 230,135,777 enwik9.7z (LZ77 + arithmetic)
 322,592,218 enwik9.zip -9 (LZ77 + Huffman)
1,000,000,000 enwik9

enwik9 preprocessed with my small dictionary encoder built using byte
pair encoding. Words are restricted to characters of the same type,
either lowercase letters including &; , digits, the @ to mark the next
letter as uppercase, white space including  , or single punctuation
characters possibly repeated. Dictionary codes are single bytes but
use 255 to encode a literal of length 1-64 or any UTF-8 character.
This improved compression in every case.

 166,044,757 x.zpaq (-12 MB from baseline)
 176,252,128 x.bsc (-5 MB)
 177,765,678 x.pmd (-7 MB)
 213,991,893 x.7z (-17 MB)
 298,426,680 x.zip (-24 MB)
 651,316,380 x

The cmix v20 preprocessor used in the last 2 Hutter prize winners
parses the ~243,426 XML article headers and encodes them separately
from the text. The articles are reordered from a list of numbers in
the file .new_article_order. The list is not needed for decompression
because the headers contain an  field with numbers in ascending
order, allowing them to be sorted. It also does some Wikipedia
specific text substitutions to make the file more like ordinary text.
It decodes to & < " > and swaps [[ ]] (for
article links) with [ ] (for website links, which are less common).
This by itself improves compression. The preprocessor output is
.ready4cmix.

 162,899,592 ready4cmix.zpaq (-16 MB from baseline)
 169,660,387 ready4cmix.pmd (-15 MB)
 171,141,389 ready4cmix.bsc (-10 MB)
 208,166,180 ready4cmix.7z (-22 MB)
 296,085,159 ready4cmix.zip (-26 MB)
 934,188,796 ready4cmix

After my small dictionary encoder:

 153,927,354 x.zpaq (-25 MB from baseline)
 164,584,921 x.pmd (-20 MB)
 168,483,120 x.bsc (-13 MB)
 196,925,172 x.7z (-34 MB)
 277,211,683 x.zip (-45 MB)
 646,368,233 x

My next experiment was to separate the effects of article reordering
from the other XML processing. I wrote a program that reorders the
articles by matching the titles in ready4cmix but making no other
changes. My small dictionary encoder finds symbols for  
  [[ ]]  [ ] so any difference should be mainly due to the
header preprocessing. Also the output of my reordering program
includes the offset and length as decimal numbers at the beginning of
each article, which makes the output 3.5 MB larger (probably 2 MB
compressed). Here is the result of reordering only.

 166,309,461 x.zpaq (+3.5 MB from cmix)
 172,802,959 x.pmd (+3.2 MB)
 173,870,086 x.bsc (+2.7 MB)
 215,778,398 x.7z (+7.7 MB)
 309,881,543 x.zip (+13 MB)
1,003,566,215 x (+3.5 MB from baseline, +69 MB from cmix)

After my small dictionary encoding.

 156,431,185 x1.zpaq (+2.5 MB from cmix + dictionary)
 167,210,773 x1.pmd (+2.7 MB)
 170,550,586 x1.bsc (+2.1 MB)
 201,639,890 x1.7z (+4.7 MB)
 285,309,381 x1.zip (+8.1 MB)
 654,944,554 x1 (+8.6 MB)

The reason all of this works is that all of these compressors are
memory constrained. They forget older statistics, so moving related
sections closer together helps.

-- Matt Mahoney, mattmahone...@gmail.com

--
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Mc3567e16f3aa048c236c3ce0
Delivery options: https://agi.topicbox.com/groups/agi/subscription


Re: [agi] Re: Lexical model learning for LLMs

2023-11-23 Thread John Rose
A compression ratio of 0.1072 seems like there is plenty of room still. What is 
the max ratio estimate something like 0.08 to 0.04?  Though 0.04 might be 
impossibly tight... even at 0.05 the resource consumption has got to 
exponentiate out of control unless there are overlooked discoveries yet to 
be made.

--
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Mb338eb5e669be29a59811a86
Delivery options: https://agi.topicbox.com/groups/agi/subscription


Re: [agi] Re: Lexical model learning for LLMs

2023-11-23 Thread Matt Mahoney
I'm assuming 1 but per character compression, so 1 GB of input text is 1B
bits, so 1B parameters. enwik9 compression is actually a little better.

A neural network with m neurons and n connections can implement roughly
2^n/m! distinct functions, allowing the m neurons to be permuted to
equivalent networks. Taking the log, that's roughly n - m log m bits, or
about n where usually n >> m.

On Wed, Nov 22, 2023, 3:33 PM James Bowery  wrote:

> I'm asking because when you say "ideally" this evokes a *recurrent*
> neural network that approximates what I've called the NiNOR complexity
> 
> of the corpus: the "ideal" "compressed training data".
>
> Then you invoke 0.3 bpp as associated with this "ideal" of a "parameter".
> This is all in the context of enwik9 where the word "billion" has the unit
> "bytes" that may, *somehow*, relate to the occurrence of the word
> "billion" in the sense of the sentence in question, which is associated
> with the unit "bit".
>
> See my confusion?
> *Artificial General Intelligence List *
> / AGI / see discussions  +
> participants  +
> delivery options 
> Permalink
> 
>

--
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-M3dee7509367cde2fecf54328
Delivery options: https://agi.topicbox.com/groups/agi/subscription


Re: [agi] Re: Lexical model learning for LLMs

2023-11-22 Thread James Bowery
Matt wrote:
> I am doing experiments on learning the rules for tokenization. Back in
> 2000 I experimented in finding word boundaries in text without spaces.
> These occur where there is low mutual information across boundaries.
> 

Possibly relevant is the sub-answer "Variable-length tokens" to the question I 
posed several years ago to cs.stackexchange titled "Finding a Simple 
Distribution In a Binary String" 
.

That answer rather "cheats" by imposing an inductive bias at the outset when it 
says "In particular, I suggest you identify a set of tokens t1,…,tk that you're 
confident will be a superset of the ones in the real model, and then use 
optimization methods to solve for the repeat-factors n1,…,nk that maximize the 
likelihood of the model."
--
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-M3207e3dda49b8b5104888b19
Delivery options: https://agi.topicbox.com/groups/agi/subscription


Re: [agi] Re: Lexical model learning for LLMs

2023-11-22 Thread James Bowery
I'm asking because when you say "ideally" this evokes a *recurrent* neural 
network that approximates what I've called the NiNOR complexity 

 of the corpus: the "ideal" "compressed training data".

Then you invoke 0.3 bpp as associated with this "ideal" of a "parameter".  This 
is all in the context of enwik9 where the word "billion" has the unit "bytes" 
that may, *somehow*, relate to the occurrence of the word "billion" in the 
sense of the sentence in question, which is associated with the unit "bit".

See my confusion?
--
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-M3c722861bab9531dc3fd786b
Delivery options: https://agi.topicbox.com/groups/agi/subscription


Re: [agi] Re: Lexical model learning for LLMs

2023-11-21 Thread Matt Mahoney
On Tue, Nov 21, 2023, 8:45 PM James Bowery  wrote:

> Please elucidate:
>
>
> Ideally a neural network should use one parameter per bit of compressed
> training data, or 1 billion
>
> Approximately, from information theory. A Hopfield associate memory
capacity is 0.3 bits per parameter.

Also I'm not convinced about transformers. Our brains are proof that it is
possible to learn in 1 pass. We also devote similar amounts of brain tissue
to low and high level language processing. My preprocessor runs in about a
minute on enwik9.

>
>
> 
>

--
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Md511c73ecc9f0f38162ec04d
Delivery options: https://agi.topicbox.com/groups/agi/subscription