Re: [agi] Lexical model learning for LLMs

Matt Mahoney Wed, 21 Feb 2024 14:30:42 -0800

To answer your question on stackexchange, the way a compressor would guess
that a binary string is made up of 2 bit tokens with different frequencies
is to train on a context that includes both the previous bit and the bit
position mod 2. PAQ has models like this at the byte level for length 2, 3,
and 4, and can discover longer repeating structures by searching for byte
values repeating at regular intervals.


If the tokens are variable length then boundaries can be discovered by
searching for high conditional entropy reading in both directions. I did
experiments in 2000 where I removed spaces from text and restored them with
77% accuracy using only 5-gram statistics.

Byte pair encoding works because English has an intermediate structure
between letters and words, namely syllables. It is learnable on a multi
layer neural network. Byte pair encoding is faster but it fails when the
tokens overlap or have gaps. The most common pair in English is "e " which
is wrong because they don't belong to the same semantic unit. I added code
so that only letters or repeated punctuation marks are paired.

I'm currently looking at article reordering in enwik9. So far I have not
been able to improve on the order in the Starlit Hutter prize submission,
which groups articles by topic. It is a difficult problem because it is
essentially a traveling salesman problem over 243,000 articles to maximize
pairwise mutual information. It is prohibitively expensive just to
calculate all the pairwise distances, never mind searching for the best
path.

Article reordering works for two reasons. First, it improves compression
because compressors have limited memory and forget old statistics. Second,
moving related strings together improves compression on non stationary
algorithms like LZ77 (shorter match offsets in zip and 7zip) and context
mixing (zpaq). Reordering has almost no effect on stationary algorithms
like BWT (bsc) and PPM (ppmd) until they reach the end of the BWT block or
run out of memory in PPM and have to discard and partially rebuild the
model.


On Tue, Feb 20, 2024, 2:11 PM James Bowery <jabow...@gmail.com> wrote:

> https://twitter.com/jabowery/status/1760015755792294174
>
> https://youtu.be/zduSFxRajkE
>
>
>
> On Tue, Nov 21, 2023 at 7:20 PM Matt Mahoney <mattmahone...@gmail.com>
> wrote:
>
>> I started the large text benchmark in 2006
>> (https://mattmahoney.net/dc/text.html ) with the claim that all you
>> need to pass the Turing test is text prediction, which you can measure
>> with compression. Both the benchmark and the Hutter prize use the same
>> 1 GB text file (enwik9), with the goal of encouraging research in
>> language modeling. Those results are now paying off, with the top
>> compressors using neural networks just like the LLMs that have been
>> released in the last year. These are worth studying if you plan to do
>> any actual development work in AI.
>> 
>> The top 2 programs are nntp and cmix. nntp is described in two papers
>> linked on the benchmark. The first paper describes two versions using
>> LSTM and transformer networks, with the LSTM performing slightly
>> better. Later versions of nntp improved the transformer model, which
>> now takes the top spot on LTCB with a compression ratio 0.1072. It
>> uses a 6 layer transformer with 200M parameters. The input is the
>> current token and the output is a vector of probabilities over a 16K
>> vocabulary for the next token. It takes 3 days to compress or
>> decompress on a Geforce RTX-3090 GPU with 10,496 CUDA cores.
>> 
>> Second place is cmix, which doesn't need a GPU, but takes 8 days on a
>> Core i7-7700 with 32 GB of memory to achieve 0.1098 compression.
>> fast-cmix is a modified version that won the Hutter prize with 0.1137
>> meeting the constraints of 10 MB memory and 1.5 days CPU time on my
>> i7-1165G7 Lenovo laptop. It is a tuned version of starlit, which won
>> an earlier Hutter prize by sorting the Wikipedia articles in the test
>> file enwik9 by mutual information. cmix uses a PAQ based model which
>> combines the predictions of lots of context models using simple 2
>> layer neural networks. To save memory it uses a large PPM model, which
>> predicts at the byte level based on the longest matching context for
>> each possible outcome. This is more memory efficient than the bit
>> level predictors used in PAQ. cmix preprocesses the input by mapping
>> words to 1-3 byte tokens using a fixed 80K dictionary similar to wxrt
>> and drt. The dictionary is organized to group similar words together
>> like "mother" and "father" to allow partial bitwise contexts.
>> 
>> The nntp papers hint at possible improvements. Ideally a neural
>> network should use one parameter per bit of compressed training data,
>> or 1 billion. It could also use a larger vocabulary. The paper notes
>> that the network learns slowly in spite of making 20 training passes
>> every 64 tokens, which causes enwik8 compression (the first 100 MB) to
>> do worse than expected. Context models like in PAQ and cmix solve this
>> problem, but these lack the arbitrarily deep feature hierarchies that
>> allow neurons to represent highly abstract concepts. Natural language
>> is completely learnable from unlabeled training data starting with the
>> simplest concepts, using neurons to represent letters or phonemes,
>> word segmentation rules, words, semantics, and grammatical categories
>> and structures in the same order that children learn these features.
>> Both nntp and cmix use fixed dictionaries to convert the text to a
>> stream of tokens to be modeled.
>> 
>> I am doing experiments on learning the rules for tokenization. Back in
>> 2000 I experimented in finding word boundaries in text without spaces.
>> These occur where there is low mutual information across boundaries.
>> WIth an order 5 model, this predicts the missing spaces with about 75%
>> accuracy. In my present experiments, I am using byte pair encoding.
>> Start with a dictionary of 255 codes each representing one byte, plus
>> one code for literal strings with no matching symbols. Find the pair
>> that could encode the most characters and make a new symbol, replacing
>> the least useful code. Repeat until there is no improvement and code
>> the input again and replace more symbols until the output stops
>> getting smaller.
>> 
>> Byte pair encoding by itself compresses, but to make the output more
>> compressible it is important that the symbol strings represent
>> independent semantic concepts, such as words, digits, or punctuation
>> characters. To achieve that, I require that all of the characters come
>> from the same set within a symbol. These sets are:
>> 
>> Lower case letters a-z and & # ; (to allow Wikipedia markup like &lt; for
>> < )
>> @   (to indicate the next lower case letter should be capitalized).
>> Digits 0-9.
>> White space and < / >  (to encode XML markup).
>> All other characters are in their own set.
>> 
>> Encoding and decoding enwik9 takes about a minute. I tested the output
>> with 5 different compression algorithms that don't have text specific
>> models. These are:
>> 
>> zip -9 (LZ77 and Huffman coding. Repeated strings are coded as
>> pointers to the previous occurrence. -9 means max compression).
>> 
>> 7zip (LZ77 and arithmetic coding with a larger window).
>> 
>> bsc -b100 -T  (BWT (suffix sorting followed by a low order, fast
>> adapting model) with a block size of 100 MB.)
>> 
>> ppmd -m256 -o16 -r1 (PPM byte level prediction and arithmetic coding.
>> Options for max compression using 256 MB memory, contexts up to 16
>> bytes, and partially rebuild the statistics tree when out of memory).
>> 
>> zpaq -ms7ci1.1.1.1.2am (context mixing with an ICM-ISSE chain of order
>> 0, 1, 2, 3, 4, 6, and a match model that predicts the next bit
>> following the last context match of 8 or longer, all mixed using a
>> neural network selected by the order 0 context. An ICM maps the
>> context hash to a bit history and then to a bit prediction. An ISSE
>> mixes the previous prediction with a constant using the context has to
>> select a pair of weights).
>> 
>> Here are the results.
>> 
>> Baseline enwik9
>>  178,051,852 enwik9.zpaq
>>  181,583,242 enwik9.bsc
>>  184,909,536 enwik9.pmd
>>  230,135,777 enwik9.7z
>>  322,592,218 enwik9.zip
>> 1,000,000,000 enwik9
>> 
>> Byte pair encoded.
>>  166,471,095 x.zpaq
>>  176,420,366 x.bsc
>>  178,037,227 x.pmd
>>  214,109,644 x.7z
>>  298,750,012 x.zip
>>  657,627,700 x
>> 
>> Here is the dictionary it built. Each number is the number of bytes
>> encoded by the symbol. This is after capitalization model. Encoding
>> "A" as "@a" etc expands the input by 3.8% but reduces the size
>> overall.
>> 
>> 0 5211047 "
>> "
>> 1 4373548 "
>> 
>> "
>> 2 1167495 "
>>   "
>> 3 117733608 " "
>> 4 3606580 "  "
>> 5 2931025 "&amp;"
>> 6 4118380 "&gt;"
>> 7 4096424 "&lt;"
>> 8 1203660 "&lt;td&gt;"
>> 9 11245188 "&quot;"
>> 10 4976940 "''"
>> 11 2875488 "'''"
>> 12 2521806 "("
>> 13 2523537 ")"
>> 14 2490470 "*"
>> 15 7978922 ","
>> 16 3542288 "-"
>> 17 9578832 "."
>> 18 2689903 "/"
>> 19 4211470 "0"
>> 20 4036090 "1"
>> 21 1301752 "18"
>> 22 2700718 "19"
>> 23 3950090 "2"
>> 24 2313888 "200"
>> 25 2736971 "3"
>> 26 2579245 "4"
>> 27 2596182 "5"
>> 28 2372995 "6"
>> 29 2005136 "7"
>> 30 1792449 "8"
>> 31 1889329 "9"
>> 32 3536470 ":"
>> 33 3714616 "</"
>> 34 1533437 "="
>> 35 2719548 "=="
>> 36 2100736 ">"
>> 37 4937064 ">
>>         <"
>> 38 11365812 ">
>>       <"
>> 39 2434260 ">
>>       </"
>> 40 5119065 ">
>>     <"
>> 41 1947408 ">
>>     </"
>> 42 1460556 ">
>>   </"
>> 43 41512171 "@"
>> 44 19790976 "[["
>> 45 19793156 "]]"
>> 46 7868460 "a"
>> 47 2253310 "ac"
>> 48 1917990 "ad"
>> 49 2411025 "age"
>> 50 6215072 "al"
>> 51 1146796 "also"
>> 52 1408732 "am"
>> 53 2096912 "american"
>> 54 5741346 "an"
>> 55 9907014 "and"
>> 56 4131752 "ar"
>> 57 3551010 "are"
>> 58 3657708 "as"
>> 59 3654828 "at"
>> 60 2307250 "ation"
>> 61 2068732 "b"
>> 62 2246596 "ba"
>> 63 4096522 "be"
>> 64 1856768 "bl"
>> 65 2250030 "bo"
>> 66 1784450 "br"
>> 67 1696670 "bu"
>> 68 1661308 "by"
>> 69 4447960 "c"
>> 70 4783624 "ca"
>> 71 3377360 "category"
>> 72 3413858 "ce"
>> 73 2556450 "census"
>> 74 1331328 "cent"
>> 75 4422008 "ch"
>> 76 2755628 "ci"
>> 77 4664290 "co"
>> 78 3465348 "com"
>> 79 2752603 "comment"
>> 80 1073664 "cont"
>> 81 5374974 "contributor"
>> 82 1448808 "county"
>> 83 2159750 "ct"
>> 84 1639685 "ction"
>> 85 7228135 "d"
>> 86 1698436 "da"
>> 87 5322652 "de"
>> 88 1425969 "der"
>> 89 3671270 "di"
>> 90 1295300 "ding"
>> 91 8313950 "e"
>> 92 2033368 "ea"
>> 93 4077406 "ed"
>> 94 1506072 "el"
>> 95 1415286 "em"
>> 96 3160132 "en"
>> 97 2125554 "ent"
>> 98 4009852 "er"
>> 99 4033320 "es"
>> 100 3449798 "f"
>> 101 1679906 "fa"
>> 102 2191754 "fe"
>> 103 2722322 "fi"
>> 104 5207685 "for"
>> 105 2664300 "from"
>> 106 4533981 "g"
>> 107 1814576 "ga"
>> 108 2976462 "ge"
>> 109 1615233 "ght"
>> 110 2039118 "gr"
>> 111 2393602 "h"
>> 112 2643402 "ha"
>> 113 1485180 "have"
>> 114 2985598 "he"
>> 115 1823524 "hi"
>> 116 1982520 "his"
>> 117 2019822 "ho"
>> 118 1008135 "household"
>> 119 1102920 "households"
>> 120 1653396 "http"
>> 121 5612158 "i"
>> 122 1911778 "ia"
>> 123 2482484 "ic"
>> 124 3937786 "id"
>> 125 2061248 "il"
>> 126 1289110 "image"
>> 127 11788674 "in"
>> 128 3959475 "ing"
>> 129 1121580 "inter"
>> 130 1706220 "ion"
>> 131 5791098 "is"
>> 132 4533192 "it"
>> 133 1693526 "j"
>> 134 3443538 "k"
>> 135 1935802 "ke"
>> 136 1074780 "king"
>> 137 1676720 "km&amp;sup"
>> 138 4688188 "l"
>> 139 3642740 "la"
>> 140 1879684 "land"
>> 141 4808666 "le"
>> 142 5253150 "li"
>> 143 3271514 "ll"
>> 144 3282854 "lo"
>> 145 2545420 "ly"
>> 146 4316302 "m"
>> 147 4111580 "ma"
>> 148 1119480 "males"
>> 149 2319318 "man"
>> 150 1586886 "mar"
>> 151 1228936 "mber"
>> 152 4692548 "me"
>> 153 1968364 "ment"
>> 154 3559026 "mi"
>> 155 1624740 "mi&amp;sup"
>> 156 3325900 "mo"
>> 157 6079102 "n"
>> 158 2871918 "na"
>> 159 1174776 "national"
>> 160 1646913 "nce"
>> 161 2328648 "nd"
>> 162 4591032 "ne"
>> 163 3307664 "ng"
>> 164 2304930 "ni"
>> 165 3850648 "no"
>> 166 1935778 "ns"
>> 167 2174184 "nt"
>> 168 6369810 "o"
>> 169 10290952 "of"
>> 170 1882854 "ol"
>> 171 4176046 "on"
>> 172 1353369 "one"
>> 173 4345328 "or"
>> 174 1680510 "other"
>> 175 2924346 "ou"
>> 176 1277936 "over"
>> 177 5716530 "p"
>> 178 2438090 "pa"
>> 179 2225092 "page"
>> 180 1268168 "part"
>> 181 3636874 "pe"
>> 182 2404846 "pl"
>> 183 2431580 "po"
>> 184 2702850 "population"
>> 185 1053200 "port"
>> 186 1249800 "pres"
>> 187 1996080 "preserve"
>> 188 2405145 "pro"
>> 189 7254014 "r"
>> 190 4395294 "ra"
>> 191 9660304 "re"
>> 192 3926368 "revision"
>> 193 5185568 "ri"
>> 194 3029684 "ro"
>> 195 1910144 "rs"
>> 196 1231892 "rt"
>> 197 11133241 "s"
>> 198 1587710 "sa"
>> 199 1517732 "sc"
>> 200 5074948 "se"
>> 201 2202958 "sh"
>> 202 3748654 "si"
>> 203 2194206 "so"
>> 204 1044084 "some"
>> 205 1752954 "sp"
>> 206 1472240 "space"
>> 207 2045926 "ss"
>> 208 7298282 "st"
>> 209 1140720 "states"
>> 210 2392062 "su"
>> 211 6344108 "t"
>> 212 2786192 "ta"
>> 213 3528818 "te"
>> 214 1681107 "ted"
>> 215 3591453 "ter"
>> 216 2127476 "text"
>> 217 3669638 "th"
>> 218 2487472 "that"
>> 219 26846655 "the"
>> 220 1700620 "there"
>> 221 1412172 "this"
>> 222 4162324 "ti"
>> 223 4382991 "timestamp"
>> 224 3851636 "tion"
>> 225 2792300 "title"
>> 226 7878548 "to"
>> 227 2647202 "tr"
>> 228 1168294 "tt"
>> 229 1495056 "ty"
>> 230 6851023 "u"
>> 231 1791650 "ul"
>> 232 2941346 "un"
>> 233 1188445 "under"
>> 234 1237356 "united"
>> 235 3364592 "ur"
>> 236 2702458 "us"
>> 237 1763349 "use"
>> 238 3288768 "username"
>> 239 3614588 "ve"
>> 240 3622539 "ver"
>> 241 2596844 "vi"
>> 242 1088152 "ving"
>> 243 2913531 "w"
>> 244 1962994 "wa"
>> 245 2598747 "was"
>> 246 2613848 "we"
>> 247 1694874 "wh"
>> 248 1790215 "which"
>> 249 1656796 "wi"
>> 250 3236908 "with"
>> 251 1244163 "wor"
>> 252 663000 "ww"
>> 253 6813940 "y"
>> 254 4907106 "|"
>> 255 18301782 - literal bytes.
>> 
>> Byte pair encoding learns the XML and Wikipedia markup quite well. For
>> example, "&lt;td&gt;" represents <td>, a marker for separating table
>> data. The file is in XML format like this, where the article is in the
>> <text> tag. The characters < " > are written as &lt; &quot; &gt; so
>> the XML parser is not confused. Byte pair encoding finds an efficient
>> encoding of the XML and markup like [[link to article]] or ===Level 3
>> Heading=== or '''bold text''' or &lt;ref&gt; for a numbered reference
>> in addition to learning an efficient way to encode words with fewer
>> symbols.
>> 
>> <page>
>>   <title>...</title>
>>   <id>...</id>                        (decimal)
>>   <revision>
>>     <id>...</id>                      (decimal)
>>     <restrictions>...</restrictions>  (optional)
>>     <timestamp>...</timestamp>        (format yyyy-mm-ddThh:mm:ssZ)
>>     <contributor>
>>       <ip>...</ip>                    (optional)
>>       <username>...</username>        (if no <ip>)
>>       <id>...</id>                    (if no <ip>, 1-6 digits)
>>     </contributor>
>>     <minor />                         (optional)
>>     <comment>...</comment>            (optional, may span lines)
>>     <text xml:space="preserve">...</text>  (may span lines)
>>   </revision>
>> </page>
>> 
>> --
>> -- Matt Mahoney, mattmahone...@gmail.com
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> +
> delivery options <https://agi.topicbox.com/groups/agi/subscription>
> Permalink
> <https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Mc445ea72fa9ab0efdf447062>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-M82491c48b74e71ea477a596e
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Lexical model learning for LLMs

Reply via email to