Re: [agi] Lexical model learning for LLMs

James Bowery Tue, 20 Feb 2024 11:09:50 -0800

https://twitter.com/jabowery/status/1760015755792294174


https://youtu.be/zduSFxRajkE



On Tue, Nov 21, 2023 at 7:20 PM Matt Mahoney <mattmahone...@gmail.com>
wrote:

> I started the large text benchmark in 2006
> (https://mattmahoney.net/dc/text.html ) with the claim that all you
> need to pass the Turing test is text prediction, which you can measure
> with compression. Both the benchmark and the Hutter prize use the same
> 1 GB text file (enwik9), with the goal of encouraging research in
> language modeling. Those results are now paying off, with the top
> compressors using neural networks just like the LLMs that have been
> released in the last year. These are worth studying if you plan to do
> any actual development work in AI.
> 
> The top 2 programs are nntp and cmix. nntp is described in two papers
> linked on the benchmark. The first paper describes two versions using
> LSTM and transformer networks, with the LSTM performing slightly
> better. Later versions of nntp improved the transformer model, which
> now takes the top spot on LTCB with a compression ratio 0.1072. It
> uses a 6 layer transformer with 200M parameters. The input is the
> current token and the output is a vector of probabilities over a 16K
> vocabulary for the next token. It takes 3 days to compress or
> decompress on a Geforce RTX-3090 GPU with 10,496 CUDA cores.
> 
> Second place is cmix, which doesn't need a GPU, but takes 8 days on a
> Core i7-7700 with 32 GB of memory to achieve 0.1098 compression.
> fast-cmix is a modified version that won the Hutter prize with 0.1137
> meeting the constraints of 10 MB memory and 1.5 days CPU time on my
> i7-1165G7 Lenovo laptop. It is a tuned version of starlit, which won
> an earlier Hutter prize by sorting the Wikipedia articles in the test
> file enwik9 by mutual information. cmix uses a PAQ based model which
> combines the predictions of lots of context models using simple 2
> layer neural networks. To save memory it uses a large PPM model, which
> predicts at the byte level based on the longest matching context for
> each possible outcome. This is more memory efficient than the bit
> level predictors used in PAQ. cmix preprocesses the input by mapping
> words to 1-3 byte tokens using a fixed 80K dictionary similar to wxrt
> and drt. The dictionary is organized to group similar words together
> like "mother" and "father" to allow partial bitwise contexts.
> 
> The nntp papers hint at possible improvements. Ideally a neural
> network should use one parameter per bit of compressed training data,
> or 1 billion. It could also use a larger vocabulary. The paper notes
> that the network learns slowly in spite of making 20 training passes
> every 64 tokens, which causes enwik8 compression (the first 100 MB) to
> do worse than expected. Context models like in PAQ and cmix solve this
> problem, but these lack the arbitrarily deep feature hierarchies that
> allow neurons to represent highly abstract concepts. Natural language
> is completely learnable from unlabeled training data starting with the
> simplest concepts, using neurons to represent letters or phonemes,
> word segmentation rules, words, semantics, and grammatical categories
> and structures in the same order that children learn these features.
> Both nntp and cmix use fixed dictionaries to convert the text to a
> stream of tokens to be modeled.
> 
> I am doing experiments on learning the rules for tokenization. Back in
> 2000 I experimented in finding word boundaries in text without spaces.
> These occur where there is low mutual information across boundaries.
> WIth an order 5 model, this predicts the missing spaces with about 75%
> accuracy. In my present experiments, I am using byte pair encoding.
> Start with a dictionary of 255 codes each representing one byte, plus
> one code for literal strings with no matching symbols. Find the pair
> that could encode the most characters and make a new symbol, replacing
> the least useful code. Repeat until there is no improvement and code
> the input again and replace more symbols until the output stops
> getting smaller.
> 
> Byte pair encoding by itself compresses, but to make the output more
> compressible it is important that the symbol strings represent
> independent semantic concepts, such as words, digits, or punctuation
> characters. To achieve that, I require that all of the characters come
> from the same set within a symbol. These sets are:
> 
> Lower case letters a-z and & # ; (to allow Wikipedia markup like &lt; for
> < )
> @   (to indicate the next lower case letter should be capitalized).
> Digits 0-9.
> White space and < / >  (to encode XML markup).
> All other characters are in their own set.
> 
> Encoding and decoding enwik9 takes about a minute. I tested the output
> with 5 different compression algorithms that don't have text specific
> models. These are:
> 
> zip -9 (LZ77 and Huffman coding. Repeated strings are coded as
> pointers to the previous occurrence. -9 means max compression).
> 
> 7zip (LZ77 and arithmetic coding with a larger window).
> 
> bsc -b100 -T  (BWT (suffix sorting followed by a low order, fast
> adapting model) with a block size of 100 MB.)
> 
> ppmd -m256 -o16 -r1 (PPM byte level prediction and arithmetic coding.
> Options for max compression using 256 MB memory, contexts up to 16
> bytes, and partially rebuild the statistics tree when out of memory).
> 
> zpaq -ms7ci1.1.1.1.2am (context mixing with an ICM-ISSE chain of order
> 0, 1, 2, 3, 4, 6, and a match model that predicts the next bit
> following the last context match of 8 or longer, all mixed using a
> neural network selected by the order 0 context. An ICM maps the
> context hash to a bit history and then to a bit prediction. An ISSE
> mixes the previous prediction with a constant using the context has to
> select a pair of weights).
> 
> Here are the results.
> 
> Baseline enwik9
>  178,051,852 enwik9.zpaq
>  181,583,242 enwik9.bsc
>  184,909,536 enwik9.pmd
>  230,135,777 enwik9.7z
>  322,592,218 enwik9.zip
> 1,000,000,000 enwik9
> 
> Byte pair encoded.
>  166,471,095 x.zpaq
>  176,420,366 x.bsc
>  178,037,227 x.pmd
>  214,109,644 x.7z
>  298,750,012 x.zip
>  657,627,700 x
> 
> Here is the dictionary it built. Each number is the number of bytes
> encoded by the symbol. This is after capitalization model. Encoding
> "A" as "@a" etc expands the input by 3.8% but reduces the size
> overall.
> 
> 0 5211047 "
> "
> 1 4373548 "
> 
> "
> 2 1167495 "
>   "
> 3 117733608 " "
> 4 3606580 "  "
> 5 2931025 "&amp;"
> 6 4118380 "&gt;"
> 7 4096424 "&lt;"
> 8 1203660 "&lt;td&gt;"
> 9 11245188 "&quot;"
> 10 4976940 "''"
> 11 2875488 "'''"
> 12 2521806 "("
> 13 2523537 ")"
> 14 2490470 "*"
> 15 7978922 ","
> 16 3542288 "-"
> 17 9578832 "."
> 18 2689903 "/"
> 19 4211470 "0"
> 20 4036090 "1"
> 21 1301752 "18"
> 22 2700718 "19"
> 23 3950090 "2"
> 24 2313888 "200"
> 25 2736971 "3"
> 26 2579245 "4"
> 27 2596182 "5"
> 28 2372995 "6"
> 29 2005136 "7"
> 30 1792449 "8"
> 31 1889329 "9"
> 32 3536470 ":"
> 33 3714616 "</"
> 34 1533437 "="
> 35 2719548 "=="
> 36 2100736 ">"
> 37 4937064 ">
>         <"
> 38 11365812 ">
>       <"
> 39 2434260 ">
>       </"
> 40 5119065 ">
>     <"
> 41 1947408 ">
>     </"
> 42 1460556 ">
>   </"
> 43 41512171 "@"
> 44 19790976 "[["
> 45 19793156 "]]"
> 46 7868460 "a"
> 47 2253310 "ac"
> 48 1917990 "ad"
> 49 2411025 "age"
> 50 6215072 "al"
> 51 1146796 "also"
> 52 1408732 "am"
> 53 2096912 "american"
> 54 5741346 "an"
> 55 9907014 "and"
> 56 4131752 "ar"
> 57 3551010 "are"
> 58 3657708 "as"
> 59 3654828 "at"
> 60 2307250 "ation"
> 61 2068732 "b"
> 62 2246596 "ba"
> 63 4096522 "be"
> 64 1856768 "bl"
> 65 2250030 "bo"
> 66 1784450 "br"
> 67 1696670 "bu"
> 68 1661308 "by"
> 69 4447960 "c"
> 70 4783624 "ca"
> 71 3377360 "category"
> 72 3413858 "ce"
> 73 2556450 "census"
> 74 1331328 "cent"
> 75 4422008 "ch"
> 76 2755628 "ci"
> 77 4664290 "co"
> 78 3465348 "com"
> 79 2752603 "comment"
> 80 1073664 "cont"
> 81 5374974 "contributor"
> 82 1448808 "county"
> 83 2159750 "ct"
> 84 1639685 "ction"
> 85 7228135 "d"
> 86 1698436 "da"
> 87 5322652 "de"
> 88 1425969 "der"
> 89 3671270 "di"
> 90 1295300 "ding"
> 91 8313950 "e"
> 92 2033368 "ea"
> 93 4077406 "ed"
> 94 1506072 "el"
> 95 1415286 "em"
> 96 3160132 "en"
> 97 2125554 "ent"
> 98 4009852 "er"
> 99 4033320 "es"
> 100 3449798 "f"
> 101 1679906 "fa"
> 102 2191754 "fe"
> 103 2722322 "fi"
> 104 5207685 "for"
> 105 2664300 "from"
> 106 4533981 "g"
> 107 1814576 "ga"
> 108 2976462 "ge"
> 109 1615233 "ght"
> 110 2039118 "gr"
> 111 2393602 "h"
> 112 2643402 "ha"
> 113 1485180 "have"
> 114 2985598 "he"
> 115 1823524 "hi"
> 116 1982520 "his"
> 117 2019822 "ho"
> 118 1008135 "household"
> 119 1102920 "households"
> 120 1653396 "http"
> 121 5612158 "i"
> 122 1911778 "ia"
> 123 2482484 "ic"
> 124 3937786 "id"
> 125 2061248 "il"
> 126 1289110 "image"
> 127 11788674 "in"
> 128 3959475 "ing"
> 129 1121580 "inter"
> 130 1706220 "ion"
> 131 5791098 "is"
> 132 4533192 "it"
> 133 1693526 "j"
> 134 3443538 "k"
> 135 1935802 "ke"
> 136 1074780 "king"
> 137 1676720 "km&amp;sup"
> 138 4688188 "l"
> 139 3642740 "la"
> 140 1879684 "land"
> 141 4808666 "le"
> 142 5253150 "li"
> 143 3271514 "ll"
> 144 3282854 "lo"
> 145 2545420 "ly"
> 146 4316302 "m"
> 147 4111580 "ma"
> 148 1119480 "males"
> 149 2319318 "man"
> 150 1586886 "mar"
> 151 1228936 "mber"
> 152 4692548 "me"
> 153 1968364 "ment"
> 154 3559026 "mi"
> 155 1624740 "mi&amp;sup"
> 156 3325900 "mo"
> 157 6079102 "n"
> 158 2871918 "na"
> 159 1174776 "national"
> 160 1646913 "nce"
> 161 2328648 "nd"
> 162 4591032 "ne"
> 163 3307664 "ng"
> 164 2304930 "ni"
> 165 3850648 "no"
> 166 1935778 "ns"
> 167 2174184 "nt"
> 168 6369810 "o"
> 169 10290952 "of"
> 170 1882854 "ol"
> 171 4176046 "on"
> 172 1353369 "one"
> 173 4345328 "or"
> 174 1680510 "other"
> 175 2924346 "ou"
> 176 1277936 "over"
> 177 5716530 "p"
> 178 2438090 "pa"
> 179 2225092 "page"
> 180 1268168 "part"
> 181 3636874 "pe"
> 182 2404846 "pl"
> 183 2431580 "po"
> 184 2702850 "population"
> 185 1053200 "port"
> 186 1249800 "pres"
> 187 1996080 "preserve"
> 188 2405145 "pro"
> 189 7254014 "r"
> 190 4395294 "ra"
> 191 9660304 "re"
> 192 3926368 "revision"
> 193 5185568 "ri"
> 194 3029684 "ro"
> 195 1910144 "rs"
> 196 1231892 "rt"
> 197 11133241 "s"
> 198 1587710 "sa"
> 199 1517732 "sc"
> 200 5074948 "se"
> 201 2202958 "sh"
> 202 3748654 "si"
> 203 2194206 "so"
> 204 1044084 "some"
> 205 1752954 "sp"
> 206 1472240 "space"
> 207 2045926 "ss"
> 208 7298282 "st"
> 209 1140720 "states"
> 210 2392062 "su"
> 211 6344108 "t"
> 212 2786192 "ta"
> 213 3528818 "te"
> 214 1681107 "ted"
> 215 3591453 "ter"
> 216 2127476 "text"
> 217 3669638 "th"
> 218 2487472 "that"
> 219 26846655 "the"
> 220 1700620 "there"
> 221 1412172 "this"
> 222 4162324 "ti"
> 223 4382991 "timestamp"
> 224 3851636 "tion"
> 225 2792300 "title"
> 226 7878548 "to"
> 227 2647202 "tr"
> 228 1168294 "tt"
> 229 1495056 "ty"
> 230 6851023 "u"
> 231 1791650 "ul"
> 232 2941346 "un"
> 233 1188445 "under"
> 234 1237356 "united"
> 235 3364592 "ur"
> 236 2702458 "us"
> 237 1763349 "use"
> 238 3288768 "username"
> 239 3614588 "ve"
> 240 3622539 "ver"
> 241 2596844 "vi"
> 242 1088152 "ving"
> 243 2913531 "w"
> 244 1962994 "wa"
> 245 2598747 "was"
> 246 2613848 "we"
> 247 1694874 "wh"
> 248 1790215 "which"
> 249 1656796 "wi"
> 250 3236908 "with"
> 251 1244163 "wor"
> 252 663000 "ww"
> 253 6813940 "y"
> 254 4907106 "|"
> 255 18301782 - literal bytes.
> 
> Byte pair encoding learns the XML and Wikipedia markup quite well. For
> example, "&lt;td&gt;" represents <td>, a marker for separating table
> data. The file is in XML format like this, where the article is in the
> <text> tag. The characters < " > are written as &lt; &quot; &gt; so
> the XML parser is not confused. Byte pair encoding finds an efficient
> encoding of the XML and markup like [[link to article]] or ===Level 3
> Heading=== or '''bold text''' or &lt;ref&gt; for a numbered reference
> in addition to learning an efficient way to encode words with fewer
> symbols.
> 
> <page>
>   <title>...</title>
>   <id>...</id>                        (decimal)
>   <revision>
>     <id>...</id>                      (decimal)
>     <restrictions>...</restrictions>  (optional)
>     <timestamp>...</timestamp>        (format yyyy-mm-ddThh:mm:ssZ)
>     <contributor>
>       <ip>...</ip>                    (optional)
>       <username>...</username>        (if no <ip>)
>       <id>...</id>                    (if no <ip>, 1-6 digits)
>     </contributor>
>     <minor />                         (optional)
>     <comment>...</comment>            (optional, may span lines)
>     <text xml:space="preserve">...</text>  (may span lines)
>   </revision>
> </page>
> 
> --
> -- Matt Mahoney, mattmahone...@gmail.com

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Mc445ea72fa9ab0efdf447062
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Lexical model learning for LLMs

Reply via email to