https://twitter.com/jabowery/status/1760015755792294174
https://youtu.be/zduSFxRajkE On Tue, Nov 21, 2023 at 7:20 PM Matt Mahoney <mattmahone...@gmail.com> wrote: > I started the large text benchmark in 2006 > (https://mattmahoney.net/dc/text.html ) with the claim that all you > need to pass the Turing test is text prediction, which you can measure > with compression. Both the benchmark and the Hutter prize use the same > 1 GB text file (enwik9), with the goal of encouraging research in > language modeling. Those results are now paying off, with the top > compressors using neural networks just like the LLMs that have been > released in the last year. These are worth studying if you plan to do > any actual development work in AI. > > The top 2 programs are nntp and cmix. nntp is described in two papers > linked on the benchmark. The first paper describes two versions using > LSTM and transformer networks, with the LSTM performing slightly > better. Later versions of nntp improved the transformer model, which > now takes the top spot on LTCB with a compression ratio 0.1072. It > uses a 6 layer transformer with 200M parameters. The input is the > current token and the output is a vector of probabilities over a 16K > vocabulary for the next token. It takes 3 days to compress or > decompress on a Geforce RTX-3090 GPU with 10,496 CUDA cores. > > Second place is cmix, which doesn't need a GPU, but takes 8 days on a > Core i7-7700 with 32 GB of memory to achieve 0.1098 compression. > fast-cmix is a modified version that won the Hutter prize with 0.1137 > meeting the constraints of 10 MB memory and 1.5 days CPU time on my > i7-1165G7 Lenovo laptop. It is a tuned version of starlit, which won > an earlier Hutter prize by sorting the Wikipedia articles in the test > file enwik9 by mutual information. cmix uses a PAQ based model which > combines the predictions of lots of context models using simple 2 > layer neural networks. To save memory it uses a large PPM model, which > predicts at the byte level based on the longest matching context for > each possible outcome. This is more memory efficient than the bit > level predictors used in PAQ. cmix preprocesses the input by mapping > words to 1-3 byte tokens using a fixed 80K dictionary similar to wxrt > and drt. The dictionary is organized to group similar words together > like "mother" and "father" to allow partial bitwise contexts. > > The nntp papers hint at possible improvements. Ideally a neural > network should use one parameter per bit of compressed training data, > or 1 billion. It could also use a larger vocabulary. The paper notes > that the network learns slowly in spite of making 20 training passes > every 64 tokens, which causes enwik8 compression (the first 100 MB) to > do worse than expected. Context models like in PAQ and cmix solve this > problem, but these lack the arbitrarily deep feature hierarchies that > allow neurons to represent highly abstract concepts. Natural language > is completely learnable from unlabeled training data starting with the > simplest concepts, using neurons to represent letters or phonemes, > word segmentation rules, words, semantics, and grammatical categories > and structures in the same order that children learn these features. > Both nntp and cmix use fixed dictionaries to convert the text to a > stream of tokens to be modeled. > > I am doing experiments on learning the rules for tokenization. Back in > 2000 I experimented in finding word boundaries in text without spaces. > These occur where there is low mutual information across boundaries. > WIth an order 5 model, this predicts the missing spaces with about 75% > accuracy. In my present experiments, I am using byte pair encoding. > Start with a dictionary of 255 codes each representing one byte, plus > one code for literal strings with no matching symbols. Find the pair > that could encode the most characters and make a new symbol, replacing > the least useful code. Repeat until there is no improvement and code > the input again and replace more symbols until the output stops > getting smaller. > > Byte pair encoding by itself compresses, but to make the output more > compressible it is important that the symbol strings represent > independent semantic concepts, such as words, digits, or punctuation > characters. To achieve that, I require that all of the characters come > from the same set within a symbol. These sets are: > > Lower case letters a-z and & # ; (to allow Wikipedia markup like < for > < ) > @ (to indicate the next lower case letter should be capitalized). > Digits 0-9. > White space and < / > (to encode XML markup). > All other characters are in their own set. > > Encoding and decoding enwik9 takes about a minute. I tested the output > with 5 different compression algorithms that don't have text specific > models. These are: > > zip -9 (LZ77 and Huffman coding. Repeated strings are coded as > pointers to the previous occurrence. -9 means max compression). > > 7zip (LZ77 and arithmetic coding with a larger window). > > bsc -b100 -T (BWT (suffix sorting followed by a low order, fast > adapting model) with a block size of 100 MB.) > > ppmd -m256 -o16 -r1 (PPM byte level prediction and arithmetic coding. > Options for max compression using 256 MB memory, contexts up to 16 > bytes, and partially rebuild the statistics tree when out of memory). > > zpaq -ms7ci1.1.1.1.2am (context mixing with an ICM-ISSE chain of order > 0, 1, 2, 3, 4, 6, and a match model that predicts the next bit > following the last context match of 8 or longer, all mixed using a > neural network selected by the order 0 context. An ICM maps the > context hash to a bit history and then to a bit prediction. An ISSE > mixes the previous prediction with a constant using the context has to > select a pair of weights). > > Here are the results. > > Baseline enwik9 > 178,051,852 enwik9.zpaq > 181,583,242 enwik9.bsc > 184,909,536 enwik9.pmd > 230,135,777 enwik9.7z > 322,592,218 enwik9.zip > 1,000,000,000 enwik9 > > Byte pair encoded. > 166,471,095 x.zpaq > 176,420,366 x.bsc > 178,037,227 x.pmd > 214,109,644 x.7z > 298,750,012 x.zip > 657,627,700 x > > Here is the dictionary it built. Each number is the number of bytes > encoded by the symbol. This is after capitalization model. Encoding > "A" as "@a" etc expands the input by 3.8% but reduces the size > overall. > > 0 5211047 " > " > 1 4373548 " > > " > 2 1167495 " > " > 3 117733608 " " > 4 3606580 " " > 5 2931025 "&" > 6 4118380 ">" > 7 4096424 "<" > 8 1203660 "<td>" > 9 11245188 """ > 10 4976940 "''" > 11 2875488 "'''" > 12 2521806 "(" > 13 2523537 ")" > 14 2490470 "*" > 15 7978922 "," > 16 3542288 "-" > 17 9578832 "." > 18 2689903 "/" > 19 4211470 "0" > 20 4036090 "1" > 21 1301752 "18" > 22 2700718 "19" > 23 3950090 "2" > 24 2313888 "200" > 25 2736971 "3" > 26 2579245 "4" > 27 2596182 "5" > 28 2372995 "6" > 29 2005136 "7" > 30 1792449 "8" > 31 1889329 "9" > 32 3536470 ":" > 33 3714616 "</" > 34 1533437 "=" > 35 2719548 "==" > 36 2100736 ">" > 37 4937064 "> > <" > 38 11365812 "> > <" > 39 2434260 "> > </" > 40 5119065 "> > <" > 41 1947408 "> > </" > 42 1460556 "> > </" > 43 41512171 "@" > 44 19790976 "[[" > 45 19793156 "]]" > 46 7868460 "a" > 47 2253310 "ac" > 48 1917990 "ad" > 49 2411025 "age" > 50 6215072 "al" > 51 1146796 "also" > 52 1408732 "am" > 53 2096912 "american" > 54 5741346 "an" > 55 9907014 "and" > 56 4131752 "ar" > 57 3551010 "are" > 58 3657708 "as" > 59 3654828 "at" > 60 2307250 "ation" > 61 2068732 "b" > 62 2246596 "ba" > 63 4096522 "be" > 64 1856768 "bl" > 65 2250030 "bo" > 66 1784450 "br" > 67 1696670 "bu" > 68 1661308 "by" > 69 4447960 "c" > 70 4783624 "ca" > 71 3377360 "category" > 72 3413858 "ce" > 73 2556450 "census" > 74 1331328 "cent" > 75 4422008 "ch" > 76 2755628 "ci" > 77 4664290 "co" > 78 3465348 "com" > 79 2752603 "comment" > 80 1073664 "cont" > 81 5374974 "contributor" > 82 1448808 "county" > 83 2159750 "ct" > 84 1639685 "ction" > 85 7228135 "d" > 86 1698436 "da" > 87 5322652 "de" > 88 1425969 "der" > 89 3671270 "di" > 90 1295300 "ding" > 91 8313950 "e" > 92 2033368 "ea" > 93 4077406 "ed" > 94 1506072 "el" > 95 1415286 "em" > 96 3160132 "en" > 97 2125554 "ent" > 98 4009852 "er" > 99 4033320 "es" > 100 3449798 "f" > 101 1679906 "fa" > 102 2191754 "fe" > 103 2722322 "fi" > 104 5207685 "for" > 105 2664300 "from" > 106 4533981 "g" > 107 1814576 "ga" > 108 2976462 "ge" > 109 1615233 "ght" > 110 2039118 "gr" > 111 2393602 "h" > 112 2643402 "ha" > 113 1485180 "have" > 114 2985598 "he" > 115 1823524 "hi" > 116 1982520 "his" > 117 2019822 "ho" > 118 1008135 "household" > 119 1102920 "households" > 120 1653396 "http" > 121 5612158 "i" > 122 1911778 "ia" > 123 2482484 "ic" > 124 3937786 "id" > 125 2061248 "il" > 126 1289110 "image" > 127 11788674 "in" > 128 3959475 "ing" > 129 1121580 "inter" > 130 1706220 "ion" > 131 5791098 "is" > 132 4533192 "it" > 133 1693526 "j" > 134 3443538 "k" > 135 1935802 "ke" > 136 1074780 "king" > 137 1676720 "km&sup" > 138 4688188 "l" > 139 3642740 "la" > 140 1879684 "land" > 141 4808666 "le" > 142 5253150 "li" > 143 3271514 "ll" > 144 3282854 "lo" > 145 2545420 "ly" > 146 4316302 "m" > 147 4111580 "ma" > 148 1119480 "males" > 149 2319318 "man" > 150 1586886 "mar" > 151 1228936 "mber" > 152 4692548 "me" > 153 1968364 "ment" > 154 3559026 "mi" > 155 1624740 "mi&sup" > 156 3325900 "mo" > 157 6079102 "n" > 158 2871918 "na" > 159 1174776 "national" > 160 1646913 "nce" > 161 2328648 "nd" > 162 4591032 "ne" > 163 3307664 "ng" > 164 2304930 "ni" > 165 3850648 "no" > 166 1935778 "ns" > 167 2174184 "nt" > 168 6369810 "o" > 169 10290952 "of" > 170 1882854 "ol" > 171 4176046 "on" > 172 1353369 "one" > 173 4345328 "or" > 174 1680510 "other" > 175 2924346 "ou" > 176 1277936 "over" > 177 5716530 "p" > 178 2438090 "pa" > 179 2225092 "page" > 180 1268168 "part" > 181 3636874 "pe" > 182 2404846 "pl" > 183 2431580 "po" > 184 2702850 "population" > 185 1053200 "port" > 186 1249800 "pres" > 187 1996080 "preserve" > 188 2405145 "pro" > 189 7254014 "r" > 190 4395294 "ra" > 191 9660304 "re" > 192 3926368 "revision" > 193 5185568 "ri" > 194 3029684 "ro" > 195 1910144 "rs" > 196 1231892 "rt" > 197 11133241 "s" > 198 1587710 "sa" > 199 1517732 "sc" > 200 5074948 "se" > 201 2202958 "sh" > 202 3748654 "si" > 203 2194206 "so" > 204 1044084 "some" > 205 1752954 "sp" > 206 1472240 "space" > 207 2045926 "ss" > 208 7298282 "st" > 209 1140720 "states" > 210 2392062 "su" > 211 6344108 "t" > 212 2786192 "ta" > 213 3528818 "te" > 214 1681107 "ted" > 215 3591453 "ter" > 216 2127476 "text" > 217 3669638 "th" > 218 2487472 "that" > 219 26846655 "the" > 220 1700620 "there" > 221 1412172 "this" > 222 4162324 "ti" > 223 4382991 "timestamp" > 224 3851636 "tion" > 225 2792300 "title" > 226 7878548 "to" > 227 2647202 "tr" > 228 1168294 "tt" > 229 1495056 "ty" > 230 6851023 "u" > 231 1791650 "ul" > 232 2941346 "un" > 233 1188445 "under" > 234 1237356 "united" > 235 3364592 "ur" > 236 2702458 "us" > 237 1763349 "use" > 238 3288768 "username" > 239 3614588 "ve" > 240 3622539 "ver" > 241 2596844 "vi" > 242 1088152 "ving" > 243 2913531 "w" > 244 1962994 "wa" > 245 2598747 "was" > 246 2613848 "we" > 247 1694874 "wh" > 248 1790215 "which" > 249 1656796 "wi" > 250 3236908 "with" > 251 1244163 "wor" > 252 663000 "ww" > 253 6813940 "y" > 254 4907106 "|" > 255 18301782 - literal bytes. > > Byte pair encoding learns the XML and Wikipedia markup quite well. For > example, "<td>" represents <td>, a marker for separating table > data. The file is in XML format like this, where the article is in the > <text> tag. The characters < " > are written as < " > so > the XML parser is not confused. Byte pair encoding finds an efficient > encoding of the XML and markup like [[link to article]] or ===Level 3 > Heading=== or '''bold text''' or <ref> for a numbered reference > in addition to learning an efficient way to encode words with fewer > symbols. > > <page> > <title>...</title> > <id>...</id> (decimal) > <revision> > <id>...</id> (decimal) > <restrictions>...</restrictions> (optional) > <timestamp>...</timestamp> (format yyyy-mm-ddThh:mm:ssZ) > <contributor> > <ip>...</ip> (optional) > <username>...</username> (if no <ip>) > <id>...</id> (if no <ip>, 1-6 digits) > </contributor> > <minor /> (optional) > <comment>...</comment> (optional, may span lines) > <text xml:space="preserve">...</text> (may span lines) > </revision> > </page> > > -- > -- Matt Mahoney, mattmahone...@gmail.com ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Mc445ea72fa9ab0efdf447062 Delivery options: https://agi.topicbox.com/groups/agi/subscription