And, even worse, I suggested the entire *change log* of Wikipedia as the
corpus so as to exposure latent identities of information sabotage in the
language models.

Google DeepMind can, and should, finance compression prizes with such
expanded corpora, based on the lessons learned with enwik8 and enwik9.

Unfortunately, measures inferior to self-extracting archive size, such as
"perplexity" or *worse* are now dominating SOTA publications.

For example, one recent publication claimed 0.99 bits per character on
enwik8 but when I went looking for the size of their model, here's what I
found:

transformer-xl/tf/models/pretrained_xl/tf_enwik8/model$ ls -alt
total 3251968
drwxrwxrwx 1 jabowery jabowery       4096 Jan 14 12:53 .
drwxrwxrwx 1 jabowery jabowery       4096 Jan 14 12:51 ..
-rwxrwxrwx 1 jabowery jabowery        171 Dec 25  2018 checkpoint
-rwxrwxrwx 1 jabowery jabowery 3326781856 Dec 25  2018
model.ckpt-0.data-00000-of-00001
-rwxrwxrwx 1 jabowery jabowery      30159 Dec 25  2018 model.ckpt-0.index
-rwxrwxrwx 1 jabowery jabowery    3195458 Dec 25  2018 model.ckpt-0.meta

A more recent paper purports 0.97 bpc and, although its authors do admit
the problematic nature of measuring model complexity, they justify
excluding it on the basis that they used the same "model setup" as the
TransformerXL 0.99 -- purportedly the prior "SOTA".

Here's my LinkedIn post on the decay of rigor in language model SOTA
metrics compared to size of self-extracting archive:

The so-called "SOTA" (State Of The Art) in the language modeling world has
wandered so far from the MDL (minimum description length) approximation of
Kolmogorov Complexity as to render papers purporting "SOTA" results highly
suspect.

An example is Table 4 provided by the most recent paper purporting a SOTA
result with the enwik8 corpus.

https://lnkd.in/ejJSNPC

The judging* criterion for the Hutter Prize is size of a self-extracting
archive of the enwik8 corpus, to standardized on the algorithmic resources
available to the archive.  This is essential for MDL commensurability.
Dividing the corpus into training and testing sets is neither necessary nor
desirable under this metric.

Controlling for the same "model setup" is a big step in the right direction
-- as it increases the commensurability with TransformerXL -- particularly
as compared to the other items in Table 4.  Model ablation can produce even
more commensurable measures, but it would be helpful for SOTA comparisons
to be more rigorous in defining the algorithmic resources assumed in their
measurements.

This improved rigor would expose just how important purported improvements,
such as .99 to .97 can  be.

*I'm on the Hutter Prize judging committee.


On Mon, Jan 27, 2020 at 3:04 PM Matt Mahoney <mattmahone...@gmail.com>
wrote:

>
>
> On Mon, Jan 27, 2020, 12:04 PM <immortal.discover...@gmail.com> wrote:
>
>> I see the Hutter Prize is a separate contest from Matt's contest/rules:
>> http://mattmahoney.net/dc/textrules.html
>>
>
> Marcus Hutter and I couldn't agree on the details of the contest, which is
> why there are two almost identical contests.
>
> He is offering prize money, so I understand the need for strict hardware
> restrictions (1 MB RAM and 8 hours x 2.2 GHz to extract 100 MB of text) to
> make the contest fair and accessible. But I think this is unrealistic for
> AGI. The human brain takes 20 years to process 1 GB of language, which is
> 10^25 operations on 6 x 10^14 synapses.
>
> The first main result of my 12 years of testing 1000+ versions of 200
> compressors is that compression (as a measure of prediction accuracy or
> intelligence) increases with the log of computing time and the log of
> memory (and probably the log of code complexity, which I didn't measure).
> The best way to establish this relationship is to test over as wide a range
> as possible by removing time and hardware restrictions. The top ranked
> program (cmix) requires 32 GB of RAM and takes a week, which is about a
> million times more time and memory than the fastest programs. But it is
> still a billion times faster and uses 100,000 times less memory than a
> human brain sized neural network.
>
> The other main result is that the most effective text compression
> algorithms are based on neural networks that model human language learning
> (lexical, semantics, and grammar in that order). But the grammatical
> modeling is rudimentary and probably requires a lot more hardware to model
> properly.
>
>> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> + delivery
> options <https://agi.topicbox.com/groups/agi/subscription> Permalink
> <https://agi.topicbox.com/groups/agi/T65747f0622d5047f-M5e6922e62911859156b660fd>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T65747f0622d5047f-M2534be684467494d8ecbf677
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to