"what method could directly group hierarchies of elements in language
which share predictions?"

First gut reaction is, some form of evolutionary learning where the
genomes are element-groups

Thinking in terms of NN-ish. models, this might mean some Neural
Darwinism type approach for evolving the groupings


On Sat, Jun 25, 2022 at 3:58 AM Rob Freeman <chaotic.langu...@gmail.com> wrote:
> I've been taking a closer look at transformers. The big advance over LSTM was 
> that they relate prediction to long distance dependencies directly, rather 
> than passing long distance dependencies down a long recurrence chain. That's 
> the whole "attention" shtick. I knew that. Nice.
> But something I was less aware of was that having broken long distance 
> dependencies from the recurrence mechanism seems to have liberated them to go 
> wild with directly representing dependencies. And with multi layers it seems 
> they are building hierarchies over what they are "attending" to. So they are 
> basically building grammars.
> This paper makes that clear:
> Piotr Nawrot, Hierarchical Transformers are More Efficient Language Models.
> https://youtu.be/soqWNyrdjkw
> They show that middle layers of language transformers explicitly generalize 
> to reduce dimensions. That's a grammar.
> The question is, whether these grammars are different for each sentence in 
> their data. If they are different they might reduce the dimensions of 
> representation each time, but not in any way which can be abstracted 
> universally.
> If the grammars generated are different for each sentence, then the advantage 
> of transformers over attempts to learn grammar, like OpenCog's, will be that 
> ignoring the hierarchies created and focusing solely on the prediction task, 
> frees them from the expectation of universal primitives. They can generate a 
> different hierarchy for each data sentence, and no-body notices. Ignorance is 
> bliss.
> Set against that advantage, the disadvantage will be that ignoring the actual 
> hierarchies created means we can't access those hierarchies for higher 
> reasoning and constraint using world knowledge. Which is indeed the problem 
> we face with transformers.
> And another disadvantage will be the equally known one that generating 
> billions of subjective hierarchies in advance is enormously costly. And the 
> less known one dependent on the subjective hierarchy insight, that generating 
> hierarchies in advance is enormously wasteful of effort, and limiting. 
> Because there will always be a limit to the number of subjective hierarchies 
> you can generate in advance.
> If all this is true, the next stage to the advance of transformers will be to 
> find a way to generate only relevant subjective hierarchies at run time.
> Transformers learn their hierarchies using back-prop to minimize predictive 
> error over dot products. These dot products will converge on groupings of 
> elements which share predictions. If there were a way to directly find these 
> groupings of elements which share predictions, we might not have to rely on 
> back-prop over dot products. And we might be able to find only relevant 
> hierarchies at run time.
> So the key to improving over transformers would seem to be to leverage their 
> (implicit) discovery that hierarchy is subjective to each sentence, and 
> minimize the burden of generating that infinity of subjective hierarchies in 
> advance, by finding a method to directly group elements which share 
> predictions, without using back-prop over dot products. And applying that 
> method to generate hierarchies which are subjective to each sentence 
> presented to a system, only at the time each sentence is presented.
> If all the above is true, the key question should be: what method could 
> directly group hierarchies of elements in language which share predictions?
Ben Goertzel, PhD

"My humanity is a constant self-overcoming" -- Friedrich Nietzsche

