An interesting thing it seems I earnt is that Self Attention in the Transformer 
architecture is actually a way of tree building horizontally. Because each word 
checks each word, which must be the IF-Condition. And then they sum values for 
those options given. And proceeds t the next layer to do it again.

> Transformers are not like traditional ANNs, they don't do hierarchical 
> activations, they use self attention.
> They don't cross connect upwards, they process in 1 layer so to speak.

Now I'm thinking for later: Could this be done without backprop? Assuming 
backprop is a worse idea. But it doesn't explode like a tree, did they cap it 
to 1024 items with 1600 dimensions each? So what does that allow?
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T6cf3be509c7cd2f2-M40678c4f4d548e846e6e7f36
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to