All is needed Architecturally wise the layers themselves can be arbitrary but for most sequence processing people will shout "Transformer!" and call it a day
Most processes do follow a sorta-autoregressive thing and sequence modeling is a good way to look at the problem, so I get the jump to that But I want to see someone test some things with popstar Transformers like GPT-2 and BERT: 1. layer order dependence: swapping layers with one another to determine if certain layers learn independent things 2. robustness to change: progressive growth, random layer insertion and removal 3. Importance of architecture: Add in random Decoder layer connections and observe the effect on overall performance This will unlock the questions for forming AI Brain ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T8fe5317c3cebf70b-M31fbfb7c4978240b8fbc69ff Delivery options: https://agi.topicbox.com/groups/agi/subscription