If you take 2GB of diverse domain text and lower your memory cost to the
maximum possible while still able to losslessly regenerate/extract it all back
very fast, you learn general patterns, but you can learn more and better ones
if find and eat more data. If you overfit on the training data and can
regenerate it all back perfectly digitally, it is not bad, you have learnt
general patterns very well that lead you to be able to compress it so well. Of
course, if your training data is only 2GB and is all about how to fix a car, it
will be optimal but that optimalness will be not that general until it eats
more data by ex. Online Learning if you don't offline it on diverse 800GB right
away. So the takeaway here is more diverse data is better, the only way to
screw up on learning general patterns is if you have too small or non-diverse
data. Also, as it eats data, it may not always learn new better patterns, hence
the 50% curves of success happen. These 50% curves are made of 50% curves,
repeat. I expect the error loss curve to be a fractal as it lowers. You can
actually see this in their graph.
------------------------------------------
Artificial General Intelligence List: AGI
Permalink:
https://agi.topicbox.com/groups/agi/Tc1f2c133ae3e4762-M4eeb30da7d31999a16813769
Delivery options: https://agi.topicbox.com/groups/agi/subscription