Thinking on this slightly more... So we know smaller models in any reasonable language including CoDD (https://arxiv.org/abs/2004.05268) will tend to lead to more accurate models, thus also to better agreement of models learned across different subsamples (and thus to smaller decision tree complexity of ensembles formed by combining models learned across various subsamples)...
In the opposite direction we can say stuff like: Maximum out-of-sample agreement of models learned on different subsamples ==> minimum decision tree complexity of ensemble formed from models learned on different subsamples We can also say that on the average smaller decision trees will tend to have smaller compressed versions, i.e. smaller CoDD size So if we have a model class for which smaller decision tree complexity leads w/ very high probability to smaller size in CoDD or whatever other relevant language, then Bob's our uncle... The hypothesis Poggio is trying to make is that current DNN model classes are of this nature, I guess On Sun, Aug 30, 2020 at 10:48 AM James Bowery <jabow...@gmail.com> wrote: > > It _is_ obvious that even _deploying_ Big Language Model inference is > expensive. Sure, running a given inference is nothing compared to the model > induction, but if you're, say, Google, and you want to deploy your > latest-greatest BLM to a sizable fraction of Earth's population, it might pay > to at least exclude zero parameters from the "parameter" count. > > ANY "step in the right direction" would fly in the face of the pernicious > anti-Occam one-upsmanship that provides aid and comfort to enemies of > humanity occupying positions of high trust and authority in society like > Jonathan Haidt. > > On Sun, Aug 30, 2020 at 12:23 PM Ben Goertzel <b...@goertzel.org> wrote: >> >> It's not obvious that cheap/simple/straightforward methods of >> knowledge distillation would actually perform the needed >> compactification effectively, though... >> >> It may be that performing this compactification of the huge models, is >> a much harder learning problem than actually learning the huge models >> with their massive cross-parameter redundancy and >> overparametrization... >> >> The approach Andres Suarez and I took for our AGI-20 paper >> >> https://arxiv.org/abs/2005.12533 >> >> could be viewed as a very fancy form of "distillation" -- i.e. >> learning a compact symbolic model guided by the massive >> overparametrized neural model as an "oracle" .... But we just did >> some promising prototype stuff there, didn't do the work at scale >> yet... >> >> >> >> On Sun, Aug 30, 2020 at 10:15 AM James Bowery <jabow...@gmail.com> wrote: >> > >> > My point is that when one obtains a good model, one _is_ finding a compact >> > model but its compactness is obscured by the sloppy definition of the word >> > "parameter". A big step toward saving civilization from going down the >> > rat hole would be for these Big Language Model guys to run knowledge >> > distillation on their language models and publish the size, in bits, of >> > the distilled weights as the "number of parameters". >> > >> > This is not to improve their models, but rather to wake the social >> > sciences up to the difference between statistic and dynamic models and the >> > fact that without AIT, they're leading institutional decision makers to >> > perform experiments on billions of unwilling human subjects with all the >> > rigor of a Medieval Barber's Humor Theory. >> > >> > We don't have much time left if we have any at all. >> > >> > On Sun, Aug 30, 2020 at 11:59 AM Ben Goertzel <b...@goertzel.org> wrote: >> >> >> >> I.e. taking a big model and then averaging its behavior across >> >> multiple subsamples, one is effectively getting a "big ensemble of big >> >> models" that has the same information content as a more compact model >> >> (which one doesn't actually have explicitly on hand) >> >> >> >> It may seem a perverse way to do things, rather than just finding the >> >> compact model, but it's not sooo perverse if what one has on hand is >> >> precisely an algorithm for learning overdetermined accurate models in >> >> acceptable time given the hardware at hand... >> >> >> >> On Sun, Aug 30, 2020 at 9:55 AM Ben Goertzel <b...@goertzel.org> wrote: >> >> > >> >> > I think the point is that >> >> > >> >> > A -- model compactness >> >> > >> >> > B -- consistency of model results across data subsamples [regardless >> >> > of model size] >> >> > >> >> > are under broad conditions basically equivalent, and since we have >> >> > learning algorithms for finding overparameterized models that are >> >> > consistent across data subsamples than for finding compact models, >> >> > there is more focus on B than A at the moment >> >> > >> >> > I agree w/ you that we need to consider both A and B, and I think in a >> >> > neural-symbolic system for instance, A is often a more relevant metric >> >> > for the symbolic component and B for the neural component. >> >> > >> >> > >> >> > On Sun, Aug 30, 2020 at 9:51 AM James Bowery <jabow...@gmail.com> wrote: >> >> > > >> >> > > Not having read his paper, I can state from my own experience going >> >> > > back to the late 1980s doing multi-source neural image segmentation >> >> > > that overparameterized models are what you need during initial >> >> > > training. An appropriate learning algorithm will naturally reduce >> >> > > the complexity of the model measured, not in terms of the number of >> >> > > so-called "parameters," but rather in terms of the number of bits >> >> > > required to encode those parameters. This is more or less what >> >> > > "knowledge distillation" takes advantage of when one is deploying on >> >> > > resource limited platforms. >> >> > > >> >> > > Remember that there are two roles for AIT here: >> >> > > >> >> > > 1) Model selection in the absence of validation data -- which might >> >> > > not even involve automated induction at all. >> >> > > 2) Reinforcement signal for automated induction. >> >> > > >> >> > > The thing that's sending civilization down a rat-hole is the failure >> >> > > to recognized AIT's value as model selection. >> >> > > >> >> > > On Sun, Aug 30, 2020 at 11:39 AM Ben Goertzel <b...@goertzel.org> >> >> > > wrote: >> >> > >> >> >> > >> James, have you seen Poggio's attempt to argue that these >> >> > >> overparametrized models are actually OK in terms of learning theory? >> >> > >> >> >> > >> https://dspace.mit.edu/handle/1721.1/124343 >> >> > >> >> >> > >> The basic argument seems to be: >> >> > >> >> >> > >> -- In a space of overparametrized models, stability under subsampling >> >> > >> (e.g. leave-one-out accuracy as he describes) is a proxy for >> >> > >> minimizing error >> >> > >> >> >> > >> -- So if it's easier/faster to find overparameterized models than >> >> > >> compact models, this can still be a route to maximally accurate >> >> > >> models >> >> > >> so long as one uses subsampling to estimate accuracy >> >> > >> >> >> > >> His statistical theory is openly hand-wavy but the conceptual >> >> > >> argument >> >> > >> is clear... >> >> > >> >> >> > >> In terms of algorithmic information theory ish considerations, he is >> >> > >> sorta implicitly assuming that the crypticity (difficulty of >> >> > >> discovering/learning) overparametrized models is less than the >> >> > >> crypticity of compact models, which is indeed the experience of the >> >> > >> ML >> >> > >> and NLP worlds so far >> >> > >> >> >> > >> I do not think this is the whole story by any means, but it's a >> >> > >> non-stupid and relevant (if not actually original) point... >> >> > >> >> >> > >> ben >> >> > >> >> >> > >> On Wed, Jul 1, 2020 at 9:08 AM James Bowery <jabow...@gmail.com> >> >> > >> wrote: >> >> > >> > >> >> > >> > There seems to be hysteria against algorithmic information theory >> >> > >> > in language modeling. >> >> > >> > >> >> > >> > OpenAI boasts a 175 BILLION parameter model. >> >> > >> > >> >> > >> > Now Google boasts a 600 BILLION parameter model. >> >> > >> > >> >> > >> > https://youtu.be/1VdEw_mGjFk >> >> > >> > >> >> > >> > Now, I wouldn't call this "anti-AIT" if it weren't for the fact >> >> > >> > that these papers don't even attempt to estimate the actual >> >> > >> > information content of these parameters. Instead, they seem to >> >> > >> > take _pride_ in the obviously-inflated "parameter count". >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > Artificial General Intelligence List / AGI / see discussions + >> >> > >> > participants + delivery options Permalink >> >> > >> >> >> > >> >> >> > >> -- >> >> > >> Ben Goertzel, PhD >> >> > >> http://goertzel.org >> >> > >> >> >> > >> “The only people for me are the mad ones, the ones who are mad to >> >> > >> live, mad to talk, mad to be saved, desirous of everything at the >> >> > >> same >> >> > >> time, the ones who never yawn or say a commonplace thing, but burn, >> >> > >> burn, burn like fabulous yellow roman candles exploding like spiders >> >> > >> across the stars.” -- Jack Kerouac >> >> > > >> >> > > Artificial General Intelligence List / AGI / see discussions + >> >> > > participants + delivery options Permalink >> >> > >> >> > >> >> > >> >> > -- >> >> > Ben Goertzel, PhD >> >> > http://goertzel.org >> >> > >> >> > “The only people for me are the mad ones, the ones who are mad to >> >> > live, mad to talk, mad to be saved, desirous of everything at the same >> >> > time, the ones who never yawn or say a commonplace thing, but burn, >> >> > burn, burn like fabulous yellow roman candles exploding like spiders >> >> > across the stars.” -- Jack Kerouac >> >> >> >> >> >> -- >> >> Ben Goertzel, PhD >> >> http://goertzel.org >> >> >> >> “The only people for me are the mad ones, the ones who are mad to >> >> live, mad to talk, mad to be saved, desirous of everything at the same >> >> time, the ones who never yawn or say a commonplace thing, but burn, >> >> burn, burn like fabulous yellow roman candles exploding like spiders >> >> across the stars.” -- Jack Kerouac >> > >> > Artificial General Intelligence List / AGI / see discussions + >> > participants + delivery options Permalink >> >> >> -- >> Ben Goertzel, PhD >> http://goertzel.org >> >> “The only people for me are the mad ones, the ones who are mad to >> live, mad to talk, mad to be saved, desirous of everything at the same >> time, the ones who never yawn or say a commonplace thing, but burn, >> burn, burn like fabulous yellow roman candles exploding like spiders >> across the stars.” -- Jack Kerouac > > Artificial General Intelligence List / AGI / see discussions + participants + > delivery options Permalink -- Ben Goertzel, PhD http://goertzel.org “The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.” -- Jack Kerouac ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T100f708e32ae7327-Ma749008b42880e365e9950df Delivery options: https://agi.topicbox.com/groups/agi/subscription