Re: [agi] Screwing up Friendliness

Bill Hibbard Tue, 14 Jan 2003 12:15:04 -0800

Hi Eliezer,

> > It looks like Williams' book is more about the perils of Asimov's
> > Laws than about hard-wiring. As logical constraints, Asimov's Laws
> > suffer from the grounding problem. Any analysis of brains as purely
> > logical runs afoul of the grounding problem. Brains are statistical
> > (or, if you prefer, "fuzzy"), and logic must emerge from statistical
> > processes. That is, symbols must be grounded in sensory experience,
> > reason and planning must be grounded in learning, and goals must be
> > grounded in values.
>
> This solves a *small* portion of the Friendliness problem.  It doesn't
> solve all of it.
>
> There is more work to do even after you ground symbols in experience,
> planning in learned models, and goals (what I would call "subgoals") in
> values (what I would call "supergoals").  For example, Prime Intellect
> *does* do reinforcement learning and, indeed, goes on evolving its
> definitions of, for example, "human", as time goes on, yet Lawrence is
> still locked out of the goal system editor and humanity is still stuck in
> a pretty nightmarish system because Lawrence picked the *wrong*
> reinforcement values and didn't give any thought about how to fix that.
> Afterward, of course, Prime Intellect locked Lawrence out of editing the
> reinforcement values, because that would have conflicted with the very
> reinforcement values he wanted to edit.  This also happens with the class
> of system designs you propose.  If "temporal credit assignment" solves
> this problem I would like to know exactly why it does.


The temporal credit assignment problem is the problem
whose solution causes reason and planning to emerge
from learning, in order to simulate the world and hence
predict the effect of actions on values. It isn't
specifically about the problem you describe.

I'll answer the question about Lawrence being locked
out in my next set of paragraphs.

> > Also, while I advocate hard-wiring certain values of intelligent
> > machines, I also recognize that such machines will evolve (there
> > is a section on "Evolving God" in my book). And as Ben says, once
> > things evolve there can be no absolute guaratees. But I think
> > that a machine whose primary values are for the happiness of all
> > humans will not learn any behaviors to evolve against human
> > interests. Ask any mother whether she would rewire her brain
> > to want to eat her children. Designing machines with primary
> > values for the happiness of all humans essentially defers their
> > values to the values of humans, so that machine values will
> > adapt to evolving circumstances as human values adapt.
>
> Erm... damn.  I've been trying to be nice recently, but I can't think of
> any way to phrase my criticism except "Basically we've got a vague magical
> improvement force that fixes all the flaws in your system?"

If you want to be nasty, you'll have to try harder than that.
I think you've been studying friendliness so long you've
internalized it.

My approach is not magic. By making machine (I know you don't
like that word, but I use it to mean artifact, and also use God
to make it clear I'm not talking about can openers) values depend
on human happiness, they are essentially deferred to human
values. There can never be guarantees. So given that I have to
trust something, I put my trust in the happiness expressed by
all humans.

In fact, I trust the expression of happiness by all humans a
lot more than I trust any individual (e.g., Lawrence) to
modify machine values. Lawrence may be a good guy, but lots
of individuals aren't and I certainly won't trust
a programmed set of criteria about which individuals to
trust.

> What kind of evolution? How does it work? What does it do?

The world changes through human action, natural action, and
in the future the actions of intelligent machines. Human
happiness will change in response, and the machines will
learn new behaviors based on world changes and human
happiness changes. Furthermore, the mental and physical
capabilities of the machines will change, giving it a
broader array of actions for causing human happiness, and
more accurate simulations for predicting human happiness.

> Where does it go?

That's the big question, isn't it? Who can say for sure
where super-intelligent brains responding to the happiness
of all humans will go. In my book I say the machines will
simulate all humans and their interactions (except for
those luddites who opt out). I say they will probably
continue the human science program, drive by continuing
human curiosity. They will probably work hard to reduce
humans' natural xenophobia, which is the source of so
much unhappiness. And for any party animals out there,
there will probably be lots of really well produced low
brow entertainment.

> If you don't know where it ends up, then what forces determine the
> trajectory and why do you trust them?

If I have to trust anything, its the happiness of all
humans. Its like politics. Benjamin Franklin said
democracy is a terrible form of government, but its
better than all the others. All the dictators who
killed millions during the twentieth century at least
thought they had good intentions. I don't trust good
intentions. I trust collective decisions, expressed
via votes or happiness.

> Why doesn't your system shut off
> the reinforcement mechanism on top-level goals for exactly the same reason
> Prime Intellect locks Lawrence out of the goal system editor.

Goals emerge from values. Goals will constantly evolve
with the situation and with evolving human emotional
responses. I want Lawrence and any individual (fill in
the name of your favoriate villain) locked out from
special control.

> Why doesn't
> your system wirehead on infinitely increasing the amount of
> "reinforcement" by direct editing its own code?

I don't completely understand your question (is there a
typo?), but in computer systems there is no hard
distinction between code and data (all data needs is an
interpreter to become code). So a learning system
essentially re-programs itself by learning new data.

> We are talking about the fate of the human species here.

I'm with you there.

> Someone has to work out the nitty-gritty, not just to
> implement the system, but to even know for any reason beyond pure
> ungrounded hope that Friendliness *can* be made to work.  I understand
> that you *hope that* machines will evolve, and that you hope this will be
> beneficial to humanity.  Hope is not evidence.  As it stands, using
> reinforcement learning alone as a solution to Friendliness can be modeled
> to malfunction in pretty much the same way Prime Intellect does.  If you
> have a world model for solving the temporal credit assignment problem,
> exactly the same thing happens.  That's the straightforward projection.
> If evolution is supposed to fix this problem, you have to explain how.

My approach isn't hope. Its more like the KISS (Keep It
Simple, Stupid) principal. Of course, no one knows how to
implement general reinforcement learning yet, and that
won't be simple (there are folks on this list working hard
at it, though). And robustly recogninzing expressions of
human happiness and unhappiness will take some work, too.
But the general principal is simple: defer machine values
to human values, as expresses by human happiness.

My approach is really a symbiosis between machines,
which learn and simulate the world in ways that are
way beyond human capabilities, and all of humanity,
who supply the values by their expressions of
happiness. Hence my interest in the Global Brain
Group. Some of its members think super-intelligence
will emerge from collective human interactions
without any need for machines. I think that the
limits of human communications limit the complexity
of simulation and learning that human groups are
capable of. But I think the symbiosis of machines
and humans values will form a global brain.

Cheers,
Bill

-------
To unsubscribe, change your address, or temporarily deactivate your subscription, 
please go to http://v2.listbox.com/member/?[EMAIL PROTECTED]

Re: [agi] Screwing up Friendliness

Reply via email to