Bayesian 20 Questions

Kragen Javier Sitaker Mon, 22 Mar 2010 00:37:06 -0700

I was lying in bed thinking about 20Q.net. What's the ideal strategy for
playing 20 Questions, given some statistical knowledge base about the world?


The traditional approach, which you can assign in a beginning programming
class, is to build a binary tree of questions, adding onto the tip when
you guess wrong. This is probably good for a start. For slightly better
performance, you can rebalance the tree using the same left and right
rotations used to manage a red-black tree, either to reduce the maximum
height of the tree or the average height of answers.

The main trouble with this approach is that sometimes different people
will answer the same question differently for the same item, because
it's vague or they don't know the answer. (Do you hold the movie
THX-1138 when you use it? Is the movie carbon-based? Can linoleum cheer
you up? Does plaster weigh more than 1 ton, and is it annoying? Can you
hold the font named Papyrus, and do you use it at night? Does Django
Reinhardt have cash value or win races? Do you hold a lentil when you
use it?) So you end up failing to guess the item right, and eventually
building big potentially duplicate question trees.

20Q.net uses a neural network to cope with this problem, and supplies
additional answers; in addition to the usual "yes" and "no", it has
"Unknown", "Irrelevant", "Sometimes", "Probably", and "Doubtful".
Nevertheless, it doesn't do very well; I thought of linoleum, plaster,
THX-1138, Django Reinhardt, and a lentil, and it failed all four times
with 30 questions.  Yet all four of these items, like most of the things
20Q tries to guess, have articles in the English Wikipedia, which only
has 3.2 million articles --- 22 bits' worth.

Here's a sample analysis of a game I played, where unfortunately 20Q's
knowledge base failed pretty badly:

> You were thinking of linoleum.
> 
> You said it's classified as Vegetable, 20Q was taught by other
> players that the answer is Other.
> 
> Is it orange colored? You said Sometimes, 20Q was taught by
> other players that the answer is Doubtful.
> 
> Is it used with animals? You said Sometimes, 20Q was taught by
> other players that the answer is Doubtful.
> 
> Does it live above ground? You said No, 20Q was taught by other
> players that the answer is Yes.
> 
> Does it come from a plant? You said Yes, 20Q was taught by other
> players that the answer is No.
> 
> Does it fit in your wallet? You said Sometimes, 20Q was taught
> by other players that the answer is No.
> 
> Do you use it at work? You said Yes, 20Q was taught by other
> players that the answer is No.
> 
> Can it be painted? You said Yes, 20Q was taught by other players
> that the answer is No.
> 
> Does it have writing on it? You said Sometimes, 20Q was taught
> by other players that the answer is No.
> 
> Does it come in a box? You said Yes, 20Q was taught by other
> players that the answer is No.

Now, some of the answers here are simply wrong (linoleum can easily be
painted, and when sold at retail it is almost always boxed) perhaps
these are not the optimal questions 20Q could have asked.

Is there an optimal approach to this problem?

Consider this model. You have a database of N items and M questions.
Each of the N items has a prior probability: an estimate of how likely a
person is to choose that item. For each item-question pair, you have a
yes-probability P(yes | Q, X): the probability that the person would
answer "yes" to question Q if they are thinking of item X.  Among the M
questions are N questions of the form "Is it X?", whose yes-probability
is defined as 1.

Given this model, a naïve Bayesian approach to estimating the
probabilities of items, given a particular set of answers to questions,
seems quite viable.  If only 5% of players think that the answer to "Do
you make things with it?" is "no" when the object is "plaster", then it
is probably reasonable, given a "no" answer, to diminish the estimate of
the probability that the object is plaster by a factor of 20.

Where naïve Bayesian models can fall down badly (uh, or so I hear; I
don't actually know this stuff) is when there's correlation among their
inputs. For example, 20Q.net asked me both, "Does [plaster] come in many
varieties?" and "Is [plaster] made in many different styles?" There are
very few items for which the answers to these two questions would
differ, so it's a mistake to treat these two as independent data points.

One approach to choosing questions is to try to maximize the information
gained with each question.

Suppose you have a particular probability distributions for the N items
(not necessarily the prior probabilities). Then you can calculate the
information content of someone telling you, "I was thinking of item X";
it's simply -lg P(X) bits. The *expected* information content of that
statement, before you know what X is, is the sum of -P(X) lg P(X) over
all the possible values of X. (This is the same as the information per
symbol on a channel.) There must be a name for this metric; I'll
confusingly call it the "remaining bit rate" of that probability
distribution.

In a sense, what you want to do is choose a next question to reduce the
remaining bit rate of the resulting distribution as much as possible ---
approaching one bit as closely as possible, for a yes/no question.

So, given some way of updating your probabilities given the answer to a
question, you can build a question tree as follows:

1. Start with the prior probabilities at the root node.

2. To choose the question for a node:

    1. For each question Q that has not been asked:

        1. Calculate the probability that the answer to the question
        will be "yes", given the prior probabilities at that node and
        the item-question pair probabilities: the sum of 
        P(X) * P(yes | Q, X) for that question Q for all items X.

        2. Calculate the new probability distributions given a yes
        answer to that question and given a no answer to that question.

        3. Calculate the remaining bit rates of those probability
        distributions, and then multiply by the answer probabilities in
        order to come up with an expected bit rate after the question
        has been answered.

    2. Choose the question Q with the lowest expected remaining bit
    rate. Create two new nodes with the probability distributions given
    previously.

    3. Choose the questions for those two new nodes using the same
    method, unless the question was of the form "Is it X?", in which
    case the "yes" node is terminal, or unless the tree has gotten too
    deep.

(This can be straightforwardly generalized for more than two answers to
each question.)

Now, given a database of past games, you can probably do better than
naïve Bayesian updating.  Given that you're examining a node at which
you know, say, that the item is round, hard, and not usually sliced or
carved (from ancestor questions), you can come up with a possibly more
accurate set of probability distributions for, say, "Does it fit in an
envelope?" by tabulating the previously-chosen items that were round,
hard, not usually sliced or carved, and fit in envelopes, versus those
that had those same attributes but did. (And, when there aren't very
many in one category or the other, you can "smooth" by adjusting the
probability a bit with other similar games that differed on one
attribute or another.)

This procedure should avoid asking repetitive questions that are
repetitive, like the "Does it come in many varieties?" example earlier,
because it will discover that asking a second version of that question
will provide very little additional information. Even if it doesn't
discover this at first, after the repetitive questions make it into some
tree paths that are played through a number of times, it will discover
it then.

One downside is that the computation described above is at least O(NM),
which is probably O(N²). So it's probably feasible for a thousand or a
million items, at least as a batch process to update a question tree
periodically, but not a billion.

Like the Open Mind Common Sense project and the Verbosity game on Games
With A Purpose, the twenty-questions game could quickly produce a
database of common-sense beliefs about objects: not only which things
are believed to be "round", but also a statistical strength of belief
for each such predicate. 20Q.net claims to have 76 million games played
already, which ought to give it a pretty decent database for this
purpose, but I don't think it's published.
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

Bayesian 20 Questions

Reply via email to