I was lying in bed thinking about 20Q.net. What's the ideal strategy for playing 20 Questions, given some statistical knowledge base about the world?
The traditional approach, which you can assign in a beginning programming class, is to build a binary tree of questions, adding onto the tip when you guess wrong. This is probably good for a start. For slightly better performance, you can rebalance the tree using the same left and right rotations used to manage a red-black tree, either to reduce the maximum height of the tree or the average height of answers. The main trouble with this approach is that sometimes different people will answer the same question differently for the same item, because it's vague or they don't know the answer. (Do you hold the movie THX-1138 when you use it? Is the movie carbon-based? Can linoleum cheer you up? Does plaster weigh more than 1 ton, and is it annoying? Can you hold the font named Papyrus, and do you use it at night? Does Django Reinhardt have cash value or win races? Do you hold a lentil when you use it?) So you end up failing to guess the item right, and eventually building big potentially duplicate question trees. 20Q.net uses a neural network to cope with this problem, and supplies additional answers; in addition to the usual "yes" and "no", it has "Unknown", "Irrelevant", "Sometimes", "Probably", and "Doubtful". Nevertheless, it doesn't do very well; I thought of linoleum, plaster, THX-1138, Django Reinhardt, and a lentil, and it failed all four times with 30 questions. Yet all four of these items, like most of the things 20Q tries to guess, have articles in the English Wikipedia, which only has 3.2 million articles --- 22 bits' worth. Here's a sample analysis of a game I played, where unfortunately 20Q's knowledge base failed pretty badly: > You were thinking of linoleum. > > You said it's classified as Vegetable, 20Q was taught by other > players that the answer is Other. > > Is it orange colored? You said Sometimes, 20Q was taught by > other players that the answer is Doubtful. > > Is it used with animals? You said Sometimes, 20Q was taught by > other players that the answer is Doubtful. > > Does it live above ground? You said No, 20Q was taught by other > players that the answer is Yes. > > Does it come from a plant? You said Yes, 20Q was taught by other > players that the answer is No. > > Does it fit in your wallet? You said Sometimes, 20Q was taught > by other players that the answer is No. > > Do you use it at work? You said Yes, 20Q was taught by other > players that the answer is No. > > Can it be painted? You said Yes, 20Q was taught by other players > that the answer is No. > > Does it have writing on it? You said Sometimes, 20Q was taught > by other players that the answer is No. > > Does it come in a box? You said Yes, 20Q was taught by other > players that the answer is No. Now, some of the answers here are simply wrong (linoleum can easily be painted, and when sold at retail it is almost always boxed) perhaps these are not the optimal questions 20Q could have asked. Is there an optimal approach to this problem? Consider this model. You have a database of N items and M questions. Each of the N items has a prior probability: an estimate of how likely a person is to choose that item. For each item-question pair, you have a yes-probability P(yes | Q, X): the probability that the person would answer "yes" to question Q if they are thinking of item X. Among the M questions are N questions of the form "Is it X?", whose yes-probability is defined as 1. Given this model, a naïve Bayesian approach to estimating the probabilities of items, given a particular set of answers to questions, seems quite viable. If only 5% of players think that the answer to "Do you make things with it?" is "no" when the object is "plaster", then it is probably reasonable, given a "no" answer, to diminish the estimate of the probability that the object is plaster by a factor of 20. Where naïve Bayesian models can fall down badly (uh, or so I hear; I don't actually know this stuff) is when there's correlation among their inputs. For example, 20Q.net asked me both, "Does [plaster] come in many varieties?" and "Is [plaster] made in many different styles?" There are very few items for which the answers to these two questions would differ, so it's a mistake to treat these two as independent data points. One approach to choosing questions is to try to maximize the information gained with each question. Suppose you have a particular probability distributions for the N items (not necessarily the prior probabilities). Then you can calculate the information content of someone telling you, "I was thinking of item X"; it's simply -lg P(X) bits. The *expected* information content of that statement, before you know what X is, is the sum of -P(X) lg P(X) over all the possible values of X. (This is the same as the information per symbol on a channel.) There must be a name for this metric; I'll confusingly call it the "remaining bit rate" of that probability distribution. In a sense, what you want to do is choose a next question to reduce the remaining bit rate of the resulting distribution as much as possible --- approaching one bit as closely as possible, for a yes/no question. So, given some way of updating your probabilities given the answer to a question, you can build a question tree as follows: 1. Start with the prior probabilities at the root node. 2. To choose the question for a node: 1. For each question Q that has not been asked: 1. Calculate the probability that the answer to the question will be "yes", given the prior probabilities at that node and the item-question pair probabilities: the sum of P(X) * P(yes | Q, X) for that question Q for all items X. 2. Calculate the new probability distributions given a yes answer to that question and given a no answer to that question. 3. Calculate the remaining bit rates of those probability distributions, and then multiply by the answer probabilities in order to come up with an expected bit rate after the question has been answered. 2. Choose the question Q with the lowest expected remaining bit rate. Create two new nodes with the probability distributions given previously. 3. Choose the questions for those two new nodes using the same method, unless the question was of the form "Is it X?", in which case the "yes" node is terminal, or unless the tree has gotten too deep. (This can be straightforwardly generalized for more than two answers to each question.) Now, given a database of past games, you can probably do better than naïve Bayesian updating. Given that you're examining a node at which you know, say, that the item is round, hard, and not usually sliced or carved (from ancestor questions), you can come up with a possibly more accurate set of probability distributions for, say, "Does it fit in an envelope?" by tabulating the previously-chosen items that were round, hard, not usually sliced or carved, and fit in envelopes, versus those that had those same attributes but did. (And, when there aren't very many in one category or the other, you can "smooth" by adjusting the probability a bit with other similar games that differed on one attribute or another.) This procedure should avoid asking repetitive questions that are repetitive, like the "Does it come in many varieties?" example earlier, because it will discover that asking a second version of that question will provide very little additional information. Even if it doesn't discover this at first, after the repetitive questions make it into some tree paths that are played through a number of times, it will discover it then. One downside is that the computation described above is at least O(NM), which is probably O(N²). So it's probably feasible for a thousand or a million items, at least as a batch process to update a question tree periodically, but not a billion. Like the Open Mind Common Sense project and the Verbosity game on Games With A Purpose, the twenty-questions game could quickly produce a database of common-sense beliefs about objects: not only which things are believed to be "round", but also a statistical strength of belief for each such predicate. 20Q.net claims to have 76 million games played already, which ought to give it a pretty decent database for this purpose, but I don't think it's published. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol