--- Mike Dougherty <[EMAIL PROTECTED]> wrote:

> On 11/2/07, Matt Mahoney <[EMAIL PROTECTED]> wrote:
> > Well, one alternative is to deduce that aluminum is a mass noun by the low
> > frequency of phrases like "an aluminum is" from a large corpus of text (or
> > count Google hits).  You could also deduce that aluminum is an adjective
> from
> > phrases like "an aluminum chair", etc.  More generally, you would cluster
> > words in the high dimensional vector space of their immediate context,
> then
> > derive rules for moving from cluster to cluster.
> >
> > However, the fact that this method is not used in the best language models
> > suggests it may exceed the computational limits of your PC.  This might
> > explain why we keep wading into the swamp.
> 
> It is doubtful this kind of examination of information can be
> 'conversational language' on PC computation for a while.

In theory it could.  A conversational model is a probability distribution over
strings of dialogs.  Given question Q the problem is to output answer A that
maximizes the probability p(A|Q) = p(QA)/p(Q).

In practice we don't know how to reliably estimate p(x) for strings x longer
than a few words.  For a 1 GB corpus, most strings longer than about 3 words
will have a count of 0.  To model longer strings, your system has to learn
semantic and syntactic constraints and have real world knowledge and common
sense.  Google lacks most of these capabilities but partly makes up for it by
using a much larger corpus that allows exact matches up to about 5 words.

> What do you
> think about the feasibility of a research request using this method?
> ex:  Find interesting information about: aluminum - to which the
> program builds a structure of information that it can continue
> refining and expanding until I return to check on it several hours
> later.  If I think it's on the right track for my definition of
> interesting, I could let it continue researching for days.  At the end
> of several days work, it would have a body of 'knowledge' that
> represents a cost to compile which makes it a local authority on this
> subject.  Assuming someone else might request information about the
> same topic, my local knowledge store could be included in preliminary
> findings.

Google will collect 80,200,000 facts about aluminum and rank them in 0.22
seconds.  You can also ask questions like "what is the thermal conductivity of
aluminum?" or "what country is the leading producer of aluminum?" verbatim and
get the answer.

> Clearly a distributed network of nodes is never going to be capable of
> the brute-force speed of knowing all things in one place.  I don't
> usually seek to know all things at once, just a useful number of
> things about a limited topic. That might good enough to make the
> effort worthwhile.

Google uses a cluster of 10^6 CPUs, enough to keep a copy of the searchable
part of the Internet in RAM.



-- Matt Mahoney, [EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=8660244&id_secret=60675138-3b2ee3

Reply via email to