There are a couple of ways to handle this.

One is to view the text as a limited horizon Markov process and look for
exceptions.  Thus, we might build a bigram language model and look for cases
where trigrams would do better.  That implies we would be looking for cases
where "clack" occurs after "click and" anomalously more than would be
expected from the number of times "clack" appears after "and".  This comes
down to comparing the counts of "clack" and all other words in the context
of "click and" versus "anything-but-click and".  Since "clack" is probably a
small fraction of the words that appear in the second context, but exhibits
an overwhelming over abundance in the context of "click and", we would
conclude that "click and clack" is an important trigram.  The contingency
table is

                         clack    -clack
             click, and    k11      k12
             -click, and   k21      k22

Theoretically speaking, this test is part of a likelihood ratio test that
compares a Markov model against a restricted from of the same Markov model
and is an extension of the simpler test for interesting binomials.

A second approach is to consider all overlapping n-grams that are in or out
of some context like a known category, or a cluster or a data source.  Then
we can do a normal LLR test to find items that are over-represented in some
category, cluster or whatever.   The size of these things doesn't actually
matter all that much.   This technique can be quick because you handle all
lengths of n-grams at the same time as opposed to building things up bit by
bit.   It is limited by the availability of categories that form reasonable
comparison sets.

On Fri, Jan 8, 2010 at 5:13 PM, Drew Farris <[email protected]> wrote:

> On Fri, Jan 8, 2010 at 12:06 AM, Robin Anil <[email protected]> wrote:
>
> > I like the Formulation that Drew made, using n-1 grams to generate
> n-grams.
>
> I think Ted first mentioned n-1 grams, and I ran with it. It is very
> useful to think about the problem this way.
>
> One questions about the concept of n-1 grams however. When n is 3 for
> example, are we really interested in the collocation of bigrams, or
> are we interested in non-overlapping tokens? For example, given the
> tri-gram 'click and clack', should we be looking at 'click and' and
> 'and clack', or are should we be analyzing 'click', 'and clack' or
> 'click and' and 'clack''? I suspect it is the first form because that
> extends easilly to values larger than 3, but it's worth confirming.
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to