As you say, there are three n-grams. the words best, times and worst each
appear once at the beginning of an n-gram (total = 3). The words times and
worst appear at the end of a bigram twice and once respectively (total =
3). The occurrences of times at the beginning and at the end of bigrams are
separate cases and should not be confused.
Also, for the record, it is common to augment the corpus with beginning and
ending symbols. Often these are virtually added between sentences. In your
example, this would give us two more bigrams: <start> best and times <end>.
These extra bigrams allow us, for instance, to note that the word "the"
commonly starts an English sentence, but rarely ends one.
For testing the "best times" bigram, the counts would be:
k11 = 1 (best times)
k12 = 0 (best NOT times)
k21 = 1 (NOT best times)
k22 = 1 (NOT best NOT times)
Note that k** = k11+k12+k21+k22 = 3 (total number of bigrams) and k1* = k11
+ k12 = 1 (number of times best occurred in bigram) and k*1 = k11 + k21 = 2
(number of times "times" occurred at end of bigram).
On Thu, Jan 14, 2010 at 2:01 PM, Drew Farris <[email protected]> wrote:
> I have a question about precisely the numbers that are plugged into the
> Log-likliehood ratio are calculated in the context of the collocation
> discovery task, specifically whether the position of the term in the ngram
> should be taken into account when generating these counts.
>
> Starting with the basic table presented by Ted:
>
> k11 = A and B occuring together
> k12 = A occuring without B
> k21 = B occuring without A
> k22 = Neither A nor B occuring.
>
> In the context of collocation discovery, A and B refer to parts of
> ngrams. Given the simple string 'best times worst times', we have the 3
> bigrams:
>
> best times
> times worst
> worst times
>
> In the case of the ngram 'best times', A = 'best' and B = 'times'. Clearly
> best appears in only one case, but in the context of 'best times' is
> 'times' considered to appear 2 or 3 times? The same question could be asked
> about the term worst, which either appears once or twice in either case.
>
> In other words, should the numbers plugged into the LLR calculation for
> collocations be based on the subgram position?
>
> Drew
>
--
Ted Dunning, CTO
DeepDyve