I have a question about precisely the numbers that are plugged into the
Log-likliehood ratio are calculated in the context of the collocation
discovery task, specifically whether the position of the term in the ngram
should be taken into account when generating these counts.

Starting with the basic table presented by Ted:

k11 = A and B occuring together
k12 = A occuring without B
k21 = B occuring without A
k22 = Neither A nor B occuring.

In the context of collocation discovery, A and B refer to parts of
ngrams. Given the simple string 'best times worst times', we have the 3
bigrams:

best times
times worst
worst times

In the case of the ngram 'best times', A = 'best' and B = 'times'. Clearly
best appears in only one case, but in the context of 'best times' is
'times' considered to appear 2 or 3 times? The same question could be asked
about the term worst, which either appears once or twice in either case.

In other words, should the numbers plugged into the LLR calculation for
collocations be based on the subgram position?

Drew

Reply via email to