[ngram] use of window size in count.pl

ted pedersen Fri, 22 Sep 2006 16:13:55 -0700

Greetings all,

I was corresponding with someone about the --window option in count.pl,
and realized that this might be of general interest to NSP users, so
I have modifed that note slightly and sent it here.


When you are counting up the bigrams in a corpus, you can specify a  
--window size that will allow there to be some number of intervening words
between the two words that make up the bigram. For example...

count.pl --window 5 output input.txt

...will allow up to 3 intervening words between the words in the bigram.  
The window size is 5, and the two words in the bigram occupy the first  
and fifth position respectively, so you have up to three "spaces" for  
words left over. 

Now, if we use the --window 5 option, all we do is simply count all the  
possible bigrams that include 0, 1, 2, and 3 intervening words, and then  
figure out the sample size based on this count, and then the calculuations  
of the measures proceed exactly as if you were doing it without any  
intervening words. This avoids, I think, any trickery or hacking of the 
measures to support a more flexible notion of what a bigram can be. 

For example, suppose this is your input:

my name is jim her name is sally

if you run count.pl without any window size, it defaults to allowing no 
intervening words (window size of 2). So, you could run...

count.pl output input.txt

...and you would get output like this:

7
name<>is<>2 2 2 
is<>jim<>1 2 1 
jim<>her<>1 1 1 
is<>sally<>1 2 1 
my<>name<>1 1 2 
her<>name<>1 1 2 

This tells us that there are 7 bigrams in the sample, and, for example,
the bigram "name is" occurs 2 times, where "name" occurs as the first 
word in any bigram 2 times and "is" occurs as the second word in any 
bigram 2 times...from that you can construct the 2x2 table thusly:

2  0 |  2
0  5 |  5
----------
2  5    7

...meaning that we have 7 bigrams in the sample, where 2 of them are "name  
is", and the other 5 do not include "name" or "is".  
 
Now, if you wanted to allow up to 3 intervening words in the bigrams, you  
could run count.pl like this....

count.pl --window 5 output input.txt

and the output would be like this:

22
name<>is<>2 6 6 
name<>name<>1 6 5 
my<>is<>1 4 6 
is<>name<>1 5 5 
jim<>sally<>1 4 4 
jim<>her<>1 4 4 
name<>jim<>1 6 3 
name<>sally<>1 6 4 
is<>jim<>1 5 3 
name<>her<>1 6 4 
is<>her<>1 5 4 
is<>sally<>1 5 4 
her<>sally<>1 3 4 
my<>name<>1 4 5 
jim<>name<>1 4 5 
her<>is<>1 3 6 
my<>her<>1 4 4 
jim<>is<>1 4 6 
her<>name<>1 3 5 
my<>jim<>1 4 3 
is<>is<>1 5 6 

Notice that our sample size is different, and we have a lot more bigrams. 
But, we can do log-likelihood exactly as we "should" (in my view) without
any tampering or manipulation of the basic formula. 

Note that the table for "name is" does change here...

2  4  | 6
4 12  | 16 
----------
6 16    22

So, what this reflects is the fact that allowing intervening words has 
added bigrams to the sample. We still have only 2 occurrrences of "name 
is", but we have 4 other bigrams where "name" is the first word, and 4 
other bigrams where "is" is the second word. That's because of the "reach" 
of the window size pulling in more bigrams. 

You could get log-likelihood values (for example) for either of the above 
outputs from count.pl via:  

statistic.pl ll output.ll output

I hope this helps clarify how happens when you use the --window option.
It's quite powerful I think, but hopefully fairly easy to understand. Do 
let us know if you have any questions about this (or anything else!)

Just a reminder, the most current version of NSP is now 1.03, and this is
available from links at :

http://www.d.umn.edu/~tpederse/nsp.html

Cordially,
Ted




 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/ngram/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/ngram/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:[EMAIL PROTECTED] 
    mailto:[EMAIL PROTECTED]

<*> To unsubscribe from this group, send an email to:
    [EMAIL PROTECTED]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/

[ngram] use of window size in count.pl

Reply via email to