Greetings all, I was corresponding with someone about the --window option in count.pl, and realized that this might be of general interest to NSP users, so I have modifed that note slightly and sent it here.
When you are counting up the bigrams in a corpus, you can specify a --window size that will allow there to be some number of intervening words between the two words that make up the bigram. For example... count.pl --window 5 output input.txt ...will allow up to 3 intervening words between the words in the bigram. The window size is 5, and the two words in the bigram occupy the first and fifth position respectively, so you have up to three "spaces" for words left over. Now, if we use the --window 5 option, all we do is simply count all the possible bigrams that include 0, 1, 2, and 3 intervening words, and then figure out the sample size based on this count, and then the calculuations of the measures proceed exactly as if you were doing it without any intervening words. This avoids, I think, any trickery or hacking of the measures to support a more flexible notion of what a bigram can be. For example, suppose this is your input: my name is jim her name is sally if you run count.pl without any window size, it defaults to allowing no intervening words (window size of 2). So, you could run... count.pl output input.txt ...and you would get output like this: 7 name<>is<>2 2 2 is<>jim<>1 2 1 jim<>her<>1 1 1 is<>sally<>1 2 1 my<>name<>1 1 2 her<>name<>1 1 2 This tells us that there are 7 bigrams in the sample, and, for example, the bigram "name is" occurs 2 times, where "name" occurs as the first word in any bigram 2 times and "is" occurs as the second word in any bigram 2 times...from that you can construct the 2x2 table thusly: 2 0 | 2 0 5 | 5 ---------- 2 5 7 ...meaning that we have 7 bigrams in the sample, where 2 of them are "name is", and the other 5 do not include "name" or "is". Now, if you wanted to allow up to 3 intervening words in the bigrams, you could run count.pl like this.... count.pl --window 5 output input.txt and the output would be like this: 22 name<>is<>2 6 6 name<>name<>1 6 5 my<>is<>1 4 6 is<>name<>1 5 5 jim<>sally<>1 4 4 jim<>her<>1 4 4 name<>jim<>1 6 3 name<>sally<>1 6 4 is<>jim<>1 5 3 name<>her<>1 6 4 is<>her<>1 5 4 is<>sally<>1 5 4 her<>sally<>1 3 4 my<>name<>1 4 5 jim<>name<>1 4 5 her<>is<>1 3 6 my<>her<>1 4 4 jim<>is<>1 4 6 her<>name<>1 3 5 my<>jim<>1 4 3 is<>is<>1 5 6 Notice that our sample size is different, and we have a lot more bigrams. But, we can do log-likelihood exactly as we "should" (in my view) without any tampering or manipulation of the basic formula. Note that the table for "name is" does change here... 2 4 | 6 4 12 | 16 ---------- 6 16 22 So, what this reflects is the fact that allowing intervening words has added bigrams to the sample. We still have only 2 occurrrences of "name is", but we have 4 other bigrams where "name" is the first word, and 4 other bigrams where "is" is the second word. That's because of the "reach" of the window size pulling in more bigrams. You could get log-likelihood values (for example) for either of the above outputs from count.pl via: statistic.pl ll output.ll output I hope this helps clarify how happens when you use the --window option. It's quite powerful I think, but hopefully fairly easy to understand. Do let us know if you have any questions about this (or anything else!) Just a reminder, the most current version of NSP is now 1.03, and this is available from links at : http://www.d.umn.edu/~tpederse/nsp.html Cordially, Ted Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/ngram/ <*> Your email settings: Individual Email | Traditional <*> To change settings online go to: http://groups.yahoo.com/group/ngram/join (Yahoo! ID required) <*> To change settings via email: mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] <*> To unsubscribe from this group, send an email to: [EMAIL PROTECTED] <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/