Good memory, Drew! No, nothing came out of it. OpenNLP and GATE have sentence boundary detection.
Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: Drew Farris <[email protected]> > To: [email protected] > Sent: Thu, January 14, 2010 8:58:53 PM > Subject: Re: Collocation clarification > > To follow up on the point about the and units. It would not be > very difficult to add, especially if there is something out there that > already detects them. > > Does anyone know of a component in the Lucene ecosystem that can detect > sentence boundaries? It would make sense if there was something in the > Analyzer family that emits tokens with an end of sentence type. I saw a post > from Otis on the java-dev list back in November regarding this issue -- does > come out of it? (I really like the new Lucene Analyzer API) > > thanks for explaining this again Ted, I suspected it was the proper way to > do the calculation so the implementation in MAHOUT-242 currently works this > way. It is very nice to have confirmation that I grokked the idea properly. > > Drew > > On Thu, Jan 14, 2010 at 5:49 PM, Ted Dunning wrote: > > > As you say, there are three n-grams. the words best, times and worst each > > appear once at the beginning of an n-gram (total = 3). The words times and > > worst appear at the end of a bigram twice and once respectively (total = > > 3). The occurrences of times at the beginning and at the end of bigrams > > are > > separate cases and should not be confused. > > > > Also, for the record, it is common to augment the corpus with beginning and > > ending symbols. Often these are virtually added between sentences. In > > your > > example, this would give us two more bigrams: best and times . > > These extra bigrams allow us, for instance, to note that the word "the" > > commonly starts an English sentence, but rarely ends one. > > > > For testing the "best times" bigram, the counts would be: > > > > k11 = 1 (best times) > > k12 = 0 (best NOT times) > > k21 = 1 (NOT best times) > > k22 = 1 (NOT best NOT times) > > > > Note that k** = k11+k12+k21+k22 = 3 (total number of bigrams) and k1* = k11 > > + k12 = 1 (number of times best occurred in bigram) and k*1 = k11 + k21 = 2 > > (number of times "times" occurred at end of bigram). > > > > On Thu, Jan 14, 2010 at 2:01 PM, Drew Farris > > wrote: > > > > > I have a question about precisely the numbers that are plugged into the > > > Log-likliehood ratio are calculated in the context of the collocation > > > discovery task, specifically whether the position of the term in the > > ngram > > > should be taken into account when generating these counts. > > > > > > Starting with the basic table presented by Ted: > > > > > > k11 = A and B occuring together > > > k12 = A occuring without B > > > k21 = B occuring without A > > > k22 = Neither A nor B occuring. > > > > > > In the context of collocation discovery, A and B refer to parts of > > > ngrams. Given the simple string 'best times worst times', we have the 3 > > > bigrams: > > > > > > best times > > > times worst > > > worst times > > > > > > In the case of the ngram 'best times', A = 'best' and B = 'times'. > > Clearly > > > best appears in only one case, but in the context of 'best times' is > > > 'times' considered to appear 2 or 3 times? The same question could be > > asked > > > about the term worst, which either appears once or twice in either case. > > > > > > In other words, should the numbers plugged into the LLR calculation for > > > collocations be based on the subgram position? > > > > > > Drew > > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > >
