Good memory, Drew!  No, nothing came out of it.  OpenNLP and GATE have sentence 
boundary detection.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Drew Farris <[email protected]>
> To: [email protected]
> Sent: Thu, January 14, 2010 8:58:53 PM
> Subject: Re: Collocation clarification
> 
> To follow up on the point about the and units. It would not be
> very difficult to add, especially if there is something out there that
> already detects them.
> 
> Does anyone know of a component in the Lucene ecosystem that can detect
> sentence boundaries? It would make sense if there was something in the
> Analyzer family that emits tokens with an end of sentence type. I saw a post
> from Otis on the java-dev list back in November regarding this issue -- does
> come out of it? (I really like the new Lucene Analyzer API)
> 
> thanks for explaining this again Ted, I suspected it was the proper way to
> do the calculation so the implementation in MAHOUT-242 currently works this
> way. It is very nice to have confirmation that I grokked the idea properly.
> 
> Drew
> 
> On Thu, Jan 14, 2010 at 5:49 PM, Ted Dunning wrote:
> 
> > As you say, there are three n-grams.  the words best, times and worst each
> > appear once at the beginning of an n-gram (total = 3).  The words times and
> > worst appear at the end of a bigram twice and once respectively (total =
> > 3).  The occurrences of times at the beginning and at the end of bigrams
> > are
> > separate cases and should not be confused.
> >
> > Also, for the record, it is common to augment the corpus with beginning and
> > ending symbols.  Often these are virtually added between sentences.  In
> > your
> > example, this would give us two more bigrams: best and times .
> > These extra bigrams allow us, for instance, to note that the word "the"
> > commonly starts an English sentence, but rarely ends one.
> >
> > For testing the "best times" bigram, the counts would be:
> >
> >     k11 = 1  (best times)
> >     k12 = 0 (best NOT times)
> >     k21 = 1 (NOT best times)
> >     k22 = 1 (NOT best NOT times)
> >
> > Note that k** = k11+k12+k21+k22 = 3 (total number of bigrams) and k1* = k11
> > + k12 = 1 (number of times best occurred in bigram) and k*1 = k11 + k21 = 2
> > (number of times "times" occurred at end of bigram).
> >
> > On Thu, Jan 14, 2010 at 2:01 PM, Drew Farris 
> > wrote:
> >
> > > I have a question about precisely the numbers that are plugged into the
> > > Log-likliehood ratio are calculated in the context of the collocation
> > > discovery task, specifically whether the position of the term in the
> > ngram
> > > should be taken into account when generating these counts.
> > >
> > > Starting with the basic table presented by Ted:
> > >
> > > k11 = A and B occuring together
> > > k12 = A occuring without B
> > > k21 = B occuring without A
> > > k22 = Neither A nor B occuring.
> > >
> > > In the context of collocation discovery, A and B refer to parts of
> > > ngrams. Given the simple string 'best times worst times', we have the 3
> > > bigrams:
> > >
> > > best times
> > > times worst
> > > worst times
> > >
> > > In the case of the ngram 'best times', A = 'best' and B = 'times'.
> > Clearly
> > > best appears in only one case, but in the context of 'best times' is
> > > 'times' considered to appear 2 or 3 times? The same question could be
> > asked
> > > about the term worst, which either appears once or twice in either case.
> > >
> > > In other words, should the numbers plugged into the LLR calculation for
> > > collocations be based on the subgram position?
> > >
> > > Drew
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >

Reply via email to