Re: Collocation clarification

Grant Ingersoll Fri, 15 Jan 2010 06:31:22 -0800

On Jan 14, 2010, at 9:45 PM, Ted Dunning wrote:

> Pretty OK sentence boundary detection in English can be done by finding full
> stops, ?, ! and excess new-lines and then qualifying the potential breaks.


Yeah, I've even found using Java's BreakIterator (there's one for Sentences and 
it is supposedly Locale aware) plus some simple edge case modifications does 
quite well.  I've got an implementation/demo in Taming Text, I think, but may 
also have one laying around somewhere else.

Only trick thing is you have to buffer the tokens in Lucene, which is slightly 
annoying with the new incrementToken API, but not horrible.  Then, once you 
find the break, just output a special token.  Maybe also consider increasing 
the position increment.

Otherwise, OpenNLP has one too which is pretty easy to use, but requires a 
chunk of memory for the model.

> 
> For the full stops, you then have check the preceding and following token.
> If you have mixed case text, the following token can eliminate some breaks
> by not being capitalized.  The preceding token eliminates the break by being
> one of about two dozen special cases.
> 
> There are a few other cases, but that is 90% of the game.
> 
> Heavy weight framework not needed.
> 
> On Thu, Jan 14, 2010 at 6:33 PM, Otis Gospodnetic <
> [email protected]> wrote:
> 
>> Good memory, Drew!  No, nothing came out of it.  OpenNLP and GATE have
>> sentence boundary detection.
>> 
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>> 
>> 
>> 
>> ----- Original Message ----
>>> From: Drew Farris <[email protected]>
>>> To: [email protected]
>>> Sent: Thu, January 14, 2010 8:58:53 PM
>>> Subject: Re: Collocation clarification
>>> 
>>> To follow up on the point about the and units. It would not be
>>> very difficult to add, especially if there is something out there that
>>> already detects them.
>>> 
>>> Does anyone know of a component in the Lucene ecosystem that can detect
>>> sentence boundaries? It would make sense if there was something in the
>>> Analyzer family that emits tokens with an end of sentence type. I saw a
>> post
>>> from Otis on the java-dev list back in November regarding this issue --
>> does
>>> come out of it? (I really like the new Lucene Analyzer API)
>>> 
>>> thanks for explaining this again Ted, I suspected it was the proper way
>> to
>>> do the calculation so the implementation in MAHOUT-242 currently works
>> this
>>> way. It is very nice to have confirmation that I grokked the idea
>> properly.
>>> 
>>> Drew
>>> 
>>> On Thu, Jan 14, 2010 at 5:49 PM, Ted Dunning wrote:
>>> 
>>>> As you say, there are three n-grams.  the words best, times and worst
>> each
>>>> appear once at the beginning of an n-gram (total = 3).  The words times
>> and
>>>> worst appear at the end of a bigram twice and once respectively (total
>> =
>>>> 3).  The occurrences of times at the beginning and at the end of
>> bigrams
>>>> are
>>>> separate cases and should not be confused.
>>>> 
>>>> Also, for the record, it is common to augment the corpus with beginning
>> and
>>>> ending symbols.  Often these are virtually added between sentences.  In
>>>> your
>>>> example, this would give us two more bigrams: best and times .
>>>> These extra bigrams allow us, for instance, to note that the word "the"
>>>> commonly starts an English sentence, but rarely ends one.
>>>> 
>>>> For testing the "best times" bigram, the counts would be:
>>>> 
>>>>    k11 = 1  (best times)
>>>>    k12 = 0 (best NOT times)
>>>>    k21 = 1 (NOT best times)
>>>>    k22 = 1 (NOT best NOT times)
>>>> 
>>>> Note that k** = k11+k12+k21+k22 = 3 (total number of bigrams) and k1* =
>> k11
>>>> + k12 = 1 (number of times best occurred in bigram) and k*1 = k11 + k21
>> = 2
>>>> (number of times "times" occurred at end of bigram).
>>>> 
>>>> On Thu, Jan 14, 2010 at 2:01 PM, Drew Farris
>>>> wrote:
>>>> 
>>>>> I have a question about precisely the numbers that are plugged into
>> the
>>>>> Log-likliehood ratio are calculated in the context of the collocation
>>>>> discovery task, specifically whether the position of the term in the
>>>> ngram
>>>>> should be taken into account when generating these counts.
>>>>> 
>>>>> Starting with the basic table presented by Ted:
>>>>> 
>>>>> k11 = A and B occuring together
>>>>> k12 = A occuring without B
>>>>> k21 = B occuring without A
>>>>> k22 = Neither A nor B occuring.
>>>>> 
>>>>> In the context of collocation discovery, A and B refer to parts of
>>>>> ngrams. Given the simple string 'best times worst times', we have the
>> 3
>>>>> bigrams:
>>>>> 
>>>>> best times
>>>>> times worst
>>>>> worst times
>>>>> 
>>>>> In the case of the ngram 'best times', A = 'best' and B = 'times'.
>>>> Clearly
>>>>> best appears in only one case, but in the context of 'best times' is
>>>>> 'times' considered to appear 2 or 3 times? The same question could be
>>>> asked
>>>>> about the term worst, which either appears once or twice in either
>> case.
>>>>> 
>>>>> In other words, should the numbers plugged into the LLR calculation
>> for
>>>>> collocations be based on the subgram position?
>>>>> 
>>>>> Drew
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ted Dunning, CTO
>>>> DeepDyve
>>>> 
>> 
>> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Collocation clarification

Reply via email to