Re: Abbreviation in SentenceDetector

[email protected] Wed, 06 Jul 2011 06:45:00 -0700

Sorry for the late answer...

On Tue, Jun 28, 2011 at 4:43 PM, Jörn Kottmann <[email protected]> wrote:

> On 6/15/11 9:07 PM, [email protected] wrote:
>
>> 1) How is the setence detector using the abbreviation dictionary? All
>> train
>> methods in SentenceDetectorME takes an abbreviation dictionary as
>> argument,
>> but is only saving it to the model. It is not using the dictionary to
>> create
>> the context generator, but it should, shouldn't it?
>>
>
> I am not sure how the dictionary is used, or what the intent was.
> Do we have features in the sentence detectors which are based on
> a dictionary?
>

Yes, we have. The the constructor of the DefaultSDContextGenerator takes a
Set<String> inducedAbbreviations as argument and it is used to populate the
contextual features. This constructor is not used anywhere inside the
project.

BTW, shouldn't we have something similar in Tokenizer? I notice that lot of
the false positives of the Tokenizer was caused by abbreviations. My feeling
is that there are so many cases were the token should be separated from the
dot that it will always split if it.

> Lets get that the dictionary support in a good state again.
>

I'll start working on that soon.
Thanks

Re: Abbreviation in SentenceDetector

Reply via email to