On 7/6/11 3:44 PM, [email protected] wrote:
Sorry for the late answer...

On Tue, Jun 28, 2011 at 4:43 PM, Jörn Kottmann<[email protected]>  wrote:

On 6/15/11 9:07 PM, [email protected] wrote:

1) How is the setence detector using the abbreviation dictionary? All
train
methods in SentenceDetectorME takes an abbreviation dictionary as
argument,
but is only saving it to the model. It is not using the dictionary to
create
the context generator, but it should, shouldn't it?

I am not sure how the dictionary is used, or what the intent was.
Do we have features in the sentence detectors which are based on
a dictionary?

Yes, we have. The the constructor of the DefaultSDContextGenerator takes a
Set<String>  inducedAbbreviations as argument and it is used to populate the
contextual features. This constructor is not used anywhere inside the
project.

+1 to fix this, and add proper support for it again.

BTW, shouldn't we have something similar in Tokenizer? I notice that lot of
the false positives of the Tokenizer was caused by abbreviations. My feeling
is that there are so many cases were the token should be separated from the
dot that it will always split if it.

+1 to add dictionary support to the tokenizer also.
Lets get that the dictionary support in a good state again.

I'll start working on that soon.


Nice, please open jiras for the two changes.

Jörn

Reply via email to