Sorry for the late answer... On Tue, Jun 28, 2011 at 4:43 PM, Jörn Kottmann <[email protected]> wrote:
> On 6/15/11 9:07 PM, [email protected] wrote: > >> 1) How is the setence detector using the abbreviation dictionary? All >> train >> methods in SentenceDetectorME takes an abbreviation dictionary as >> argument, >> but is only saving it to the model. It is not using the dictionary to >> create >> the context generator, but it should, shouldn't it? >> > > I am not sure how the dictionary is used, or what the intent was. > Do we have features in the sentence detectors which are based on > a dictionary? > Yes, we have. The the constructor of the DefaultSDContextGenerator takes a Set<String> inducedAbbreviations as argument and it is used to populate the contextual features. This constructor is not used anywhere inside the project. BTW, shouldn't we have something similar in Tokenizer? I notice that lot of the false positives of the Tokenizer was caused by abbreviations. My feeling is that there are so many cases were the token should be separated from the dot that it will always split if it. > Lets get that the dictionary support in a good state again. > I'll start working on that soon. Thanks
