Re: SentenceDetector & Abbreviations

William Colen Thu, 27 Mar 2014 06:46:38 -0700

Siarhei,

The abbreviation dictionary is used both during training and execution
time. OpenNLP will use it during training time while extracting features
from training data. It will check if a token is present in the dictionary,
and if there is a match, it will add a feature to the model. During
runtime, the featurizer will, among other things, check if a token can be
an abbreviation, and add it to the list of features which will be used to
decide if it is a sentence separator or not.


In this case, you need to keep in mind that:
1) It is _not_ enough to have a match between a token and an entry in the
abbreviation dictionary to OpenNLP understand that it is an abbreviation,
it will take into account all the context to decide.
2) Training is important. If there wasn't an abbreviation dictionary during
training, or if the training data does not contain any abbreviation
matching the abbreviations in the dictionary, OpenNLP will never add a
abbreviation dictionary feature to the model. It means that during runtime
it will not know what to do when an abbreviation dictionary feature is
found.

To understand it better, you can extract the model using a Zip utility and
take a look at the abbreviation dictionary inside it. You can check if
"corp." is there, and also try a few other abbreviations to check the
behavior.

Regards,
William



2014-03-27 9:27 GMT-03:00 Siarhei Rusak <[email protected]>:

> Hello,
>
> Seems, I'm doing something wrong, but documentation & forum isn't very
> helpful in my case.
> My goal is to add abbreviations to SentenceDetector, but I can't succeed.
> I'm trying to use this constructor overload:
>
> public *SentenceModel*(String
> <http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html>
> languageCode,
>                      opennlp.model.AbstractModel sentModel,
>                      boolean useTokenEnd, Dictionary
> <
> http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/dictionary/Dictionary.html
> >
> abbreviations)
>
> and a trivial model from OpenNlp repository.
>
> Here is a code example (it's C# port via IKVM. Don't be confused) :
>
> var abbreviations = new Dictionary();
> abbreviations.put(new StringList("corp."));
>
> var modelPath = @"....\sent.model"; //path to file, extracted from
> "en-sent.bin"
> var dataStream = new DataInputStream(new FileInputStream(modelPath));
> var sentenceModel = new BinaryGISModelReader(dataStream).getModel();
> var abbreviatedSentenceModel = new SentenceModel("en", sentenceModel, true,
> abbreviations);
>                         .............................
>
>                         var sentenceSplitter = new SentenceDetectorME(
> abbreviatedSentenceModel);
> sentenceSplitter.sentDetect(text);
>
> The result of it's execution is the same, as though there wouldn't be any
> abbreviations dictionary.
> So I suppose that either there should be any other way to do this, either
> it's a bug.
> Could you help, please.
>
> Thanks In Advance,
> Siarhei.
>

Re: SentenceDetector & Abbreviations

Reply via email to