Re: SentenceDetector & Abbreviations

Siarhei Rusak Fri, 28 Mar 2014 10:48:27 -0700

Hello, William.

My goal was to use existing (default one) sentence model, but "to add some
abbreviations".
If I understood you correctly, there is no way to do that, because I need
my own sample data, which I can not extract somehow from existing model. Is
that correct?


Thanks,
Siarhei.

2014-03-27 16:45 GMT+03:00 William Colen <[email protected]>:

> Siarhei,
>
> The abbreviation dictionary is used both during training and execution
> time. OpenNLP will use it during training time while extracting features
> from training data. It will check if a token is present in the dictionary,
> and if there is a match, it will add a feature to the model. During
> runtime, the featurizer will, among other things, check if a token can be
> an abbreviation, and add it to the list of features which will be used to
> decide if it is a sentence separator or not.
>
> In this case, you need to keep in mind that:
> 1) It is _not_ enough to have a match between a token and an entry in the
> abbreviation dictionary to OpenNLP understand that it is an abbreviation,
> it will take into account all the context to decide.
> 2) Training is important. If there wasn't an abbreviation dictionary during
> training, or if the training data does not contain any abbreviation
> matching the abbreviations in the dictionary, OpenNLP will never add a
> abbreviation dictionary feature to the model. It means that during runtime
> it will not know what to do when an abbreviation dictionary feature is
> found.
>
> To understand it better, you can extract the model using a Zip utility and
> take a look at the abbreviation dictionary inside it. You can check if
> "corp." is there, and also try a few other abbreviations to check the
> behavior.
>
> Regards,
> William
>
>
>
> 2014-03-27 9:27 GMT-03:00 Siarhei Rusak <[email protected]>:
>
> > Hello,
> >
> > Seems, I'm doing something wrong, but documentation & forum isn't very
> > helpful in my case.
> > My goal is to add abbreviations to SentenceDetector, but I can't succeed.
> > I'm trying to use this constructor overload:
> >
> > public *SentenceModel*(String
> > <http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html>
> > languageCode,
> >                      opennlp.model.AbstractModel sentModel,
> >                      boolean useTokenEnd, Dictionary
> > <
> >
> http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/dictionary/Dictionary.html
> > >
> > abbreviations)
> >
> > and a trivial model from OpenNlp repository.
> >
> > Here is a code example (it's C# port via IKVM. Don't be confused) :
> >
> > var abbreviations = new Dictionary();
> > abbreviations.put(new StringList("corp."));
> >
> > var modelPath = @"....\sent.model"; //path to file, extracted from
> > "en-sent.bin"
> > var dataStream = new DataInputStream(new FileInputStream(modelPath));
> > var sentenceModel = new BinaryGISModelReader(dataStream).getModel();
> > var abbreviatedSentenceModel = new SentenceModel("en", sentenceModel,
> true,
> > abbreviations);
> >                         .............................
> >
> >                         var sentenceSplitter = new SentenceDetectorME(
> > abbreviatedSentenceModel);
> > sentenceSplitter.sentDetect(text);
> >
> > The result of it's execution is the same, as though there wouldn't be any
> > abbreviations dictionary.
> > So I suppose that either there should be any other way to do this, either
> > it's a bug.
> > Could you help, please.
> >
> > Thanks In Advance,
> > Siarhei.
> >
>



-- 
С уважением, Русак С.

Re: SentenceDetector & Abbreviations

Reply via email to