Re: SentenceDetector & Abbreviations

William Colen Sun, 30 Mar 2014 19:10:31 -0700

Exactly. I just checked the English sentence detector model and it was not
trained with an abbreviation dictionary. In this case I believe including
one during runtime has no effect.



2014-03-28 10:54 GMT-03:00 Siarhei Rusak <[email protected]>:

> Hello, William.
>
> My goal was to use existing (default one) sentence model, but "to add some
> abbreviations".
> If I understood you correctly, there is no way to do that, because I need
> my own sample data, which I can not extract somehow from existing model. Is
> that correct?
>
> Thanks,
> Siarhei.
>
> 2014-03-27 16:45 GMT+03:00 William Colen <[email protected]>:
>
> > Siarhei,
> >
> > The abbreviation dictionary is used both during training and execution
> > time. OpenNLP will use it during training time while extracting features
> > from training data. It will check if a token is present in the
> dictionary,
> > and if there is a match, it will add a feature to the model. During
> > runtime, the featurizer will, among other things, check if a token can be
> > an abbreviation, and add it to the list of features which will be used to
> > decide if it is a sentence separator or not.
> >
> > In this case, you need to keep in mind that:
> > 1) It is _not_ enough to have a match between a token and an entry in the
> > abbreviation dictionary to OpenNLP understand that it is an abbreviation,
> > it will take into account all the context to decide.
> > 2) Training is important. If there wasn't an abbreviation dictionary
> during
> > training, or if the training data does not contain any abbreviation
> > matching the abbreviations in the dictionary, OpenNLP will never add a
> > abbreviation dictionary feature to the model. It means that during
> runtime
> > it will not know what to do when an abbreviation dictionary feature is
> > found.
> >
> > To understand it better, you can extract the model using a Zip utility
> and
> > take a look at the abbreviation dictionary inside it. You can check if
> > "corp." is there, and also try a few other abbreviations to check the
> > behavior.
> >
> > Regards,
> > William
> >
> >
> >
> > 2014-03-27 9:27 GMT-03:00 Siarhei Rusak <[email protected]>:
> >
> > > Hello,
> > >
> > > Seems, I'm doing something wrong, but documentation & forum isn't very
> > > helpful in my case.
> > > My goal is to add abbreviations to SentenceDetector, but I can't
> succeed.
> > > I'm trying to use this constructor overload:
> > >
> > > public *SentenceModel*(String
> > > <
> http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html>
> > > languageCode,
> > >                      opennlp.model.AbstractModel sentModel,
> > >                      boolean useTokenEnd, Dictionary
> > > <
> > >
> >
> http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/dictionary/Dictionary.html
> > > >
> > > abbreviations)
> > >
> > > and a trivial model from OpenNlp repository.
> > >
> > > Here is a code example (it's C# port via IKVM. Don't be confused) :
> > >
> > > var abbreviations = new Dictionary();
> > > abbreviations.put(new StringList("corp."));
> > >
> > > var modelPath = @"....\sent.model"; //path to file, extracted from
> > > "en-sent.bin"
> > > var dataStream = new DataInputStream(new FileInputStream(modelPath));
> > > var sentenceModel = new BinaryGISModelReader(dataStream).getModel();
> > > var abbreviatedSentenceModel = new SentenceModel("en", sentenceModel,
> > true,
> > > abbreviations);
> > >                         .............................
> > >
> > >                         var sentenceSplitter = new SentenceDetectorME(
> > > abbreviatedSentenceModel);
> > > sentenceSplitter.sentDetect(text);
> > >
> > > The result of it's execution is the same, as though there wouldn't be
> any
> > > abbreviations dictionary.
> > > So I suppose that either there should be any other way to do this,
> either
> > > it's a bug.
> > > Could you help, please.
> > >
> > > Thanks In Advance,
> > > Siarhei.
> > >
> >
>
>
>
> --
> С уважением, Русак С.
>

Re: SentenceDetector & Abbreviations

Reply via email to