Hello, William. My goal was to use existing (default one) sentence model, but "to add some abbreviations". If I understood you correctly, there is no way to do that, because I need my own sample data, which I can not extract somehow from existing model. Is that correct?
Thanks, Siarhei. 2014-03-27 16:45 GMT+03:00 William Colen <[email protected]>: > Siarhei, > > The abbreviation dictionary is used both during training and execution > time. OpenNLP will use it during training time while extracting features > from training data. It will check if a token is present in the dictionary, > and if there is a match, it will add a feature to the model. During > runtime, the featurizer will, among other things, check if a token can be > an abbreviation, and add it to the list of features which will be used to > decide if it is a sentence separator or not. > > In this case, you need to keep in mind that: > 1) It is _not_ enough to have a match between a token and an entry in the > abbreviation dictionary to OpenNLP understand that it is an abbreviation, > it will take into account all the context to decide. > 2) Training is important. If there wasn't an abbreviation dictionary during > training, or if the training data does not contain any abbreviation > matching the abbreviations in the dictionary, OpenNLP will never add a > abbreviation dictionary feature to the model. It means that during runtime > it will not know what to do when an abbreviation dictionary feature is > found. > > To understand it better, you can extract the model using a Zip utility and > take a look at the abbreviation dictionary inside it. You can check if > "corp." is there, and also try a few other abbreviations to check the > behavior. > > Regards, > William > > > > 2014-03-27 9:27 GMT-03:00 Siarhei Rusak <[email protected]>: > > > Hello, > > > > Seems, I'm doing something wrong, but documentation & forum isn't very > > helpful in my case. > > My goal is to add abbreviations to SentenceDetector, but I can't succeed. > > I'm trying to use this constructor overload: > > > > public *SentenceModel*(String > > <http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html> > > languageCode, > > opennlp.model.AbstractModel sentModel, > > boolean useTokenEnd, Dictionary > > < > > > http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/dictionary/Dictionary.html > > > > > abbreviations) > > > > and a trivial model from OpenNlp repository. > > > > Here is a code example (it's C# port via IKVM. Don't be confused) : > > > > var abbreviations = new Dictionary(); > > abbreviations.put(new StringList("corp.")); > > > > var modelPath = @"....\sent.model"; //path to file, extracted from > > "en-sent.bin" > > var dataStream = new DataInputStream(new FileInputStream(modelPath)); > > var sentenceModel = new BinaryGISModelReader(dataStream).getModel(); > > var abbreviatedSentenceModel = new SentenceModel("en", sentenceModel, > true, > > abbreviations); > > ............................. > > > > var sentenceSplitter = new SentenceDetectorME( > > abbreviatedSentenceModel); > > sentenceSplitter.sentDetect(text); > > > > The result of it's execution is the same, as though there wouldn't be any > > abbreviations dictionary. > > So I suppose that either there should be any other way to do this, either > > it's a bug. > > Could you help, please. > > > > Thanks In Advance, > > Siarhei. > > > -- С уважением, Русак С.
