Ade, The abbreviations provided in the dictionary when training the model are used to determine features of the training text. When an end-of-sentence character is found in the training text the trainer looks to see if the text immediately preceding the character is one of the provided abbreviations. If it is then a feature is generated. The trained model will then be better at differentiating between abbreviations and actual ends of sentences in input text.
Jeff On Wed, Sep 6, 2017 at 12:59 PM, Ade Miller <[email protected]> wrote: > > I train the model on a sample stream with many sentences, one per line. > The single sentence is just a trivial test example to > See if abbreviations work. > > model = trainer.train(language, sampleStream, fact, trainingParameters); > > It seems like I have to define an abbreviation in the dictionary and > examples in the training data for this to work. In which case I'm not clear > what the abbreviations dictionary actually does. > > -----Original Message----- > From: Daniel Russ [mailto:[email protected]] > Sent: Wednesday, September 6, 2017 9:51 AM > To: [email protected] > Subject: Re: How do abbreviations work when training a sentence detector > > You are trying to train a sentence detector with only 1 sentence. Each > line should be 1 sentence, the final character in the line marks the EOS. > It should handle abbreviations correctly. The idea behind the S.D. is that > every period (or ? or ! ) is classified as EOS or notEOS. > Daniel > > Please see: http://opennlp.apache.org/docs/1.8.1/manual/opennlp. > html#tools.sentdetect <http://opennlp.apache.org/ > docs/1.8.1/manual/opennlp.html#tools.sentdetect> for more info. > > > > On Sep 6, 2017, at 12:21 PM, Ade Miller <[email protected]> wrote: > > > > I'm trying to train a sentence detector with a set of abbreviations but > am not seeing the behavior I expected. > > > > InputStreamFactory factory = new MarkableFileInputStreamFactory > (trainingData); > > PlainTextByLineStream lineStream = new > PlainTextByLineStream(factory, Constants.CHARSET); > > ObjectStream<SentenceSample> sampleStream = new > SentenceSampleStream(lineStream); > > > > Dictionary abbreviations = new AbbreviationsResourceLoader(). > load(); > > SentenceDetectorFactory fact = new SentenceDetectorFactory(language, > true, abbreviations, null); > > model = trainer.train(language, sampleStream, fact, > trainingParameters); > > > > CustomSentenceDetectorME detect = new CustomSentenceDetectorME( > model); > > String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat > on the mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, > well, it lay in Mrs. Smythe's yard."); > > for (String s : sentences) { > > LOG.info(s); > > } > > > > The output I get shows that sentences are being split on the > abbreviations: > > > > The cat, Ms. > > , sat on the mat. > > I called 464-6859 ext. > > 13 and asked for Mr. > > Frank. > > The dog, well, it lay in Mrs. > > Smythe's yard. > > > > How is the abbreviation dictionary used? Does the training set also have > to include examples of the same abbreviation(s). > > > > Thanks, > > > > Ade > >
