I'm trying to train a sentence detector with a set of abbreviations but am not
seeing the behavior I expected.
InputStreamFactory factory = new
MarkableFileInputStreamFactory(trainingData);
PlainTextByLineStream lineStream = new PlainTextByLineStream(factory,
Constants.CHARSET);
ObjectStream<SentenceSample> sampleStream = new
SentenceSampleStream(lineStream);
Dictionary abbreviations = new AbbreviationsResourceLoader().load();
SentenceDetectorFactory fact = new SentenceDetectorFactory(language,
true, abbreviations, null);
model = trainer.train(language, sampleStream, fact, trainingParameters);
CustomSentenceDetectorME detect = new CustomSentenceDetectorME(model);
String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat on the
mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay
in Mrs. Smythe's yard.");
for (String s : sentences) {
LOG.info(s);
}
The output I get shows that sentences are being split on the abbreviations:
The cat, Ms.
, sat on the mat.
I called 464-6859 ext.
13 and asked for Mr.
Frank.
The dog, well, it lay in Mrs.
Smythe's yard.
How is the abbreviation dictionary used? Does the training set also have to
include examples of the same abbreviation(s).
Thanks,
Ade