Hello OpenNLP community, I am a long time OpenNLP user. I use various NLP tasks provided by OpenNLP in my application. I have a few basic queries regarding the SentenceDetector and the default sentence model.
Since sentence tokenizer is a basic and the first step in many data-processing pipelines, I am trying to make it more robust. SentenceDetectorFactory class provides a parameter to feed the abbreviations, which I think is very useful. I checked the default models available from http://opennlp.sourceforge.net/models-1.5/. The available sentence model does not seem to use any abbreviations because the getAbbreviations on the loaded model shows null. If abbreviations dictionary is not used during training, the model will be agnostic to features such as "sabbrev", "vabbrev", "xabbrev" generated by DeafaultSDContextGenerator.collectFeatures based on the dictionary. In that case, I am not sure if feeding in a list of abbreviations through SentenceDetectorFactory during the evaluation will make any difference. Am I missing something? Sorry, if I am wrong. If I am right, please give me suggestions on alternatives. Also, I am not sure about the purpose of useTokenEnd param of SentenceDetectorFactory, can someone explain or point me to a resource that explains this? I am not sure if users-list is the right place for this post, if not please let me know and I will move it to dev. Thanks -- Vihari Piratla
