Hello!
I state my problem again as I think it is quite similar to the following
issue:
https://issues.apache.org/jira/browse/OPENNLP-602
I work with clinical narratives so eos characters are very often just
missing, and I try to train a new robust sentence model.
>From the issue above it is suggested to encode these types of endings with
<CR><LF> or just a <LF>
How do I set this up properly?
char[] eosCharacters = {'!','?','.'};
SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
true, null ,eosCharacters);
eosCharacters is a char array, how to put in your suggested encodings
'<CR><LF>', '<LF>'?
How do I have to prepare my final training data set then?
So I have for example in the text something like (with an artificial line
break in the middle of the sentence):
The quick abbr. brown
fox jumps over the lazy dog
Training:
The quick abbr. brown fox jumps over the lazy dog <CR><LF>
If the standard eos charactes {'.','?','!'} are existing:
The quick abbr. brown
fox jumps over the lazy dog.
Training:
The quick abbr. brown fox jumps over the lazy dog.
If I have an abbreviation at the end of a sentence do I have to encode this
in a special way?
The quick abbr. brown
fox jumps over the lazy dog abbr.
Training:
The quick abbr. brown fox jumps over the lazy dog abbr.
When I have trained my model, do I have to accommodate the input text to
e.g. <CR><LF> or <LF> inputs as used in the training sentences?
Thank you for your help!
lg Markus