I am not suggesting we actually change anything. Only that it is more complicated than adding chars to the eos array.
Daniel > On Sep 29, 2017, at 10:44 AM, Joern Kottmann <[email protected]> wrote: > > I think it is a bit unlucky that we have two <LF> and <CR> tags. I > would change this and normalize it into just one tag e.g. <NEW_LINE> > and then allow this to be placed in our existing training format as a > end-of-sentence marker. > > The eos array needs to also contain that char, we can just take /n and > use this as a marker that we need to detect new line chars independent > of the platform. > > And just to remind us all, we have this problem also in other > components, e.g. the name finder can't take new lines into account, > but this is obviously needed for certain data sets like a name list > where each name is written in one line. > > Jörn > > On Fri, Sep 29, 2017 at 4:32 PM, Dan Russ <[email protected]> wrote: >> Hi Markus, >> Just adding the characters <CR> and <LF> to the eos array is not going to >> solve your problem. You would need to add <CR> and <LF> to you training set >> otherwise the sentence detector will ALWAYS end the sentence at <CR><LF>. >> Think about how the training data (including the example you gave). I think >> this would require OpenNLP to change the format of the sentence detector >> training data, so we could see <CR> and <LF> read the next word and decide >> whether it is an end of sentence. You would want data like: >> >> Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of stomach >> cramps <LF><CR><End:Sentence> >> >> In order to catch the end-of-line as a sentence delimiter. >> >> Do you see a way around it? Comments? >> Daniel >> >>> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler >>> <[email protected]> wrote: >>> >>> Hello! >>> >>> I state my problem again as I think it is quite similar to the following >>> issue: >>> https://issues.apache.org/jira/browse/OPENNLP-602 >>> >>> I work with clinical narratives so eos characters are very often just >>> missing, and I try to train a new robust sentence model. >>> From the issue above it is suggested to encode these types of endings with >>> <CR><LF> or just a <LF> >>> >>> How do I set this up properly? >>> >>> char[] eosCharacters = {'!','?','.'}; >>> SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de", >>> true, null ,eosCharacters); >>> >>> eosCharacters is a char array, how to put in your suggested encodings >>> '<CR><LF>', '<LF>'? >>> >>> How do I have to prepare my final training data set then? >>> So I have for example in the text something like (with an artificial line >>> break in the middle of the sentence): >>> The quick abbr. brown >>> fox jumps over the lazy dog >>> >>> Training: >>> The quick abbr. brown fox jumps over the lazy dog <CR><LF> >>> >>> If the standard eos charactes {'.','?','!'} are existing: >>> The quick abbr. brown >>> fox jumps over the lazy dog. >>> >>> Training: >>> The quick abbr. brown fox jumps over the lazy dog. >>> >>> If I have an abbreviation at the end of a sentence do I have to encode this >>> in a special way? >>> The quick abbr. brown >>> fox jumps over the lazy dog abbr. >>> >>> Training: >>> The quick abbr. brown fox jumps over the lazy dog abbr. >>> >>> When I have trained my model, do I have to accommodate the input text to >>> e.g. <CR><LF> or <LF> inputs as used in the training sentences? >>> >>> Thank you for your help! >>> >>> lg Markus >>
