I am not suggesting we actually change anything.  Only that it is more 
complicated than adding chars to the eos array.

Daniel


> On Sep 29, 2017, at 10:44 AM, Joern Kottmann <[email protected]> wrote:
> 
> I think it is a bit unlucky that we have two <LF> and <CR> tags. I
> would change this and normalize it into just one tag e.g. <NEW_LINE>
> and then allow this to be placed in our existing training format as a
> end-of-sentence marker.
> 
> The eos array needs to also contain that char, we can just take /n and
> use this as a marker that we need to detect new line chars independent
> of the platform.
> 
> And just to remind us all, we have this problem also in other
> components, e.g. the name finder can't take new lines into account,
> but this is obviously needed for certain data sets like a name list
> where each name is written in one line.
> 
> Jörn
> 
> On Fri, Sep 29, 2017 at 4:32 PM, Dan Russ <[email protected]> wrote:
>> Hi Markus,
>>   Just adding the characters <CR> and <LF> to the eos array is not going to 
>> solve your problem.  You would need to add <CR> and <LF> to you training set 
>> otherwise the sentence detector will ALWAYS end the sentence at <CR><LF>.  
>> Think about how the training data (including the example you gave).  I think 
>> this would require OpenNLP to change the format of the sentence detector 
>> training data, so we could see <CR> and <LF> read the next word and decide 
>> whether it is an end of sentence.  You would want data like:
>> 
>> Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of stomach 
>> cramps   <LF><CR><End:Sentence>
>> 
>> In order to catch the end-of-line as a sentence delimiter.
>> 
>> Do you see a way around it?  Comments?
>> Daniel
>> 
>>> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler 
>>> <[email protected]> wrote:
>>> 
>>> Hello!
>>> 
>>> I state my problem again as I think it is quite similar to the following
>>> issue:
>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>> 
>>> I work with clinical narratives so eos characters are very often just
>>> missing, and I try to train a new robust sentence model.
>>> From the issue above it is suggested to encode these types of endings with
>>> <CR><LF> or just a <LF>
>>> 
>>> How do I set this up properly?
>>> 
>>> char[] eosCharacters = {'!','?','.'};
>>> SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
>>> true, null ,eosCharacters);
>>> 
>>> eosCharacters is a char array, how to put in your suggested encodings
>>> '<CR><LF>', '<LF>'?
>>> 
>>> How do I have to prepare my final training data set then?
>>> So I have for example in the text something like (with an artificial line
>>> break in the middle of the sentence):
>>> The quick abbr. brown
>>> fox jumps over the lazy dog
>>> 
>>> Training:
>>> The quick abbr. brown fox jumps over the lazy dog <CR><LF>
>>> 
>>> If the standard eos charactes {'.','?','!'} are existing:
>>> The quick abbr. brown
>>> fox jumps over the lazy dog.
>>> 
>>> Training:
>>> The quick abbr. brown fox jumps over the lazy dog.
>>> 
>>> If I have an abbreviation at the end of a sentence do I have to encode this
>>> in a special way?
>>> The quick abbr. brown
>>> fox jumps over the lazy dog abbr.
>>> 
>>> Training:
>>> The quick abbr. brown fox jumps over the lazy dog abbr.
>>> 
>>> When I have trained my model, do I have to accommodate the input text to
>>> e.g. <CR><LF> or <LF> inputs as used in the training sentences?
>>> 
>>> Thank you for your help!
>>> 
>>> lg Markus
>> 

Reply via email to