On 01/23/2014 10:06 PM, Tim Miller wrote:
Just an FYI, a while back I did some of these annotations myself on MIMIC to get around this issue. I replaced the newline character with a special (non-English) character, then pre-processed ctakes input to replace newlines with that character, then did sentence detection, then added the newlines back in. I would be happy to share these annotations and my code modifications.

I would be really happy to get access to your annotations so I can test the new line support in OpenNLP with it.

Instead of a special char you would now have to use tags (<CR> and <LF>) to encode the new lines in the training data. The tags only need to be inserted into the training data, for the actual sentence detection the document string can be passed in as it is.

Jörn

Reply via email to