[ https://issues.apache.org/jira/browse/OPENNLP-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876483#comment-13876483 ]
Joern Kottmann commented on OPENNLP-602: ---------------------------------------- I am not sure if I understand what you suggest. Let me rephrase it. You suggest that we could make the sentence separator char (which is currently a newline) configurable in the training format parser? The problem with the separator char is that it should be a char which does not occur in the training data. There are two solutions in my opinion to that, either you escape the char, if it occurs, or you use something which is really never there. The issue with the later solution is that it might differ from user to user and therefore need to be changed (e.g. with a parameter). Anyway, I believe we will get the best user experience if we simply allow to escape the two new line chars with a tag. In Java this can be really easily done with a string replace. As far as I know there will be no issues with span offsets. The format parsing code will replace the tags with the actual chars to construct the sample object for training. During runtime the user has to pass in the text, as it is, without replacing the chars. > SentenceDetector should support new line as and end of sentence char > -------------------------------------------------------------------- > > Key: OPENNLP-602 > URL: https://issues.apache.org/jira/browse/OPENNLP-602 > Project: OpenNLP > Issue Type: Improvement > Components: Sentence Detector > Affects Versions: tools-1.5.3 > Reporter: Joern Kottmann > Assignee: Joern Kottmann > Priority: Minor > Fix For: 1.6.0 > > > The Sentence Detector should have support to consider new line chars as the > end of a sentence. This will probably require special handling in the > training code to assume that there is an new line char if any other eos is > missing. -- This message was sent by Atlassian JIRA (v6.1.5#6160)