[ 
https://issues.apache.org/jira/browse/OPENNLP-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876483#comment-13876483
 ] 

Joern Kottmann commented on OPENNLP-602:
----------------------------------------

I am not sure if I understand what you suggest. Let me rephrase it.

You suggest that we could make the sentence separator char (which is currently 
a newline) configurable in the training format parser?

The problem with the separator char is that it should be a char which does not 
occur in the training data. There are two solutions in my opinion to that, 
either you escape the char, if it occurs, or you use something which is really 
never there. The issue with the later solution is that it might differ from 
user to user and therefore need to be changed (e.g. with a parameter).

Anyway, I believe we will get the best user experience if we simply allow to 
escape the two new line chars with a tag. In Java this can be really easily 
done with a string replace.

As far as I know there will be no issues with span offsets. The format parsing 
code will replace the tags with the actual chars to construct the sample object 
for training. During runtime the user has to pass in the text, as it is, 
without replacing the chars.

> SentenceDetector should support new line as and end of sentence char
> --------------------------------------------------------------------
>
>                 Key: OPENNLP-602
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-602
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Sentence Detector
>    Affects Versions: tools-1.5.3
>            Reporter: Joern Kottmann
>            Assignee: Joern Kottmann
>            Priority: Minor
>             Fix For: 1.6.0
>
>
> The Sentence Detector should have support to consider new line chars as the 
> end of a sentence. This will probably require special handling in the 
> training code to assume that there is an new line char if any other eos is 
> missing.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to