Re: sentence detector newline behavior

Jörn Kottmann Mon, 20 Jan 2014 05:26:26 -0800

Hi all,

currently I have quite a bit of time to work on OpenNLP, and would liketo help you

out with this issue.


Here is the follow up issue for this change:
https://issues.apache.org/jira/browse/OPENNLP-602

I am still trying to figure out what would be the best option toimplement this.In the training data a user could just use a special tag to identify thechars.

Instead of <NEWLINE> it might be better to use <CR> and <LF> to encodethese two chars

in the training data. Any thoughts?

I am planning to release this as part of OpenNLP 1.6.0.

Thanks,
Jörn

On 05/22/2013 02:03 PM, Jörn Kottmann wrote:

On 05/22/2013 01:17 PM, Miller, Timothy wrote:
That's awesome! It might be worth trying at least. How does the training
process change? Previously the training data would be one sentence per
line, but with newlines as possible mid-sentence characters that could
be trouble, is there a new representation for training data? Or would we
have to use the training api?
Good point, yes that will be a problem with the default trainingformat, but it shouldn't be hardto solve. In the format itself we could define a new line tag e.g.<NEWLINE> to mark new lines.as a hack to make it work with 1.5.3 you could instead use a specialchar as a replacement
for the new line char.
When you pass the text down to the sentence detector a simple stringreplace could be used to
convert all new line chars to the special new line marker char.
If things work out for you performance wise as well we will justintegrate it properly into OpenNLP
for the next release.
Could you produce a sentence detector training file with a new linemarker char?
You should try to pick a char you can also pass in on a terminalotherwise you have to use theAPI to train the model. The build in cross validation could be used toevaluate the performance.
Jörn

Re: sentence detector newline behavior

Reply via email to