Yes, exactly, OPENNLP-602 is about training a sentence detector model which can use a new line as a end-of-sentence character.

In case you have certain rules to split sentences we should have a look at them. The Sentence Detector could be extended to support a user provided rule based splitter. If there is an interest in that we could probably get it into 1.6.0 as well.

Jörn

On 01/20/2014 10:02 PM, Chen, Pei wrote:
I presume Joern was suggesting that if he supports new lines in the opennlp 
SentenceDectector (either part of the trained models or post processing with 
some rules?) cTAKES will be able to use it out of the box and we should be able 
remove any additional custom logic that we currently have- which seems like a 
good idea.

[but when to use within cTAKES individual components such as negation might be 
another discussion?]
--Pei

On Jan 20, 2014, at 12:46 PM, "vijay garla" <vnga...@gmail.com> wrote:

The sentence detection opennlp model used by ctakes does not split
sentences at newlines - there is additional logic in the takes sentence
splitter that does this (and an alternative impl that doesn't is in the
ytex branch). Afaik no retraining / change to the feature representation is
necessary.

Vj

On Monday, January 20, 2014, Jörn Kottmann <kottm...@gmail.com> wrote:

Hi all,

currently I have quite a bit of time to work on OpenNLP, and would like to
help you
out with this issue.

Here is the follow up issue for this change:
https://issues.apache.org/jira/browse/OPENNLP-602

I am still trying to figure out what would be the best option to implement
this.
In the training data a user could just use a special tag to identify the
chars.

Instead of <NEWLINE> it might be better to use <CR> and <LF> to encode
these two chars
in the training data. Any thoughts?

I am planning to release this as part of OpenNLP 1.6.0.

Thanks,
Jörn

On 05/22/2013 02:03 PM, Jörn Kottmann wrote:

On 05/22/2013 01:17 PM, Miller, Timothy wrote:

That's awesome! It might be worth trying at least. How does the training
process change? Previously the training data would be one sentence per
line, but with newlines as possible mid-sentence characters that could
be trouble, is there a new representation for training data? Or would we
have to use the training api?
Good point, yes that will be a problem with the default training format,
but it shouldn't be hard
to solve. In the format itself we could define a new line tag e.g.
<NEWLINE> to mark new lines.
as a hack to make it work with 1.5.3 you could instead use a special char
as a replacement
for the new line char.
When you pass the text down to the sentence detector a simple string
replace could be used to
convert all new line chars to the special new line marker char.

If things work out for you performance wise as well we will just
integrate it properly into OpenNLP
for the next release.

Could you produce a sentence detector training file with a new line
marker char?

You should try to pick a char you can also pass in on a terminal
otherwise you have to use the
API to train the model. The build in cross validation could be used to
evaluate the performance.

Jörn


Reply via email to