Re: sentence detector newline behavior

Tim Miller Thu, 06 Jun 2013 06:49:51 -0700

Hi opennlp,

I started a thread on ctakes-dev about training the sentence detector toallow newlines in the middle of sentences, Jorn said it was possible,now I have a question about how to proceed.

I've replaced all newlines with a special character (ß) and built asmall training file of sentences. I have a question about opennlptraining file format that I couldn't find in the documentation. At theend of a section, there might be a period, multiple newlines, and somemiscellaneous whitespace:


   This concludes the recording of family history.ß    ß ßHISTORY OF
   PRESENT ILLNESS:

Now, for downstream processing we probably want one sentence ending with"..history." and the next beginning "HISTORY..."But what does that mean for the file format? Should it include all thenewlines and other whitespace between the period (which "officially"ends the sentence) and the start of the next sentence? If so, does it goat the end of the first line or the beginning of the second? Does thealgorithm even use this info?Sorry about the barrage of questions and thanks for your help with this.It's already coming along nicely but just want to make sure I'm doing itoptimally.

Tim

On 05/23/2013 01:52 PM, Tim Miller wrote:

OK I've started doing this, was able to get training working on a verysmall example, will try doing slightly bigger.
Tim

On 05/22/2013 08:03 AM, Jörn Kottmann wrote:
On 05/22/2013 01:17 PM, Miller, Timothy wrote:
That's awesome! It might be worth trying at least. How does thetraining
process change? Previously the training data would be one sentence per
line, but with newlines as possible mid-sentence characters that could
be trouble, is there a new representation for training data? Orwould we
have to use the training api?
Good point, yes that will be a problem with the default trainingformat, but it shouldn't be hardto solve. In the format itself we could define a new line tag e.g.<NEWLINE> to mark new lines.as a hack to make it work with 1.5.3 you could instead use a specialchar as a replacement
for the new line char.
When you pass the text down to the sentence detector a simple stringreplace could be used to
convert all new line chars to the special new line marker char.
If things work out for you performance wise as well we will justintegrate it properly into OpenNLP
for the next release.
Could you produce a sentence detector training file with a new linemarker char?
You should try to pick a char you can also pass in on a terminalotherwise you have to use theAPI to train the model. The build in cross validation could be usedto evaluate the performance.
Jörn

Re: sentence detector newline behavior

Reply via email to