Re: sentence detector newline behavior

Jörn Kottmann Thu, 06 Jun 2013 07:32:54 -0700

On 06/06/2013 03:48 PM, Tim Miller wrote:

Hi opennlp,
I started a thread on ctakes-dev about training the sentence detectorto allow newlines in the middle of sentences, Jorn said it waspossible, now I have a question about how to proceed.
I've replaced all newlines with a special character (ß) and built asmall training file of sentences. I have a question about opennlptraining file format that I couldn't find in the documentation. At theend of a section, there might be a period, multiple newlines, and somemiscellaneous whitespace:
   This concludes the recording of family history.ß    ß ßHISTORY OF
   PRESENT ILLNESS:
Now, for downstream processing we probably want one sentence endingwith "..history." and the next beginning "HISTORY..."But what does that mean for the file format? Should it include all thenewlines and other whitespace between the period (which "officially"ends the sentence) and the start of the next sentence? If so, does itgo at the end of the first line or the beginning of the second? Doesthe algorithm even use this info?Sorry about the barrage of questions and thanks for your help withthis. It's already coming along nicely but just want to make sure I'mdoing it optimally.

I had a look at the code, and as far as I can see the white spaces arejust assumed to be betweentwo sentences if you train with a training file. If you use the APIdirectly that assumption is not made, so in caseyou have some UIMA CASes with sentence markup its probably easier totrain directly without using the training format.

The new line chars (ß in your case) should be there and you probablywant them to be consistent. In the current implementation a trainingevent will be generated for each of them, so maybe if you have this caseI would suggest just to attach them to the front ofthe next line, rather than having them in the end of the previous line,but you might just want to experiment a bit what gives the best

results.

The current sentence detector has an optimization that it removes allnew lines and white spaces around a detected sentence,with your modification that will not work. New lines will be included inthe returned sentence span.


If you want to have a look at the feature generation, its defined here:
opennlp.tools.sentdetect.DefaultSDContextGenerator

You can also implement your own feature generation or copy and hack thedefault one, for this you need to define acustom factory which sets the sentence detector up during instantiation.I can provide you with a sample if you want to try it.During training you can just hand in the class name of the factory andit will then be used from then on for training and later

during model instantiation automatically.

Don't hesitate do ask further questions.

HTH,
Jörn

Re: sentence detector newline behavior

Reply via email to