On 06/06/2013 03:48 PM, Tim Miller wrote:
Hi opennlp,
I started a thread on ctakes-dev about training the sentence detector to allow newlines in the middle of sentences, Jorn said it was possible, now I have a question about how to proceed.

I've replaced all newlines with a special character (ß) and built a small training file of sentences. I have a question about opennlp training file format that I couldn't find in the documentation. At the end of a section, there might be a period, multiple newlines, and some miscellaneous whitespace:

   This concludes the recording of family history.ß    ß ßHISTORY OF
   PRESENT ILLNESS:

Now, for downstream processing we probably want one sentence ending with "..history." and the next beginning "HISTORY..." But what does that mean for the file format? Should it include all the newlines and other whitespace between the period (which "officially" ends the sentence) and the start of the next sentence? If so, does it go at the end of the first line or the beginning of the second? Does the algorithm even use this info? Sorry about the barrage of questions and thanks for your help with this. It's already coming along nicely but just want to make sure I'm doing it optimally.

I had a look at the code, and as far as I can see the white spaces are just assumed to be between two sentences if you train with a training file. If you use the API directly that assumption is not made, so in case you have some UIMA CASes with sentence markup its probably easier to train directly without using the training format.

The new line chars (ß in your case) should be there and you probably want them to be consistent. In the current implementation a training event will be generated for each of them, so maybe if you have this case I would suggest just to attach them to the front of the next line, rather than having them in the end of the previous line, but you might just want to experiment a bit what gives the best
results.

The current sentence detector has an optimization that it removes all new lines and white spaces around a detected sentence, with your modification that will not work. New lines will be included in the returned sentence span.

If you want to have a look at the feature generation, its defined here:
opennlp.tools.sentdetect.DefaultSDContextGenerator

You can also implement your own feature generation or copy and hack the default one, for this you need to define a custom factory which sets the sentence detector up during instantiation. I can provide you with a sample if you want to try it. During training you can just hand in the class name of the factory and it will then be used from then on for training and later
during model instantiation automatically.

Don't hesitate do ask further questions.

HTH,
Jörn

Reply via email to