On 06/06/2013 03:48 PM, Tim Miller wrote:
Hi opennlp,
I started a thread on ctakes-dev about training the sentence detector
to allow newlines in the middle of sentences, Jorn said it was
possible, now I have a question about how to proceed.
I've replaced all newlines with a special character (ß) and built a
small training file of sentences. I have a question about opennlp
training file format that I couldn't find in the documentation. At the
end of a section, there might be a period, multiple newlines, and some
miscellaneous whitespace:
This concludes the recording of family history.ß ß ßHISTORY OF
PRESENT ILLNESS:
Now, for downstream processing we probably want one sentence ending
with "..history." and the next beginning "HISTORY..."
But what does that mean for the file format? Should it include all the
newlines and other whitespace between the period (which "officially"
ends the sentence) and the start of the next sentence? If so, does it
go at the end of the first line or the beginning of the second? Does
the algorithm even use this info?
Sorry about the barrage of questions and thanks for your help with
this. It's already coming along nicely but just want to make sure I'm
doing it optimally.
I had a look at the code, and as far as I can see the white spaces are
just assumed to be between
two sentences if you train with a training file. If you use the API
directly that assumption is not made, so in case
you have some UIMA CASes with sentence markup its probably easier to
train directly without using the training format.
The new line chars (ß in your case) should be there and you probably
want them to be consistent. In the current implementation a training
event will be generated for each of them, so maybe if you have this case
I would suggest just to attach them to the front of
the next line, rather than having them in the end of the previous line,
but you might just want to experiment a bit what gives the best
results.
The current sentence detector has an optimization that it removes all
new lines and white spaces around a detected sentence,
with your modification that will not work. New lines will be included in
the returned sentence span.
If you want to have a look at the feature generation, its defined here:
opennlp.tools.sentdetect.DefaultSDContextGenerator
You can also implement your own feature generation or copy and hack the
default one, for this you need to define a
custom factory which sets the sentence detector up during instantiation.
I can provide you with a sample if you want to try it.
During training you can just hand in the class name of the factory and
it will then be used from then on for training and later
during model instantiation automatically.
Don't hesitate do ask further questions.
HTH,
Jörn