On Thu, Feb 9, 2012 at 11:41 AM, Jens Grivolla <[email protected]> wrote:
> Hi,
>
> On 02/08/2012 05:52 PM, Katrin Tomanek wrote:
>
>> [...] I realized that only these EOS (end of sentence)
>>
>> characters are currently supported:
>>
>> '.', '!', '?'
>>
>> However, in our case we have many other EOS (":" as one of the most
>> common ones)
>>
>
> I believe our situation is even worse, because we want to have line breaks
> as possible EOS. We use OpenNLP through UIMA where this should not be an
> issue, but I understand that the algorithms are designed to work with
> training files that use line breaks to represent sentence boundaries, i.e.
> line breaks are used as a meta character that can not actually occur within
> the document.
>
> When introducing configurability of EOS characters it would be good to
> take that into account and provide a way to deal with line breaks in the
> documents.
>
>
Actually I think you need to detect the basic document/article structure
first, e.g. headline, sub-headline, paragraphs, bylines, ...
The Sentence Detector is designed to split a paragraph into sentences and
not to detect the document structure.
In your case I would try to make an Analysis Engine which can identify your
"text blocks", annotate them with an annotation and then
tell the Sentence Detector AE to only perform sentence splitting on the
text within these annotations (already implemented).
I used this to do news analysis.
We had a couple of bugs with the white space handling in the sentence
detector, these are now fixed. So you should not have any issues with
white spaces handling anymore.
The training of the sentence detector can be done with the UIMA integration
as well, there you need to provide CASes with sentence annotations.
Hope this helps,
Jörn