Re: OpenNLP Annotations Proposal

Jörn Kottmann Fri, 24 Jun 2011 09:38:35 -0700

On 6/24/11 11:54 AM, Olivier Grisel wrote:

but we need to agree on a CAS type system first. I don't
know the opennlp-uima myself and won't have time to invest more effort
on this project before mid-july unfortunately.


I suggest that there are two classes of types in the type system.

The first class contains annotations which describe the input we collect

from our annotators and are also suitable to document comments anddisagreements

between annotators.

And the second class of annotations contain standard linguistic annotations
such as sentences, tokens, entities, chunks, parses, etc.

The idea is that the annotation in the second class can be automatically

be derived from the annotations in the first class. In case the articleis not

completely labeled the statistic models could fill the gap.

For example, we could ask the annotators to label token splits, form these
token splits we can derive the actual token annotations. For english texts
the annotation ui could make use of the alpha num optimization and only
ask the user for questionable token splits.

A similar approach could be done for sentence annotations.

For named entity annotations the user could do BIO style token labelingthrough aspecial ui, similar to the one in Walter. The BIO labels can then beused to compute the

name spans.

Our models can either be trained directly on the derived annotations, orwe add a sentence levelannotation where users needs to confirm that the entire sentence islabeled correctly, for example

all person annotation are marked in this sentence.

Any opinions?

Jörn

Re: OpenNLP Annotations Proposal

Reply via email to