On 6/24/11 11:54 AM, Olivier Grisel wrote:
but we need to agree on a CAS type system first. I don't
know the opennlp-uima myself and won't have time to invest more effort
on this project before mid-july unfortunately.

I suggest that there are two classes of types in the type system.

The first class contains annotations which describe the input we collect
from our annotators and are also suitable to document comments and disagreements
between annotators.

And the second class of annotations contain standard linguistic annotations
such as sentences, tokens, entities, chunks, parses, etc.

The idea is that the annotation in the second class can be automatically
be derived from the annotations in the first class. In case the article is not
completely labeled the statistic models could fill the gap.

For example, we could ask the annotators to label token splits, form these
token splits we can derive the actual token annotations. For english texts
the annotation ui could make use of the alpha num optimization and only
ask the user for questionable token splits.

A similar approach could be done for sentence annotations.

For named entity annotations the user could do BIO style token labeling through a special ui, similar to the one in Walter. The BIO labels can then be used to compute the
name spans.

Our models can either be trained directly on the derived annotations, or we add a sentence level annotation where users needs to confirm that the entire sentence is labeled correctly, for example
all person annotation are marked in this sentence.

Any opinions?

Jörn

Reply via email to