2011/6/24 Jörn Kottmann <[email protected]>: > On 6/24/11 11:54 AM, Olivier Grisel wrote: >> >> but we need to agree on a CAS type system first. I don't >> know the opennlp-uima myself and won't have time to invest more effort >> on this project before mid-july unfortunately. > > I suggest that there are two classes of types in the type system. > > The first class contains annotations which describe the input we collect > from our annotators and are also suitable to document comments and > disagreements > between annotators. > > And the second class of annotations contain standard linguistic annotations > such as sentences, tokens, entities, chunks, parses, etc. > > The idea is that the annotation in the second class can be automatically > be derived from the annotations in the first class. In case the article is > not > completely labeled the statistic models could fill the gap. > > For example, we could ask the annotators to label token splits, form these > token splits we can derive the actual token annotations. For english texts > the annotation ui could make use of the alpha num optimization and only > ask the user for questionable token splits. > > A similar approach could be done for sentence annotations. > > For named entity annotations the user could do BIO style token labeling > through a > special ui, similar to the one in Walter. The BIO labels can then be used to > compute the > name spans. > > Our models can either be trained directly on the derived annotations, or we > add a sentence level > annotation where users needs to confirm that the entire sentence is labeled > correctly, for example > all person annotation are marked in this sentence.
I like the ability to move the UI focus from one sentence to another and being able to mark a complete sentence as validated. +1 for the rest of your proposal. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
