On 6/24/11 7:42 PM, Hannes Korte wrote:
On 06/24/2011 06:37 PM, Jörn Kottmann wrote:
You mean we take the user annotations above a certain agreement level
from the first class types to the second class types to get the gold
annotations? For entities this is no problem, but where do we start for
tokens and sentences? I think we intially apply the current OpenNLP
sentence splitter and tokenizer, right?

Exactly for sentences we have a special annotation to label end-of-sentence characters as split or not split. And we do the same for tokens, but there the split annotation has a length of zero. The users can then vote on these annotations. Since its a binary decision, it
is either true or false.
For example, we could ask the annotators to label token splits, form
these token splits we can derive the actual token annotations. For
english texts the annotation ui could make use of the alpha num
optimization and only ask the user for questionable token splits.
Ok, so similar to the entities the UI needs to show the token boudaries
as well as functionality to change these. Or do you want this
functionality in a different UI than the named entity one?

I am not sure, maybe we should just try something and then refine it after
a little experimenting. I think both would work in the beginning.

I guess we do best when we only hand out articles with high quality tokenization to tasks which depend on it. So maybe it would be good to have some ui to quickly
confirm that the tokenization is ok.

For named entity annotations the user could do BIO style token
labeling through a special ui, similar to the one in Walter. The BIO
labels can then be used to compute the name spans.
Until the beginning of this post I thought we use the name spans to
compute the BIO labels not the other way round. But if we show the
tokens as single blocks, then it makes sense to use some sort of
BIO-style annotations.

For example, the user navigates over the tokens with the left and right
arrow keys. If he hits "P" (for "B-PER") then the focus moves to the
next token. Hitting "p" marks it as "I-PER", hitting "P" another time
marks it as a new entity ("B-PER") and hitting "space" marks it as "O",
i.e., removing a previous annotation. The arrow keys don't change the
label. Feels pretty usable in my mind.. :)

Yes, but the labeling ui itself can also use other methods, e.g. confirm existing entities and then confirm the entire sentence, the ui code can simply transform this into BIO-style annotations, and the UI will be able to offer a veto with a comment for a single token. Maybe we decide to label person names without a title in front, for example Mr. *Smith* but now someone lables it as *Mr. Smith*, then a user vetoes the annotation on Mr. and insert a short
comment.

Jörn

Reply via email to