2011/7/4 Jörn Kottmann <[email protected]>:
> On 7/4/11 7:20 PM, Olivier Grisel wrote:
>>
>> Keeping the correct link position from the original markup while
>> cleaning it can be tricky though. Be careful when tweaking the parser.
>> Maybe the Span helper classes from OpenNLP could help make this code
>> more robust.
>
> I wonder how important the links are here, because we do not want to throw
> away sentences which do not have links covering their entities.
>
> But I believe the links might be very interesting for entity identification,
> if lets say a person name is labeled, and also covered by a link. The link
> can be used to identify the person mention.

Yes this is exactly what pignlproc is doing. Building a NameFinder
training corpus automatically from the link position info from the
wikipedia articles and the entity typing info from the DBpedia dumps
(this articles is a person, this one is an organization....).

> And after we have a few manually labeled articles we can use the links to
> generate special features which are passed to the name finder.
>
> So in the end, do we just generate an annotation for every link?!

This is very important to build a preannotated corpus to boostrap and
train a first version of OpenNLP models automatically. This model can
then be used to annotate new text without any annotations and human
refinement can then be used to produce gold annotations rapidly by
mostly validating / fixing existing annotations rather that annotating
text from scratch.

Links can also be useful to train a NE disambiguation training corpus.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to