2011/7/4 Jörn Kottmann <[email protected]>: > On 7/4/11 7:41 PM, Olivier Grisel wrote: >> >> 2011/7/4 Jörn Kottmann<[email protected]>: >>> >>> On 7/4/11 7:20 PM, Olivier Grisel wrote: >>>> >>>> Keeping the correct link position from the original markup while >>>> cleaning it can be tricky though. Be careful when tweaking the parser. >>>> Maybe the Span helper classes from OpenNLP could help make this code >>>> more robust. >>> >>> I wonder how important the links are here, because we do not want to >>> throw >>> away sentences which do not have links covering their entities. >>> >>> But I believe the links might be very interesting for entity >>> identification, >>> if lets say a person name is labeled, and also covered by a link. The >>> link >>> can be used to identify the person mention. >> >> Yes this is exactly what pignlproc is doing. Building a NameFinder >> training corpus automatically from the link position info from the >> wikipedia articles and the entity typing info from the DBpedia dumps >> (this articles is a person, this one is an organization....). >> >>> And after we have a few manually labeled articles we can use the links to >>> generate special features which are passed to the name finder. >>> >>> So in the end, do we just generate an annotation for every link?! >> >> This is very important to build a preannotated corpus to boostrap and >> train a first version of OpenNLP models automatically. This model can >> then be used to annotate new text without any annotations and human >> refinement can then be used to produce gold annotations rapidly by >> mostly validating / fixing existing annotations rather that annotating >> text from scratch. >> >> Links can also be useful to train a NE disambiguation training corpus. >> > > The automatic labeling can be supported by features generated for the link > annotations, this way I guess the name finder performs much better, but > evaluation will show that.
Yes but such features are useless to be able to tag other named entity occurrences for which we don't have any kind of link data. If the data is linked to a wikipedia page, you can just do a sparql query on DBpedia to know the type, no need for any kind of NLP (or better build a local index of entity types using solr based on the DBpedia dump). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
