Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Jörn Kottmann Wed, 19 Jan 2011 14:41:25 -0800

On 1/19/11 11:17 PM, Olivier Grisel wrote:

In that annotation project we could introduce the concept of
"atomic" annotations. That are annotations which are only considered as
correct in a part of the article. Some named entity annotations could maybe
directly
created from the wiki markup with an approach similar to the one
you used. And more could be produced by the community.
I guess it is possible to give these partial available named entities to our
name finder
to automatically label the rest of the article with a higher precision than
usual.

It's worth a try by need careful manual validation and evaluation of
the quality.

Having these atomic annotations I think is very important for acommunity labelingproject, because it allows people just to add that information to thearticle wherethey are really sure about it is correct. Maybe there are a few caseswhere they areunsure, with atomic annotations they are not forced to label the wholearticle.We have to see how that exactly could be done, that also depends on thecomponent.For the name finder it would be easy to do it on a sentence level, ormaybe evena mixture is possible of document level, sentence level and individualannotations.


If the overall quality is good enough training on half-automatically label
articles could also be an option.

After we manually labeled a few hundred articles with entities we could even
go a step further and try to create new features for the name finder
which take the wiki markup into account (such a name finder could also help
your
project to process the whole wikipedia).

Yes, it would be great to add new gazetteer features (names and
alternative spelling for famous entities such as persons, places,
organizations and so on) maybe in a compressed form using bloom
filters:

   http://en.wikipedia.org/wiki/Bloom_filter

Yes having something like that would be really nice. There are other
interesting applications of bloom filters in nlp. Jason once pointed me
to a paper where they used bloom filters for language models.

+1 to work on that

If we start something like that it might be only useful for the tokenizer,
sentence
detector and name finder in a short term. Maybe over time it is even
possible to
add annotations for all the components we have in OpenNLP into this corpus.

What do others think ?

+1 overall

We also need user friendly tooling quickly review / validate / fix an
annotated corpus and fix it (rather than using vim or emacs).


Yes this tooling should actually exceed the capabilities you could get with
a text editor in a way that the annotations in the text are updated as soon
as the user adds one. That way labeling will be speed up dramatically.
Often articles contain the same name over and over again, and it is really
boring labeling a name 5 - 6 times, because it feels like doing it once
should be enough to get the rest labeled automatically.

I am actually the author of the Cas Editor, maybe we could write a plugin
for that one or start some completely new web based tooling.

We also need annotations guide lines which explain what should be labeled
and what not.

Jörn

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Reply via email to