On 26/01/2012 20:39, Olivier Grisel wrote:
You should use the DBpedia NTriples dumps instead of parsing the
wikipedia template as done in https://github.com/ogrisel/pignlproc .
The type information for person, places and organization is very good.
Ok, it will be my next step.

I don't think it's a huge problem for training but it's indeed a
problem the performance evaluation: if you use this some held out
folds from this dataset for performance evaluation (precision, recall,
f1-score of the trained NameFinder model) then the fact that dataset
itself is missing annotation will artificially increase the false
positive rate estimate which will have an potentially great impact on
the evaluation of the precision. The actually precision should be
higher that what's measured.
My sentiment is that if I train model with sentences missing annotations these will worse the performance of my model. Isn't it?

I think the only way to fix this issue is to manually fix the
annotations of a small portion of the automatically generated dataset
to add the missing annotations. I think we probably need 1000
sentences per type to get a non ridiculous validation set.

Besides performance evaluation, the missing annotation issue will also
bias the model towards negative response, hence increasing the false
negatives rate and decreasing the true model recall.

That's exactly what I mean. The fact is that in our interpretation of Wikipedia, not all the sentences are annotated. That is because not all the sentences containing an entity requires linking. So I'm thinking of using only a better subset of my sentences (since they are so much). From this the idea of sampling only featured pages: stubs or poor pages may have a greater probability of being poorly annotated.

The idea may also be extended with the other proposal, which I'll try to explain with an example. Imagine a page about a vegetable. If a city appears in a sentence inside this page, it could be possible that it will appear not linked (i.e. not annotated) since the topics of the article aren't as much related. Otherwise I suspect that in a page talking about Geography, places are tagged more frequently. This is obviously an hypothesis, which shoul be better verify.

Another idea is to use only sentences containing links regarding the entities which may be interesting. For example: * "[[Milan|Milan]] is an industrial city" becomes: "<place>Milan</place> is an industrial city" * "[[Paris|Paris Hilton]] was drunk last Friday." becomes: "Paris was drunk last Friday" (this sentence is kept because the link text is in the list of candidates to be tagged as places, but in this case the anchor suggest us it isn't so, hence is a good negative example) "Paris is a very touristic city." is discarded because it doesn't contain any interesting link



In my first experiment reported in [1] I had not taken the wikipedia
redirect links into account which did probably aggravate this problem
even further. The current version of the pig script has been fixed
w.r.t redirect handling [2] but I have not found the time to rerun a
complete performance evaluation. This will solve frequent
classification errors such as "China" which is redirected to "People's
Republic of China" in Wikipedia. So just handling the redirect my
improve the quality of the data and hence the trained model by quite a
bit.

[1] 
http://dev.blogs.nuxeo.com/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
[2] 
https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig#L22

Also note that the perceptron model was not available when I ran this
experiment. It's probably more scalable, e.s.p. memory wise and would
be very worth trying again.

I my case I can handle redirects too, and surely I'll try to test also the perceptron model.

In my experience the DBpedia type links for Person, Place and Organization are very good quality. No false positives, there might be some missing links though. It might be interesting to do some manual checking of the top 100 recurring false positive names after a first round of DBpedia extraction => model training => model evaluation on held out data. Then if a significant portion of those false positive names are actually missing type info in DBpedia or in the redirect links, add them manually and iterate.

Ok, now I have a lot of ideas for customizing my experiments. I will publish of course my results as soon as I execute my tests. However I'd like to get more in depth also with the training parameters, so the discussion goes on :)


Anyway if you are interested in reviving the annotation sub-project, please feel free to do so: https://cwiki.apache.org/OPENNLP/opennlp-annotations.html We need a database of annotated open data text (wikipedia, wikinews, project Gutemberg...) with human validation metadata and a nice Web UI to maintain it.

I think it would be a great think, and also a work which requires a good project phase (a mistake here could bring to a lot of problems in the future). I'll think about contributing to the project, but for sure it won't be immediate.

Thanks
    Riccardo

Reply via email to