Re: [Name Finder] Wikipedia training set and parameters tuning

Riccardo Tasso Fri, 27 Jan 2012 10:10:05 -0800

On 26/01/2012 20:39, Olivier Grisel wrote:

You should use the DBpedia NTriples dumps instead of parsing the
wikipedia template as done in https://github.com/ogrisel/pignlproc .
The type information for person, places and organization is very good.

Ok, it will be my next step.

I don't think it's a huge problem for training but it's indeed a
problem the performance evaluation: if you use this some held out
folds from this dataset for performance evaluation (precision, recall,
f1-score of the trained NameFinder model) then the fact that dataset
itself is missing annotation will artificially increase the false
positive rate estimate which will have an potentially great impact on
the evaluation of the precision. The actually precision should be
higher that what's measured.

My sentiment is that if I train model with sentences missing annotationsthese will worse the performance of my model. Isn't it?

I think the only way to fix this issue is to manually fix the
annotations of a small portion of the automatically generated dataset
to add the missing annotations. I think we probably need 1000
sentences per type to get a non ridiculous validation set.

Besides performance evaluation, the missing annotation issue will also
bias the model towards negative response, hence increasing the false
negatives rate and decreasing the true model recall.

That's exactly what I mean. The fact is that in our interpretation ofWikipedia, not all the sentences are annotated. That is because not allthe sentences containing an entity requires linking. So I'm thinking ofusing only a better subset of my sentences (since they are so much).From this the idea of sampling only featured pages: stubs or poor pagesmay have a greater probability of being poorly annotated.

The idea may also be extended with the other proposal, which I'll try toexplain with an example. Imagine a page about a vegetable. If a cityappears in a sentence inside this page, it could be possible that itwill appear not linked (i.e. not annotated) since the topics of thearticle aren't as much related. Otherwise I suspect that in a pagetalking about Geography, places are tagged more frequently. This isobviously an hypothesis, which shoul be better verify.

Another idea is to use only sentences containing links regarding theentities which may be interesting. For example:* "[[Milan|Milan]] is an industrial city" becomes: "<place>Milan</place>is an industrial city"* "[[Paris|Paris Hilton]] was drunk last Friday." becomes: "Paris wasdrunk last Friday" (this sentence is kept because the link text is inthe list of candidates to be tagged as places, but in this case theanchor suggest us it isn't so, hence is a good negative example)"Paris is a very touristic city." is discarded because it doesn'tcontain any interesting link

In my first experiment reported in [1] I had not taken the wikipedia
redirect links into account which did probably aggravate this problem
even further. The current version of the pig script has been fixed
w.r.t redirect handling [2] but I have not found the time to rerun a
complete performance evaluation. This will solve frequent
classification errors such as "China" which is redirected to "People's
Republic of China" in Wikipedia. So just handling the redirect my
improve the quality of the data and hence the trained model by quite a
bit.

[1] 
http://dev.blogs.nuxeo.com/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
[2] 
https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig#L22

Also note that the perceptron model was not available when I ran this
experiment. It's probably more scalable, e.s.p. memory wise and would
be very worth trying again.

I my case I can handle redirects too, and surely I'll try to test alsothe perceptron model.

In my experience the DBpedia type links for Person, Place andOrganization are very good quality. No false positives, there might besome missing links though. It might be interesting to do some manualchecking of the top 100 recurring false positive names after a firstround of DBpedia extraction => model training => model evaluation onheld out data. Then if a significant portion of those false positivenames are actually missing type info in DBpedia or in the redirectlinks, add them manually and iterate.

Ok, now I have a lot of ideas for customizing my experiments. I willpublish of course my results as soon as I execute my tests. However I'dlike to get more in depth also with the training parameters, so thediscussion goes on :)

Anyway if you are interested in reviving the annotation sub-project,please feel free to do so:https://cwiki.apache.org/OPENNLP/opennlp-annotations.html We need adatabase of annotated open data text (wikipedia, wikinews, projectGutemberg...) with human validation metadata and a nice Web UI tomaintain it.

I think it would be a great think, and also a work which requires a goodproject phase (a mistake here could bring to a lot of problems in thefuture). I'll think about contributing to the project, but for sure itwon't be immediate.


Thanks
    Riccardo

Re: [Name Finder] Wikipedia training set and parameters tuning

Reply via email to