2011/1/24 Olivier Grisel <[email protected]>: > 2011/1/24 Michael Migdol <[email protected]>: >> Hi Everyone, > > Hi Michael and thanks for sharing your experiments, > >> So, a few questions: >> >> 1) Olivier stated results for english location entities were a recall of >> 0.64. Does this mean that, in general, detecting only 3 of 5 >> countries mentioned in an article is about what one would expect? >> There were actually 12 mentions in the article for the 5 distinct >> countries (it found Sweden twice), so the recall for this simple test >> was actually more like 42%. And obviously, a single article is not a >> sufficient sample size to judge with. I know, my next task should be to >> run the OpenNLP evaluator on a separate dataset, right? > > The evaluation I did was using a model trained on more than 100k > sentences. Maybe the recall is even worth than mine because you used a > much smaller training set. I realised that I haven't uploaded my > results for English on S3. I will do so and let you know when it's > done.
Here it is: http://pignlproc.s3.amazonaws.com/corpus/en/opennlp_location/part-r-00000 The output is chunked: to get the following chunks replace the trailing file name with part-r-00001, part-r-00002, and so on. Indeed by looking at the output, China is never annotated while most other countries are. This is likely to be caused by the fact that the country article in wikipedia / dbpedia is named "People's Republic of China" while "China" is the article for the civilization. Furthermore pignlproc does not yet resolve Wikipedia / DBpedia redirect data such as available from http://downloads.dbpedia.org/3.5.1/en/redirects_en.nt.bz2 . So I think it would be really worth implementing the additional left outer JOIN / COGROUP on the redirect data. If you manage to do so, please send me a patch :) Also here are the resulting models I trained in my post for the English language: http://pignlproc.s3.amazonaws.com/models/opennlp/en-ner-location.bin http://pignlproc.s3.amazonaws.com/models/opennlp/en-ner-person.bin http://pignlproc.s3.amazonaws.com/models/opennlp/en-ner-organization.bin Best, -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
