2011/1/24 Michael Migdol <[email protected]>: > Hi Everyone, Hi Michael and thanks for sharing your experiments,
> So, a few questions: > > 1) Olivier stated results for english location entities were a recall of > 0.64. Does this mean that, in general, detecting only 3 of 5 > countries mentioned in an article is about what one would expect? > There were actually 12 mentions in the article for the 5 distinct > countries (it found Sweden twice), so the recall for this simple test > was actually more like 42%. And obviously, a single article is not a > sufficient sample size to judge with. I know, my next task should be to > run the OpenNLP evaluator on a separate dataset, right? The evaluation I did was using a model trained on more than 100k sentences. Maybe the recall is even worth than mine because you used a much smaller training set. I realised that I haven't uploaded my results for English on S3. I will do so and let you know when it's done. > 2) When creating the country Model, all of the sentences passed to the > trainer had countries in them. Am I supposed to be passing sentences > that do not contain countries as well? That would help for the precision if you have a way to be sure that those sentences for not have any country. Since Wikipedia is only partially labeled / linked, there is no obvious way to do it without manual annotations. > 3) The sentences passed to the trainer were not split up into document > groups. How much of an effect will this have on the results? Is > there away to do this split using the existing pignlproc scripts? As pignlproc only keeps sentences with at least one link / annotations of the requested type, the resulting documents will be incomplete, hence I decided to treat each individual sentence as a document. I don't know if this has a huge impact or not. I should be possible to concatenate sentences originating from the same documents into a single line thought. It needs a bit of tweaking in the pignnlproc scripts and maybe the UDF as well. > 4) Does anyone have some basic troubleshooting advice for trying to > understand why specific texts are not being successfully extracted? > I've got OpenNLP and Maxent up and running in my IDE, so I'm trying to > understand where the best places to break and look at intermediate > results. Is this usually your approach, or do you take more of a > black-box approach? In a similar vein, do you know an easy way to > view the feature vector for a specific entry in the model? I don't think introspecting the running Maxent code will help much. Manually introspecting the training corpus might help understand it's limitation. Manual completion of the training corpus should really help but would be tedious without proper tools. Also you should try the approach described by Ted Dunning in this comment: http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html#comment-6a010536291c30970b0147e1c7e65b970b In a nutshell: use the trained model to annotate the original corpus to add some missing annotations and retrain a second model on the resulting corpus. This model should have a better recall performance. The precision might degrade a bit though. Be sure to manually check the quality of the corpus to check that this did not introduce too many wrong annotations. > 5) Are there any customizations you would suggest making to the > Feature Generator? Well as Jörn mentioned, Dictionary-based features (a.k.a. Gazetteers) should indeed help. pignlproc should make it really easy to build such dictionaries out of the Wikipedia content. Alternatively querying DBpedia directly would give an even easier way to build such gazetteers. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
