Hi Everyone, I hope you don't mind a few basic questions from someone who is quite new to NLP. Hopefully they are coherent and relevant to the group.
I have been enjoying Olivier's fantastic work on pignlproc. I am initially trying to create an OpenNLP model which can be used to extract entities mentioned in news articles, starting with "countries" I have successfully created a training set of about 10K annotated sentences extracted from a 100MB chunk of Wikipedia using Pignlproc, and have created a country model from that training set using the default OpenNLP CLI tools. My first attempt to execute the extractor on the new model was the article at http://hosted.ap.org/dynamic/stories/S/SOC_FOUR_NATIONS?SITE=MOSPL&SECTION=HOME&TEMPLATE=DEFAULT . I ran the NameExtractor on the plain text of the article (after sentenceDetection and tokenization). It successfully detected Sweden, Canada, and Germany, but not the United States or China. So, a few questions: 1) Olivier stated results for english location entities were a recall of 0.64. Does this mean that, in general, detecting only 3 of 5 countries mentioned in an article is about what one would expect? There were actually 12 mentions in the article for the 5 distinct countries (it found Sweden twice), so the recall for this simple test was actually more like 42%. And obviously, a single article is not a sufficient sample size to judge with. I know, my next task should be to run the OpenNLP evaluator on a separate dataset, right? 2) When creating the country Model, all of the sentences passed to the trainer had countries in them. Am I supposed to be passing sentences that do not contain countries as well? 3) The sentences passed to the trainer were not split up into document groups. How much of an effect will this have on the results? Is there away to do this split using the existing pignlproc scripts? 4) Does anyone have some basic troubleshooting advice for trying to understand why specific texts are not being successfully extracted? I've got OpenNLP and Maxent up and running in my IDE, so I'm trying to understand where the best places to break and look at intermediate results. Is this usually your approach, or do you take more of a black-box approach? In a similar vein, do you know an easy way to view the feature vector for a specific entry in the model? 5) Are there any customizations you would suggest making to the Feature Generator? Thanks to Olivier for sharing his work with the semantic community! I appreciate any and all help you can provide. Thanks in advance, Michael
