On 1/24/11 11:01 AM, Olivier Grisel wrote:
3) The sentences passed to the trainer were not split up into document
groups.  How much of an effect will this have on the results?  Is
there away to do this split using the existing pignlproc scripts?
As pignlproc only keeps sentences with at least one link / annotations
of the requested type, the resulting documents will be incomplete,
hence I decided to treat each individual sentence as a document. I
don't know if this has a huge impact or not. I should be possible to
concatenate sentences originating from the same documents into a
single line thought. It needs a bit of tweaking in the pignnlproc
scripts and maybe the UDF as well.

Documents in the native OpenNLP training format are separated by
blank lines and it should be only one sentence per line. Then OpenNLP
can train the previous map feature correctly, which help to achieve
a better recall.

4) Does anyone have some basic troubleshooting advice for trying to
understand why specific texts are not being successfully extracted?

In a nutshell: use the trained model to annotate the original corpus
to add some missing annotations and retrain a second model on the
resulting corpus.

I would try to manually set the previous map features based on the
available Wikipedia links, because that will increase the recall.
Maybe we should try to make a small test where the name finder is
maybe trained on 1000 sentences, and then runs once over your extracted
data. For English we could also take the model we distribute on our website.

Jörn

Reply via email to