On 10/01/2013 05:36 PM, Ryan Josal wrote:
That is what I'm doing. I've set up semaphore pools for all my
TokenNameFinders. I would wonder if there's any technical concession one would
have to make a TokenNameFinder thread safe. What would happen to the adaptive
data? On the topic of models, the sourceforge ones have been certainly useful;
I'm mainly using the NER models, but indeed more models, or models trained on
more recent data would be nice. But I know training data, even without
annotations doesn't come out of thin air, otherwise I'd have created a few
models myself.
If there is an interest and contributors it would be possible to label
wikinews data (we worked a bit on that), but sure there are more sources
of documents which could be obtained with an Apache compatible license.
Anyway I guess the process to create training data as part of the
OpenNLP process would be kind of as follows:
- Obtain some raw text
- Write an annotation guide (maybe based on some existing ones)
- Agree on an annotation tool to use (e.g. brat)
- Annotate a few hundred documents
- Make the first release of the corpus
Jörn