On 10/01/2013 05:36 PM, Ryan Josal wrote:
That is what I'm doing.  I've set up semaphore pools for all my 
TokenNameFinders.  I would wonder if there's any technical concession one would 
have to make a TokenNameFinder thread safe.  What would happen to the adaptive 
data?  On the topic of models, the sourceforge ones have been certainly useful; 
I'm mainly using the NER models, but indeed more models, or models trained on 
more recent data would be nice.  But I know training data, even without 
annotations doesn't come out of thin air, otherwise I'd have created a few 
models myself.

If there is an interest and contributors it would be possible to label wikinews data (we worked a bit on that), but sure there are more sources
of documents which could be obtained with an Apache compatible license.

Anyway I guess the process to create training data as part of the OpenNLP process would be kind of as follows:
- Obtain some raw text
- Write an annotation guide (maybe based on some existing ones)
- Agree on an annotation tool to use (e.g. brat)
- Annotate a few hundred documents
- Make the first release of the corpus

Jörn

Reply via email to