Hi all, 

OpenNLP provides lots of great features for pre-processing and tagging. 
However, one thing I am missing is a component that works on the higher level 
of corpus management and document handling. Imagine, for instance, if you have 
raw text that is sent through different pre-processing pipelines. It should be 
possible to store the results in some intermediate format for future 
processing, along with the configuration of the pre-processing pipelines. 

Up until now, I have been writing my own code for this for prototyping 
purposes, but surely others have faced the same problem and there are useful 
solutions out there. I have looked into UIMA, but it has a relatively steep 
learning curve. 

What are other people using for corpus management? 

Thanks a lot. 

Cheers, 

Martin
 

Reply via email to