Thanks so much to Richard and Jörn for the very helpful comments. It seems that 
should take another look at UIMA, at least for the purposes of pipeline 
management. The projects Richard mentioned look very interesting and promising. 
Hopefully, I will have the time to into them more closely. I need to carefully 
weigh the benefits of potential time-saving and knowledge gain against the 
drawback of learning these systems and libraries. 

Cheers, 

Martin
 

> Am 06.05.2015 um 16:00 schrieb Joern Kottmann <[email protected]>:
> 
> Hello,
> 
> in the projects I have been involved in, we usually wrote some custom code
> to store exactly the information from the NLP processing which is relevant.
> I believe that is a good strategy if you have to process really huge
> amounts of documents.
> 
> If you do thinks on a smaller scale and use UIMA I can recommend to just
> store the CAS on disk or in a database.
> Another project worth mentioning is Brat in case you are interested in
> producing a corpus for training NLP components.
> 
> HTH,
> Jörn
> 
> 
> 
> 
> On Sun, May 3, 2015 at 2:57 PM, Martin Wunderlich <[email protected]> wrote:
> 
>> Hi all,
>> 
>> OpenNLP provides lots of great features for pre-processing and tagging.
>> However, one thing I am missing is a component that works on the higher
>> level of corpus management and document handling. Imagine, for instance, if
>> you have raw text that is sent through different pre-processing pipelines.
>> It should be possible to store the results in some intermediate format for
>> future processing, along with the configuration of the pre-processing
>> pipelines.
>> 
>> Up until now, I have been writing my own code for this for prototyping
>> purposes, but surely others have faced the same problem and there are
>> useful solutions out there. I have looked into UIMA, but it has a
>> relatively steep learning curve.
>> 
>> What are other people using for corpus management?
>> 
>> Thanks a lot.
>> 
>> Cheers,
>> 
>> Martin
>> 

Reply via email to