Hi all, OpenNLP provides lots of great features for pre-processing and tagging. However, one thing I am missing is a component that works on the higher level of corpus management and document handling. Imagine, for instance, if you have raw text that is sent through different pre-processing pipelines. It should be possible to store the results in some intermediate format for future processing, along with the configuration of the pre-processing pipelines.
Up until now, I have been writing my own code for this for prototyping purposes, but surely others have faced the same problem and there are useful solutions out there. I have looked into UIMA, but it has a relatively steep learning curve. What are other people using for corpus management? Thanks a lot. Cheers, Martin
