Hello, in the projects I have been involved in, we usually wrote some custom code to store exactly the information from the NLP processing which is relevant. I believe that is a good strategy if you have to process really huge amounts of documents.
If you do thinks on a smaller scale and use UIMA I can recommend to just store the CAS on disk or in a database. Another project worth mentioning is Brat in case you are interested in producing a corpus for training NLP components. HTH, Jörn On Sun, May 3, 2015 at 2:57 PM, Martin Wunderlich <[email protected]> wrote: > Hi all, > > OpenNLP provides lots of great features for pre-processing and tagging. > However, one thing I am missing is a component that works on the higher > level of corpus management and document handling. Imagine, for instance, if > you have raw text that is sent through different pre-processing pipelines. > It should be possible to store the results in some intermediate format for > future processing, along with the configuration of the pre-processing > pipelines. > > Up until now, I have been writing my own code for this for prototyping > purposes, but surely others have faced the same problem and there are > useful solutions out there. I have looked into UIMA, but it has a > relatively steep learning curve. > > What are other people using for corpus management? > > Thanks a lot. > > Cheers, > > Martin >
