Hello,

in the projects I have been involved in, we usually wrote some custom code
to store exactly the information from the NLP processing which is relevant.
I believe that is a good strategy if you have to process really huge
amounts of documents.

If you do thinks on a smaller scale and use UIMA I can recommend to just
store the CAS on disk or in a database.
Another project worth mentioning is Brat in case you are interested in
producing a corpus for training NLP components.

HTH,
Jörn




On Sun, May 3, 2015 at 2:57 PM, Martin Wunderlich <[email protected]> wrote:

> Hi all,
>
> OpenNLP provides lots of great features for pre-processing and tagging.
> However, one thing I am missing is a component that works on the higher
> level of corpus management and document handling. Imagine, for instance, if
> you have raw text that is sent through different pre-processing pipelines.
> It should be possible to store the results in some intermediate format for
> future processing, along with the configuration of the pre-processing
> pipelines.
>
> Up until now, I have been writing my own code for this for prototyping
> purposes, but surely others have faced the same problem and there are
> useful solutions out there. I have looked into UIMA, but it has a
> relatively steep learning curve.
>
> What are other people using for corpus management?
>
> Thanks a lot.
>
> Cheers,
>
> Martin
>

Reply via email to