Hi Martin,

I'm not sure this is the right one, but I use Lucene index as a corpus database.
Because Lucene provides powerful Analyzers, we can use them to normalize text
(A -> a, ß -> ss, 廣 -> 広,...). Once you normalize text, those normalized words
are recorded into Lucene index. Since Lucene provides API to access index (word 
database)
as well, we can get basic stats info such as word counts, N-gram counts, etc.

I'm now working on NLP4L (NLP for Lucene) project [1]. It has connectivities for
Mahout and Spark now, but I think it could have one for OpenNLP, too.
If you have any ideas on our project for OpenNLP, please let us know.

Thanks,

Koji

[1] https://github.com/NLP4L/nlp4l


On 2015/05/03 21:57, Martin Wunderlich wrote:
Hi all,

OpenNLP provides lots of great features for pre-processing and tagging. 
However, one thing I am missing is a component that works on the higher level 
of corpus management and document handling. Imagine, for instance, if you have 
raw text that is sent through different pre-processing pipelines. It should be 
possible to store the results in some intermediate format for future 
processing, along with the configuration of the pre-processing pipelines.

Up until now, I have been writing my own code for this for prototyping 
purposes, but surely others have faced the same problem and there are useful 
solutions out there. I have looked into UIMA, but it has a relatively steep 
learning curve.

What are other people using for corpus management?

Thanks a lot.

Cheers,

Martin




Reply via email to