Hi Koji, Thanks a lot for the reply. I do have some experience with Lucene, in particular in a Solr environment. However, I would not have thought of it as a system for document/corpus management. I guess it would have some advantages to use Lucency/Solr for this and try to map the concept of a „corpus“ to the index, maybe simply using a meta-data field. Then again, a file-based solution might be more flexible, since it doesn’t depend as much on the index structure.
Cheers, Martin > Am 04.05.2015 um 02:32 schrieb Koji Sekiguchi <[email protected]>: > > Hi Martin, > > I'm not sure this is the right one, but I use Lucene index as a corpus > database. > Because Lucene provides powerful Analyzers, we can use them to normalize text > (A -> a, ß -> ss, 廣 -> 広,...). Once you normalize text, those normalized words > are recorded into Lucene index. Since Lucene provides API to access index > (word database) > as well, we can get basic stats info such as word counts, N-gram counts, etc. > > I'm now working on NLP4L (NLP for Lucene) project [1]. It has connectivities > for > Mahout and Spark now, but I think it could have one for OpenNLP, too. > If you have any ideas on our project for OpenNLP, please let us know. > > Thanks, > > Koji > > [1] https://github.com/NLP4L/nlp4l > > > On 2015/05/03 21:57, Martin Wunderlich wrote: >> Hi all, >> >> OpenNLP provides lots of great features for pre-processing and tagging. >> However, one thing I am missing is a component that works on the higher >> level of corpus management and document handling. Imagine, for instance, if >> you have raw text that is sent through different pre-processing pipelines. >> It should be possible to store the results in some intermediate format for >> future processing, along with the configuration of the pre-processing >> pipelines. >> >> Up until now, I have been writing my own code for this for prototyping >> purposes, but surely others have faced the same problem and there are useful >> solutions out there. I have looked into UIMA, but it has a relatively steep >> learning curve. >> >> What are other people using for corpus management? >> >> Thanks a lot. >> >> Cheers, >> >> Martin >> >> > >
