Hi Koji, 

Thanks a lot for the reply. I do have some experience with Lucene, in 
particular in a Solr environment. However, I would not have thought of it as a 
system for document/corpus management. I guess it would have some advantages to 
use Lucency/Solr for this and try to map the concept of a „corpus“ to the 
index, maybe simply using a meta-data field. Then again, a file-based solution 
might be more flexible, since it doesn’t depend as much on the index structure. 

Cheers, 

Martin
  


> Am 04.05.2015 um 02:32 schrieb Koji Sekiguchi <[email protected]>:
> 
> Hi Martin,
> 
> I'm not sure this is the right one, but I use Lucene index as a corpus 
> database.
> Because Lucene provides powerful Analyzers, we can use them to normalize text
> (A -> a, ß -> ss, 廣 -> 広,...). Once you normalize text, those normalized words
> are recorded into Lucene index. Since Lucene provides API to access index 
> (word database)
> as well, we can get basic stats info such as word counts, N-gram counts, etc.
> 
> I'm now working on NLP4L (NLP for Lucene) project [1]. It has connectivities 
> for
> Mahout and Spark now, but I think it could have one for OpenNLP, too.
> If you have any ideas on our project for OpenNLP, please let us know.
> 
> Thanks,
> 
> Koji
> 
> [1] https://github.com/NLP4L/nlp4l
> 
> 
> On 2015/05/03 21:57, Martin Wunderlich wrote:
>> Hi all,
>> 
>> OpenNLP provides lots of great features for pre-processing and tagging. 
>> However, one thing I am missing is a component that works on the higher 
>> level of corpus management and document handling. Imagine, for instance, if 
>> you have raw text that is sent through different pre-processing pipelines. 
>> It should be possible to store the results in some intermediate format for 
>> future processing, along with the configuration of the pre-processing 
>> pipelines.
>> 
>> Up until now, I have been writing my own code for this for prototyping 
>> purposes, but surely others have faced the same problem and there are useful 
>> solutions out there. I have looked into UIMA, but it has a relatively steep 
>> learning curve.
>> 
>> What are other people using for corpus management?
>> 
>> Thanks a lot.
>> 
>> Cheers,
>> 
>> Martin
>> 
>> 
> 
> 

Reply via email to