Hi Martin, in the past, I have tried using XML files and an XML database (eXist) for corpus management [1]. Even at that time, when I found out about UIMA, I immediately switched to it for pipeline management, while still ingesting XML data from the XML DB or files and also writing it back there.
Nowadays, I am very opportunistic when it comes to storing corpora. Typically, I simply use folders in the file system. For pipelines, I still use UIMA. If intermediate results are annotated documents, I store them in some format that can capture the full expressiveness of the UIMA CAS, e.g. one of the UIMA binary formats or XMI. If the results are aggregated data such as frequency counts, extracted features, etc. I use some ad-hoc format, e.g. some tab-separated format. If it is necessary to search over data, I additionally create some index from the data in the CASes (e.g. using CQP [2]) or store information in a relational database. Most of the analysis components and file conversion components I create or work with end up in the DKPro Core [3] project. To manage complex workflows that produce intermediate results that could be re-used, e.g. in a parameter sweeping setup, I cooked up a little framework [4] (works with or without UIMA). Others I know/have heard of cooked up frameworks of their own, implemented custom UIMA flow-controllers for support non-linear workflows, run their pipelines on Hadoop, or just write their workflow in Java mixing UIMA and non-UIMA stuff via uimaFIT [5]... or they use some other language / workflow engine they fancy ;) So to sum it up, after I started UIMA, I never looked back. We tried to lower the learning curve with uimaFIT [5]. However, UIMA is not a cure-all. If you are a into non-linear workflows or parallel algorithms (as opposed to simple scale-out), other frameworks might suite you better, e.g. maybe Spark, Scala, etc. Yet, you'll find that people have also build bridges between these and UIMA, because sometimes you just want to run some linear annotation pipeline at some point and re-using components from one of the available UIMA component collections can be very convenient. Cheers, -- Richard P.S.: I'm working on most of the project mentioned above, so don't let me fool you ;) P.P.S.: Most of the projects mentioned are not Apache projects. [1] http://annolab.org [2] http://cwb.sourceforge.net [3] https://code.google.com/p/dkpro-core-asl/ [4] https://code.google.com/p/dkpro-lab/ [5] http://uima.apache.org/uimafit.html P.P.P.S.: Stuff will migrate from Google Code to Github soon... On 03.05.2015, at 14:57, Martin Wunderlich <[email protected]> wrote: > Hi all, > > OpenNLP provides lots of great features for pre-processing and tagging. > However, one thing I am missing is a component that works on the higher level > of corpus management and document handling. Imagine, for instance, if you > have raw text that is sent through different pre-processing pipelines. It > should be possible to store the results in some intermediate format for > future processing, along with the configuration of the pre-processing > pipelines. > > Up until now, I have been writing my own code for this for prototyping > purposes, but surely others have faced the same problem and there are useful > solutions out there. I have looked into UIMA, but it has a relatively steep > learning curve. > > What are other people using for corpus management? > > Thanks a lot. > > Cheers, > > Martin
