Re: Document and corpus management - what are people using?

Richard Eckart de Castilho Mon, 04 May 2015 10:51:10 -0700

Hi Martin,

in the past, I have tried using XML files and an XML database (eXist) for 
corpus management [1].
Even at that time, when I found out about UIMA, I immediately switched to it 
for pipeline
management, while still ingesting XML data from the XML DB or files and also 
writing it back
there.

Nowadays, I am very opportunistic when it comes to storing corpora. Typically, 
I simply use
folders in the file system. For pipelines, I still use UIMA. If intermediate 
results are
annotated documents, I store them in some format that can capture the full 
expressiveness
of the UIMA CAS, e.g. one of the UIMA binary formats or XMI. If the results are 
aggregated
data such as frequency counts, extracted features, etc. I use some ad-hoc 
format, e.g. some
tab-separated format. If it is necessary to search over data, I additionally 
create some index
from the data in the CASes (e.g. using CQP [2]) or store information in a 
relational database.
Most of the analysis components and file conversion components I create or work 
with end up
in the DKPro Core [3] project.

To manage complex workflows that produce intermediate results that could be 
re-used, e.g. in
a parameter sweeping setup, I cooked up a little framework [4] (works with or 
without UIMA).
Others I know/have heard of cooked up frameworks of their own, implemented 
custom UIMA
flow-controllers for support non-linear workflows, run their pipelines on 
Hadoop, or just
write their workflow in Java mixing UIMA and non-UIMA stuff via uimaFIT [5]... 
or they use some
other language / workflow engine they fancy ;)

So to sum it up, after I started UIMA, I never looked back. 
We tried to lower the learning curve with uimaFIT [5]. 
However, UIMA is not a cure-all. If you are a into non-linear workflows
or parallel algorithms (as opposed to simple scale-out), other frameworks
might suite you better, e.g. maybe Spark, Scala, etc. Yet, you'll find that
people have also build bridges between these and UIMA, because sometimes
you just want to run some linear annotation pipeline at some point and
re-using components from one of the available UIMA component collections
can be very convenient.

Cheers,

-- Richard

P.S.: I'm working on most of the project mentioned above, so don't let me fool 
you ;)
P.P.S.: Most of the projects mentioned are not Apache projects.

[1] http://annolab.org
[2] http://cwb.sourceforge.net
[3] https://code.google.com/p/dkpro-core-asl/
[4] https://code.google.com/p/dkpro-lab/
[5] http://uima.apache.org/uimafit.html

P.P.P.S.: Stuff will migrate from Google Code to Github soon...

On 03.05.2015, at 14:57, Martin Wunderlich <[email protected]> wrote:

> Hi all, 
> 
> OpenNLP provides lots of great features for pre-processing and tagging. 
> However, one thing I am missing is a component that works on the higher level 
> of corpus management and document handling. Imagine, for instance, if you 
> have raw text that is sent through different pre-processing pipelines. It 
> should be possible to store the results in some intermediate format for 
> future processing, along with the configuration of the pre-processing 
> pipelines. 
> 
> Up until now, I have been writing my own code for this for prototyping 
> purposes, but surely others have faced the same problem and there are useful 
> solutions out there. I have looked into UIMA, but it has a relatively steep 
> learning curve. 
> 
> What are other people using for corpus management? 
> 
> Thanks a lot. 
> 
> Cheers, 
> 
> Martin

Re: Document and corpus management - what are people using?

Reply via email to