Hej Martin,

I agree with Peter. We are in the process of migrating our existing text 
analysis components to UIMA coming from an approach that more closely resembles 
what you would call just "gluing things together”. This works well when you 
initially just experiment with rapid prototypes. I think UIMA could in this 
phase even get in the way if you don’t already understand it very well. 
However, once you need to scale the dev team and move to production then these 
ad-hoc approaches become a problem. A framework like UIMA gives you a 
systematic development approach for the whole team and once you have climbed 
the steep learning curve then I believe it can also be a faster prototyping 
tool because it makes it easier to quickly combine different components in a 
new pipeline. An important factors for us was therefore also the diverse 
ecosystem of quality analysis components like DKPro, cTakes, clearTK etc. You 
can even integrate Gate components and vice versa (see 
https://gate.ac.uk/sale/tao/splitch22.html#chap:uima 
<https://gate.ac.uk/sale/tao/splitch22.html#chap:uima>) although I haven’t 
myself played with this yet.

We are not using the distributed scale out features of UIMA but rely on various 
AWS services instead although it takes a bit of tinkering to figure out how to 
do this but we are gradually getting there. Generally we do the unstructured 
NLP processing on document by document basis in UIMA but then we do corpus wide 
structured analysis using map reduce type of approaches outside UIMA. That 
said, we are now also moving towards stream based approaches since we have to 
ingest large amount of data continuously. Doing very large MR batch jobs on a 
daily basis is in our case wasteful and impractical.

I think UIMA feels a bit "old school” with all these XML descriptions but there 
is purpose behind this once you start understanding the architecture. Luckily 
this is where UIMAfit comes to the rescue. We don’t use the Eclipse tools at 
all but integrate JCasGen with Gradle using this nice plugin: 
https://github.com/Dictanova/gradle-jcasgen-plugin 
<https://github.com/Dictanova/gradle-jcasgen-plugin>. I would wish there was 
direct support for Gradle as well. We don’t want to rely on these IDE specific 
tools ourselves since we use both Eclipse and Intellij IDEA in development and 
we need to have the code generation tools integrated with the automated build 
process. The main difference is that we only need to write the type definitions 
directly in XML and for the analysis engine and pipeline descriptions we can 
just use UIMAfit. However, be prepared to do some digging since not every 
detail is covered as well in the UIMAfit documentation as it is for the general 
UIMA framework. Community responses on this mailing is a big plus though.

Cheers
Mario


> On 26 Apr 2015, at 11:05 , Petr Baudis <pa...@ucw.cz> wrote:
> 
>  Hi!
> 
> On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote:
>> To provide a concrete scenario, would UIMA be useful in modeling the 
>> following processing pipeline, given a corpus consisting of a number of text 
>> documents: 
>> 
>> - annotate each doc with meta-data extracted from it, such as publication 
>> date
>> - preprocess the corpus, e.g. by stopword removal and lemmatization
>> - save intermediate pre-processed and annotated versions of corpus (so that 
>> pre-processing has to be done only once)
>> - run LDA (e.g. using Mallet) on the entire training corpus to model topics, 
>> with number of topics ranging, for instance, from 50 to 100
>> - convert each doc to a feature vector as per the LDA model
> +
>> - extract paragraphs from relevant documents and use for unsupervised 
>> pre-training in a deep learning architecture (built using e.g. 
>> Deeplearning4J)
> 
>  I think up to here, UIMA would be a good choice for you.
> 
>> - train and test an SVM for supervised text classification (binary 
>> classification into „relevant“ vs. „non-relevant“) using cross-validation
>> - store each trained SVM
>> - report results of CV into CSV file for further processing
> 
>  The moment stop dealing with *unstructured* data and just do feature
> vectors and classifier objects, it's imho easier to get out of UIMA,
> but that may not be a big deal.
> 
>> Would UIMA be a good choice to build and manage a project like this? 
>> What would be the advantages of UIMA compared to using simple shell scripts 
>> for „gluing together“ the individual components? 
> 
>  Well, UIMA provides the gluing so you don't have to do it yourself,
> it's not that small amount of work:
> 
>  (i) a common container (CAS) for annotated data
>  (ii) pipeline flow control that also supports scale out
>  (iii) the DKpro project, which lets you effortlessly perform NLP
> annotations, interface resources etc. using off-the-shelf NLP components
> 
>  For me, UIMA had a rather steep learning curve.  But that was largely
> because my pipeline is highly non-linear and I didn't use the Eclipse
> GUI tools; I would hope things should go pretty easily in a simpler
> scenario with a completely linear pipeline like yours.
> 
>  P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator
> XML descriptors you see in the UIMA User Guide.  I recommend that you
> just look at the DKpro example suite to get started up quickly.
> 
> -- 
>                               Petr Baudis
>       If you do not work on an important problem, it's unlikely
>       you'll do important work.  -- R. Hamming
>       http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Reply via email to