Thanks so much, Petr and Mario, for your detailed views. They confirm my 
initial impression that the learning curve of the system was non 
underestimated. I might have a look at the DKPro project to see, if that would 
be a suitable starting point for my project. Although I might decide to stick 
with components that very losely coupled by scripting for getting a prototype 
together quickly and the move the system to UIMA when it has stabilised. It 
seems definitely like a system worth getting familiar with. 

Cheers, 

Martin
 


> Am 26.04.2015 um 13:44 schrieb Mario Gazzo <mario.ga...@gmail.com>:
> 
> Hej Martin,
> 
> I agree with Peter. We are in the process of migrating our existing text 
> analysis components to UIMA coming from an approach that more closely 
> resembles what you would call just "gluing things together”. This works well 
> when you initially just experiment with rapid prototypes. I think UIMA could 
> in this phase even get in the way if you don’t already understand it very 
> well. However, once you need to scale the dev team and move to production 
> then these ad-hoc approaches become a problem. A framework like UIMA gives 
> you a systematic development approach for the whole team and once you have 
> climbed the steep learning curve then I believe it can also be a faster 
> prototyping tool because it makes it easier to quickly combine different 
> components in a new pipeline. An important factors for us was therefore also 
> the diverse ecosystem of quality analysis components like DKPro, cTakes, 
> clearTK etc. You can even integrate Gate components and vice versa (see 
> https://gate.ac.uk/sale/tao/splitch22.html#chap:uima 
> <https://gate.ac.uk/sale/tao/splitch22.html#chap:uima>) although I haven’t 
> myself played with this yet.
> 
> We are not using the distributed scale out features of UIMA but rely on 
> various AWS services instead although it takes a bit of tinkering to figure 
> out how to do this but we are gradually getting there. Generally we do the 
> unstructured NLP processing on document by document basis in UIMA but then we 
> do corpus wide structured analysis using map reduce type of approaches 
> outside UIMA. That said, we are now also moving towards stream based 
> approaches since we have to ingest large amount of data continuously. Doing 
> very large MR batch jobs on a daily basis is in our case wasteful and 
> impractical.
> 
> I think UIMA feels a bit "old school” with all these XML descriptions but 
> there is purpose behind this once you start understanding the architecture. 
> Luckily this is where UIMAfit comes to the rescue. We don’t use the Eclipse 
> tools at all but integrate JCasGen with Gradle using this nice plugin: 
> https://github.com/Dictanova/gradle-jcasgen-plugin 
> <https://github.com/Dictanova/gradle-jcasgen-plugin>. I would wish there was 
> direct support for Gradle as well. We don’t want to rely on these IDE 
> specific tools ourselves since we use both Eclipse and Intellij IDEA in 
> development and we need to have the code generation tools integrated with the 
> automated build process. The main difference is that we only need to write 
> the type definitions directly in XML and for the analysis engine and pipeline 
> descriptions we can just use UIMAfit. However, be prepared to do some digging 
> since not every detail is covered as well in the UIMAfit documentation as it 
> is for the general UIMA framework. Community responses on this mailing is a 
> big plus though.
> 
> Cheers
> Mario
> 
> 
>> On 26 Apr 2015, at 11:05 , Petr Baudis <pa...@ucw.cz> wrote:
>> 
>> Hi!
>> 
>> On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote:
>>> To provide a concrete scenario, would UIMA be useful in modeling the 
>>> following processing pipeline, given a corpus consisting of a number of 
>>> text documents: 
>>> 
>>> - annotate each doc with meta-data extracted from it, such as publication 
>>> date
>>> - preprocess the corpus, e.g. by stopword removal and lemmatization
>>> - save intermediate pre-processed and annotated versions of corpus (so that 
>>> pre-processing has to be done only once)
>>> - run LDA (e.g. using Mallet) on the entire training corpus to model 
>>> topics, with number of topics ranging, for instance, from 50 to 100
>>> - convert each doc to a feature vector as per the LDA model
>> +
>>> - extract paragraphs from relevant documents and use for unsupervised 
>>> pre-training in a deep learning architecture (built using e.g. 
>>> Deeplearning4J)
>> 
>> I think up to here, UIMA would be a good choice for you.
>> 
>>> - train and test an SVM for supervised text classification (binary 
>>> classification into „relevant“ vs. „non-relevant“) using cross-validation
>>> - store each trained SVM
>>> - report results of CV into CSV file for further processing
>> 
>> The moment stop dealing with *unstructured* data and just do feature
>> vectors and classifier objects, it's imho easier to get out of UIMA,
>> but that may not be a big deal.
>> 
>>> Would UIMA be a good choice to build and manage a project like this? 
>>> What would be the advantages of UIMA compared to using simple shell scripts 
>>> for „gluing together“ the individual components? 
>> 
>> Well, UIMA provides the gluing so you don't have to do it yourself,
>> it's not that small amount of work:
>> 
>> (i) a common container (CAS) for annotated data
>> (ii) pipeline flow control that also supports scale out
>> (iii) the DKpro project, which lets you effortlessly perform NLP
>> annotations, interface resources etc. using off-the-shelf NLP components
>> 
>> For me, UIMA had a rather steep learning curve.  But that was largely
>> because my pipeline is highly non-linear and I didn't use the Eclipse
>> GUI tools; I would hope things should go pretty easily in a simpler
>> scenario with a completely linear pipeline like yours.
>> 
>> P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator
>> XML descriptors you see in the UIMA User Guide.  I recommend that you
>> just look at the DKpro example suite to get started up quickly.
>> 
>> -- 
>>                              Petr Baudis
>>      If you do not work on an important problem, it's unlikely
>>      you'll do important work.  -- R. Hamming
>>      http://www.cs.virginia.edu/~robins/YouAndYourResearch.html
> 

Reply via email to