Hej Martin, I agree with Peter. We are in the process of migrating our existing text analysis components to UIMA coming from an approach that more closely resembles what you would call just "gluing things together”. This works well when you initially just experiment with rapid prototypes. I think UIMA could in this phase even get in the way if you don’t already understand it very well. However, once you need to scale the dev team and move to production then these ad-hoc approaches become a problem. A framework like UIMA gives you a systematic development approach for the whole team and once you have climbed the steep learning curve then I believe it can also be a faster prototyping tool because it makes it easier to quickly combine different components in a new pipeline. An important factors for us was therefore also the diverse ecosystem of quality analysis components like DKPro, cTakes, clearTK etc. You can even integrate Gate components and vice versa (see https://gate.ac.uk/sale/tao/splitch22.html#chap:uima <https://gate.ac.uk/sale/tao/splitch22.html#chap:uima>) although I haven’t myself played with this yet.
We are not using the distributed scale out features of UIMA but rely on various AWS services instead although it takes a bit of tinkering to figure out how to do this but we are gradually getting there. Generally we do the unstructured NLP processing on document by document basis in UIMA but then we do corpus wide structured analysis using map reduce type of approaches outside UIMA. That said, we are now also moving towards stream based approaches since we have to ingest large amount of data continuously. Doing very large MR batch jobs on a daily basis is in our case wasteful and impractical. I think UIMA feels a bit "old school” with all these XML descriptions but there is purpose behind this once you start understanding the architecture. Luckily this is where UIMAfit comes to the rescue. We don’t use the Eclipse tools at all but integrate JCasGen with Gradle using this nice plugin: https://github.com/Dictanova/gradle-jcasgen-plugin <https://github.com/Dictanova/gradle-jcasgen-plugin>. I would wish there was direct support for Gradle as well. We don’t want to rely on these IDE specific tools ourselves since we use both Eclipse and Intellij IDEA in development and we need to have the code generation tools integrated with the automated build process. The main difference is that we only need to write the type definitions directly in XML and for the analysis engine and pipeline descriptions we can just use UIMAfit. However, be prepared to do some digging since not every detail is covered as well in the UIMAfit documentation as it is for the general UIMA framework. Community responses on this mailing is a big plus though. Cheers Mario > On 26 Apr 2015, at 11:05 , Petr Baudis <pa...@ucw.cz> wrote: > > Hi! > > On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote: >> To provide a concrete scenario, would UIMA be useful in modeling the >> following processing pipeline, given a corpus consisting of a number of text >> documents: >> >> - annotate each doc with meta-data extracted from it, such as publication >> date >> - preprocess the corpus, e.g. by stopword removal and lemmatization >> - save intermediate pre-processed and annotated versions of corpus (so that >> pre-processing has to be done only once) >> - run LDA (e.g. using Mallet) on the entire training corpus to model topics, >> with number of topics ranging, for instance, from 50 to 100 >> - convert each doc to a feature vector as per the LDA model > + >> - extract paragraphs from relevant documents and use for unsupervised >> pre-training in a deep learning architecture (built using e.g. >> Deeplearning4J) > > I think up to here, UIMA would be a good choice for you. > >> - train and test an SVM for supervised text classification (binary >> classification into „relevant“ vs. „non-relevant“) using cross-validation >> - store each trained SVM >> - report results of CV into CSV file for further processing > > The moment stop dealing with *unstructured* data and just do feature > vectors and classifier objects, it's imho easier to get out of UIMA, > but that may not be a big deal. > >> Would UIMA be a good choice to build and manage a project like this? >> What would be the advantages of UIMA compared to using simple shell scripts >> for „gluing together“ the individual components? > > Well, UIMA provides the gluing so you don't have to do it yourself, > it's not that small amount of work: > > (i) a common container (CAS) for annotated data > (ii) pipeline flow control that also supports scale out > (iii) the DKpro project, which lets you effortlessly perform NLP > annotations, interface resources etc. using off-the-shelf NLP components > > For me, UIMA had a rather steep learning curve. But that was largely > because my pipeline is highly non-linear and I didn't use the Eclipse > GUI tools; I would hope things should go pretty easily in a simpler > scenario with a completely linear pipeline like yours. > > P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator > XML descriptors you see in the UIMA User Guide. I recommend that you > just look at the DKpro example suite to get started up quickly. > > -- > Petr Baudis > If you do not work on an important problem, it's unlikely > you'll do important work. -- R. Hamming > http://www.cs.virginia.edu/~robins/YouAndYourResearch.html