Thanks so much, Petr and Mario, for your detailed views. They confirm my initial impression that the learning curve of the system was non underestimated. I might have a look at the DKPro project to see, if that would be a suitable starting point for my project. Although I might decide to stick with components that very losely coupled by scripting for getting a prototype together quickly and the move the system to UIMA when it has stabilised. It seems definitely like a system worth getting familiar with.
Cheers, Martin > Am 26.04.2015 um 13:44 schrieb Mario Gazzo <mario.ga...@gmail.com>: > > Hej Martin, > > I agree with Peter. We are in the process of migrating our existing text > analysis components to UIMA coming from an approach that more closely > resembles what you would call just "gluing things together”. This works well > when you initially just experiment with rapid prototypes. I think UIMA could > in this phase even get in the way if you don’t already understand it very > well. However, once you need to scale the dev team and move to production > then these ad-hoc approaches become a problem. A framework like UIMA gives > you a systematic development approach for the whole team and once you have > climbed the steep learning curve then I believe it can also be a faster > prototyping tool because it makes it easier to quickly combine different > components in a new pipeline. An important factors for us was therefore also > the diverse ecosystem of quality analysis components like DKPro, cTakes, > clearTK etc. You can even integrate Gate components and vice versa (see > https://gate.ac.uk/sale/tao/splitch22.html#chap:uima > <https://gate.ac.uk/sale/tao/splitch22.html#chap:uima>) although I haven’t > myself played with this yet. > > We are not using the distributed scale out features of UIMA but rely on > various AWS services instead although it takes a bit of tinkering to figure > out how to do this but we are gradually getting there. Generally we do the > unstructured NLP processing on document by document basis in UIMA but then we > do corpus wide structured analysis using map reduce type of approaches > outside UIMA. That said, we are now also moving towards stream based > approaches since we have to ingest large amount of data continuously. Doing > very large MR batch jobs on a daily basis is in our case wasteful and > impractical. > > I think UIMA feels a bit "old school” with all these XML descriptions but > there is purpose behind this once you start understanding the architecture. > Luckily this is where UIMAfit comes to the rescue. We don’t use the Eclipse > tools at all but integrate JCasGen with Gradle using this nice plugin: > https://github.com/Dictanova/gradle-jcasgen-plugin > <https://github.com/Dictanova/gradle-jcasgen-plugin>. I would wish there was > direct support for Gradle as well. We don’t want to rely on these IDE > specific tools ourselves since we use both Eclipse and Intellij IDEA in > development and we need to have the code generation tools integrated with the > automated build process. The main difference is that we only need to write > the type definitions directly in XML and for the analysis engine and pipeline > descriptions we can just use UIMAfit. However, be prepared to do some digging > since not every detail is covered as well in the UIMAfit documentation as it > is for the general UIMA framework. Community responses on this mailing is a > big plus though. > > Cheers > Mario > > >> On 26 Apr 2015, at 11:05 , Petr Baudis <pa...@ucw.cz> wrote: >> >> Hi! >> >> On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote: >>> To provide a concrete scenario, would UIMA be useful in modeling the >>> following processing pipeline, given a corpus consisting of a number of >>> text documents: >>> >>> - annotate each doc with meta-data extracted from it, such as publication >>> date >>> - preprocess the corpus, e.g. by stopword removal and lemmatization >>> - save intermediate pre-processed and annotated versions of corpus (so that >>> pre-processing has to be done only once) >>> - run LDA (e.g. using Mallet) on the entire training corpus to model >>> topics, with number of topics ranging, for instance, from 50 to 100 >>> - convert each doc to a feature vector as per the LDA model >> + >>> - extract paragraphs from relevant documents and use for unsupervised >>> pre-training in a deep learning architecture (built using e.g. >>> Deeplearning4J) >> >> I think up to here, UIMA would be a good choice for you. >> >>> - train and test an SVM for supervised text classification (binary >>> classification into „relevant“ vs. „non-relevant“) using cross-validation >>> - store each trained SVM >>> - report results of CV into CSV file for further processing >> >> The moment stop dealing with *unstructured* data and just do feature >> vectors and classifier objects, it's imho easier to get out of UIMA, >> but that may not be a big deal. >> >>> Would UIMA be a good choice to build and manage a project like this? >>> What would be the advantages of UIMA compared to using simple shell scripts >>> for „gluing together“ the individual components? >> >> Well, UIMA provides the gluing so you don't have to do it yourself, >> it's not that small amount of work: >> >> (i) a common container (CAS) for annotated data >> (ii) pipeline flow control that also supports scale out >> (iii) the DKpro project, which lets you effortlessly perform NLP >> annotations, interface resources etc. using off-the-shelf NLP components >> >> For me, UIMA had a rather steep learning curve. But that was largely >> because my pipeline is highly non-linear and I didn't use the Eclipse >> GUI tools; I would hope things should go pretty easily in a simpler >> scenario with a completely linear pipeline like yours. >> >> P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator >> XML descriptors you see in the UIMA User Guide. I recommend that you >> just look at the DKpro example suite to get started up quickly. >> >> -- >> Petr Baudis >> If you do not work on an important problem, it's unlikely >> you'll do important work. -- R. Hamming >> http://www.cs.virginia.edu/~robins/YouAndYourResearch.html >