Hi all, I am relatively new to UIMA and I was wondering, if the system would be the right choice for a project that I am currently working on. In essence, this project deals with a variety of text classification problems on different levels (document level, paragraph level, sentence level) using different methods.
To provide a concrete scenario, would UIMA be useful in modeling the following processing pipeline, given a corpus consisting of a number of text documents: - annotate each doc with meta-data extracted from it, such as publication date - preprocess the corpus, e.g. by stopword removal and lemmatization - save intermediate pre-processed and annotated versions of corpus (so that pre-processing has to be done only once) - run LDA (e.g. using Mallet) on the entire training corpus to model topics, with number of topics ranging, for instance, from 50 to 100 - convert each doc to a feature vector as per the LDA model - train and test an SVM for supervised text classification (binary classification into „relevant“ vs. „non-relevant“) using cross-validation - store each trained SVM - report results of CV into CSV file for further processing - extract paragraphs from relevant documents and use for unsupervised pre-training in a deep learning architecture (built using e.g. Deeplearning4J) Would UIMA be a good choice to build and manage a project like this? What would be the advantages of UIMA compared to using simple shell scripts for „gluing together“ the individual components? Thanks a lot. Kind regards, Martin