Hi all, 

I am relatively new to UIMA and I was wondering, if the system would be the 
right choice for a project that I am currently working on. In essence, this 
project deals with a variety of text classification problems on different 
levels (document level, paragraph level, sentence level) using different 
methods. 

To provide a concrete scenario, would UIMA be useful in modeling the following 
processing pipeline, given a corpus consisting of a number of text documents: 

- annotate each doc with meta-data extracted from it, such as publication date
- preprocess the corpus, e.g. by stopword removal and lemmatization
- save intermediate pre-processed and annotated versions of corpus (so that 
pre-processing has to be done only once)
- run LDA (e.g. using Mallet) on the entire training corpus to model topics, 
with number of topics ranging, for instance, from 50 to 100
- convert each doc to a feature vector as per the LDA model
- train and test an SVM for supervised text classification (binary 
classification into „relevant“ vs. „non-relevant“) using cross-validation
- store each trained SVM
- report results of CV into CSV file for further processing
- extract paragraphs from relevant documents and use for unsupervised 
pre-training in a deep learning architecture (built using e.g. Deeplearning4J)

Would UIMA be a good choice to build and manage a project like this? 
What would be the advantages of UIMA compared to using simple shell scripts for 
„gluing together“ the individual components? 

Thanks a lot. 

Kind regards, 

Martin

Reply via email to