Hi,

the advantage of using UIMA over plain OpenNLP is that it can allow you to more easily combine components from different sources, e.g. a tokenizer and POS tagger from OpenNLP, a parser from Stanford, etc.

You then have components for input that deal with the different sources, and for output to e.g. index the results in Solr. Within the processing pipeline you have one consistent data representation so you don't have to worry about writing glue code to do format conversions.

Take a look at DKpro (https://code.google.com/p/dkpro-core-asl/) which has UIMA wrappers for many different components.

HTH,
Jens

On 15/01/14 14:52, Burcu B wrote:
Hi,

Thank you, Jens. I was planning to use OpenNLP  for named entity
recognition directly  for the analysis you''ve mentioned; and Lucene for
tokenization. However, UIMA has OpenNLP component, too. What is the reason
to use UIMA instead of uisng OpenNLP and SOLR together?

I am planning to use Mahout and R together in the application; but later
other libraries or algorithms could be added to the application. However,
the program should be extended like Atlassian's JIRA plugins. Does UIMA's
component architecture provide this easier compared to other options?

Where does UIMA fit in a system that reads documents from different
sources; removes stop words, identifies named entities; indexs them and
then classifies, clusteres text and indexes topics/labels? I am confused if
& why UIMA should be used or not.

Regards,





On Wed, Jan 15, 2014 at 1:15 PM, Jens Grivolla <j+...@grivolla.net> wrote:

Hello Burcu,

UIMA has an entirely different purpose actually, and doesn't do
classification or clustering.  You would rather use UIMA to enrich
documents (individually) through text analysis and then use the result to
create better feature vectors to use with Solr, Mahout, etc.

We typically use UIMA to do named entity recognition, sentiment analysis,
chunking, etc. and then index the result in Solr. From there you can either
use it for retrieval (i.e. use the enriched representation to get a better
document similarity measure) or extract the vectors to use with
Mahout/Weka/Cluto/...

HTH,
Jens


On 14/01/14 16:25, Burcu B wrote:

Hi,

I'd like to know why someone should prefer UIMA when developing an
application for end users to classify and cluster general purpose
documents?

I have two options:
1- integrating Mahout, SOLR, R ,Hadoop and other file sources such as
   document man. systems or file system.
2- or doing these using UIMA.

Intiutively, I think that UIMA should be preferred, but I could not
justify
my feeling. I need a list of pros and cons.

If you could suggest me resources, it would be great.

Thank you.







Reply via email to