Hi guys,
Hi Julien,
Why not using Behemoth to deploy your UIMA application on Hadoop? (
http://code.google.com/p/behemoth-pebble/)
Behemoth uses for input & output the HDFS. We integrated so far Heritrix in combination with the HBase writer ( http://code.google.com/p/hbase-writer/ ) and focus on our whole architecture on HBase. It will be nice, if Behemoth supports HBase in the future.

Behemoth is meant to do exactly what you described and has already an
adapter for Nutch & WARC archives. It can take a UIMA pear deploy it on a
Hadoop cluster and extract some of the UIMA-generated annotations + store
them at a neutral format which could then be used to generate vectors for
Mahout. The purpose of Behemoth is to facilitate the deployment of NLP
components for large scale processing and act as a bridge between common
inputs (e.g. Nutch, WARC) and other projects (Mahout, Tika) etc...
In the course of facilitating the deployment of NLP components, you are perfectly right.
If we had a mechanism for generating Mahout vectors from Behemoth
annotations we would be able to use other NLP frameworks such as GATE as
well. Doing something like this is on the roadmap for Behemoth anyway but it
sounds like what you are planning to do would be a perfect match.

Any thoughts on this?

Julien


Marc

Reply via email to