Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Marc Hofer Mon, 30 Nov 2009 05:56:32 -0800

Hi guys,

Hi Julien,

Why not using Behemoth to deploy your UIMA application on Hadoop? (
http://code.google.com/p/behemoth-pebble/)

Behemoth uses for input & output the HDFS. We integrated so far Heritrixin combination with the HBase writer (http://code.google.com/p/hbase-writer/ ) and focus on our wholearchitecture on HBase. It will be nice, if Behemoth supports HBase inthe future.

Behemoth is meant to do exactly what you described and has already an
adapter for Nutch & WARC archives. It can take a UIMA pear deploy it on a
Hadoop cluster and extract some of the UIMA-generated annotations + store
them at a neutral format which could then be used to generate vectors for
Mahout. The purpose of Behemoth is to facilitate the deployment of NLP
components for large scale processing and act as a bridge between common
inputs (e.g. Nutch, WARC) and other projects (Mahout, Tika) etc...

In the course of facilitating the deployment of NLP components, you areperfectly right.

If we had a mechanism for generating Mahout vectors from Behemoth
annotations we would be able to use other NLP frameworks such as GATE as
well. Doing something like this is on the roadmap for Behemoth anyway but it
sounds like what you are planning to do would be a perfect match.

Any thoughts on this?

Julien


Marc

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Reply via email to