Would just like to mention that Behemoth ( https://github.com/jnioche/behemoth) can be used to run UIMA on Hadoop. There is a new project S4 in incubation at Apache which could be used to do on the fly processing, but I haven't looked at the details yet
Julien On 6 October 2011 07:01, Thilo Götz <twgo...@gmx.de> wrote: > Forgot to mention performance. Latency is pretty bad, but > once it gets going, it's pretty fast in my experience. We > get near-linear scale out on multiple nodes. I have less > experience with using larger, multi-core machines. > > So use hadoop when you have thousands of documents you need > to process in batch mode, and you can easily replicate your > processing pipeline multiple times. For those scenarios, > it works well. That is not to say it couldn't work in other > setups as well, I simply never tried it. It may well be > that you can bring latency down by being a bit cleverer in > your hadoop setup, but for the batch scenarios, it's not > worth the trouble. > > --Thilo > > On 06/10/11 07:43, Thilo Götz wrote: > > On 05/10/11 22:43, Marshall Schor wrote: > >> We use hadoop with UIMA. Here's the "fit", in one case: > >> > >> 1) UIMA runs as the map step; we put the uima pipeline into the mapper. > Hadoop > >> has a configure (?) method where you can stick the creation and set up > of the > >> uima pipeline, similar to UIMA's initialize. > >> > >> 2) Write a hadoop record reader that reads input from hadoop's "splits", > and > >> creates things that would go into individual CASes. These are the input > to the > >> Map step. > >> > >> 3) The map takes the input (a string, say), and puts it into a CAS, and > then > >> calls the process() method on the engine it set up and initialized in > step 1. > >> > >> 4) When the process method returns, the CAS has all the results - > iterate thru > >> it and extract whatever you want, and stick those values into your > hadoop output > >> object, and output it. > >> > >> 5) The reduce step can take all of these output objects (which can be > sorted as > >> you wish) and do whatever you want with them. > > > > That basically sums it up. We (and that's a different we than Marshall's > we) > > use hadoop only for batch processing, but since that's the only > processing > > we're currently doing, that works out well. We use hdfs as the > underlying > > storage normally. > > > > --Thilo > > > >> > >> We usually replicate our data 2x in Hadoop Distributed File System, so > that big > >> runs don't fail due to single failures of disk drives. > >> > >> HTH. -Marshall > >> > >> On 10/5/2011 2:24 PM, Greg Holmberg wrote: > >>> On Tue, 27 Sep 2011 01:06:02 -0700, Thilo Götz <twgo...@gmx.de> wrote: > >>> > >>>> On 26/09/11 22:31, Greg Holmberg wrote: > >>>>> > >>>>> This is what I'm doing. I use JavaSpaces (producer/consumer queue), > but I'm > >>>>> sure you can get the same effect with UIMA AS and ActiveMQ. > >>>> > >>>> Or Hadoop. > >>> > >>> Thilo, could you expand on this? Exactly how do you use Hadoop to > scale UIMA? > >>> > >>> What storage do you use under Hadoop (HDFS, Hbase, Hive, etc), and what > is > >>> your final storage destination for the CAS data? > >>> > >>> Are you doing on-demand, streaming, or batch processing of documents? > >>> > >>> What are your key/value pairs? URLs? What's your map step, what's > your > >>> reduce step? > >>> > >>> How do you partition? Do you find the system is load balanced? What > level of > >>> efficiency do you get? What level of CPU utilization? > >>> > >>> Do you do just document (UIMA) analysis in Hadoop, or also collection > >>> (multi-doc) analytics? > >>> > >>> The fit between UIMA and Hadoop isn't obvious to me. Just trying to > figure it > >>> out. > >>> > >>> Thanks, > >>> > >>> > >>> Greg Holmberg > >>> > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com