Re: Scaling using Hadoop

Julien Nioche Thu, 06 Oct 2011 00:52:46 -0700

Would just like to mention that Behemoth (
https://github.com/jnioche/behemoth) can be used to run UIMA on Hadoop.
There is a new project S4 in incubation at Apache which could be used to do
on the fly processing, but I haven't looked at the details yet


Julien

On 6 October 2011 07:01, Thilo Götz <twgo...@gmx.de> wrote:

> Forgot to mention performance.  Latency is pretty bad, but
> once it gets going, it's pretty fast in my experience.  We
> get near-linear scale out on multiple nodes.  I have less
> experience with using larger, multi-core machines.
>
> So use hadoop when you have thousands of documents you need
> to process in batch mode, and you can easily replicate your
> processing pipeline multiple times.  For those scenarios,
> it works well.  That is not to say it couldn't work in other
> setups as well, I simply never tried it.  It may well be
> that you can bring latency down by being a bit cleverer in
> your hadoop setup, but for the batch scenarios, it's not
> worth the trouble.
>
> --Thilo
>
> On 06/10/11 07:43, Thilo Götz wrote:
> > On 05/10/11 22:43, Marshall Schor wrote:
> >> We use hadoop with UIMA.  Here's the "fit", in one case:
> >>
> >> 1) UIMA runs as the map step; we put the uima pipeline into the mapper.
>  Hadoop
> >> has a configure (?) method where you can stick the creation and set up
> of the
> >> uima pipeline, similar to UIMA's initialize.
> >>
> >> 2) Write a hadoop record reader that reads input from hadoop's "splits",
> and
> >> creates things that would go into individual CASes.  These are the input
> to the
> >> Map step.
> >>
> >> 3) The map takes the input (a string, say), and puts it into a CAS, and
> then
> >> calls the process() method on the engine it set up and initialized in
> step 1.
> >>
> >> 4) When the process method returns, the CAS has all the results -
> iterate thru
> >> it and extract whatever you want, and stick those values into your
> hadoop output
> >> object, and output it.
> >>
> >> 5) The reduce step can take all of these output objects (which can be
> sorted as
> >> you wish) and do whatever you want with them.
> >
> > That basically sums it up.  We (and that's a different we than Marshall's
> we)
> > use hadoop only for batch processing, but since that's the only
> processing
> > we're currently doing, that works out well.  We use hdfs as the
> underlying
> > storage normally.
> >
> > --Thilo
> >
> >>
> >> We usually replicate our data 2x in Hadoop Distributed File System, so
> that big
> >> runs don't fail due to single failures of disk drives.
> >>
> >> HTH. -Marshall
> >>
> >> On 10/5/2011 2:24 PM, Greg Holmberg wrote:
> >>> On Tue, 27 Sep 2011 01:06:02 -0700, Thilo Götz <twgo...@gmx.de> wrote:
> >>>
> >>>> On 26/09/11 22:31, Greg Holmberg wrote:
> >>>>>
> >>>>> This is what I'm doing.  I use JavaSpaces (producer/consumer queue),
> but I'm
> >>>>> sure you can get the same effect with UIMA AS and ActiveMQ.
> >>>>
> >>>> Or Hadoop.
> >>>
> >>> Thilo, could you expand on this?  Exactly how do you use Hadoop to
> scale UIMA?
> >>>
> >>> What storage do you use under Hadoop (HDFS, Hbase, Hive, etc), and what
> is
> >>> your final storage destination for the CAS data?
> >>>
> >>> Are you doing on-demand, streaming, or batch processing of documents?
> >>>
> >>> What are your key/value pairs?  URLs?  What's your map step, what's
> your
> >>> reduce step?
> >>>
> >>> How do you partition?  Do you find the system is load balanced?  What
> level of
> >>> efficiency do you get?  What level of CPU utilization?
> >>>
> >>> Do you do just document (UIMA) analysis in Hadoop, or also collection
> >>> (multi-doc) analytics?
> >>>
> >>> The fit between UIMA and Hadoop isn't obvious to me.  Just trying to
> figure it
> >>> out.
> >>>
> >>> Thanks,
> >>>
> >>>
> >>> Greg Holmberg
> >>>
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Scaling using Hadoop

Reply via email to