Re: Scaling using Hadoop

Thilo Götz Wed, 05 Oct 2011 23:01:35 -0700

Forgot to mention performance.  Latency is pretty bad, but
once it gets going, it's pretty fast in my experience.  We
get near-linear scale out on multiple nodes.  I have less
experience with using larger, multi-core machines.


So use hadoop when you have thousands of documents you need
to process in batch mode, and you can easily replicate your
processing pipeline multiple times.  For those scenarios,
it works well.  That is not to say it couldn't work in other
setups as well, I simply never tried it.  It may well be
that you can bring latency down by being a bit cleverer in
your hadoop setup, but for the batch scenarios, it's not
worth the trouble.

--Thilo

On 06/10/11 07:43, Thilo Götz wrote:
> On 05/10/11 22:43, Marshall Schor wrote:
>> We use hadoop with UIMA.  Here's the "fit", in one case:
>>
>> 1) UIMA runs as the map step; we put the uima pipeline into the mapper.  
>> Hadoop
>> has a configure (?) method where you can stick the creation and set up of the
>> uima pipeline, similar to UIMA's initialize.
>>
>> 2) Write a hadoop record reader that reads input from hadoop's "splits", and
>> creates things that would go into individual CASes.  These are the input to 
>> the
>> Map step.
>>
>> 3) The map takes the input (a string, say), and puts it into a CAS, and then
>> calls the process() method on the engine it set up and initialized in step 1.
>>
>> 4) When the process method returns, the CAS has all the results - iterate 
>> thru
>> it and extract whatever you want, and stick those values into your hadoop 
>> output
>> object, and output it.
>>
>> 5) The reduce step can take all of these output objects (which can be sorted 
>> as
>> you wish) and do whatever you want with them. 
> 
> That basically sums it up.  We (and that's a different we than Marshall's we)
> use hadoop only for batch processing, but since that's the only processing
> we're currently doing, that works out well.  We use hdfs as the underlying
> storage normally.
> 
> --Thilo
> 
>>
>> We usually replicate our data 2x in Hadoop Distributed File System, so that 
>> big
>> runs don't fail due to single failures of disk drives. 
>>
>> HTH. -Marshall
>>
>> On 10/5/2011 2:24 PM, Greg Holmberg wrote:
>>> On Tue, 27 Sep 2011 01:06:02 -0700, Thilo Götz <twgo...@gmx.de> wrote:
>>>
>>>> On 26/09/11 22:31, Greg Holmberg wrote:
>>>>>
>>>>> This is what I'm doing.  I use JavaSpaces (producer/consumer queue), but 
>>>>> I'm
>>>>> sure you can get the same effect with UIMA AS and ActiveMQ.
>>>>
>>>> Or Hadoop.
>>>
>>> Thilo, could you expand on this?  Exactly how do you use Hadoop to scale 
>>> UIMA?
>>>
>>> What storage do you use under Hadoop (HDFS, Hbase, Hive, etc), and what is
>>> your final storage destination for the CAS data?
>>>
>>> Are you doing on-demand, streaming, or batch processing of documents?
>>>
>>> What are your key/value pairs?  URLs?  What's your map step, what's your
>>> reduce step?
>>>
>>> How do you partition?  Do you find the system is load balanced?  What level 
>>> of
>>> efficiency do you get?  What level of CPU utilization?
>>>
>>> Do you do just document (UIMA) analysis in Hadoop, or also collection
>>> (multi-doc) analytics?
>>>
>>> The fit between UIMA and Hadoop isn't obvious to me.  Just trying to figure 
>>> it
>>> out.
>>>
>>> Thanks,
>>>
>>>
>>> Greg Holmberg
>>>

Re: Scaling using Hadoop

Reply via email to