Each document processing is independent and can be processed
parallelly, so that part could be done in a map reduce job.
Now whether it suits this use case depends on rate at which new
URI's are discovered for processing and acceptable delay in processing
of a document. The way I see it you can batch the URI's
and input that to mapreduce job. Each mapper can work on sublist of URIs.
You can choose to make DB inserts from mapper itself. In that case
you can set no of reducers to 0. Otherwise if batching of the queries
is an option then you can consider making batch inserts in reducer. It
will help in reducing load on DB.

- Sharad

Adam Retter wrote:
>
> If I understand correctly - Hadoop forms a general purpose cluster on
> which you can execute jobs?
>
> We have a Java data processing application here that follows the
> Producer -> Consumer pattern. It has been written with threading as a
> concern from the start using java.util.concurrent.Callable.
>
> At present the producer is a thread that retrieves a list of document
> URI's from a SQL query against databaseA and adds them to a shared
> (synchronised) queue.
>
> Each consumer is a thread, of which there can be n, but we typically run
> with 16 on the current hardware.
> The consumer sits in a loop, processing the queue until it is empty. It
> removes a document URI from the shared queue, retrieves the document and
> performs a pipeline of transformations on the document, resulting in a
> series of 600 to 16000 SQL insert statements which are then executed
> against databaseB.
>
> I have been reading about both Terracotta and Hadoop. Hadoop appears the
> more general purpose solution that we could use for many applications,
> however I am not sure how our application would map onto Hadoop
> concepts. I have been studying the Map/Reduce Hadoop approach but our
> application does not produce any intermediate files that would be the
> input/output to the Map/Reduce processes.
>
> Any guidance would be appreciated, it may well be that our application
> is not an appropriate use of Hadoop?
>
>
> Thanks Adam.
>
> Adam Retter
> Software Developer
> Landmark Information Group
>
> T: 01392 685403 (x5403)
>
> 5-7 Abbey Court, Eagle Way, Sowton,
> Exeter, Devon, EX2 7HY
>
> www.landmark.co.uk
>
>
>
> Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
> Registered Number 2892803 Registered in England and Wales
>
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
>
> The information contained in this e-mail is confidential and may be subject to
> legal privilege. If you are not the intended recipient, you must not use, 
> copy,
> distribute or disclose the e-mail or any part of its contents or take any
> action in reliance on it. If you have received this e-mail in error, please
> e-mail the sender by replying to this message. All reasonable precautions have
> been taken to ensure no viruses are present in this e-mail. Landmark 
> Information
> Group Limited cannot accept responsibility for loss or damage arising from the
> use of this e-mail or attachments and recommend that you subject these to
> your virus checking procedures prior to use.
>

Reply via email to