us also.
The pulling in of data from external machines then a pipeline of simple map/reduces is our standard pattern.

Joydeep Sen Sarma wrote:
We have had a lot of peace of mind by building a data pipeline that does
not assume that hdfs is always up and running. If the application is
primarily non real-time log processing - I would suggest
batch/incremental copies of data to hdfs that can catch up automatically
in case of failures/downtimes.

we have a rsync like map-reduce job that monitors a log directories and
keeps pulling new data in (and suspect lot of other users do similar
stuff as well). Might be a useful notion to generalize and put in
contrib.


-----Original Message-----
From: Steve Sapovits [mailto:[EMAIL PROTECTED] Sent: Thursday, February 28, 2008 4:54 PM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery


How does replication affect this?  If there's at least one replicated
 client still running, I assume that takes care of it?

Never mind -- I get this now after reading the docs again.

My remaining point of failure question concerns name nodes.  The docs
say manual intervention is still required if a name node goes down. How is this
typically managed
in production environments?   It would seem even a short name node
outage in a data intestive environment would lead to data loss (no name node to give
the data
to).

Reply via email to