us also.
The pulling in of data from external machines then a pipeline of simple
map/reduces is our standard pattern.
Joydeep Sen Sarma wrote:
We have had a lot of peace of mind by building a data pipeline that does
not assume that hdfs is always up and running. If the application is
primarily non real-time log processing - I would suggest
batch/incremental copies of data to hdfs that can catch up automatically
in case of failures/downtimes.
we have a rsync like map-reduce job that monitors a log directories and
keeps pulling new data in (and suspect lot of other users do similar
stuff as well). Might be a useful notion to generalize and put in
contrib.
-----Original Message-----
From: Steve Sapovits [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 28, 2008 4:54 PM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery
How does replication affect this? If there's at least one replicated
client still running, I assume that takes care of it?
Never mind -- I get this now after reading the docs again.
My remaining point of failure question concerns name nodes. The docs
say manual
intervention is still required if a name node goes down. How is this
typically managed
in production environments? It would seem even a short name node
outage in a
data intestive environment would lead to data loss (no name node to give
the data
to).