I agree with Joydeep. For batch processing, it is sufficient to make the application not assume that HDFS is always up and active. However, for real-time applications that are not batch-centric, it might not be sufficient. There are a few things that HDFS could do to better handle Namenode outages:
1. Make Clients handle transient Namenode downtime. This requires that Namenode restarts are fast, clients can handle long Namenode outages, etc.etc. 2. Design HDFS Namenode to be a set of two, an active one and a passive one. The active Namenode could continuously forward transactions to the passive one. In case of failure of the active Namenode, the passive could take over. This type of High-Availability would probably be very necessary for non-batch-type-applications. Thanks, dhruba -----Orivery necessaginal Message----- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Thursday, February 28, 2008 6:06 PM To: core-user@hadoop.apache.org Subject: RE: long write operations and data recovery We have had a lot of peace of mind by building a data pipeline that does not assume that hdfs is always up and running. If the application is primarily non real-time log processing - I would suggest batch/incremental copies of data to hdfs that can catch up automatically in case of failures/downtimes. we have a rsync like map-reduce job that monitors a log directories and keeps pulling new data in (and suspect lot of other users do similar stuff as well). Might be a useful notion to generalize and put in contrib. -----Original Message----- From: Steve Sapovits [mailto:[EMAIL PROTECTED] Sent: Thursday, February 28, 2008 4:54 PM To: core-user@hadoop.apache.org Subject: Re: long write operations and data recovery > How does replication affect this? If there's at least one replicated > client still running, I assume that takes care of it? Never mind -- I get this now after reading the docs again. My remaining point of failure question concerns name nodes. The docs say manual intervention is still required if a name node goes down. How is this typically managed in production environments? It would seem even a short name node outage in a data intestive environment would lead to data loss (no name node to give the data to). -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]