Jason,

FWIW -- based on a daily batch process, requiring 9 Hadoop jobs in
sequence -- 100+2 EC2 nodes, 2 Tb data, 6 hrs run time.

We tend to see a namenode failing early, e.g., the "problem advancing"
exception in the values iterator, particularly during a reduce phase.

Hot-fail would be great. Otherwise, given the duration of our batch
job overall, we use what you describe: shut down cluster, etc.

Would prefer to observe this kind of failure sooner than later. We've
discussed internally how to craft an initial job which could stress
test the namenode.  Think of a "unit test" for the cluster.

The business case for this becomes especially important when you need
to automate the Hadoop cluster launch, e.g. with RightScale or another
"cloud enabler" service.

Anybody else heading in this direction?

Paco


On Tue, Jul 29, 2008 at 11:01 AM, Jason Venner <[EMAIL PROTECTED]> wrote:
> What are people doing?
>
> For jobs that have a long enough SLA, just shutting down the cluster and
> bringing up the secondary as the master works for us.
> We have some jobs where that doesn't work well, because the recovery time is
> not acceptable.
>
> There has been internal discussion of using drdb to hotfail a namenode to a
> backup so that the running job can continue.

Reply via email to