losing network interfaces during long running map-reduce jobs

David Howell Fri, 02 Apr 2010 18:17:22 -0700

I'm encountering a completely bizarre failure mode in my Hadoop
cluster. A week ago, I switched from vanilla apache Hadoop 0.20.1 to
CDH 2.


Ever since then, my tasktracker/ datenode machines have been regularly
losing their networking during long (> 1 hour) jobs. Restarting the
network interface brings them back online immediately.

I'm mystified as to how this can be happening: anyone care to venture
a hypothesis? I'm running on Centos 5.2.

Cheers,
David

losing network interfaces during long running map-reduce jobs

Reply via email to