I'm encountering a completely bizarre failure mode in my Hadoop cluster. A week ago, I switched from vanilla apache Hadoop 0.20.1 to CDH 2.
Ever since then, my tasktracker/ datenode machines have been regularly losing their networking during long (> 1 hour) jobs. Restarting the network interface brings them back online immediately. I'm mystified as to how this can be happening: anyone care to venture a hypothesis? I'm running on Centos 5.2. Cheers, David