On Fri, Apr 9, 2010 at 8:18 AM, stephen mulcahy <stephen.mulc...@deri.org>wrote:

> Allen Wittenauer wrote:
>
>> On Apr 8, 2010, at 9:37 AM, stephen mulcahy wrote:
>>
>>> When I run this on the Debian 2.6.32 kernel - over the course of the run,
>>> 1 or 2 datanodes of the cluster enter a state whereby they are no longer
>>> responsive to network traffic.
>>>
>>
>> How much free memory do you have?
>>
>
> Lots, a few GB
>
>
>
>> How many tasks per node do you have?
>>
>
> I left this at the default.
>
>
>
>> What are the service times, etc, on your IO system?
>>
>
> Can you clarify this query?
>
>
>
>>  Has anyone run into similar problems with their environments? I noticed
>>> that the when the nodes become unresponsive, it often happens when the
>>> TeraSort is at
>>>
>>
>> I've always seen Linux nodes go unresponsive when they get memory starved
>> to the point that the OOM can't function because it can't allocate enough
>> mem.
>>
>
> Sure, but I can login to the unresponsive nodes via the console - it's just
> the network that has become responsive. To be clear here, I don't suspect
> Hadoop is the root cause of the problem - I suspect either a kernel bug or
> some other operating system level bug. I was wondering if others had run
> into similar problems.
>

Most likely a kernel bug. In previous versions of Debian there was a buggy
forcedeth driver, for example, that caused it to drop off the network in
high load. Who knows what new bug is in 2.6.32 which is brand spanking new.


>
> I was also wondering in general what kernel versions and distros people are
> using, especially for larger production clusters.
>
>
The overwhelming majority of production clusters run on RHEL 5.3 or RHEL 5.4
in my experience (I'm lumping CentOS 5.3/5.4 in with RHEL here). I know one
or two production clusters running Debian Lenny, but none running something
as new as what you're talking about. Hadoop doesn't exercise the new
features in very recent kernels, so there's no sense accepting instability -
just go with something old that works!

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to