Package: linux-image-2.6.32-5-amd64
Version: 2.6.32-41squeeze2
Severity: important

Hi,

since about December 2011 we've seen systems were SSH sessions suddenly hang and
further logins on the physical TTY or via SSH are no longer possible. In some 
cases
ssh logins still work and you see motd and mayeb can even issue one or two 
commands.
(I've brought this issue up on debian-user in march with a private reply from a
fellow DD yesterday http://lists.debian.org/debian-user/2012/03/msg01204.html)


Over time we observed that ssh logins without PTS (ssh -T) still work. Looking 
at
other sessions sshd was in state and D entries in /dev/pts/ were created 
correctly.
Searching through munin graphs we could narrow down the starting point of this 
issue
to the point when the hpet interrupts for one CPU core multiplied. Sometimes 
they
multiplied by six. Looking further we've found the Kernel [events/$x] in state D
where $x is the number of the CPU core which has the high number of hpet 
interrupts.

When we started strace -f on the sshd master process everything works until you 
logout.
Then you'll again see the forked sshd process hanging in state D.

Up to that point we've seen this issue exclusively on Linux 2.6.32 based 
systems,
most often on Debian/Squeeze and less often on Ubuntu 10.04 and once or twice on
a RHEL 6.1 system.

Searching further I've seen references on a Dell PowerEdge mailinglist 
referencing
RedHat BZ#750201 and Intel CPU errata number AAO67 for Nehalem (rapid C state 
switching).
The RedHat bug is currently non-public but through our technical contact at 
RedHat I was
able to receive a summary of this bug and other referenced bugs which describe 
more or
less exactly our issue.

According to RedHat that should be fixed in their Kernel 2.6.32-220.7.1.el6
citing the following in the changelog:
- [x86] hpet: Disable per-cpu hpet timer if ARAT is supported (Prarit Bhargava) 
[772884 750201]
- [x86] Improve TSC calibration using a delayed workqueue (Prarit Bhargava) 
[772884 750201]
- [kernel] clocksource: Add clocksource_register_hz/khz interface (Prarit 
Bhargava) [772884 750201]
- [kernel] clocksource: Provide a generic mult/shift factor calculation (Prarit 
Bhargava) [772884 750201]

(Maybe that helps to track down the relevant changes.)

As a workaround it could work to disable C-states in the BIOS or on the Kernel 
commandline
with intel_idle.max_cstate=0 processor.max_cstate=1.
Since we run into that issue only from time to time on the same system we could 
not yet
verify either workaround. Rumours indicate that sometimes disabling it in the 
BIOS did
not help because the Kernel enabled C-states again.

My current guess is that it's somehow related to the Intel Nehalem CPU bug and 
only happens
if you have a high single threaded load which leads to one or core cores are 
switched into
a C-6 sleep state so that they can overclock one core. Marketing name is 
TurboBoost.

Regarding the CPUs I know this happens with:
- Intel X3430
- Intel X3450
- Intel L3426

We see it in almost all cases on Dell R210 with the X3430 CPUs.
Rumours claim it also happens with other Dell models based on other CPUs from 
the
Intel Nehalem series with TurboBoost. 


Would be great if someone could track down the needed changes and incorporate 
those
into a point release. In general I would be available for testing but we still 
have
no way reproduce it beside waiting a few month. :(


Regards,
Sven

-- System Information:
Debian Release: 6.0.4
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-0.bpo.1-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages linux-image-2.6.32-5-amd64 depends on:
ii  debconf [debconf-2.0]       1.5.36.1     Debian configuration management sy
ii  initramfs-tools [linux-init 0.99~bpo60+1 tools for generating an initramfs
ii  linux-base                  3.4~bpo60+1  Linux image base package
ii  module-init-tools           3.12-2       tools for managing Linux kernel mo

Versions of packages linux-image-2.6.32-5-amd64 recommends:
pn  firmware-linux-free           <none>     (no description available)

Versions of packages linux-image-2.6.32-5-amd64 suggests:
pn  grub | lilo                   <none>     (no description available)
pn  linux-doc-2.6.32              <none>     (no description available)



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to