On Tue, Jul 06, 2010 at 10:33:36AM -0500, Brian Moon wrote: > On no regular schedule, the two production servers will hang. And it is > a weird hang. They still respond to ping. And TCP connnections answer > (connect) but don't respond.
Most of the time (say, 70%), this indicates a disk subsystem problem (could be anything disk-related: driver, controller, cable, disk). If the kernel can't (re-)read a portion of program code (say, a previously discarded or not yet loaded memory page for /usr/sbin/sshd) from disk, the process is likely to hang with the attempted read for a long time. Ditto for previously swapped-out data pages (but those are arguably less common than discarded code pages). Less commonly (say, 20%), this is also seen after certain kernel-mode faults ("Oops") - if a process or a thread dies on an unexpected kernel-mode fault, but with a lock on a resource still held (so the lock is then never released, causing other processes to bump into it). I "reserved" another 10% for all other possible causes. ;-) None of the above is OpenVZ-specific. > There is nothing in syslog on the host server or any containers. If it's a disk issue, and you only have one RAID array with both the root fs and the logs on it, then logging will likely not work when the issue is triggered - which is why you won't see anything in the logs. Ideally, you'd run "dmesg", but for that you need to be able to run a command. > There is nothing on the console. This is not specific enough. ;-) Is the console screen entirely blank or does it show, say, a login prompt? If it's blank, then does it get unblanked on a keypress? (I am assuming that you're not running any sort of GUI on the server.) I recommend that you deactivate the kernel's built-in screensaver by adding: echo -e '\033[9;0]' to /etc/rc.d/rc.local, and also issuing this command on the running system with output redirected to /dev/console and/or /dev/tty0. Then the console will display the last messages even if the kernel is locked up so badly that a keypress would not unblank the screen. In another message, you mentioned you were using serial console. That's great. Were you referring to it when you said that there was nothing on the console? If you suspect that the console might not be working well enough (e.g., not being "quick enough" to capture the last messages before the kernel locks up too badly to continue logging even to the console), you could also try netconsole (it uses the UDP-based syslog protocol). In our experience, netconsole usually eliminates the need for serial consoles. > It sounds like a resource issue. No, it does not. You seem to have plenty of RAM, and I assume that you have reasonable privvmpages and kmemsize limits set up, right? If so, it is unlikely that a container would unintentionally cause resource starvation this bad. > Linux atl-vz1 2.6.18-028stab056 #1 SMP Tue Jun 30 07:50:32 EDT 2009 > x86_64 Intel(R) Xeon(R) CPU E5420 @ 2.50GHz GenuineIntel GNU/Linux You really ought to upgrade to a more recent rhel5 branch kernel, although we've been successfully running both older and newer OpenVZ rhel5 kernels on DELL 2950s without running into any issues. We've been always doing custom builds (our own CONFIG_* settings), though. > # vzlist -o ctid,kmemsize,kmemsize.l -s kmemsize These limits are low enough (for an x86_64 system), no problem here. I hope this helps. <plug> You may also consider outsourcing your sysadmin issues to us. We're quite used to installing and managing OpenVZ-based servers remotely, which we've been doing for years. </plug> Alexander _______________________________________________ Users mailing list Users@openvz.org https://openvz.org/mailman/listinfo/users