Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86
Hello Kurt, thanks for your support and for providing the packages! This made testing for me very easy - and I can tell you that it works perfectly. Setting the environment variable was NOT a full workaround as there are processes like postgresql which use the libssl internally and from a non-interactive shell, which prevented postgresql even from starting as it calculates a key at each startup. By the way, the internal capability flags have the following values: OPENSSL_ia32cap_P[0] = 0x0080A535 (same as before) OPENSSL_ia32cap_P[1] = 0x0 Was the value of OPENSSL_ia32cap_P[1] a "random" value previously due to wrong assumptions when reading? Best regards, Hans-Juergen Mauser -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86
Hello Kurt, thanks for your reply and the proposal of a fix! Is there a possibility to obtain a daily build of the library anywhere, or do I have to compile it myself? If possible, I'd like to keep efforts simple here... but if I have to, I will compile it. Thanks, Hans-Juergen Mauser -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86
Hello Kurt, thanks for your hint - by disabling this capability flag (bit 62) for RDRAND, it works perfectly! The _unmodified_ content (without env-var override) of the capability variables look like this: OPENSSL_ia32cap_P[0] = 0x0080A535 OPENSSL_ia32cap_P[1] = 0x64616574 As a workaround, I will set the override variable globally to be compatible to the most recent versions - however, it would be good if a fix can be integrated into the code itself as former versions are working. If no fix is possible, documentation would be an alternative as I think that even for advanced users, these control variables are very special, internal details of openssl which are anything but obvious. Thanks and best regards, Hans-Juergen Mauser -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86
Hello Kurt, see the required information attached to this mail. If I can provide further information, don't hesitate to ask me. Thanks and best regards, Hans-Juergen Mauser Starting program: /usr/bin/openssl req -new -x509 -out cert.pem -days 3650 Generating a 2048 bit RSA private key Program received signal SIGILL, Illegal instruction. 0xb7dfa3f5 in OPENSSL_ia32_rdrand () at x86cpuid.s:333 333 x86cpuid.s: Datei oder Verzeichnis nicht gefunden. (gdb) bt #0 0xb7dfa3f5 in OPENSSL_ia32_rdrand () at x86cpuid.s:333 (gdb) frame #0 0xb7dfa3f5 in OPENSSL_ia32_rdrand () at x86cpuid.s:333 333 in x86cpuid.s (gdb) disas Dump of assembler code for function OPENSSL_ia32_rdrand: 0xb7dfa3f0 <+0>: mov$0x8,%ecx => 0xb7dfa3f5 <+5>: rdrand %eax 0xb7dfa3f8 <+8>: jb 0xb7dfa3fc 0xb7dfa3fa <+10>:loop 0xb7dfa3f5 0xb7dfa3fc <+12>:cmp$0x0,%eax 0xb7dfa3ff <+15>:cmove %ecx,%eax 0xb7dfa402 <+18>:ret End of assembler dump. (gdb)
Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86
A little amendment: openssl 0.9.8 also uses the "cmov" library version, I cross-checked this after generating the debug information for you. est regards, Hans-Juergen Mauser linux-gate.so.1 => (0xb778) libssl.so.0.9.8 => /usr/lib/i686/cmov/libssl.so.0.9.8 (0xb7721000) libcrypto.so.0.9.8 => /usr/lib/i686/cmov/libcrypto.so.0.9.8 (0xb75c9000) libdl.so.2 => /lib/i386-linux-gnu/i686/cmov/libdl.so.2 (0xb75c5000) libz.so.1 => /lib/i386-linux-gnu/libz.so.1 (0xb75ac000) libc.so.6 => /lib/i386-linux-gnu/i686/cmov/libc.so.6 (0xb7449000) /lib/ld-linux.so.2 (0xb7781000)
Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86
Hello Kurt, thanks for your reply. Of course I like to help, see the required information (and a little bit more) attached to this mail. The "cmov" instruction by itself should not be the problem as this CPU supports it - but maybe the presence of cmov suggests using other instructions which are missing? Hope my information helps you a bit - if I can provide further details, don't hesitate to ask! Best regards, Hans-Juergen Mauser Kurt Roeckx schrieb: On Sun, Feb 03, 2013 at 05:47:35PM +0100, Hans-Juergen Mauser wrote: Package: openssl Version: 1.0.1c-4 Hello! When trying a new Wheezy install on a machine with Cyrix MII / IBM 6x86 CPU, openssl cannot complete it's install routine because the creation of the default certificate fails reproducibly ith the result "illegal instruction". It seems as if the package is compiled with some optimisation not suitable for regular Pentium machine. libssl is actually compiled 3 times. Ones for the default architecture which is i486, once for i586, and once for i686 with cmov. The dynamic linker should pick up the correct one. Can you verify that which version you pick up? You can see this with: ldd /usr/bin/openssl Can you also show /proc/cpuinfo? openssl also contains hand written assembler, which detects cpu capabilities as well. Maybe something is broken there. In any case would it be useful if you could give information about which function it was that has the problem. Can you install libssl1.0.0-dbg and run whatever you wanted to do from gdb and give me a backtrace? Kurt processor : 0 vendor_id : CyrixInstead cpu family : 6 model : 2 model name : M II 3.5x Core/Bus Clock stepping: 8 cpu MHz : 233.895 fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu de tsc msr cx8 pge cmov mmx cyrix_arr bogomips: 467.79 clflush size: 32 cache_alignment : 32 address sizes : 32 bits physical, 32 bits virtual power management: [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Linux version 3.2.0-4-486 (debian-ker...@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 Debian 3.2.35-2 [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: - 000a (usable) [0.00] BIOS-e820: 000f - 0010 (reserved) [0.00] BIOS-e820: 0010 - 2000 (usable) [0.00] BIOS-e820: - 0001 (reserved) [0.00] Notice: NX (Execute Disable) protection missing in CPU! [0.00] DMI 2.0 present. [0.00] DMI: System Manufacturer System Name/P/I-P55T2P4, BIOS #401A0-0207-204/28/99 [0.00] e820 update range: - 0001 (usable) ==> (reserved) [0.00] e820 remove range: 000a - 0010 (usable) [0.00] last_pfn = 0x2 max_arch_pfn = 0x10 [0.00] initial memory mapped : 0 - 0180 [0.00] Base memory trampoline at [c009c000] 9c000 size 12288 [0.00] init_memory_mapping: -2000 [0.00] 00 - 002000 page 4k [0.00] kernel direct mapping tables up to 2000 @ 177d000-180 [0.00] RAMDISK: 1f68 - 2000 [0.00] ACPI Error: A valid RSDP was not found (20110623/tbxfroot-219) [0.00] 0MB HIGHMEM available. [0.00] 512MB LOWMEM available. [0.00] mapped low ram: 0 - 2000 [0.00] low ram: 0 - 2000 [0.00] Zone PFN ranges: [0.00] DMA 0x0010 -> 0x1000 [0.00] Normal 0x1000 -> 0x0002 [0.00] HighMem empty [0.00] Movable zone start PFN for each node [0.00] early_node_map[2] active PFN ranges [0.00] 0: 0x0010 -> 0x00a0 [0.00] 0: 0x0100 -> 0x0002 [0.00] On node 0 totalpages: 130960 [0.00] free_area_init_node: node 0, pgdat c13c5488, node_mem_map df27f200 [0.00] DMA zone: 32 pages used for memmap [0.00] DMA zone: 0 pages reserved [0.00] DMA zone: 3952 pages, LIFO batch:0 [0.00] Normal zone: 992 pages used for memmap [0.00] Normal zone: 125984 pages, LIFO batch:31 [0.00] Using APIC driver default [0.00] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [0.00] No local APIC present or hardware disabled [0.00] APIC: disable apic facility [0.00] APIC: switched to apic NOOP [0.00] nr_irqs_gsi: 16 [0.00] PM: Registered nosave memory: 000a - 000f [0.00] PM: Registered nosave memory:
Bug#699693: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1:6.0p1-3 of openssh-server on a system with Cyrix MII / IBM 6x86
Package: openssh-server Version: 1:6.0p1-3 Hello! When trying a new Wheezy install on a machine with Cyrix MII / IBM 6x86 CPU, openssh-server cannot complete it's install routine because the creation of the default certificate fails reproducibly ith the result "illegal instruction". It seems as if the package is compiled with some optimisation not suitable for regular Pentium machine. I suggest checking the compilation options, as encryption is a basic requirement of any system and should not be privileged to most recent processors. SSL and/or SSH failing at install time even prevents establishing a secure remote connection to take care of the problem. Expected behaviour: installation / certificate ceration succeeds, as Debian should be able of being installed on any Pentium or (with manual work) even on a 486 machine. The MII/6x86 is supposed to be similar to a Pentium Pro by its instruction set, and from my experience, anything optimised for "686" is able to run on this CPU. WORKAROUND: Using openssh-server 1:5.5p1-6+squeeze2 allows installation completion. THEORETICAL DANGER: Maybe this "illegal instruction" problem can also occur on more recent CPUs when performing a new Debian installation / certificate creation and is possibly "hidden" as long as existing certificates are present! Please resolve this problem by removing "special" optimisations or by providing a separate package for "older" processors. Best regards, Hans-Juergen Mauser -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86
Package: openssl Version: 1.0.1c-4 Hello! When trying a new Wheezy install on a machine with Cyrix MII / IBM 6x86 CPU, openssl cannot complete it's install routine because the creation of the default certificate fails reproducibly ith the result "illegal instruction". It seems as if the package is compiled with some optimisation not suitable for regular Pentium machine. I suggest checking the compilation options, as encryption is a basic requirement of any system and should not be privileged to most recent processors. SSL and/or SSH failing at install time even prevents establishing a secure remote connection to take care of the problem. Expected behaviour: installation / certificate ceration succeeds, as Debian should be able of being installed on any Pentium or (with manual work) even on a 486 machine. The MII/6x86 is supposed to be similar to a Pentium Pro by its instruction set, and from my experience, anything optimised for "686" is able to run on this CPU. WORKAROUND: Using openssl 0.9.8o-4squeeze13 allows installation completion. THEORETICAL DANGER: Maybe this "illegal instruction" problem can also occur on more recent CPUs when performing a new Debian installation / certificate creation and is possibly "hidden" as long as existing certificates are present! Please resolve this problem by removing "special" optimisations or by providing a separate package for "older" processors. Best regards, Hans-Juergen Mauser -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello, thanks for your reply. Due to a lot of work "at work", I did not yet manage to report the bug, but I will do so soon. Today I want to add my current uptime and interrupt state for a last time, as I might have to power down the system in a few days for maintenance measures (and anyway want to put and end to ompelled uptime watching related to this bug). In addition to the flawless uptime, the complete system and all running tasks have proven to be absolutely flawless over this amount of time (well, that's the way I expect it from a Linux operating system as long as no very risky software is running - but it also confirms that the hardware really has no problems and my problems were only related to the "lockup detector". Even the amount of shared interrupts and their dependencies on the APIC system and correct driver implementations don't hurt. No kernel errors have been logged since 17 July, and these were link down/up messages due to a switch reboot... netfinity5000:~$ uptime 17:14:06 up 46 days, 14 min, 2 users, load average: 0,05, 0,06, 0,05 netfinity5000:~$ cat /proc/interrupts CPU0 CPU1 0: 49 0 IO-APIC-edge timer 1: 3 0 IO-APIC-edge i8042 6: 3 0 IO-APIC-edge floppy 7: 1 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 1 3 IO-APIC-edge i8042 14: 42 74 IO-APIC-edge ata_generic 15: 0 0 IO-APIC-edge ata_generic 16: 49 48 IO-APIC-fasteoi aic7xxx, aic7xxx 17: 154500925 154495377 IO-APIC-fasteoi eth0 18:26575282728937 IO-APIC-fasteoi megaraid, ohci_hcd:usb2 19: 69807511 69703638 IO-APIC-fasteoi eth1 22: 91578533 91635430 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3 NMI: 1 1 Non-maskable interrupts LOC: 262393426 323398808 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 2 0 APIC ICR read retries RES:67917116755464 Rescheduling interrupts CAL:12316441607457 Function call interrupts TLB: 859984 805603 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 13251 13251 Machine check polls ERR: 0 MIS: 0 netfinity5000:~$ free total used free sharedbuffers cached Mem: 20748041340228 734576 0 294672 805404 -/+ buffers/cache: 2401521834652 Swap: 1943860 0 1943860 Best regards, Hans-Juergen Mauser -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello, Ben Hutchings wrote: > [...] > > I think it's fine and has nothing to do with the problem. > > Since you say it has taken 1-8 days for any problem to appear, I suppose > you will have to wait a few weeks to have some confidence that > 'nowatchdog' makes a difference. well, even if you think it has nothing to do with the problem, now I am almost sure it has. Nothing is more evident than uptime: netfinity5000:~# uptime 21:39:39 up 21 days, 4:39, 2 users, load average: 0,13, 0,10, 0,07 For comparison, see the last mail I added to this bug, the maximal continuous operation time was nothing more than about 8 days. It would be great if anyone took care of this bug, maybe there are other people getting hit by this and not being able to track it down. Would you recommend me to report this on bugzilla.kernel.org ? Thanks and best regards, Hans-Juergen -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello, currently the system starts reaching an amount of uptime that was hardly possible before setting "nowatchdog": netfinity5000:~# uptime 13:43:12 up 10 days, 20:43, 2 users, load average: 0,01, 0,08, 0,07 When we reach 14 days or more, we know that it's really the watchdog/NMI "feature" causing these SMP systems to lock up intermittently but quite deterministic after an uptime of 1 to 8 days. To avoid any side-effects while testing, I did not change anything on the system except this kernel boot parameter after the last lockup those 10 days ago. No software updates, no additional change to the kernel (this means the current kernel produced at least one "successful" lockup as I had tried various configurations and versions before the hint to the NMI/watchdog issue gained my full attention). After having me frustrated for months, I have quite a detailed impression of this misbehaviour and nothing ever made me feel that confident in restored reliability than setting this boot parameter. Here is my current interrupt state: netfinity5000:~# cat /proc/interrupts CPU0 CPU1 0: 49 0 IO-APIC-edge timer 1: 3 0 IO-APIC-edge i8042 6: 3 0 IO-APIC-edge floppy 7: 1 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 1 3 IO-APIC-edge i8042 14: 42 74 IO-APIC-edge ata_generic 15: 0 0 IO-APIC-edge ata_generic 16: 49 48 IO-APIC-fasteoi aic7xxx, aic7xxx 17: 19391683 19362804 IO-APIC-fasteoi eth0 18: 649647 660452 IO-APIC-fasteoi megaraid, ohci_hcd:usb2 19:87614728704241 IO-APIC-fasteoi eth1 22: 11804557 11924853 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3 NMI: 1 1 Non-maskable interrupts LOC: 62410645 76099188 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 2 0 APIC ICR read retries RES:16280561619691 Rescheduling interrupts CAL: 293382 396292 Function call interrupts TLB: 211292 194994 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 3129 3129 Machine check polls ERR: 0 MIS: 0 Here are my boot parameters and the reboot date since which the system has been running flawlessly: Jun 20 17:01:49 netfinity5000 kernel: [0.00] Kernel command line: auto BOOT_IMAGE=Linux ro root=UUID=338417b5-b8c8-47ed-97ee-2ebc9c8afee8 aic7xxx=no_reset,allow_memio pci=bios,use_crs,routeirq libata.force=mwdma2 reboot=warm rootdelay=30 nowatchdog Just for comparison: before this, reboots/lockups occured on June 4th, June 6th, June 7th, June 8th, June 11th, June 13th, June 15th and June 20th. If you need more information like a full kernel boot log or whatever, just ask me. Thanks and best regards, Hans-Juergen -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on muti-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello, Ben Hutchings wrote: On Thu, 2012-06-21 at 21:25 +0200, Hans-Juergen Mauser wrote: [...] Here we see again how bad the documentation of open-source projects sometimes is cared about... even when configuring a kernel, the config help says that the nmi watchdog had to be enabled consciously by a boot parameter I don't see any documentation saying that; maybe you're looking at the wrong version. But thanks for the general criticism anyway, it really helps to motivate developers. Sorry, that wasn't meant negatively. II know it from my own work that it happens - but on the other hand, as a linux enthusiast, I am often asking myself how an "average" user should be able to handle this. And, you are right, I mixed up two locations: in the current kernel source the config help is correct, but the information files are still partly wrong, and that's where I took it from: http://www.kernel.org/doc/Documentation/nmi_watchdog.txt - in fact it seems to be activated by default as soon as SMP code is loaded and/or an APIC is detected (but though the presence of an APIC, I have not seen those NMIs on my uniprocessor P3 machines yet). It actually depends on whether the processor has a PMU (performance monitoring unit) with a useful counter. Okay, found at least one system which _does_ "count NMIs" - just for learning I will take a look at the differences between the systems and running kernel versions/configurations. [...] I think it's fine and has nothing to do with the problem. Since you say it has taken 1-8 days for any problem to appear, I suppose you will have to wait a few weeks to have some confidence that 'nowatchdog' makes a difference. That's what I like to do and also will do, there won't be any other reason to reboot the machine which gets hit by the problem most often. As soon as a definite difference (or definitely the same behaviour) is visible, I will post a reply here. Anyway I just liked to be able to discuss the problem and initially posted it as a reply to the bug referenced above, but a hint was given that I should open a new one. At least the bug has sone one good thing to me: I got used again to compile my own kernels which I had abandoned with the advent of the 2.6 series and the change in most distributions to initrds, which made me use only pre-packaged binaries for consistence among a number of machines and simplicity. Now I am happy again to be able to optimise some details again or choose other options than the distribution team. Best regards, Hans-Juergen -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on muti-Pentium-3 SMP systems on all kernel builds since 2.6.38
Package: linux-2.6 Version: 2.6.38-5 Hello! (Backgrund information: the motivation for this bug report comes from bug 639331 which already seems to have dug more deeply into the version introducing this berhaviour) The NMI watchdog mechanism (as I know today it is the probable source) has given me serious headaches since Debian kernel 2.6.38 was released. I cannot tell it definitely yet as it is an intermittent error in my case which may take up to a week to appear once, and I disabled the NMI watchdog mechanism by adding "nowatchdog" not until yesterday (20120620) when I came across the bug report mentioned above. A short summary of my problem: - among several uniprocessor systems with Debian and Ubuntu, I am running several older multiprocessor servers (IBM Netfinity 5000 (Dual-P3 (Coppermine)), IBM Netfinity 7000 M10 (Quad-P3-Xeon (Tanner)) and IBM xSeries 232 (Dual-P3 (Tualatin))) with Debian (using testing as "rolling release" after a long time with lenny) - the systems were running rock-solid up to and including the Debian-packaged kernel 2.6.32 (all sub-versions). - when Debian-packaged kernel 2.6.38 came out, my problem started and appeared mainly on the Netfinity 5000 (but less often also on the other systems): after running continuously for one to eight days, the system suddenly locked up hard, in most cases it was just idle when this happened - this lockup was a classic livelock which can be diagnosed nicely on these IBM machines as they have activity LEDs for each CPU which glowed with identical brightness and without any modulation, so both CPUs were switching between each other with short cycles - when comparing the basic system data and properties, I noticed a difference between kernel 2.6.32 and 2.6.38: the latter caused a continuously rising NMI count on each CPU which could not be seen with 2.6.32! Today I know where these NMIs are coming from: it is the watchdog mechanism also causing your laptop problem - I hoped that the problem might disappear with kernel 3.4 as there were a few discussions on LKML about several livelocks/deadlocks related to timers and the like (the config change concerning the "lockup detector" which got enabled from 2.6.32 to 2.6.38 remained unnoticed for me) - as you see it on the laptop, this lockup NEVER allows to get any message out via the debugging mechanisms, not even by attaching a serial cable and logging the console output on a second machine - now using kernel 3.4.2, the problem still exists, but has changed a bit in its consequences - instead of a livelock, it is a deadlock in most cases and activity stays on a single CPU, sometimes even causing a reboot instead of staying locked up - on a German forum I described the problem, but nobody could point me to this lockup-detector change in the kernel config though I posted this significant change from "no NMIs" to "continuous NMIs". Here we see again how bad the documentation of open-source projects sometimes is cared about... even when configuring a kernel, the config help says that the nmi watchdog had to be enabled consciously by a boot parameter - in fact it seems to be activated by default as soon as SMP code is loaded and/or an APIC is detected (but though the presence of an APIC, I have not seen those NMIs on my uniprocessor P3 machines yet). Here is a link to my description on the German "debianforum": http://debianforum.de/forum/viewtopic.php?f=33&t=134210 I would like to report the bug to http://bugzilla.kernel.org if it has not yet been done by someone else. Therefore it would be great if you could give me a short note if you have reported it already. Basically I think this mechanism has its bugs and/or wrong assumptions on some machines and should undergo a critical review. I'm wondering if there are more people in the world getting set up by strange lockups of their machines which are wrongly diagnosed as "hardware errors" etc. Thanks and best regards, Hans-Juergen -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#639331: linux-image-2.6.36-rc6-686-bigmem: Closing laptop lid hangs the system on Dell studio 1555
Hello! I am very happy having found this bug report as it is possible that the NMI watchdog mechanism has given me serious headaches since Debian kernel 2.6.38 was released! I cannot tell it definitely yet as it is an intermittent error in my case which may take up to a week to appear once, and I disabled the NMI watchdog mechanism by adding "nowatchdog" not until a few hours ago when I came across this bug report. A short summary of my problem: - among several uniprocessor systems with Debian and Ubuntu, I am running several older multiprocessor servers (IBM Netfinity 5000 (Dual-P3), IBM Netfinity 7000 M10 (Quad-P3-Xeon) and IBM xSeries 232 (Dual P3-Tualatin)) with Debian (using testing as "rolling release" after a long time with lenny) - the systems were running rock-solid up to and including the Debian-packaged kernel 2.6.32 - when Debian-packaged kernel 2.6.38 came out, my problem started and appeared mainly on the Netfinity 5000 (but less often also on the other systems): after running continuously for one to eight days, the system suddenly locked up hard, in most cases it was just idle when this happened - this lockup was a classic livelock which can be diagnosed nicely on these IBM machines as they have activity LEDs for each CPU which glowed with identical brightness and without any modulation, so both CPUs were switching between each other with short cycles - when comparing the basic system data and properties, I noticed a difference between kernel 2.6.32 and 2.6.38: the latter caused a continuously rising NMI count on each CPU which could not be seen with 2.6.32! Today I know where these NMIs are coming from: it is the watchdog mechanism also causing your laptop problem - I hoped that the problem might disappear with kernel 3.4 as there were a few discussions on LKML about several livelocks/deadlocks related to timers and the like (the config change concerning the "lockup detector" which got enabled from 2.6.32 to 2.6.38 remained unnoticed for me) - as you see it on the laptop, this lockup NEVER allows to get any message out via the debugging mechanisms, not even by attaching a serial cable and logging the console output on a second machine - now using kernel 3.4.2, the problem still exists, but has changed a bit in its consequences - instead of a livelock, it is a deadlock in most cases and activity stays on a single CPU, sometimes even causing a reboot instead of staying locked up - on a German forum I described the problem, but nobody could point me to this lockup-detector change in the kernel config though I posted this significant change from "no NMIs" to "continuous NMIs". Here we see again how bad the documentation of open-source projects sometimes is cared about... even when configuring a kernel, the config help says that the nmi watchdog had to be enabled consciously by a boot parameter - in fact it seems to be activated by default as soon as SMP code is loaded and/or an APIC is detected (but though the presence of an APIC, I have not seen those NMIs on my uniprocessor P3 machines yet). Here is a link to my description on the German "debianforum": http://debianforum.de/forum/viewtopic.php?f=33&t=134210 I would like to report the bug to http://bugzilla.kernel.org if it has not yet been done by someone else. Therefore it would be great if you could give me a short note if you have reported it already. Basically I think this mechanism has its bugs and/or wrong assumptions on some machines and should undergo a critical review. I'm wondering if there are more people in the world getting set up by strange lockups of their machines which are wrongly diagnosed as "hardware errors" etc. Hope to read from you soon! Thanks and best regards, Hans-Juergen -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org