Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86

2013-03-10 Thread Hans-Juergen Mauser

Hello Kurt,

thanks for your support and for providing the packages! This made 
testing for me very easy - and I can tell you that it works perfectly.


Setting the environment variable was NOT a full workaround as there are 
processes like postgresql which use the libssl internally and from a 
non-interactive shell, which prevented postgresql even from starting as 
it calculates a key at each startup.


By the way, the internal capability flags have the following values:

OPENSSL_ia32cap_P[0] = 0x0080A535 (same as before)
OPENSSL_ia32cap_P[1] = 0x0

Was the value of OPENSSL_ia32cap_P[1] a "random" value previously due to 
wrong assumptions when reading?


Best regards,

Hans-Juergen Mauser


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86

2013-03-10 Thread Hans-Juergen Mauser

Hello Kurt,

thanks for your reply and the proposal of a fix!

Is there a possibility to obtain a daily build of the library anywhere, 
or do I have to compile it myself? If possible, I'd like to keep efforts 
simple here... but if I have to, I will compile it.




Thanks,

Hans-Juergen Mauser


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86

2013-03-02 Thread Hans-Juergen Mauser

Hello Kurt,

thanks for your hint - by disabling this capability flag (bit 62) for 
RDRAND, it works perfectly!


The _unmodified_ content (without env-var override) of the capability 
variables look like this:


OPENSSL_ia32cap_P[0] = 0x0080A535
OPENSSL_ia32cap_P[1] = 0x64616574


As a workaround, I will set the override variable globally to be 
compatible to the most recent versions - however, it would be good if a 
fix can be integrated into the code itself as former versions are working.


If no fix is possible, documentation would be an alternative as I think 
that even for advanced users, these control variables are very special, 
internal details of openssl which are anything but obvious.



Thanks and best regards,

Hans-Juergen Mauser


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86

2013-03-01 Thread Hans-Juergen Mauser

Hello Kurt,

see the required information attached to this mail. If I can provide 
further information, don't hesitate to ask me.


Thanks and best regards,

Hans-Juergen Mauser
Starting program: /usr/bin/openssl req -new -x509 -out cert.pem -days 3650
Generating a 2048 bit RSA private key

Program received signal SIGILL, Illegal instruction.
0xb7dfa3f5 in OPENSSL_ia32_rdrand () at x86cpuid.s:333
333 x86cpuid.s: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  0xb7dfa3f5 in OPENSSL_ia32_rdrand () at x86cpuid.s:333
(gdb) frame
#0  0xb7dfa3f5 in OPENSSL_ia32_rdrand () at x86cpuid.s:333
333 in x86cpuid.s
(gdb) disas
Dump of assembler code for function OPENSSL_ia32_rdrand:
   0xb7dfa3f0 <+0>: mov$0x8,%ecx
=> 0xb7dfa3f5 <+5>: rdrand %eax
   0xb7dfa3f8 <+8>: jb 0xb7dfa3fc 
   0xb7dfa3fa <+10>:loop   0xb7dfa3f5 
   0xb7dfa3fc <+12>:cmp$0x0,%eax
   0xb7dfa3ff <+15>:cmove  %ecx,%eax
   0xb7dfa402 <+18>:ret
End of assembler dump.
(gdb) 


Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86

2013-02-03 Thread Hans-Juergen Mauser
A little amendment: openssl 0.9.8 also uses the "cmov" library version, 
I cross-checked this after generating the debug information for you.


est regards,

Hans-Juergen Mauser
linux-gate.so.1 =>  (0xb778)
libssl.so.0.9.8 => /usr/lib/i686/cmov/libssl.so.0.9.8 (0xb7721000)
libcrypto.so.0.9.8 => /usr/lib/i686/cmov/libcrypto.so.0.9.8 (0xb75c9000)
libdl.so.2 => /lib/i386-linux-gnu/i686/cmov/libdl.so.2 (0xb75c5000)
libz.so.1 => /lib/i386-linux-gnu/libz.so.1 (0xb75ac000)
libc.so.6 => /lib/i386-linux-gnu/i686/cmov/libc.so.6 (0xb7449000)
/lib/ld-linux.so.2 (0xb7781000)


Bug#699692: [Pkg-openssl-devel] Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86

2013-02-03 Thread Hans-Juergen Mauser

Hello Kurt,

thanks for your reply. Of course I like to help, see the required 
information (and a little bit more) attached to this mail.


The "cmov" instruction by itself should not be the problem as this CPU 
supports it - but maybe the presence of cmov suggests using other 
instructions which are missing?


Hope my information helps you a bit - if I can provide further details, 
don't hesitate to ask!


Best regards,

Hans-Juergen Mauser




Kurt Roeckx schrieb:

On Sun, Feb 03, 2013 at 05:47:35PM +0100, Hans-Juergen Mauser wrote:

Package: openssl
Version: 1.0.1c-4

Hello!

When trying a new Wheezy install on a machine with Cyrix MII /  IBM
6x86 CPU, openssl cannot complete it's install routine because the
creation of the default certificate fails reproducibly ith the
result "illegal instruction".

It seems as if the package is compiled with some optimisation not
suitable for regular Pentium machine.


libssl is actually compiled 3 times.  Ones for the default
architecture which is i486, once for i586, and once for
i686 with cmov.  The dynamic linker should pick up the correct
one.

Can you verify that which version you pick up?  You can see this
with:
ldd /usr/bin/openssl

Can you also show /proc/cpuinfo?

openssl also contains hand written assembler, which detects cpu
capabilities as well.  Maybe something is broken there.

In any case would it be useful if you could give information
about which function it was that has the problem.  Can you
install libssl1.0.0-dbg and run whatever you wanted to
do from gdb and give me a backtrace?


Kurt

processor   : 0
vendor_id   : CyrixInstead
cpu family  : 6
model   : 2
model name  : M II 3.5x Core/Bus Clock
stepping: 8
cpu MHz : 233.895
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu de tsc msr cx8 pge cmov mmx cyrix_arr
bogomips: 467.79
clflush size: 32
cache_alignment : 32
address sizes   : 32 bits physical, 32 bits virtual
power management:

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.2.0-4-486 (debian-ker...@lists.debian.org) (gcc 
version 4.6.3 (Debian 4.6.3-14) ) #1 Debian 3.2.35-2
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 000a (usable)
[0.00]  BIOS-e820: 000f - 0010 (reserved)
[0.00]  BIOS-e820: 0010 - 2000 (usable)
[0.00]  BIOS-e820:  - 0001 (reserved)
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] DMI 2.0 present.
[0.00] DMI: System Manufacturer System Name/P/I-P55T2P4, BIOS 
#401A0-0207-204/28/99
[0.00] e820 update range:  - 0001 (usable) 
==> (reserved)
[0.00] e820 remove range: 000a - 0010 (usable)
[0.00] last_pfn = 0x2 max_arch_pfn = 0x10
[0.00] initial memory mapped : 0 - 0180
[0.00] Base memory trampoline at [c009c000] 9c000 size 12288
[0.00] init_memory_mapping: -2000
[0.00]  00 - 002000 page 4k
[0.00] kernel direct mapping tables up to 2000 @ 177d000-180
[0.00] RAMDISK: 1f68 - 2000
[0.00] ACPI Error: A valid RSDP was not found (20110623/tbxfroot-219)
[0.00] 0MB HIGHMEM available.
[0.00] 512MB LOWMEM available.
[0.00]   mapped low ram: 0 - 2000
[0.00]   low ram: 0 - 2000
[0.00] Zone PFN ranges:
[0.00]   DMA  0x0010 -> 0x1000
[0.00]   Normal   0x1000 -> 0x0002
[0.00]   HighMem  empty
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[2] active PFN ranges
[0.00] 0: 0x0010 -> 0x00a0
[0.00] 0: 0x0100 -> 0x0002
[0.00] On node 0 totalpages: 130960
[0.00] free_area_init_node: node 0, pgdat c13c5488, node_mem_map 
df27f200
[0.00]   DMA zone: 32 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 3952 pages, LIFO batch:0
[0.00]   Normal zone: 992 pages used for memmap
[0.00]   Normal zone: 125984 pages, LIFO batch:31
[0.00] Using APIC driver default
[0.00] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[0.00] No local APIC present or hardware disabled
[0.00] APIC: disable apic facility
[0.00] APIC: switched to apic NOOP
[0.00] nr_irqs_gsi: 16
[0.00] PM: Registered nosave memory: 000a - 000f
[0.00] PM: Registered nosave memory: 

Bug#699693: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1:6.0p1-3 of openssh-server on a system with Cyrix MII / IBM 6x86

2013-02-03 Thread Hans-Juergen Mauser

Package: openssh-server
Version: 1:6.0p1-3

Hello!

When trying a new Wheezy install on a machine with Cyrix MII /  IBM 6x86 
CPU, openssh-server cannot complete it's install routine because the 
creation of the default certificate fails reproducibly ith the result 
"illegal instruction".


It seems as if the package is compiled with some optimisation not 
suitable for regular Pentium machine.


I suggest checking the compilation options, as encryption is a basic 
requirement of any system and should not be privileged to most recent 
processors. SSL and/or SSH failing at install time even prevents 
establishing a secure remote connection to take care of the problem.


Expected behaviour: installation / certificate ceration succeeds, as 
Debian should be able of being installed on any Pentium or (with manual 
work) even on a 486 machine. The MII/6x86 is supposed to be similar to a 
Pentium Pro by its instruction set, and from my experience, anything 
optimised for "686" is able to run on this CPU.


WORKAROUND: Using openssh-server 1:5.5p1-6+squeeze2 allows installation 
completion.



THEORETICAL DANGER: Maybe this "illegal instruction" problem can also 
occur on more recent CPUs when performing a new Debian installation / 
certificate creation and is possibly "hidden" as long as existing 
certificates are present!



Please resolve this problem by removing "special" optimisations or by 
providing a separate package for "older" processors.



Best regards,

Hans-Juergen Mauser


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#699692: "Illegal instruction" when installing (creating certificate) with Wheezy's version 1.0.1c-4 of openssl on a system with Cyrix MII / IBM 6x86

2013-02-03 Thread Hans-Juergen Mauser

Package: openssl
Version: 1.0.1c-4

Hello!

When trying a new Wheezy install on a machine with Cyrix MII /  IBM 6x86 
CPU, openssl cannot complete it's install routine because the creation 
of the default certificate fails reproducibly ith the result "illegal 
instruction".


It seems as if the package is compiled with some optimisation not 
suitable for regular Pentium machine.


I suggest checking the compilation options, as encryption is a basic 
requirement of any system and should not be privileged to most recent 
processors. SSL and/or SSH failing at install time even prevents 
establishing a secure remote connection to take care of the problem.


Expected behaviour: installation / certificate ceration succeeds, as 
Debian should be able of being installed on any Pentium or (with manual 
work) even on a 486 machine. The MII/6x86 is supposed to be similar to a 
Pentium Pro by its instruction set, and from my experience, anything 
optimised for "686" is able to run on this CPU.


WORKAROUND: Using openssl 0.9.8o-4squeeze13 allows installation completion.


THEORETICAL DANGER: Maybe this "illegal instruction" problem can also 
occur on more recent CPUs when performing a new Debian installation / 
certificate creation and is possibly "hidden" as long as existing 
certificates are present!



Please resolve this problem by removing "special" optimisations or by 
providing a separate package for "older" processors.



Best regards,

Hans-Juergen Mauser


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38

2012-08-05 Thread Hans-Juergen Mauser

Hello,

thanks for your reply. Due to a lot of work "at work", I did not yet 
manage to report the bug, but I will do so soon.


Today I want to add my current uptime and interrupt state for a last 
time, as I might have to power down the system in a few days for 
maintenance measures (and anyway want to put and end to ompelled uptime 
watching related to this bug). In addition to the flawless uptime, the 
complete system and all running tasks have proven to be absolutely 
flawless over this amount of time (well, that's the way I expect it from 
a Linux operating system as long as no very risky software is running - 
but it also confirms that the hardware really has no problems and my 
problems were only related to the "lockup detector". Even the amount of 
shared interrupts and their dependencies on the APIC system and correct 
driver implementations don't hurt. No kernel errors have been logged 
since 17 July, and these were link down/up messages due to a switch 
reboot...



netfinity5000:~$ uptime
 17:14:06 up 46 days, 14 min,  2 users,  load average: 0,05, 0,06, 0,05


netfinity5000:~$ cat /proc/interrupts
   CPU0   CPU1
  0: 49  0   IO-APIC-edge  timer
  1:  3  0   IO-APIC-edge  i8042
  6:  3  0   IO-APIC-edge  floppy
  7:  1  0   IO-APIC-edge  parport0
  8:  0  0   IO-APIC-edge  rtc0
  9:  0  0   IO-APIC-fasteoi   acpi
 12:  1  3   IO-APIC-edge  i8042
 14: 42 74   IO-APIC-edge  ata_generic
 15:  0  0   IO-APIC-edge  ata_generic
 16: 49 48   IO-APIC-fasteoi   aic7xxx, aic7xxx
 17:  154500925  154495377   IO-APIC-fasteoi   eth0
 18:26575282728937   IO-APIC-fasteoi   megaraid, ohci_hcd:usb2
 19:   69807511   69703638   IO-APIC-fasteoi   eth1
 22:   91578533   91635430   IO-APIC-fasteoi   ehci_hcd:usb1, 
ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3

NMI:  1  1   Non-maskable interrupts
LOC:  262393426  323398808   Local timer interrupts
SPU:  0  0   Spurious interrupts
PMI:  0  0   Performance monitoring interrupts
IWI:  0  0   IRQ work interrupts
RTR:  2  0   APIC ICR read retries
RES:67917116755464   Rescheduling interrupts
CAL:12316441607457   Function call interrupts
TLB: 859984 805603   TLB shootdowns
TRM:  0  0   Thermal event interrupts
THR:  0  0   Threshold APIC interrupts
MCE:  0  0   Machine check exceptions
MCP:  13251  13251   Machine check polls
ERR:  0
MIS:  0


netfinity5000:~$ free
 total   used   free sharedbuffers cached
Mem:   20748041340228 734576  0 294672 805404
-/+ buffers/cache: 2401521834652
Swap:  1943860  0    1943860


Best regards,

Hans-Juergen Mauser


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38

2012-07-11 Thread Hans-Juergen Mauser

Hello,


Ben Hutchings wrote:
> [...]
>
> I think it's fine and has nothing to do with the problem.
>
> Since you say it has taken 1-8 days for any problem to appear, I suppose
> you will have to wait a few weeks to have some confidence that
> 'nowatchdog' makes a difference.


well, even if you think it has nothing to do with the problem, now I am 
almost sure it has. Nothing is more evident than uptime:


netfinity5000:~# uptime
 21:39:39 up 21 days,  4:39,  2 users,  load average: 0,13, 0,10, 0,07

For comparison, see the last mail I added to this bug, the maximal 
continuous operation time was nothing more than about 8 days.


It would be great if anyone took care of this bug, maybe there are other 
people getting hit by this and not being able to track it down.


Would you recommend me to report this on bugzilla.kernel.org ?

Thanks and best regards,

Hans-Juergen





--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38

2012-07-01 Thread Hans-Juergen Mauser

Hello,

currently the system starts reaching an amount of uptime that was hardly 
possible before setting "nowatchdog":


netfinity5000:~# uptime
 13:43:12 up 10 days, 20:43,  2 users,  load average: 0,01, 0,08, 0,07

When we reach 14 days or more, we know that it's really the watchdog/NMI 
"feature" causing these SMP systems to lock up intermittently but quite 
deterministic after an uptime of 1 to 8 days.


To avoid any side-effects while testing, I did not change anything on 
the system except this kernel boot parameter after the last lockup those 
10 days ago. No software updates, no additional change to the kernel 
(this means the current kernel produced at least one "successful" lockup 
as I had tried various configurations and versions before the hint to 
the NMI/watchdog issue gained my full attention).


After having me frustrated for months, I have quite a detailed 
impression of this misbehaviour and nothing ever made me feel that 
confident in restored reliability than setting this boot parameter.


Here is my current interrupt state:

netfinity5000:~# cat /proc/interrupts
   CPU0   CPU1
  0: 49  0   IO-APIC-edge  timer
  1:  3  0   IO-APIC-edge  i8042
  6:  3  0   IO-APIC-edge  floppy
  7:  1  0   IO-APIC-edge  parport0
  8:  0  0   IO-APIC-edge  rtc0
  9:  0  0   IO-APIC-fasteoi   acpi
 12:  1  3   IO-APIC-edge  i8042
 14: 42 74   IO-APIC-edge  ata_generic
 15:  0  0   IO-APIC-edge  ata_generic
 16: 49 48   IO-APIC-fasteoi   aic7xxx, aic7xxx
 17:   19391683   19362804   IO-APIC-fasteoi   eth0
 18: 649647 660452   IO-APIC-fasteoi   megaraid, ohci_hcd:usb2
 19:87614728704241   IO-APIC-fasteoi   eth1
 22:   11804557   11924853   IO-APIC-fasteoi   ehci_hcd:usb1, 
ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3

NMI:  1  1   Non-maskable interrupts
LOC:   62410645   76099188   Local timer interrupts
SPU:  0  0   Spurious interrupts
PMI:  0  0   Performance monitoring interrupts
IWI:  0  0   IRQ work interrupts
RTR:  2  0   APIC ICR read retries
RES:16280561619691   Rescheduling interrupts
CAL: 293382 396292   Function call interrupts
TLB: 211292 194994   TLB shootdowns
TRM:  0  0   Thermal event interrupts
THR:  0  0   Threshold APIC interrupts
MCE:  0  0   Machine check exceptions
MCP:   3129   3129   Machine check polls
ERR:  0
MIS:  0

Here are my boot parameters and the reboot date since which the system 
has been running flawlessly:


Jun 20 17:01:49 netfinity5000 kernel: [0.00] Kernel command 
line: auto BOOT_IMAGE=Linux ro 
root=UUID=338417b5-b8c8-47ed-97ee-2ebc9c8afee8 
aic7xxx=no_reset,allow_memio pci=bios,use_crs,routeirq 
libata.force=mwdma2 reboot=warm rootdelay=30 nowatchdog


Just for comparison: before this, reboots/lockups occured on June 4th, 
June 6th, June 7th, June 8th, June 11th, June 13th, June 15th and June 20th.



If you need more information like a full kernel boot log or whatever, 
just ask me.



Thanks and best regards,

Hans-Juergen





--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on muti-Pentium-3 SMP systems on all kernel builds since 2.6.38

2012-06-22 Thread Hans-Juergen Mauser

Hello,

Ben Hutchings wrote:

On Thu, 2012-06-21 at 21:25 +0200, Hans-Juergen Mauser wrote:
[...]

Here we see
again how bad the documentation of open-source projects sometimes is
cared about... even when configuring a kernel, the config help says that
the nmi watchdog had to be enabled consciously by a boot parameter


I don't see any documentation saying that; maybe you're looking at the
wrong version.  But thanks for the general criticism anyway, it really
helps to motivate developers.


Sorry, that wasn't meant negatively. II know it from my own work that it 
happens - but on the other hand, as a linux enthusiast, I am often 
asking myself how an "average" user should be able to handle this.
And, you are right, I mixed up two locations: in the current kernel 
source the config help is correct, but the information files are still 
partly wrong, and that's where I took it from:


http://www.kernel.org/doc/Documentation/nmi_watchdog.txt


- in
fact it seems to be activated by default as soon as SMP code is loaded
and/or an APIC is detected (but though the presence of an APIC, I have
not seen those NMIs on my uniprocessor P3 machines yet).


It actually depends on whether the processor has a PMU (performance
monitoring unit) with a useful counter.


Okay, found at least one system which _does_ "count NMIs" - just for 
learning I will take a look at the differences between the systems and 
running kernel versions/configurations.



[...]

I think it's fine and has nothing to do with the problem.

Since you say it has taken 1-8 days for any problem to appear, I suppose
you will have to wait a few weeks to have some confidence that
'nowatchdog' makes a difference.


That's what I like to do and also will do, there won't be any other 
reason to reboot the machine which gets hit by the problem most often.
As soon as a definite difference (or definitely the same behaviour) is 
visible, I will post a reply here. Anyway I just liked to be able to 
discuss the problem and initially posted it as a reply to the bug 
referenced above, but a hint was given that I should open a new one.


At least the bug has sone one good thing to me: I got used again to 
compile my own kernels which I had abandoned with the advent of the 2.6 
series and the change in most distributions to initrds, which made me 
use only pre-packaged binaries for consistence among a number of 
machines and simplicity. Now I am happy again to be able to optimise 
some details again or choose other options than the distribution team.


Best regards,

Hans-Juergen



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on muti-Pentium-3 SMP systems on all kernel builds since 2.6.38

2012-06-21 Thread Hans-Juergen Mauser

Package: linux-2.6
Version: 2.6.38-5

Hello!

(Backgrund information: the motivation for this bug report comes from 
bug 639331 which already seems to have dug more deeply into the version 
introducing this berhaviour)


The NMI watchdog mechanism (as I know today it is the probable source) 
has given me serious headaches since Debian kernel 2.6.38 was released. 
I cannot tell it definitely yet as it is an intermittent error in my 
case which may take up to a week to appear once, and I disabled the NMI 
watchdog mechanism by adding "nowatchdog" not until yesterday (20120620) 
when I came across the bug report mentioned above.


A short summary of my problem:

- among several uniprocessor systems with Debian and Ubuntu, I am 
running several older multiprocessor servers (IBM Netfinity 5000 
(Dual-P3 (Coppermine)), IBM Netfinity 7000 M10 (Quad-P3-Xeon (Tanner)) 
and IBM xSeries 232 (Dual-P3 (Tualatin))) with Debian (using testing as 
"rolling release" after a long time with lenny)


- the systems were running rock-solid up to and including the 
Debian-packaged kernel 2.6.32 (all sub-versions).


- when Debian-packaged kernel 2.6.38 came out, my problem started and 
appeared mainly on the Netfinity 5000 (but less often also on the other 
systems): after running continuously for one to eight days, the system 
suddenly locked up hard, in most cases it was just idle when this happened


- this lockup was a classic livelock which can be diagnosed nicely on 
these IBM machines as they have activity LEDs for each CPU which glowed 
with identical brightness and without any modulation, so both CPUs were 
switching between each other with short cycles


- when comparing the basic system data and properties, I noticed a 
difference between kernel 2.6.32 and 2.6.38: the latter caused a 
continuously rising NMI count on each CPU which could not be seen with 
2.6.32! Today I know where these NMIs are coming from: it is the 
watchdog mechanism also causing your laptop problem


- I hoped that the problem might disappear with kernel 3.4 as there were 
a few discussions on LKML about several livelocks/deadlocks related to 
timers and the like (the config change concerning the "lockup detector" 
which got enabled from 2.6.32 to 2.6.38 remained unnoticed for me)


- as you see it on the laptop, this lockup NEVER allows to get any 
message out via the debugging mechanisms, not even by attaching a serial 
cable and logging the console output on a second machine


- now using kernel 3.4.2, the problem still exists, but has changed a 
bit in its consequences - instead of a livelock, it is a deadlock in 
most cases and activity stays on a single CPU, sometimes even causing a 
reboot instead of staying locked up


- on a German forum I described the problem, but nobody could point me 
to this lockup-detector change in the kernel config though I posted this 
significant change from "no NMIs" to "continuous NMIs". Here we see 
again how bad the documentation of open-source projects sometimes is 
cared about... even when configuring a kernel, the config help says that 
the nmi watchdog had to be enabled consciously by a boot parameter - in 
fact it seems to be activated by default as soon as SMP code is loaded 
and/or an APIC is detected (but though the presence of an APIC, I have 
not seen those NMIs on my uniprocessor P3 machines yet).


Here is a link to my description on the German "debianforum": 
http://debianforum.de/forum/viewtopic.php?f=33&t=134210


I would like to report the bug to http://bugzilla.kernel.org if it has 
not yet been done by someone else. Therefore it would be great if you 
could give me a short note if you have reported it already.


Basically I think this mechanism has its bugs and/or wrong assumptions 
on some machines and should undergo a critical review. I'm wondering if 
there are more people in the world getting set up by strange lockups of 
their machines which are wrongly diagnosed as "hardware errors" etc.



Thanks and best regards,

Hans-Juergen



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#639331: linux-image-2.6.36-rc6-686-bigmem: Closing laptop lid hangs the system on Dell studio 1555

2012-06-20 Thread Hans-Juergen Mauser

Hello!

I am very happy having found this bug report as it is possible that the 
NMI watchdog mechanism has given me serious headaches since Debian 
kernel 2.6.38 was released! I cannot tell it definitely yet as it is an 
intermittent error in my case which may take up to a week to appear 
once, and I disabled the NMI watchdog mechanism by adding "nowatchdog" 
not until a few hours ago when I came across this bug report.


A short summary of my problem:

- among several uniprocessor systems with Debian and Ubuntu, I am 
running several older multiprocessor servers (IBM Netfinity 5000 
(Dual-P3), IBM Netfinity 7000 M10 (Quad-P3-Xeon) and IBM xSeries 232 
(Dual P3-Tualatin)) with Debian (using testing as "rolling release" 
after a long time with lenny)


- the systems were running rock-solid up to and including the 
Debian-packaged kernel 2.6.32


- when Debian-packaged kernel 2.6.38 came out, my problem started and 
appeared mainly on the Netfinity 5000 (but less often also on the other 
systems): after running continuously for one to eight days, the system 
suddenly locked up hard, in most cases it was just idle when this happened


- this lockup was a classic livelock which can be diagnosed nicely on 
these IBM machines as they have activity LEDs for each CPU which glowed 
with identical brightness and without any modulation, so both CPUs were 
switching between each other with short cycles


- when comparing the basic system data and properties, I noticed a 
difference between kernel 2.6.32 and 2.6.38: the latter caused a 
continuously rising NMI count on each CPU which could not be seen with 
2.6.32! Today I know where these NMIs are coming from: it is the 
watchdog mechanism also causing your laptop problem


- I hoped that the problem might disappear with kernel 3.4 as there were 
a few discussions on LKML about several livelocks/deadlocks related to 
timers and the like (the config change concerning the "lockup detector" 
which got enabled from 2.6.32 to 2.6.38 remained unnoticed for me)


- as you see it on the laptop, this lockup NEVER allows to get any 
message out via the debugging mechanisms, not even by attaching a serial 
cable and logging the console output on a second machine


- now using kernel 3.4.2, the problem still exists, but has changed a 
bit in its consequences - instead of a livelock, it is a deadlock in 
most cases and activity stays on a single CPU, sometimes even causing a 
reboot instead of staying locked up


- on a German forum I described the problem, but nobody could point me 
to this lockup-detector change in the kernel config though I posted this 
significant change from "no NMIs" to "continuous NMIs". Here we see 
again how bad the documentation of open-source projects sometimes is 
cared about... even when configuring a kernel, the config help says that 
the nmi watchdog had to be enabled consciously by a boot parameter - in 
fact it seems to be activated by default as soon as SMP code is loaded 
and/or an APIC is detected (but though the presence of an APIC, I have 
not seen those NMIs on my uniprocessor P3 machines yet).


Here is a link to my description on the German "debianforum": 
http://debianforum.de/forum/viewtopic.php?f=33&t=134210


I would like to report the bug to http://bugzilla.kernel.org if it has 
not yet been done by someone else. Therefore it would be great if you 
could give me a short note if you have reported it already.


Basically I think this mechanism has its bugs and/or wrong assumptions 
on some machines and should undergo a critical review. I'm wondering if 
there are more people in the world getting set up by strange lockups of 
their machines which are wrongly diagnosed as "hardware errors" etc.


Hope to read from you soon!

Thanks and best regards,

Hans-Juergen



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org