Bug#627019: several kernel hangs before geting to login
Monday, December 26, 2011 5:24 PMWill Set wrote: Sunday, December 25, 2011 4:24 AM Jonathan Nieder wrote: Will Set wrote: Jonathan Nieder wrote but the boot fails in some way unless you add processor.nocst=1 to the kernel command line. [...] I had to test using 3.1.0-1-686-pae ( which I believe is an old experiemtnal kernel and not a sid or wheezy kernel) ooops : sorry for confusion: 3.1.0-1-686-pae is a sid kernel. I rebooted 3.1.0-1-686-pae 10 times with hyperthreading enabled, and got 10 different dmesg problems very near if not exactly while udev was populating /dev Best Regards, Will
Bug#627019: several kernel hangs before geting to login
Will Set wrote: Jonathan Nieder wrote but the boot fails in some way unless you add processor.nocst=1 to the kernel command line. Yes, Adding processor.nocst=1 has always worked for me on all effected kernels I've tested so far. [...] This is on the machine with a D865GBF motherboard. No, This report is and always will be Intel D865GRH mobo. Sorry for the typo, and thanks for the corrections. Excellent --- I suspect that udev is actually a red herring and that _any_ code executed during the early boot process is likely to misbehave or segfault on this machine unless processor.nocst=1 is passed. In other words, this looks like incorrect execution or memory corruption during boot. Which is consistent with a broken _CST table. Unfortunately the acpidump you sent does not include a _CST table. The log you sent does not include any complaints about lack of a _CST table, though. Puzzling. I recommend keeping processor.nocst=1 on the kernel command line for now. We should report this upstream to Len Brown and the linux-a...@vger.kernel.org list, but I would like to delay that until after the holidays to avoid overwhelming them. There is another Debian user that has an Intel D865GBF mobo with a very similar debian bug report filed. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631597 Does disabling hyperthreading in the BIOS avoid trouble for you, too? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#627019: several kernel hangs before geting to login
Friday, December 23, 2011 6:54 PMJonathan Nieder wrote Hi Will, Will Set wrote: I was able to take three pictures of the boot messages by scrolling up the boot buffer from the login prompt, while booting 3.2.0-rc4-686-pae to illustrate what I did my best to explain yesterday. I'll also attach the dmesg.udev-2 and acpidump-udev-2 Thanks. If I understand you correctly, udev 175-2 segfaults at boot. No, Not always a segfault. Sometimes udev just hangs, leaving the machine without keyboard access. And it's way to early in the boot process to get normal network connectivity, Other times the kernel will panic. And when the kernel panics I'm not able to save any data from the boot buffer other than the screen full of data showing when the boot buffer finishes sending the trace data to the buffer. Boot also fails in at least one other way. Where I can see a udev settle message and messages showing the /sys directory structure. But when this type of issue happens I am able to login and run the system console. But, if I start the xserver under these condition I have no keyboard or mouse. These failures have not changed much since I initially reported this. But I have seen the failures so may times now that I'm a bit less confused by them. udev 175-3 does _not_ segfault, No, udev 175-3 also segaults iirc but I have not re - upgraded udev to 175-3 to test exactly what it shows, yet. but the boot fails in some way unless you add processor.nocst=1 to the kernel command line. Yes, Adding processor.nocst=1 has always worked for me on all effected kernels I've tested so far. But, the boot fails consistently when using udev 175-3 unstable with 3.2.0-rc4-686-pae and without processor.nocst=1 added to the boot command. Which is already weird, since the only advertised changes in 175-3 were a fix to the systemd service file and a fix to udev rules for Xen support. Based on the kernel log you sent, you are not using systemd, and I assume you're not using Xen. Please understand that a failed boot, appears - at least from what I can see here, always to have something to do with udev. This is on the machine with a D865GBF motherboard. No, This report is and always will be Intel D865GRH mobo. My other mobo is an Intel D865PERLK There is another Debian user that has an Intel D865GBF mobo with a very similar debian bug report filed. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631597 [9.132009] Pid: 311, comm: modprobe Tainted: G D 2.6.39-2-686-pae #1 /D865GBF And this user has also filed a bug report upstream after Ben requested he do so. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631597#15 Anyway, you were able to take advantage of this situation to get an acpidump. Are these results reproducible? Yes, But, the fail is not consistently one failure. I had three failed boot attempts today while testing with a clean kernel commandline. ie: processor.nocst=1 was not added to the commandline. on any of my 4 boot attempts today. The fourth time the machine booted to a useable state. I hope you can find some clues in this email that will make this issue less weird to understand. And as always I'll do my best to get timely responses back to you, even though I have been busy elsewhere recently. I've not had my usual amount of time to devote to testing and learning about the kernel. Best Regards, Will Hope that helps, Jonathan
Bug#627019: several kernel hangs before geting to login
Hi Will, Will Set wrote: I was able to take three pictures of the boot messages by scrolling up the boot buffer from the login prompt, while booting 3.2.0-rc4-686-pae to illustrate what I did my best to explain yesterday. I'll also attach the dmesg.udev-2 and acpidump-udev-2 Thanks. If I understand you correctly, udev 175-2 segfaults at boot. udev 175-3 does _not_ segfault, but the boot fails in some way unless you add processor.nocst=1 to the kernel command line. Which is already weird, since the only advertised changes in 175-3 were a fix to the systemd service file and a fix to udev rules for Xen support. Based on the kernel log you sent, you are not using systemd, and I assume you're not using Xen. This is on the machine with a D865GBF motherboard. Anyway, you were able to take advantage of this situation to get an acpidump. Are these results reproducible? Hope that helps, Jonathan -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#627019: several kernel hangs before geting to login
Hi, Will Set wrote: I see several segfaults and hangs during boot per each successfult login. By private email, you said: - you have tested some 3.1.0-1-686-pae kernel (I assume 3.1.0-1~experimental.1 from experimental). - unless you add processor.nocst=1, it reliably hangs at boot time. - adding processor.nocst=1 makes it boot without hanging. - in addition to this machine, you have another machine that has an i865 chipset. It produces the same symptoms. - in addition, you have a machine with an i915 chipset, which works fine, with no need for special boot parameters. In the bug log, I see: - this is an Acer Aspire One AO521, board JV01-NL, BIOS v1.08 - the chipset is indeed an 82865G - oopses are all over the place. Feels like corruption somewhere. - with debug=3, we see that the DMI says this is board D865GRH, BIOS BF86510A.86A.0077.P25.0508040031 --- wait, are these even the same machine? - the other i865 is D865PERLK. Ok. The processor.nocst=1 workaround indicates that the ACPI tables might be incorrect or being incorrectly parsed. For the D865GBF, such a problem is being tracked as bug#630031 and upstream bug 38262. Compare v2.6.22-rc1~1112^2^2 (ACPICA: clear fields reserved before FADT r3, 2007-04-28). To move forward on that, the right thing to do would be to get in touch with Len Brown, for example by answering his questions from the Fedora bugtracker at https://bugzilla.redhat.com/show_bug.cgi?id=727865. For the D865PERLK, a quick web search does not show anyone but you having this problem. You've said you have three boards you're checking with and only two exhibit the problem. I'm not sure where the JV01-NL fits into the picture. Anyway, for the future, it would be way less confusing to have one bug per machine, unless they are identically configured or we can be reasonably certain for some other reason that the same fix will apply to all of them. Please provide a summary of which machines that you use are affected and not affected, and I can clone this bug and let you know the bug number assigned to each. Thanks for your help and patience. Regards, Jonathan -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#627019: several kernel hangs before geting to login
Tuesday, November 15, 2011 3:10AM Jonathan Nieder wrote: Hi, Will Set wrote: - you have tested some 3.1.0-1-686-pae kernel (I assume 3.1.0-1~experimental.1 from experimental). Yes, 3.1.0-1~experimental.1 from experimental - unless you add processor.nocst=1, it reliably hangs at boot time. - adding processor.nocst=1 makes it boot without hanging. - in addition to this machine, you have another machine that has an i865 chipset. It produces the same symptoms. - in addition, you have a machine with an i915 chipset, which works fine, with no need for special boot parameters. Yes. In the bug log, I see: - this is an Acer Aspire One AO521, board JV01-NL, BIOS v1.08 - the chipset is indeed an 82865G - oopses are all over the place. Feels like corruption somewhere. - with debug=3, we see that the DMI says this is board D865GRH, BIOS BF86510A.86A.0077.P25.0508040031 --- wait, are these even the same machine? - the other i865 is D865PERLK. What I have gathered so far from reading docs and reports it looks like a C state problem. I think there isn't a CST in this processor... If CST adjusts processor voltage and stepping for energy saving when idle? I;m thinking legacy FADT is all this chip can use.. It's not a big deal for me to use the workaround Len Brown suggested https://bugzilla.redhat.com/show_bug.cgi?id=727865#c16 for 2.6.38-rc and newer kernels. --- Debian stable / 2.6.32-5-686 kernel still works fine. And I'm still OK if it's an upstream ( will not fix issue). But I would like a fix as well, if one is possible. Ok. The processor.nocst=1 workaround indicates that the ACPI tables might be incorrect or being incorrectly parsed. For the D865GBF, such a problem is being tracked as bug#630031 and upstream bug 38262. Compare v2.6.22-rc1~1112^2^2 (ACPICA: clear fields reserved before FADT r3, 2007-04-28). To move forward on that, the right thing to do would be to get in touch with Len Brown, for example by answering his questions from the Fedora bugtracker at https://bugzilla.redhat.com/show_bug.cgi?id=727865. All my answers to Len Browns questions are identical to Adam 's https://bugzilla.redhat.com/show_bug.cgi?id=727865#c17 answers to Len Browns questions. $ grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* -- doesn't exist. $/sys/firmware/acpi/tables/dynamic/* -- doesn't exist in the filesystem. For the D865PERLK, a quick web search does not show anyone but you having this problem. You've said you have three boards you're checking with and only two exhibit the problem. I'm not sure where the JV01-NL fits into the picture. I'm not sure how the JV01-NL got into the picture either. Anyway, for the future, it would be way less confusing to have one bug per machine, Yes, I agree 100% unless they are identically configured or we can be reasonably certain for some other reason that the same fix will apply to all of them. Yes, at this preliminary stage, I think the issue is exactly the same, or at least close enough, on my two Intel 865 chipset machines. Even though the two mobos are not identical, the processors, memory and disks are identical in both machines. Please provide a summary of which machines that you use are affected and not affected, and I can clone this bug and let you know the bug number assigned to each. I will file a separate bug report from the other machine. Thanks for your help and patience. Regards, Jonathan Best Regards, Will