Bug#627019: several kernel hangs before geting to login

2011-12-26 Thread Will Set


  

Monday, December 26, 2011 5:24 PMWill Set wrote:

Sunday, December 25, 2011 4:24 AM Jonathan Nieder wrote:
Will Set wrote:
 Jonathan Nieder wrote

 but the boot fails in some way unless you
 add processor.nocst=1 to the kernel command line.  
[...]

I had to test using  3.1.0-1-686-pae 
( which I believe is an old experiemtnal kernel and not a sid or wheezy kernel)

ooops : sorry for confusion: 3.1.0-1-686-pae is a sid kernel.

I rebooted 3.1.0-1-686-pae 10 times with hyperthreading enabled,
and got 10 different dmesg problems  very
 near if not exactly while 
udev was populating /dev

Best Regards,
Will

Bug#627019: several kernel hangs before geting to login

2011-12-25 Thread Jonathan Nieder
Will Set wrote:
 Jonathan Nieder wrote

 but the boot fails in some way unless you
 add processor.nocst=1 to the kernel command line.  

 Yes, 
 Adding processor.nocst=1 has always worked for me on all effected kernels 
 I've tested so far.
[...]
 This is on the machine with a D865GBF motherboard.

 No,
 This report is and always will be  Intel D865GRH mobo.

Sorry for the typo, and thanks for the corrections.

Excellent --- I suspect that udev is actually a red herring and that
_any_ code executed during the early boot process is likely to
misbehave or segfault on this machine unless processor.nocst=1 is
passed.

In other words, this looks like incorrect execution or memory
corruption during boot.  Which is consistent with a broken _CST table.

Unfortunately the acpidump you sent does not include a _CST table.
The log you sent does not include any complaints about lack of a _CST
table, though.  Puzzling.

I recommend keeping processor.nocst=1 on the kernel command line for
now.  We should report this upstream to Len Brown and the
linux-a...@vger.kernel.org list, but I would like to delay that until
after the holidays to avoid overwhelming them.

 There is another Debian user that has an Intel D865GBF mobo  
 with a  very similar debian bug report filed.

 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631597

Does disabling hyperthreading in the BIOS avoid trouble for you, too?



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#627019: several kernel hangs before geting to login

2011-12-24 Thread Will Set



  Friday, December 23, 2011 6:54 PMJonathan Nieder wrote
Hi Will,

Will Set wrote:

 I was able to take three pictures of the boot messages by scrolling up the 
 boot buffer from the login prompt, while booting 3.2.0-rc4-686-pae
 to illustrate what I did my best to explain yesterday.

 I'll also attach the dmesg.udev-2 and acpidump-udev-2 

Thanks.

If I understand you correctly, udev 175-2 segfaults at boot.

No, Not always a segfault.

Sometimes udev just hangs, leaving the machine without keyboard access.
And it's way to early in the boot process to get normal network connectivity,

Other times the kernel will panic.
And when the kernel panics I'm not able to save any data from the boot buffer 
other than the screen full of data showing when the boot buffer finishes 
sending the trace data to the buffer.

Boot also fails in at least one other way.
Where I can see a udev settle message and  messages showing the /sys 
directory structure.
But when this type of issue happens I am able to login and run the system 
console.
But, if I start the xserver under these condition I have no keyboard or mouse.

These failures have not changed much since I initially reported this.
But I have seen the failures so may times now that I'm a bit less confused by 
them.


 udev 175-3 does _not_ segfault, 

No, udev 175-3 also segaults iirc
but I have not re - upgraded udev to 175-3 to test exactly what it shows, yet.

but the boot fails in some way unless you
add processor.nocst=1 to the kernel command line.  

Yes, 
Adding processor.nocst=1 has always worked for me on all effected kernels I've 
tested so far.

But, the boot fails consistently when using udev 175-3  unstable with 
3.2.0-rc4-686-pae 
and without processor.nocst=1 added to the boot command. 

Which is already
weird, since the only advertised changes in 175-3 were a fix to the
systemd service file and a fix to udev rules for Xen support.  Based
on the kernel log you sent, you are not using systemd, and I assume
you're not using Xen.

Please understand that a failed boot, appears - at least from what I can see 
here, 
always to have something to do with udev.


This is on the machine with a D865GBF motherboard.

No,
This report is and always will be  Intel D865GRH mobo.
My other mobo is an Intel D865PERLK

There is another Debian user that has an Intel D865GBF mobo  
with a  very similar debian bug report filed.

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631597

[9.132009] Pid: 311, comm: modprobe Tainted: G  D
2.6.39-2-686-pae #1  /D865GBF

And this user has also filed a bug report upstream after Ben requested he do so.

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631597#15


Anyway, you were able to take advantage of this situation to get an acpidump.

Are these results reproducible?

Yes, But, the fail is not consistently one failure.
I had three failed boot attempts today while testing with a clean kernel 
commandline.
ie: processor.nocst=1 was not added to the commandline. on any of my 4 boot 
attempts today.
The fourth time the machine booted to a useable state.



I hope you can find some clues in this email that will make this issue less 
weird to understand.
And as always I'll do my best to get timely responses back to you, even though 
I have been busy 
elsewhere recently.
I've not had my usual amount of time to devote to testing and learning about 
the kernel.

Best Regards,
Will

Hope that helps,
Jonathan


Bug#627019: several kernel hangs before geting to login

2011-12-23 Thread Jonathan Nieder
Hi Will,

Will Set wrote:

 I was able to take three pictures of the boot messages by scrolling up the 
 boot buffer from the login prompt, while booting 3.2.0-rc4-686-pae
 to illustrate what I did my best to explain yesterday.

 I'll also attach the dmesg.udev-2 and acpidump-udev-2 

Thanks.

If I understand you correctly, udev 175-2 segfaults at boot.  udev
175-3 does _not_ segfault, but the boot fails in some way unless you
add processor.nocst=1 to the kernel command line.  Which is already
weird, since the only advertised changes in 175-3 were a fix to the
systemd service file and a fix to udev rules for Xen support.  Based
on the kernel log you sent, you are not using systemd, and I assume
you're not using Xen.

This is on the machine with a D865GBF motherboard.

Anyway, you were able to take advantage of this situation to get an acpidump.

Are these results reproducible?

Hope that helps,
Jonathan



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#627019: several kernel hangs before geting to login

2011-11-15 Thread Jonathan Nieder
Hi,

Will Set wrote:

 I see several segfaults and hangs during boot per each successfult login.

By private email, you said:

 - you have tested some 3.1.0-1-686-pae kernel (I assume
   3.1.0-1~experimental.1 from experimental).
 - unless you add processor.nocst=1, it reliably hangs at boot time.
 - adding processor.nocst=1 makes it boot without hanging.
 - in addition to this machine, you have another machine that has an
   i865 chipset.  It produces the same symptoms.
 - in addition, you have a machine with an i915 chipset, which works
   fine, with no need for special boot parameters.

In the bug log, I see:

 - this is an Acer Aspire One AO521, board JV01-NL, BIOS v1.08
 - the chipset is indeed an 82865G
 - oopses are all over the place.  Feels like corruption somewhere.
 - with debug=3, we see that the DMI says this is board D865GRH, BIOS
   BF86510A.86A.0077.P25.0508040031 --- wait, are these even the same
   machine?
 - the other i865 is D865PERLK.

Ok.  The processor.nocst=1 workaround indicates that the ACPI tables
might be incorrect or being incorrectly parsed.  For the D865GBF, such
a problem is being tracked as bug#630031 and upstream bug 38262.
Compare v2.6.22-rc1~1112^2^2 (ACPICA: clear fields reserved before
FADT r3, 2007-04-28).  To move forward on that, the right thing to do
would be to get in touch with Len Brown, for example by answering his
questions from the Fedora bugtracker at
https://bugzilla.redhat.com/show_bug.cgi?id=727865.

For the D865PERLK, a quick web search does not show anyone but you
having this problem.

You've said you have three boards you're checking with and only two
exhibit the problem.  I'm not sure where the JV01-NL fits into the
picture.

Anyway, for the future, it would be way less confusing to have one bug
per machine, unless they are identically configured or we can be
reasonably certain for some other reason that the same fix will apply
to all of them.  Please provide a summary of which machines that you
use are affected and not affected, and I can clone this bug and let
you know the bug number assigned to each.

Thanks for your help and patience.

Regards,
Jonathan



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#627019: several kernel hangs before geting to login

2011-11-15 Thread Will Set

Tuesday, November 15, 2011 3:10AM Jonathan Nieder wrote:

Hi,

Will Set
 wrote:

- you have tested some 3.1.0-1-686-pae kernel (I assume
   3.1.0-1~experimental.1 from experimental).

Yes, 3.1.0-1~experimental.1 from experimental 

- unless you add processor.nocst=1, it reliably hangs at boot time.
- adding processor.nocst=1 makes it boot without hanging.
- in addition to this machine, you have another machine that has an
   i865 chipset.  It produces the same symptoms.
- in addition, you have a machine with an i915 chipset, which works
   fine, with no need for special boot parameters.

Yes.


In the bug log, I see:

- this is an Acer Aspire One AO521, board JV01-NL, BIOS v1.08
- the chipset is indeed an 82865G
- oopses are all over the place.  Feels like corruption somewhere.
- with debug=3, we see that the DMI
 says this is board D865GRH, BIOS
   BF86510A.86A.0077.P25.0508040031 --- wait, are these even the same
   machine?
- the other i865 is D865PERLK.

What I have gathered so far from reading docs and reports
 it looks like a C state problem.
I think there isn't a CST in this processor... 
If CST adjusts processor voltage and stepping for energy saving when idle? 
I;m thinking legacy FADT is all this chip can use..

It's not a big deal for me to use the workaround Len Brown suggested
https://bugzilla.redhat.com/show_bug.cgi?id=727865#c16
for 2.6.38-rc  and newer kernels. ---
Debian stable / 2.6.32-5-686 kernel still works fine.
 
And I'm still OK if it's an upstream ( will not fix issue).
But I would like a fix as well, if one is possible.  


Ok.  The processor.nocst=1 workaround indicates that the ACPI
 tables
might be incorrect or being incorrectly parsed.  For the D865GBF, such
a problem is being tracked as bug#630031 and upstream bug 38262.
Compare v2.6.22-rc1~1112^2^2 (ACPICA: clear fields reserved before
FADT r3, 2007-04-28).  To move forward on that, the right thing to do
would be to get in touch with Len Brown, for example by answering his
questions from the Fedora bugtracker at
https://bugzilla.redhat.com/show_bug.cgi?id=727865.

All my answers to  Len Browns questions are identical to
Adam 's  https://bugzilla.redhat.com/show_bug.cgi?id=727865#c17
answers to Len Browns questions.

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*  --  doesn't exist.
$/sys/firmware/acpi/tables/dynamic/* --  doesn't exist in the filesystem.


For the D865PERLK, a quick web search does not show anyone but you
having this problem.

You've said you have three boards you're checking with and only two
exhibit the problem.  I'm not sure where the JV01-NL fits into the
picture.

 I'm not sure how the JV01-NL got into the picture either.


Anyway, for the future, it would be way less confusing to have one bug
per machine, 

Yes, I agree 100%

unless they are identically configured or we can
 be
reasonably certain for some other reason that the
 same fix will apply
to all of them.  

Yes,  at this preliminary stage, I think the issue is exactly the same, 
or at least close enough, on my two Intel 865 chipset machines.

Even though the two mobos are not identical,  
the processors, memory and disks  are identical in both machines.

Please provide a summary of which machines that you
use are affected and not affected, and I can clone this bug and let
you know the bug number assigned to each.

I will file a separate bug report from the other machine.


Thanks for your help and patience.

Regards,
Jonathan

Best Regards,
Will