Re: Re: 3.2.38 most of the time has 100% cpu use reported

2013-02-16 Thread Zhenzhong Duan

On Sat, Feb 16, 2013 at 05:18:10AM +, Ben Hutchings wrote:
> On Fri, 2013-02-15 at 22:23 -0500, tmhik...@gmail.com wrote:
> > On Thu, Feb 14, 2013 at 05:04:07AM +, Ben Hutchings wrote:
> > > On Tue, 2013-02-12 at 00:01 -0500, tmhik...@gmail.com wrote:
> > > > Okay, I finally have located the patch causing this bizzare
> problem
> > > > for me. Before I discuss it, I'm going to drag out the kernel
> bug reporting
> > > > guidelines and try to make a proper bug report out of this.
> > > > 
> > > > [1.] One line summary of the problem:
> > > > 3.2.38 most of the time has 100% cpu use reported
> > > > 
> > > > [2.] Full description of the problem/report:
> > > > Reverse applying the patch for
> > > > 
> > > > [9a1f08a1a192f9177d7063d903773aed800b840f]
> drivers/firmware/dmi_scan.c: fetch dmi version from SMBIOS if it
> exists
> > > > 
> > > > on top of a clean 3.2.38 tree makes the problem go away.
> > > [...]
> > > > I have to admit I have no idea what the patch I'm reversing
> actually does,
> > > 
> > > It changes how we look for the version of a BIOS interface (DMI
> or
> > > SMBIOS).  All the code that it touches, and the version number
> variable,
> > > are discarded after boot and therefore can have very limited
> effect on
> > > what happens later!
> > 
> > Actually upon poking about, I noticed something VERY different about
> what
> > the patch does to my computer during the boot process.  Let me show
> you
> > dmesg output from a good kernel without the patch, then a bad one
> with the
> > patch, then I'll point out the important differences I noticed:
> [...]
> > Important bits I noticed: The 'bad' kernel initially boots with
> acpi
> > disabled, then later turns it back on, whereas the 'good' one boots
> with
> > acpi completely enabled with no problems.
> >
> > Also, very early on just after yelling about the NX bit not existing
> in my
> > cpu, it says this: (-good/+bad)
> > 
> > -[0.00] DMI 2.3 present.
> > -[0.00] DMI: ECS M848A/M848A, BIOS 07.00T 04/02/01
> > +[0.00] Legacy DMI 0.0 present.
> > +[0.00] DMI:  , BIOS  
> 
> Oh wow, this has really gone very wrong.
> 
> > The 'good' kernel without the patch finds something for DMI (That is
> in fact
> > my motherboard model number, so I know it's doing *something* right)
> whereas
> > the bad one with the patch...  Does not.
> >
> > Anyway, I suppose I could try using acpi=force as the kernel
> suggests for a
> > workaround for now, but I'm still convinced something went wrong
> with that
> > patch now more than ever.  Due to the fact that ACPI gets reenabled
> I didn't
> > catch on that something was going wrong so early in the boot process
> with it
> > until now.
> >
> > One thing I want to emphasize is that the only difference between
> the two
> > kernels is that lone patch.
> >
> > As for perf top, I tried running it as you suggested, but I honestly
> can't
> > figure out how to get this thing to dump output to a file. Every
> time I
> > redirect stdio/stderr I get a empty file, I don't know what I'm
> doing wrong.
> > Any tips?
> [...]
> 
> Don't worry about it - I think the log messages are a pretty good
> clue.
> 
> Does this patch fix the log messages and/or the other issues?
> 
> --- a/drivers/firmware/dmi_scan.c
> +++ b/drivers/firmware/dmi_scan.c
> @@ -416,6 +416,9 @@ static int __init dmi_present(const char __iomem
> *p)
>  
>   memcpy_fromio(buf, p, 15);
>   if (dmi_checksum(buf, 15)) {
> + print_hex_dump(KERN_INFO, "DMI entry: ", DUMP_PREFIX_NONE,
> +16, 1, buf, 15, false);
> +
>   dmi_num = (buf[13] << 8) | buf[12];
>   dmi_len = (buf[7] << 8) | buf[6];
>   dmi_base = (buf[11] << 24) | (buf[10] << 16) |
> @@ -442,10 +445,12 @@ static int __init dmi_present(const char __iomem
> *p)
>  static int __init smbios_present(const char __iomem *p)
>  {
>   u8 buf[32];
> - int offset = 0;
>  
>   memcpy_fromio(buf, p, 32);
>   if ((buf[5] < 32) && dmi_checksum(buf, buf[5])) {
> + print_hex_dump(KERN_INFO, "SMBIOS entry: ", DUMP_PREFIX_NONE,
> +16, 1, buf, 16, false);
> +
>   dmi_ver = (buf[6] << 8) + buf[7];
>  
>   /* Some BIOS report weird SMBIOS version, fix that up */
> @@ -461,9 +466,9 @@ static int __init smbios_present(const char
> __iomem *p)
>   dmi_ver = 0x0206;
>   break;
>   }
> - offset = 16;
> + return dmi_present(p + 16);
>   }
> - return dmi_present(buf + offset);
> + return 1;
>  }
>  
>  void __init dmi_scan_machine(void)
> --- END ---
Thanks Ben fix it.
Acked-by: Zhenzhong Duan 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.2.38 most of the time has 100% cpu use reported

2013-02-15 Thread tmhikaru
On Sat, Feb 16, 2013 at 05:18:10AM +, Ben Hutchings wrote:
> Don't worry about it - I think the log messages are a pretty good clue.
> 
> Does this patch fix the log messages and/or the other issues?
> 

Yes! In fact after a quick reboot, the log lines look to me like they exist
properly, acpi looks like it's working properly, and most importantly my cpu
usage is normal.

One thing though:
> If not, can you report back the 'SMBIOS entry:' and 'DMI entry:' log
> lines this produces.

I don't see an SMBIOS entry line in the dmesg log output, though there IS a
DMI entry.  I take it that I should have seen an SMBIOS entry of some
kind...  I've included the entire output of dmesg just in case you would
find it useful below.  For all I know I missed something.

Tim McGrath

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.2.38 (root@roll) (gcc version 4.7.1 (GCC) ) #3 
Sat Feb 16 01:12:31 EST 2013
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009fc00 (usable)
[0.00]  BIOS-e820: 0009fc00 - 000a (reserved)
[0.00]  BIOS-e820: 000e8000 - 000ea000 (reserved)
[0.00]  BIOS-e820: 000f - 0010 (reserved)
[0.00]  BIOS-e820: 0010 - 5fff (usable)
[0.00]  BIOS-e820: 5fff - 5fff8000 (ACPI data)
[0.00]  BIOS-e820: 5fff8000 - 6000 (ACPI NVS)
[0.00]  BIOS-e820: fec0 - fec01000 (reserved)
[0.00]  BIOS-e820: fee0 - fee01000 (reserved)
[0.00]  BIOS-e820: ffee - fff0 (reserved)
[0.00]  BIOS-e820: fffc - 0001 (reserved)
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] DMI entry: 5f 44 4d 49 5f 73 51 03 50 06 0f 00 19 00 23
[0.00] Legacy DMI 2.3 present.
[0.00] DMI: ECS M848A/M848A, BIOS 07.00T 04/02/01
[0.00] e820 update range:  - 0001 (usable) 
==> (reserved)
[0.00] e820 remove range: 000a - 0010 (usable)
[0.00] last_pfn = 0x5fff0 max_arch_pfn = 0x100
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-E uncachable
[0.00]   F-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 0 mask FC000 write-back
[0.00]   1 base 04000 mask FE000 write-back
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 base 0D000 mask FF800 write-combining
[0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[0.00] initial memory mapped : 0 - 01c0
[0.00] Base memory trampoline at [c009c000] 9c000 size 12288
[0.00] init_memory_mapping: -37bfe000
[0.00]  00 - 20 page 4k
[0.00]  20 - 0037a0 page 2M
[0.00]  0037a0 - 0037bfe000 page 4k
[0.00] kernel direct mapping tables up to 37bfe000 @ 1bfa000-1c0
[0.00] RAMDISK: 5fc59000 - 5fff
[0.00] Allocated new RAMDISK: 37867000 - 37bfd6a3
[0.00] Move RAMDISK from 5fc59000 - 5ffef6a2 to 
37867000 - 37bfd6a2
[0.00] ACPI: RSDP 000fa260 00014 (v00 AMI   )
[0.00] ACPI: RSDT 5fff 0002C (v01 AMIINT SiS740XX 1000 MSFT 
010B)
[0.00] ACPI: FACP 5fff0030 00081 (v01 AMIINT SiS740XX 0011 MSFT 
010B)
[0.00] ACPI: DSDT 5fff0110 034D2 (v01SiS  740 0100 MSFT 
010D)
[0.00] ACPI: FACS 5fff8000 00040
[0.00] ACPI: APIC 5fff00c0 00050 (v01 AMIINT SiS740XX 1000 MSFT 
010B)
[0.00] ACPI: Local APIC address 0xfee0
[0.00] 643MB HIGHMEM available.
[0.00] 891MB LOWMEM available.
[0.00]   mapped low ram: 0 - 37bfe000
[0.00]   low ram: 0 - 37bfe000
[0.00] Zone PFN ranges:
[0.00]   DMA  0x0010 -> 0x1000
[0.00]   Normal   0x1000 -> 0x00037bfe
[0.00]   HighMem  0x00037bfe -> 0x0005fff0
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[2] active PFN ranges
[0.00] 0: 0x0010 -> 0x009f
[0.00] 0: 0x0100 -> 0x0005fff0
[0.00] On node 0 totalpages: 393087
[0.00] free_area_init_node: node 0, pgdat c14e7a08, node_mem_map 
f6c66200
[0.00]   DMA zone: 32 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 3951 pages, LIFO batch:0
[0.00]   Normal zone: 1752 pages used for memmap
[0.00]   Normal zone: 222502 pages, 

Re: 3.2.38 most of the time has 100% cpu use reported

2013-02-15 Thread Ben Hutchings
On Fri, 2013-02-15 at 22:23 -0500, tmhik...@gmail.com wrote:
> On Thu, Feb 14, 2013 at 05:04:07AM +, Ben Hutchings wrote:
> > On Tue, 2013-02-12 at 00:01 -0500, tmhik...@gmail.com wrote:
> > >   Okay, I finally have located the patch causing this bizzare problem
> > > for me. Before I discuss it, I'm going to drag out the kernel bug 
> > > reporting
> > > guidelines and try to make a proper bug report out of this.
> > > 
> > > [1.] One line summary of the problem:
> > > 3.2.38 most of the time has 100% cpu use reported
> > > 
> > > [2.] Full description of the problem/report:
> > > Reverse applying the patch for
> > > 
> > > [9a1f08a1a192f9177d7063d903773aed800b840f] drivers/firmware/dmi_scan.c: 
> > > fetch dmi version from SMBIOS if it exists
> > > 
> > > on top of a clean 3.2.38 tree makes the problem go away.
> > [...]
> > > I have to admit I have no idea what the patch I'm reversing actually does,
> > 
> > It changes how we look for the version of a BIOS interface (DMI or
> > SMBIOS).  All the code that it touches, and the version number variable,
> > are discarded after boot and therefore can have very limited effect on
> > what happens later!
> 
> Actually upon poking about, I noticed something VERY different about what
> the patch does to my computer during the boot process.  Let me show you
> dmesg output from a good kernel without the patch, then a bad one with the
> patch, then I'll point out the important differences I noticed:
[...]
> Important bits I noticed: The 'bad' kernel initially boots with acpi
> disabled, then later turns it back on, whereas the 'good' one boots with
> acpi completely enabled with no problems.
>
> Also, very early on just after yelling about the NX bit not existing in my
> cpu, it says this: (-good/+bad)
> 
> -[0.00] DMI 2.3 present.
> -[0.00] DMI: ECS M848A/M848A, BIOS 07.00T 04/02/01
> +[0.00] Legacy DMI 0.0 present.
> +[0.00] DMI:  , BIOS  

Oh wow, this has really gone very wrong.

> The 'good' kernel without the patch finds something for DMI (That is in fact
> my motherboard model number, so I know it's doing *something* right) whereas
> the bad one with the patch...  Does not.
>
> Anyway, I suppose I could try using acpi=force as the kernel suggests for a
> workaround for now, but I'm still convinced something went wrong with that
> patch now more than ever.  Due to the fact that ACPI gets reenabled I didn't
> catch on that something was going wrong so early in the boot process with it
> until now.
>
> One thing I want to emphasize is that the only difference between the two
> kernels is that lone patch.
>
> As for perf top, I tried running it as you suggested, but I honestly can't
> figure out how to get this thing to dump output to a file. Every time I
> redirect stdio/stderr I get a empty file, I don't know what I'm doing wrong.
> Any tips?
[...]

Don't worry about it - I think the log messages are a pretty good clue.

Does this patch fix the log messages and/or the other issues?

--- a/drivers/firmware/dmi_scan.c
+++ b/drivers/firmware/dmi_scan.c
@@ -416,6 +416,9 @@ static int __init dmi_present(const char __iomem *p)
 
memcpy_fromio(buf, p, 15);
if (dmi_checksum(buf, 15)) {
+   print_hex_dump(KERN_INFO, "DMI entry: ", DUMP_PREFIX_NONE,
+  16, 1, buf, 15, false);
+
dmi_num = (buf[13] << 8) | buf[12];
dmi_len = (buf[7] << 8) | buf[6];
dmi_base = (buf[11] << 24) | (buf[10] << 16) |
@@ -442,10 +445,12 @@ static int __init dmi_present(const char __iomem *p)
 static int __init smbios_present(const char __iomem *p)
 {
u8 buf[32];
-   int offset = 0;
 
memcpy_fromio(buf, p, 32);
if ((buf[5] < 32) && dmi_checksum(buf, buf[5])) {
+   print_hex_dump(KERN_INFO, "SMBIOS entry: ", DUMP_PREFIX_NONE,
+  16, 1, buf, 16, false);
+
dmi_ver = (buf[6] << 8) + buf[7];
 
/* Some BIOS report weird SMBIOS version, fix that up */
@@ -461,9 +466,9 @@ static int __init smbios_present(const char __iomem *p)
dmi_ver = 0x0206;
break;
}
-   offset = 16;
+   return dmi_present(p + 16);
}
-   return dmi_present(buf + offset);
+   return 1;
 }
 
 void __init dmi_scan_machine(void)
--- END ---

If not, can you report back the 'SMBIOS entry:' and 'DMI entry:' log
lines this produces.

Ben.

-- 
Ben Hutchings
Computers are not intelligent.  They only think they are.


signature.asc
Description: This is a digitally signed message part


Re: 3.2.38 most of the time has 100% cpu use reported

2013-02-13 Thread Ben Hutchings
On Tue, 2013-02-12 at 00:01 -0500, tmhik...@gmail.com wrote:
>   Okay, I finally have located the patch causing this bizzare problem
> for me. Before I discuss it, I'm going to drag out the kernel bug reporting
> guidelines and try to make a proper bug report out of this.
> 
> [1.] One line summary of the problem:
> 3.2.38 most of the time has 100% cpu use reported
> 
> [2.] Full description of the problem/report:
> Reverse applying the patch for
> 
> [9a1f08a1a192f9177d7063d903773aed800b840f] drivers/firmware/dmi_scan.c: fetch 
> dmi version from SMBIOS if it exists
> 
> on top of a clean 3.2.38 tree makes the problem go away.
[...]
> I have to admit I have no idea what the patch I'm reversing actually does,

It changes how we look for the version of a BIOS interface (DMI or
SMBIOS).  All the code that it touches, and the version number variable,
are discarded after boot and therefore can have very limited effect on
what happens later!

> what the cpu is doing when it's stuck at 100%, nor how I would diagnose this
> further or whom I should contact.  I'd like to help anyone who wants to fix
> whatever went wrong, so please contact me.

'perf top' will tell you where the CPU is spending its time.

Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
  - Albert Camus


signature.asc
Description: This is a digitally signed message part