On Samstag 06 Juni 2009, Alexander Puchmayr wrote: > Hi there! > > This week I've tried to setup a home-server, but the system is highly > instable. The first symptoms were lots of page allocation errors, which > disappeared after setting the internal memory allocator from SLUB to SLAB > and increasing the min_free_kbytes in /proc/sys/vm from 8MB to 20MB. > > The machine is a AMD Athlon64X2 5050e on a asus M3A78-Pro board with 2x2GB > RAM. I'm using kernel 2.6.29.4 (vanilla, but the result is the same as > using 2.6.29-gentoo-r5), and I also upgraded the board's BIOS to the latest > version (which is 0902) > > But still the system freezes after some hours. It just freezes. Console is > dead, no entry in the logs, no network connectivity, even sysrq doesn't > seem to do anything. The worst thing is I don't even have an idea what the > error could be, and in the rare situations when it crashed and the console > was not blanked, I only see the end of a stack trace, and the intresting > parts are scrolled out (and I can't scroll back as the console is > absolutely dead :-( ) The only button that is still working is the reset > button, and after rebooting the log does't tell anything (just ends without > any message) > > I inspected my dmesg-output right after booting more precisely, and I've > found some strange entries which could indicate a problem. What do you > think about them? > > [ 0.000000] ACPI Warning (tbfadt-0568): 32/64X length mismatch in > Gpe0Block: 64/32 [20081204] > [ 0.000000] FADT: X_PM1a_EVT_BLK.bit_width (16) does not match > PM1_EVT_LEN (4) > ... > [ 0.000000] 4 Processors exceeds NR_CPUS limit of 2 > [ 0.000000] SMP: Allowing 2 CPUs, 0 hotplug CPUs > ... > [ 0.000999] Aperture pointing to e820 RAM. Ignoring. > [ 0.000999] Your BIOS doesn't leave a aperture memory hole > [ 0.000999] Please enable the IOMMU option in the BIOS setup > [ 0.000999] This costs you 64 MB of RAM > [ 0.000999] Mapping aperture over 65536 KB of RAM @ 20000000 > [ 0.000999] PM: Registered nosave memory: 0000000020000000 - > 0000000024000000 > ... > [ 0.099055] mtrr: your CPUs had inconsistent fixed MTRR settings > [ 0.099059] mtrr: probably your BIOS does not setup all CPUs. > [ 0.099116] mtrr: corrected configuration. > ... > [ 0.151260] PCI-DMA: Disabling AGP. > [ 0.151260] PCI-DMA: aperture base @ 20000000 size 65536 KB > [ 0.151260] PCI-DMA: using GART IOMMU. > [ 0.151260] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture > ... > [ 0.163241] system 00:09: iomem range 0xfec00000-0xfec00fff has been > reserved > [ 0.163305] system 00:09: iomem range 0xfee00000-0xfee00fff has been > reserved > [ 0.163365] system 00:0a: ioport range 0x4d0-0x4d1 has been reserved > [ 0.163422] system 00:0a: ioport range 0x40b-0x40b has been reserved > [ 0.163480] system 00:0a: ioport range 0x4d6-0x4d6 has been reserved > [ 0.163537] system 00:0a: ioport range 0xc00-0xc01 has been reserved > [ 0.163595] system 00:0a: ioport range 0xc14-0xc14 has been reserved > [ 0.163653] system 00:0a: ioport range 0xc50-0xc51 has been reserved > [ 0.163711] system 00:0a: ioport range 0xc52-0xc52 has been reserved > [ 0.163769] system 00:0a: ioport range 0xc6c-0xc6c has been reserved > [ 0.163827] system 00:0a: ioport range 0xc6f-0xc6f has been reserved > [ 0.163885] system 00:0a: ioport range 0xcd0-0xcd1 has been reserved > [ 0.163942] system 00:0a: ioport range 0xcd2-0xcd3 has been reserved > [ 0.163999] system 00:0a: ioport range 0xcd4-0xcd5 has been reserved > [ 0.164070] system 00:0a: ioport range 0xcd6-0xcd7 has been reserved > [ 0.164127] system 00:0a: ioport range 0xcd8-0xcdf has been reserved > [ 0.164184] system 00:0a: ioport range 0x800-0x89f has been reserved > [ 0.164241] system 00:0a: ioport range 0xb00-0xb3f has been reserved > [ 0.164305] system 00:0a: ioport range 0x900-0x90f has been reserved > [ 0.164363] system 00:0a: ioport range 0x910-0x91f has been reserved > [ 0.164421] system 00:0a: ioport range 0xfe00-0xfefe has been reserved > [ 0.164480] system 00:0a: iomem range 0xffb80000-0xffbfffff has been > reserved > [ 0.164538] system 00:0a: iomem range 0xfec10000-0xfec1001f has been > reserved > [ 0.164598] system 00:0c: ioport range 0xe00-0xe0f has been reserved > [ 0.164656] system 00:0c: ioport range 0xe80-0xe8f has been reserved > [ 0.164713] system 00:0c: ioport range 0xf40-0xf4f has been reserved > [ 0.164771] system 00:0c: ioport range 0xa30-0xa3f has been reserved > [ 0.164830] system 00:0d: iomem range 0xe0000000-0xefffffff has been > reserved > [ 0.164890] system 00:0e: iomem range 0x0-0x9ffff could not be reserved > [ 0.164947] system 00:0e: iomem range 0xc0000-0xcffff has been reserved > [ 0.165018] system 00:0e: iomem range 0xe0000-0xfffff could not be > reserved > [ 0.165076] system 00:0e: iomem range 0x100000-0xdfffffff could not be > reserved > [ 0.165158] system 00:0e: iomem range 0xfec00000-0xffffffff could not be > reserved > ... > [ 21.298450] ACPI: I/O resource piix4_smbus [0xb00-0xb07] conflicts with > ACPI region SOR1 [0xb00-0xb0f] > [ 21.298454] ACPI: Device needs an ACPI driver > [ 21.298461] piix4_smbus 0000:00:14.0: SMBus Host Controller at 0xb00, > revision 0 > ... > [ 73.861479] ACPI: I/O resource it87 [0xe85-0xe86] conflicts with ACPI > region HWRE [0xe85-0xe86] > [ 73.861483] ACPI: Device needs an ACPI driver > > Whats does this message "4 Processors exceeds NR_CPUS" say? the system is a > Dual-Core AMD Athlon64 5050e, AFAIK it has two cores and nothing more. The > mttr-Message later also indicate that there could be more than 2 CPUs > available. wondering... > > The next thing which seems somewhat strange to me is the AGP aperture and > the IOMMU. The Mainboard does not have an AGP port, nor does the bios have > any option to enable. The only thing I can set is the size of the memory > reservered for the onboad video card, which I set to the smallest value of > 32MB as the machine will usually not even have a display. > > The iomem-range reservation errors at the end? Harmful or not? > > The last messages come after loading the hw-sensors modules it87.ko and > i2c_piix4. > > Thanks in advance for suggestions > Alex
*sigh* Ok, just for starters - all AMD cpus of the Athlon64 architecture have a builtin agpgart. This agpgart functions also as an iommu. This is a great hack to have a hardware iommu . Intel does not have this, so they rely on software. The solution came up while AMD devs and linux kernel devs worked together. Please read the following links: http://en.wikipedia.org/wiki/Iommu http://marc.info/?l=linux-kernel&m=107759901509280&w=2 http://marc.info/?l=linux-kernel&m=107764033904042&w=2 the iommu is needed so 32bit pci devices can live with their pci adress space behind 4gb and other sweet things. Sadly the iommu needs a minimum on memory for itself - and uses the agp- aperture. This is fine, but mobo vendors suck and make it too small/or not available. In that case the kernel is forced to use real memory for the iommu. In short, that message has nothing to do with your problem. The NR_CPU message is confusing - I strongly suspect that your kernel config is really fucked uo. The iomem-range messages are harmless. Please enable: [] Check for low memory corruption [] Reserve low 64K of RAM on AMI/Phoenix BIOSen in the kernel config. Also clean it up and remove stuff like 'hyperthreading scheduler'. If the problem persists, start testing your hardware. I would suspect the PSU.