Bug#637190: Random kernel panics general protection faults

2011-11-24 Thread Simon Morvan


Jonathan Nieder jrnie...@gmail.com wrote:

Ben Hutchings wrote:

 So, new theory required.

 Given you said you're not using ECC memory, can you test it with
 memtest86+ for a few hours?

I assume you tried this?

I ran memtest86+ several days with all RAM modules on the board. It shown only 
one error. I decided to run the same test for each module individually. Again, 
couple of days for each. No error at all.

Then I (desesperatly) started to look at BIOS settings an found on the web some 
references to instability issues related to the AMD ganged/unganged mode.

In my case I switched from unganged to ganged and got no more issues since then 
(several month of uptime now on a system that freezed after less a day being 
up).




I would also (selfishly) be interested in whether the kernel from sid
behaves any differently.  The only packages from outside squeeze one
would need in order to test are the kernel image itself,
initramfs-tools, and linux-base.  If it is reproducible with a 3.1.y
kernel, we can try pursuing this upstream, and if not, we can try to
look for the patch that fixed it.

I definitively hear and understand your concern but the box is now in 
production and I can't afford any testing window no more (plus it's a big 
storage system on which my company is heavily dependent).

I'm not sure if the issue is actually solely hardware related or if there's 
something kernel related with that memory management mode on AMD platforms but 
I tend to think it's the first case.

If you want more specific details, let me know.


Sincerely,
Jonathan

-- 
Simon



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#637190: Info received (Bug#637190: linux-image-2.6.32-5-amd64: Random kernel panics general protection faults)

2011-08-11 Thread Simon Morvan
Interresting thing : if I do not mount any filesystem (except the 
systems one) I get no freezes.


A the moment, there's a resync on the soft-raid array (5 disks) and a 
extents move between LVM PV (soft-raid array = hardware-based array), 
making amounts of I/O on all that bunch of disks and nothing freezes.


I'm pretty sure if I mount one of these LV and start IOing upon it, I'll 
get a freeze. (but I have to let the pvmove finish as I want to free 
some disks to plug them in another backup NAS).


vmstat:
procs ---memory-- ---swap-- -io -system-- 
cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy 
id wa
 2  0  0 16096936  20156  6571200 29440 29696  698  786  0  
0 100  0
 0  0  0 16097060  20156  6571200 26288 25856  795  911  0  
1 99  0
 0  0  0 16097060  20156  6571200 30208 30400  701  792  0  
0 100  0
 0  0  0 16097060  20156  6571200 30208 30272  703  805  0  
1 99  0



--
Simon




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#637190: linux-image-2.6.32-5-amd64: Random kernel panics general protection faults

2011-08-09 Thread Simon Morvan
Package: linux-2.6
Version: 2.6.32-35
Severity: grave
Justification: renders package unusable


This system is standard PC loaded with a bunch of SATA disks (15).
8 on a LSI raid card
the remaining on standard SATA port on the motherboard.
This is primarily a NAS for the LAN (Samba  netatalk).

We're getting random crash of the system (panics, GPF). Stack trace is always 
different.

I tried to disable every unuseful motherboard integrated peripherals (Sound, 
FireWire, USB3.0, ...).
It seems that some of them still appear in lspci, though.

I also blacklisted a bunch of module to prevent those subsystems from starting :

root@tank:~# cat /etc/modprobe.d/radeon.conf 
blacklist radeon
root@tank:~# cat /etc/modprobe.d/snd.conf 
blacklist snd
root@tank:~# cat /etc/modprobe.d/snd_hda_codec_atihdmi.conf 
blacklist snd_hda_codec_atihdmi
root@tank:~# cat /etc/modprobe.d/snd_hda_intel.conf 
blacklist snd_hda_intel

Here's the latest kernel trace I had a chance to retrieve (tail -f on 
/var/log/messages through SSH) :

Message from syslogd@tank at Aug  9 11:55:49 ...
 kernel:[ 2967.226046] general protection fault:  [#1] SMP

Message from syslogd@tank at Aug  9 11:55:49 ...
 kernel:[ 2967.226057] last sysfs file: 
/sys/devices/pci:00/:00:15.0/:05:00.0/host9/scsi_host/host9/proc_name
Aug  9 11:55:49 tank kernel: [ 2967.226070] CPU 0
Aug  9 11:55:49 tank kernel: [ 2967.226075] Modules linked in: ip6table_filter 
ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_amd kvm 
quota_v2 quota_tree bridge stp ext4 jbd2 crc16 loop snd_pcm snd_timer snd 
soundcore snd_page_alloc pcspkr i2c_piix4 i2c_core k10temp evdev edac_core 
edac_mce_amd shpchp pci_hotplug ext3 jbd mbcache dm_mod raid456 
async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 
md_mod sd_mod crc_t10dif ata_generic ohci_hcd pata_jmicron pata_atiixp 
megaraid_sas ahci libata scsi_mod ehci_hcd r8169 mii usbcore nls_base [last 
unloaded: scsi_wait_scan]
Aug  9 11:55:49 tank kernel: [ 2967.226164] Pid: 349, comm: md2_raid5 Not 
tainted 2.6.32-5-amd64 #1 GA-890GPA-UD3H
Aug  9 11:55:49 tank kernel: [ 2967.226172] RIP: 0010:[a014600b]  
[a014600b] handle_stripe+0x29/0x1785 [raid456]
Aug  9 11:55:49 tank kernel: [ 2967.226186] RSP: 0018:880421855ce0  EFLAGS: 
00010282
Aug  9 11:55:49 tank kernel: [ 2967.226191] RAX:  RBX: 
880421f32120 RCX: 880421f32130
Aug  9 11:55:49 tank kernel: [ 2967.226196] RDX: 880421f3216c RSI: 
0286 RDI: 880421855d00
Aug  9 11:55:49 tank kernel: [ 2967.226202] RBP: 880421f32160 R08: 
880422a91ce0 R09: 880422842000
Aug  9 11:55:49 tank kernel: [ 2967.226208] R10: 1000 R11: 
880421c308d0 R12: f7ff880422842000
Aug  9 11:55:49 tank kernel: [ 2967.226218] R13: 0004 R14: 
880421855e80 R15: 880422842188
Aug  9 11:55:49 tank kernel: [ 2967.226224] FS:  7f2eb9d4f700() 
GS:88000fa0() knlGS:
Aug  9 11:55:49 tank kernel: [ 2967.226232] CS:  0010 DS: 0018 ES: 0018 CR0: 
8005003b
Aug  9 11:55:49 tank kernel: [ 2967.226237] CR2: 00618d50 CR3: 
00034fd94000 CR4: 06f0
Aug  9 11:55:49 tank kernel: [ 2967.226242] DR0:  DR1: 
 DR2: 
Aug  9 11:55:49 tank kernel: [ 2967.226248] DR3:  DR6: 
0ff0 DR7: 0400
Aug  9 11:55:49 tank kernel: [ 2967.226253] Process md2_raid5 (pid: 349, 
threadinfo 880421854000, task 880422cdcdb0)

Message from syslogd@tank at Aug  9 11:55:49 ...
 kernel:[ 2967.226260] Stack:
Aug  9 11:55:49 tank kernel: [ 2967.226264]  000300015780 880422842150 
0086 880422842000
Aug  9 11:55:49 tank kernel: [ 2967.226273] 0  
0002  0001
Aug  9 11:55:49 tank kernel: [ 2967.226283] 0  
  

Message from syslogd@tank at Aug  9 11:55:49 ...
 kernel:[ 2967.226296] Call Trace:
Aug  9 11:55:49 tank kernel: [ 2967.226302]  [a0147b0c] ? 
raid5d+0x3a5/0x3ee [raid456]
Aug  9 11:55:49 tank kernel: [ 2967.226310]  [812fb53d] ? 
schedule_timeout+0x2e/0xdd
Aug  9 11:55:49 tank kernel: [ 2967.226319]  [a00e1855] ? 
md_thread+0xf1/0x10f [md_mod]
Aug  9 11:55:49 tank kernel: [ 2967.226326]  [81064f1a] ? 
autoremove_wake_function+0x0/0x2e
Aug  9 11:55:49 tank kernel: [ 2967.226334]  [a00e1764] ? 
md_thread+0x0/0x10f [md_mod]
Aug  9 11:55:49 tank kernel: [ 2967.226339]  [81064c4d] ? 
kthread+0x79/0x81
Aug  9 11:55:49 tank kernel: [ 2967.226345]  [81011baa] ? 
child_rip+0xa/0x20
Aug  9 11:55:49 tank kernel: [ 2967.226350]  [81064bd4] ? 
kthread+0x0/0x81
Aug  9 11:55:49 tank kernel: [ 2967.226356]  [81011ba0] ? 
child_rip+0x0/0x20

Message from syslogd@tank at Aug  9 11:55:49 ...
 kernel:[ 2967.226360] Code: 5f c3 41 57 41 56 41 55 41 

Bug#637190: linux-image-2.6.32-5-amd64: Random kernel panics general protection faults

2011-08-09 Thread Simon Morvan

Le 09/08/2011 14:50, Ben Hutchings a écrit :

On Tue, 2011-08-09 at 12:22 +0200, Simon Morvan wrote:

Package: linux-2.6
Version: 2.6.32-35
Severity: grave
Justification: renders package unusable


This system is standard PC loaded with a bunch of SATA disks (15).
8 on a LSI raid card
the remaining on standard SATA port on the motherboard.
This is primarily a NAS for the LAN (Samba  netatalk).

We're getting random crash of the system (panics, GPF). Stack trace is always 
different.

Can you check that the power supply is sufficient for all these disks?
Do you have recommendations ? I haven't found so much information on how 
to estimate the power need. Currently this is a 600W power supply (FWIW: 
Cooler Master Silent Pro M - 600W)





[...]

[5.088841] EDAC amd64: This node reports that Memory ECC is currently 
disabled, set F3x44[22] (:00:18.3).
[5.05] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
module will not load.
[5.06]  Either enable ECC checking or force module loading by setting 
'ecc_enable_override'.
[5.07]  (Note that use of the override may cause unknown side effects.)
[5.088978] amd64_edac: probe of :00:18.2 failed with error -22

[...]

It would also be sensible to enable ECC on such an important machine.

This requires specific RAM chips, does it ?


--
Simon




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#637190: linux-image-2.6.32-5-amd64: Random kernel panics general protection faults

2011-08-09 Thread Simon Morvan

Le 09/08/2011 15:09, Ben Hutchings a écrit :

On Tue, 2011-08-09 at 14:55 +0200, Simon Morvan wrote:

Le 09/08/2011 14:50, Ben Hutchings a écrit :

On Tue, 2011-08-09 at 12:22 +0200, Simon Morvan wrote:

We're getting random crash of the system (panics, GPF). Stack trace is always 
different.

Can you check that the power supply is sufficient for all these disks?

Do you have recommendations ? I haven't found so much information on how
to estimate the power need. Currently this is a 600W power supply (FWIW:
Cooler Master Silent Pro M - 600W)

Many motherboards have a voltage monitoring chip, which you should be
able to read with the 'sensors' command from the 'lm-sensors' package.
This should show whether the actual voltages are being pulled down
because the power supply is overloaded.  You would need to actually make
all the hard drives active while checking this.

it8720-isa-0228
Adapter: ISA adapter
Vcore:   +1.33 V  (min =  +0.78 V, max =  +1.50 V)
Vdram:   +1.50 V  (min =  +1.42 V, max =  +1.57 V)
+3.3V:   +3.30 V  (min =  +3.14 V, max =  +3.47 V)
*+5V: +4.92 V  (min =  +4.76 V, max =  +5.24 V)*
+12V:   +12.36 V  (min = +11.41 V, max = +12.62 V)
in5: +2.70 V  (min =  +0.00 V, max =  +4.08 V)
5VSB:+4.92 V  (min =  +4.76 V, max =  +5.24 V)
Vbat:+3.25 V
CPU Fan:   0 RPM  (min =0 RPM)
Sys Fan:   0 RPM  (min =0 RPM)
Sys Fan:   0 RPM  (min =0 RPM)
fan5:  0 RPM  (min =0 RPM)
temp1:   +44.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = 
thermistor
CPU Temp:+59.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = 
thermal diode
temp3:   +54.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = 
thermistor

cpu0_vid:   +0.513 V

Assuming the sensors.conf is OK (which is not 100% sure for that 
Gigabyte GA-890GPA-UD3H mobo, AFAIK), do you think the 4.92V for the 5V 
is too low ?


I was hdparm'ing some disks while, compiling a kernel and a raid5 sync 
was inprogress at that time (and it freezed, of course)


--
Simon




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#569930: mysql-server upgrade stop and restart a stopped server

2010-06-08 Thread Simon Morvan

Hey there,

Same story repeating this time :

Mysql is *stopped* when running aptitude safe-upgrade, because service 
is migrated to the failover node during the upgrade to minimize downtime.



Preparing to replace mysql-server-5.0 5.0.51a-24+lenny3 (using 
.../mysql-server-5.0_5.0.51a-24+lenny4_i386.deb) ...

Stopping MySQL database server: mysqld.
Stopping MySQL database server: mysqld.
Unpacking replacement mysql-server-5.0 ...
Processing triggers for man-db ...
(...)
Setting up mysql-server-5.0 (5.0.51a-24+lenny4) ...
Stopping MySQL database server: mysqld.
Starting MySQL database server: mysqld.


Ideally, Mysql shouldn't be started in this case.

--
Simon Morvan




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#569930: mysql-server upgrade stop and restart a stopped server

2010-02-15 Thread Simon Morvan
Package: mysql-server-5.0
Version: 5.0.51a-24+lenny3_i386

Hello folks,

When upgrading mysql-server on a lenny server on which mysqld is stopped
(because it's a standby heartbeat node), the upgrade script restart the
server at the end.

This can be considered as a minor problem but in my case (an
active/passive heartbeat cluster with DRBD sync) it'll start the server
outside of heartbeat-scope without the data partition being mounted,
resulting in stale file creation inside the directory that serves as the
mount point, and forces me to stop the process after each upgrade.

The correct behavior should be to test the existence of a running
process before killing/restarting or check the rc.d configuration to
check if the process is automatically started at boot time.

Cheers,

-- 
Simon Morvan




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4b790b33.1010...@zone84.net



Bug#550116: Package update

2009-11-04 Thread Simon Morvan

Is there any chance to have a package update for stable ?

--
Simon.




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#432610: same problem here

2007-07-20 Thread Simon Morvan
Same version, same behavior here. I had to install modutils prior to 
lvm-common or the Setting up lvm-common... step won't succeed.


Maybe the dependency should be added.

--
Simon Morvan


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]