Hello Jens,

JA> Then lets wait for Jim to repeat his testing with all the debugging
JA> options enabled, that should make us a little wiser.
Ok, I'll enable the kernel rebuilt with these options and report any
findings later. So far, I'll move to the other questions aroused.

I remember when I ran with most of debugging ON last week, it made the
server run darn slow, with LA around 40-60 and little responsiveness,
and numerous messages like: 
  Hangcheck: hangcheck value past margin!
  BUG: soft lockup detected on CPU#1!
which hide the interesting traces :)
However, I can post these captured traces of several system lifetimes
to the list or privately.

Concerning other questions:
1) The workload on the software raid is rather small. It's a set of
   system partitions which keep fileserver's logs, etc. The file
   storage is on 3Ware cards and has substantial load. MD arrays are
   checked nightly, though (echo check > sync_action), and most often
   this triggers the problem. These drives only contain mirrored
   partitions, so there should be no I/O to these drives around
   MD, except for rare cases of lilo running :)

2) I have installed a third submirror disk this weekend, it's an IDE
   slave device hdd (near the failing hdc). Since then I got errors
   on other partitions, attached below as "2*)".
   
3) The failures which lead to reboots are usually preceded by a long
   history of dma_intr errors 0x40 and 0x51, but that sample I sent
   was rather full. A few errors preceded it every 5 seconds, making
   the full trace like this:

[87319.049902] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87319.057393] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315631
[87319.067205] ide: failed opcode was: unknown
[87323.956399] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87323.963681] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315631
[87323.973171] ide: failed opcode was: unknown
[87328.846265] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87328.853485] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315631
[87328.862834] ide: failed opcode was: unknown
[87333.736127] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87333.743535] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315631
[87333.752876] ide: failed opcode was: unknown
[87333.806569] ide1: reset: success
[87338.675891] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87338.685143] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315711
[87338.694791] ide: failed opcode was: unknown
[87343.557424] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87343.566388] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315711
[87343.576105] ide: failed opcode was: unknown
[87348.472226] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87348.481170] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315711
[87348.490843] ide: failed opcode was: unknown
[87353.387028] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87353.395735] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315711
[87353.405500] ide: failed opcode was: unknown
[87353.461342] ide1: reset: success
[87358.326783] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87358.335739] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87358.345395] ide: failed opcode was: unknown
[87363.208313] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87363.217319] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87363.228371] ide: failed opcode was: unknown
[87368.106472] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87368.115414] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87368.125275] ide: failed opcode was: unknown
[87372.979686] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87372.988706] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87372.998849] ide: failed opcode was: unknown
[87373.052152] ide1: reset: success
[87377.927744] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87377.936682] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87377.946399] ide: failed opcode was: unknown
[87382.800953] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87382.809881] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87382.819511] ide: failed opcode was: unknown
[87387.682479] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87387.691473] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87387.701287] ide: failed opcode was: unknown
[87392.564004] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87392.572790] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87392.582454] ide: failed opcode was: unknown
[87392.635961] ide1: reset: success
[87397.528687] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete 
DataRequest Error }
[87397.537607] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, 
LBAsect=176315718, sector=176315718
[87397.547335] ide: failed opcode was: unknown
[87397.551897] end_request: I/O error, dev hdc, sector 176315718
[87398.520820] raid1: Disk failure on hdc11, disabling device. 
[87398.520826]  Operation continuing on 1 devices
[87398.531579] blk: request botched
[87398.535098] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete }
[87398.542129] ide: failed opcode was: unknown
[87403.582775] ------------[ cut here ]------------
[87403.587748] kernel BUG at mm/filemap.c:541!
[87403.592082] invalid opcode: 0000 [#1]
[87403.596063] SMP 
[87403.598217] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core 
w83627hf_wdt
[87403.606114] CPU:    0
[87403.606117] EIP:    0060:[<c01406a7>]    Not tainted VLI
[87403.606120] EFLAGS: 00010046   (2.6.18.2debug #1) 
[87403.619728] EIP is at unlock_page+0x12/0x2d
[87403.624170] eax: 00000000   ebx: c2d5caa8   ecx: e8148680   edx: c2d5caa8
[87403.631543] esi: da71c600   edi: 00000001   ebp: c04cfe28   esp: c04cfe24
[87403.638924] ds: 007b   es: 007b   ss: 0068
[87403.643419] Process swapper (pid: 0, ti=c04ce000 task=c041e500 
task.ti=c04ce000)
[87403.650774] Stack: e81487e8 c04cfe3c c0180e0a da71c600 00000000 c0180dac 
c04cfe64 c0164af9 
[87403.659985]        f7d49000 c04cfe84 f2dea5a0 f2dea5a0 00000000 da71c600 
00000000 da71c600 
[87403.669288]        c04cfea8 c0256778 c041e500 00000000 c04cbd90 00000046 
00000000 00000000 
[87403.678603] Call Trace:
[87403.681462]  [<c0103bba>] show_stack_log_lvl+0x8d/0xaa
[87403.686911]  [<c0103ddc>] show_registers+0x1b0/0x221
[87403.692306]  [<c0103ffc>] die+0x124/0x1ee
[87403.696558]  [<c0104165>] do_trap+0x9f/0xa1
[87403.700988]  [<c0104427>] do_invalid_op+0xa7/0xb1
[87403.706012]  [<c0103871>] error_code+0x39/0x40
[87403.710794]  [<c0180e0a>] mpage_end_io_read+0x5e/0x72
[87403.716154]  [<c0164af9>] bio_endio+0x56/0x7b
[87403.720798]  [<c0256778>] __end_that_request_first+0x1e0/0x301
[87403.726985]  [<c02568a4>] end_that_request_first+0xb/0xd
[87403.732699]  [<c02bd73c>] __ide_end_request+0x54/0xe1
[87403.738214]  [<c02bd807>] ide_end_request+0x3e/0x5c
[87403.743382]  [<c02c35df>] task_error+0x5b/0x97
[87403.748113]  [<c02c36fa>] task_in_intr+0x6e/0xa2
[87403.753120]  [<c02bf19e>] ide_intr+0xaf/0x12c
[87403.757815]  [<c013e5a7>] handle_IRQ_event+0x23/0x57
[87403.763135]  [<c013e66f>] __do_IRQ+0x94/0xfd
[87403.767802]  [<c0105192>] do_IRQ+0x32/0x68
[87403.772278]  [<c010372e>] common_interrupt+0x1a/0x20
[87403.777586]  [<c0100cfe>] cpu_idle+0x7d/0x86
[87403.782184]  [<c01002b7>] rest_init+0x23/0x25
[87403.786869]  [<c04d4889>] start_kernel+0x175/0x19d
[87403.791963]  [<00000000>] 0x0
[87403.795270] Code: ff ff ff b9 0b 00 14 c0 8d f0 <0>Kernel panic - not 
syncing: Fatal exception in interrupt
smp_call_function+0x12b/0x130
[87403.935200]  [<c013e66f>] 


2*) The new problem log:

... rebuilding new spare partitions into a 3-way array went well...
[114683.721498]  --- wd:3 rd:3
[114683.721508]  disk 0, wo:0, o:1, dev:hdc10
[114683.721541]  disk 1, wo:0, o:1, dev:hda10
[114683.721556]  disk 2, wo:0, o:1, dev:hdd10
[115150.208060] RAID1 conf printout:
[115150.211893]  --- wd:3 rd:3
[115150.214923]  disk 0, wo:0, o:1, dev:hdc8
[115150.219269]  disk 1, wo:0, o:1, dev:hda8
[115150.224022]  disk 2, wo:0, o:1, dev:hdd8
[115296.149849] RAID1 conf printout:
[115296.149868]  --- wd:3 rd:3
[115296.149895]  disk 0, wo:0, o:1, dev:hdc7
[115296.149915]  disk 1, wo:0, o:1, dev:hda7
[115296.149931]  disk 2, wo:0, o:1, dev:hdd7
[115460.641824] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[115460.649294] hdc: dma_intr: error=0x40 { UncorrectableError }, 
LBAsect=28433205, sector=28433023
[115460.658774] ide: failed opcode was: unknown
[115460.663287] end_request: I/O error, dev hdc, sector 28433023
[115465.938995] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[115465.946451] hdc: dma_intr: error=0x40 { UncorrectableError }, 
LBAsect=28433201, sector=28433031
[115465.957536] ide: failed opcode was: unknown
[115465.962041] end_request: I/O error, dev hdc, sector 28433031
[115488.743391] hdc: dma_timer_expiry: dma status == 0x60
[115488.748714] hdc: DMA timeout retry
[115488.752450] hdc: timeout waiting for DMA
[115488.979377] hdc: status error: status=0x58 { DriveReady SeekComplete 
DataRequest }
[115488.987829] ide: failed opcode was: unknown
[115488.992469] hdc: drive not ready for command
[115518.991827] hdc: lost interrupt
[115518.995201] hdc: task_out_intr: status=0x58 { DriveReady SeekComplete 
DataRequest }
[115519.003493] ide: failed opcode was: unknown
[115524.009004] hdc: status timeout: status=0xd0 { Busy }
[115524.014441] ide: failed opcode was: unknown
[115524.018988] hdc: drive not ready for command
[115524.114548] ide1: reset: success
[115524.198619] raid1: Disk failure on hdc6, disabling device. 
[115524.198624]         Operation continuing on 2 devices
[115524.209260] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete }
[115544.209530] hdc: dma_timer_expiry: dma status == 0x20
[115544.215027] hdc: DMA timeout retry
[115544.218669] hdc: timeout waiting for DMA
[115544.461490] hdc: status error: status=0x58 { DriveReady SeekComplete 
DataRequest }
[115544.469875] ide: failed opcode was: unknown
[115544.474388] hdc: drive not ready for command
[115544.503556] RAID1 conf printout:
[115544.507091]  --- wd:1 rd:2
[115544.510120]  disk 0, wo:0, o:1, dev:hdd1
[115544.712573] ------------[ cut here ]------------
[115544.717459] kernel BUG at mm/filemap.c:541!
[115544.721902] invalid opcode: 0000 [#1]
[115544.725806] SMP 
[115544.728015] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core 
w83627hf_wdt
[115544.735981] CPU:    0
[115544.735983] EIP:    0060:[<c013c436>]    Not tainted VLI
[115544.735986] EFLAGS: 00010046   (2.6.18.2server #4) 
[115544.749248] EIP is at unlock_page+0xf/0x28
[115544.753589] eax: 00000000   ebx: c1f5d8c0   ecx: eab25680   edx: c1f5d8c0
[115544.760627] esi: f6a78f00   edi: 00000001   ebp: 00000000   esp: c0481e7c
[115544.767703] ds: 007b   es: 007b   ss: 0068
[115544.772033] Process swapper (pid: 0, ti=c0480000 task=c040e460 
task.ti=c0480000)
[115544.779512] Stack: eab25674 c017bed6 f6a78f00 00000000 c017be7c c016069e 
00000003 c2e2aa20 
[115544.788824]        c0115cf1 00000020 f6a78f00 00003400 f6a78f00 00000000 
c024f8ac f7f37a00 
[115544.798171]        f7f37a00 00000001 c04fcd94 00000001 00000000 00000000 
00000000 00002c00 
[115544.807683] Call Trace:
[115544.810797]  [<c017bed6>] mpage_end_io_read+0x5a/0x6e
[115544.816407]  [<c017be7c>] mpage_end_io_read+0x0/0x6e
[115544.821865]  [<c016069e>] bio_endio+0x5f/0x84
[115544.826749]  [<c0115cf1>] find_busiest_group+0x153/0x4df
[115544.832582]  [<c024f8ac>] __end_that_request_first+0x1ec/0x31c
[115544.838889]  [<c02b508a>] __ide_end_request+0x56/0xe4
[115544.844462]  [<c02b5159>] ide_end_request+0x41/0x65
[115544.849671]  [<c02bad69>] task_end_request+0x37/0x7e
[115544.854950]  [<c02baed9>] task_out_intr+0x84/0xb9
[115544.860128]  [<c01243bd>] del_timer+0x56/0x58
[115544.864839]  [<c02bae55>] task_out_intr+0x0/0xb9
[115544.869760]  [<c02b6a4b>] ide_intr+0xb7/0x132
[115544.874467]  [<c013a2f4>] handle_IRQ_event+0x26/0x59
[115544.879740]  [<c013a3ae>] __do_IRQ+0x87/0xed
[115544.884368]  [<c0105199>] do_IRQ+0x31/0x69
[115544.888843]  [<c0103746>] common_interrupt+0x1a/0x20
[115544.894124]  [<c0100c4f>] default_idle+0x35/0x5b
[115544.899233]  [<c0100d03>] cpu_idle+0x7a/0x83
[115544.903844]  [<c0486810>] start_kernel+0x16a/0x194
[115544.908950]  [<c0486286>] unknown_bootoption+0x0/0x1b6
[115544.914384] Code: e8 5e ff ff ff b9 __wake_up+0x38/0x4e
[115545.083930]  [<c01243bd>] 


-- 
Best regards,
 Jim Klimov                            mailto:[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to