Hi Jens![I made a typo in the Cc: list so that lkml is only included as of now. Actually I copied the typo from you ;-) ]
On 24 Apr 2007, at 11:40, Jens Axboe wrote:
The files are 100-400MB in size and the client is merging them into a new file in the same directory using the ROOT library, which does in essence alternating sequences ofOn Tue, Apr 24 2007, Jens Axboe wrote:On Tue, Apr 24 2007, Roland Kuhn wrote:Hi Jens! On 24 Apr 2007, at 11:18, Jens Axboe wrote:On Tue, Apr 24 2007, Roland Kuhn wrote:Hi Jens! We're using a custom built fileserver (dual core Athlon64, using x64_64 arch) with 22 disks in a RAID6 and while resyncing /dev/md2(9.1GB ext3) after a hardware incident (cable pulled on one disk) themachine would reliably oops while serving some large files over NFSv3. The oops message scrolled partly off the screen, but the IP was in cfq_dispatch_insert, so I tried your debug patch fromyesterday with 2.6.21-rc7. I used netconsole for capturing the output(which works nicely, thanks Matt!) and as usual the conditiontriggered after about half a minute, this with the following printoutinstead of crashing (still works fine):cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report theissue to [EMAIL PROTECTED] cfq: busy=1,drv=1,timer=0 cfq rr_list: cfq busy_list: 4272: sort=0,next=0000000000000000,q=0/1,a=2/0,d=0/1,f=221 cfq idle_list: cfq cur_rr:cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report theissue to [EMAIL PROTECTED] cfq: busy=1,drv=1,timer=0 cfq rr_list: cfq busy_list: 4276: sort=0,next=0000000000000000,q=0/1,a=2/0,d=0/1,f=221 cfq idle_list: cfq cur_rr:There was no backtrace, so the only thing I can tell is that for the previous crashes some nfs threads were always involved, only once didit happen inside an interrupt handler (with the "aieee" kind of message). If you want me to try something else, don't hesitate to ask!Nifty, great that you can reproduce so quickly. I'll try a 3-drive raid6 here and see if read activity along with a resync will trigger anything. If that doesn't work for me, I'll provide you with a more extensive debug patch (if you don't mind).Sure. You might want to include NFS file access into your tests,since we've not triggered this with locally accessing the disks. BTW:How are you exporting the directory (what exports options) - how is itmounted by the client(s)? What chunksize is your raid6 using?And what are the nature of the files on the raid (huge, small, ?) and what are the client(s) doing? Just approximately, I know these things can be hard/difficult/impossible to specify.
_llseek(somewhere) read(n bytes) _llseek(somewhere+n) read(m bytes) ... and then _llseek(somewhere) rt_sigaction(ignore INT) write(n bytes) rt_sigaction(INT->DFL) time() _llseek(somewhere+n) ...where n is of the the order of 30kB. The input files are treated sequentially, not randomly.
BTW: the machine just stopped dead, no sign whatsoever on console or netconsole, so I rebooted with elevator=deadline
(need to get some work done besides ;-) ) Ciao, Roland -- TU Muenchen, Physik-Department E18, James-Franck-Str., 85748 Garching Telefon 089/289-12575; Telefax 089/289-12570 -- CERN office: 892-1-D23 phone: +41 22 7676540 mobile: +41 76 487 4482 -- Any society that would give up a little liberty to gain a little security will deserve neither and lose both. - Benjamin Franklin -----BEGIN GEEK CODE BLOCK----- Version: 3.12GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P+++ L+++ E(+) W+ !N K- w--- M + !V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++>++++ h---- y+++ ------END GEEK CODE BLOCK------
smime.p7s
Description: S/MIME cryptographic signature
PGP.sig
Description: This is a digitally signed message part