Re: panic: biodone2 already
On Fri, Aug 17, 2018 at 12:59:57PM +0200, Jaromir Dolecek wrote: > > yes, one of the problems is the code happily handles stale bufs. it does not > clear the buf pointer when the response is already handled. we should add > some KASSERTs there for this and clear the response structure on reuse. yes, that also makes it clear wether a request is processed twice (and not just a buffer). > also DPRINTF() in the loop assume bp is balid so it uses stale pointer for > disk flush ... It also tries to handle invalid operation codes, but then assumes other data (i.e. the buffer pointer) is valid. Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: panic: biodone2 already
> Le 17 août 2018 à 07:07, Michael van Elst a écrit : > >> On Fri, Aug 17, 2018 at 02:23:16AM +, Emmanuel Dreyfus wrote: >> >>blkif_response_t *rep = RING_GET_RESPONSE(&sc->sc_ring, i); >>struct xbd_req *xbdreq = &sc->sc_reqs[rep->id]; >>bp = xbdreq->req_bp; >> >> It decides to call dk_done for the last occurence and return. Next >> call to xbd_handler finds the same offending buf_t leading the queue. >> dk_done is called again, leading to the panic. > It should not do this since cons should equal prod and it should not enter the loop. i was investigating whether it could be some interaction between DIOCCACHEFLUSH and bio, or raw vs block I/O, nothing found yet. yes, one of the problems is the code happily handles stale bufs. it does not clear the buf pointer when the response is already handled. we should add some KASSERTs there for this and clear the response structure on reuse. also DPRINTF() in the loop assume bp is balid so it uses stale pointer for disk flush ... the whole xbd code really needs a cleanup and proper mpification. I was not able to repeat the panic yet on my machine though. maybe i need bigger virtual disk.
Re: panic: biodone2 already
On Fri, Aug 17, 2018 at 02:23:16AM +, Emmanuel Dreyfus wrote: > blkif_response_t *rep = RING_GET_RESPONSE(&sc->sc_ring, i); > struct xbd_req *xbdreq = &sc->sc_reqs[rep->id]; > bp = xbdreq->req_bp; > > It decides to call dk_done for the last occurence and return. Next > call to xbd_handler finds the same offending buf_t leading the queue. > dk_done is called again, leading to the panic. The ring returns the response with the request ID. Is that ID returned multiple times or do several requests reference the same buffer ? > Where should I look for the code whrere the receive ring is populated in > the dom0? The answers are pushed back to the ring by dom0 in xbdback_iodone() that calls xbdback_send_reply(). Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: panic: biodone2 already
On Fri, Aug 10, 2018 at 06:55:46AM -, Michael van Elst wrote: > a queued operation eventually returns with a call to xbd_handler. > - for every buffer returned, dk_done is called which finally ends > in invoking biodone. After adding debug statements, I can now tel the offending buf_t is queued once, but xbd_handler finds it hundred of times in response queue here: for (i = sc->sc_ring.rsp_cons; i != resp_prod; i++) { blkif_response_t *rep = RING_GET_RESPONSE(&sc->sc_ring, i); struct xbd_req *xbdreq = &sc->sc_reqs[rep->id]; bp = xbdreq->req_bp; It decides to call dk_done for the last occurence and return. Next call to xbd_handler finds the same offending buf_t leading the queue. dk_done is called again, leading to the panic. Where should I look for the code whrere the receive ring is populated in the dom0? -- Emmanuel Dreyfus m...@netbsd.org
Re: panic: biodone2 already
m...@netbsd.org (Emmanuel Dreyfus) writes: >I can tell that in vfs_bio.c, bread() -> bio_doread() will call >VOP_STRATEGY once for the offendinf buf_t, but biodone() is called twice >in interrupt context for the buf_t, leading to the biodone2 already >panic later. >Since you know the xbd code you could save me some time: where do we go >below VOP_SRATEGY? The buffer is passed to xbdstrategy that tries to run the I/O queue. -> on error it finishes with biodone regulary it calls dk_strategy that iterates over xbd_diskstart. -> on error from xbd_diskstart it again finishes with biodone xbd_diskstart tries to push a single buffer, it either - queues the buffer to the "hardware" (a hypervisor event) or - returns EAGAIN to retry the operation or - returns another error to fail the operation a queued operation eventually returns with a call to xbd_handler. - for every buffer returned, dk_done is called which finally ends in invoking biodone. hypervisor events link to a 'xbd_req' structure (and per request, which can be I/O but also a cache flush operation). For I/O requests the 'xbd_req' structure points to the buffer. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: panic: biodone2 already
Emmanuel Dreyfus wrote: > > xbd is not mpsafe, so it shouldn't be even race due to parallell > > processing on different CPUs. Maybe it would be useful to check if the > > problem still happens when you assign just single CPU to the DOMU. > > I get the crash with vcpu = 1 for the domU. I also tried to pin a single > cpu for the test domU, I still get it to crash: I started tracing the code to see where the problem comes from. So far, I can tell that in vfs_bio.c, bread() -> bio_doread() will call VOP_STRATEGY once for the offendinf buf_t, but biodone() is called twice in interrupt context for the buf_t, leading to the biodone2 already panic later. Since you know the xbd code you could save me some time: where do we go below VOP_SRATEGY? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
On Wed, Aug 08, 2018 at 10:30:23AM +0700, Robert Elz wrote: > This suggests to me that something is getting totally scrambled in > the buf headers when things get busy. I tried to crash with BIOHIST enabled, here is he story about the buf_t that triggers the panic beause it as BO_DONE: 1533732322.896722 bread#23928@0: called! 1533732322.896727 biowait#53190@0: called: bp=0xa57d96c0, oflags=0x0, ret_addr=0x804c2e40 1533732322.896727 biowait#53190@0: waiting bp=0xa57d96c0 1533732322.897058 biodone#16699@0: called: bp=0xa57d96c0, softint scheduled 1533732322.897058 biodone#16700@0: called: bp=0xa57d96c0, softint scheduled 1533732322.897064 biointr#16597@0: called! 1533732322.897064 biointr#16597@0: bp=0xa57d96c0 1533732322.897064 biodone2#16777@0: called: bp=0xa57d96c0 1533732322.897064 biodone2#16777@0: wake-up 1533732322.897065 biointr#16597@0: bp=0xa57d96c0 1533732322.897065 biodone2#16778@0: called: bp=0xa57d96c0 There does not seems to be anything corrupted. the buf_t is handled twice in biointr because it has been inserted twice in ci_data.cpu_biodone by biodone. -- Emmanuel Dreyfus m...@netbsd.org
Re: panic: biodone2 already
Robert Elz wrote: > This suggests to me that something is getting totally scrambled in > the buf headers when things get busy. I tried dumping the buf_t before panic, to check if it could be completely corrupted, but it seems it is not the case. Iblkno is 4904744, filesystem has 131891200 blocks. bp = 0xa5c1e000 bp->b_error = 0 bp->b_resid = 0 bp->b_flags = 0x10 bp->b_prio = 1 bp->b_bufsize = 2048 bp->b_bcount = 2048 bp->b_dev = 0x8e00 bp->b_blkno = 4904744 bp->b_proc = 0x0 bp->b_saveaddr = 0x0 bp->b_private = 0x0 bp->b_dcookie = 0 bp->b_refcnt = 1 bp->b_lblkno = 0 bp->b_cflags = 0x10 bp->b_vp = 0xa00005c2b2a8 bp->b_oflags = 0x200 panic: biodone2 already db{0}> show vnode 0xa5c2b2a8 OBJECT 0xa5c2b2a8: locked=1, pgops=0x8058ac00, npages=0, refs=2 vnode 0xa5c2b2a8 flags 0x30 tag VT_UFS(1) type VDIR(2) mount 0xa448c000 typedata 0x0 usecount 2 writecount 0 holdcount 1 size 200 writesize 200 numoutput 0 data 0xa5c2cf00 lock 0xa5c2b3d8 state LOADED key(0xa448c000 8) 3a 9f 04 00 00 00 00 00 lrulisthd 0x8067bb80 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
For what it is worth, and in this case it might not be much, I did a similar test on my test amd64 DomU last night. Running dump and /etc/daily in parallel did nothing, but running lots of them in parallel eventually did cause a crash. But I saw a different crash -- rather than a biodone2, I got a KASSERT from one I added as part of attempting to diagnose the babylon5 "install failures" - that is, if my test kernel ever gets an I/O error, the KASSERT (which is just KASSERT(0)) causes a crash. This was intended for generic kernels that run in qemu - but I use the same sources for my generic testing, and simply left that there. My DomU test system *never* gets an I/O error, so it simply did not matter (its filesystem is on a raid on the Dom0, and neither the Dom0 nor the raid report anything even smelling like I/O errors, what's more, the Dom0 is more likely to crash than ever allow a real I/O error through to the DomU). This is the I/O error that occurred... [ 485570.8105971] xbd0a: error writing fsbn 49691936 of 49691936-49691951 (xbd0 bn 49691936; cn 24263 tn 0 sn 1312)panic: kernel diagnostic assertion "0" failed: file "/readonly/release/testing/src/sys/kern/subr_disk.c", line 163 What's kind of interesting about that, is that the DomU filesystem is ... format FFSv1 endian little-endian magic 11954 timeWed Aug 8 03:57:00 2018 superblock location 8192id [ 57248bd0 6db5a772 ] cylgrp dynamic inodes 4.4BSD sblock FFSv2 fslevel 4 nbfree 4037762 ndir2334nifree 2009341 nffree 3116 ncg 624 size33554432blocks 33289830 (no idea how I managed to make it FFSv1, but that doesn't matter). What is interesting is "blocks 33289830" when compared with the I/O error "error writing fsbn 49691936" which is 16402106 blocks beyond the end of the filesystem This suggests to me that something is getting totally scrambled in the buf headers when things get busy. kre
Re: panic: biodone2 already
Jaromír Dole?ek wrote: > Thanks. Could you please try a -current kernel for DOMU and see if it > crashes the same? If possible a DOMU kernel from daily builds, to rule > out local compiler issue. It crashes the same way with a kernel from 201808050730. here is uname -a output: NetBSD bacasable 8.99.23 NetBSD 8.99.23 (XEN3PAE_DOMU) #0: Sun Aug 5 06:48:50 UTC 2018 mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/xen/compile/XEN3PAE_DOMU i386 > xbd is not mpsafe, so it shouldn't be even race due to parallell > processing on different CPUs. Maybe it would be useful to check if the > problem still happens when you assign just single CPU to the DOMU. I get the crash with vcpu = 1 for the domU. I also tried to pin a single cpu for the test domU, I still get it to crash: xl vcpu-pin bacasable 0 0 xl vcpu-pin $other_domU all 1-3 An interesting point: Adding the -X flag to dump seems to let it work without a panic. It may be luck, but that did not crash so far. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
2018-08-07 18:42 GMT+02:00 Emmanuel Dreyfus : > kern/53506 Thanks. Could you please try a -current kernel for DOMU and see if it crashes the same? If possible a DOMU kernel from daily builds, to rule out local compiler issue. There are not really many differences in xbd/evtchn code itself between 8.0 and -current however. There was some interrupt code reorganization which might affected this, but this happened after netbsd-8 was created. xbd is not mpsafe, so it shouldn't be even race due to parallell processing on different CPUs. Maybe it would be useful to check if the problem still happens when you assign just single CPU to the DOMU. Jaromir
Re: panic: biodone2 already
Martin Husemann wrote: > Please file a PR. kern/53506 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
Emmanuel Dreyfus wrote: > /sbin/dump -a0f /dev/null / > sh /etc/daily The second command can be replaced by a simple grep -r something /etc But so far I did not managed to crash without running dump. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
On Tue, Aug 07, 2018 at 05:30:27PM +0200, Emmanuel Dreyfus wrote: > I can reproduce the crash at will: running at the same time the two > following commands reliabily trigger "panic biodone2 already" > > /sbin/dump -a0f /dev/null / > sh /etc/daily Please file a PR. Martin
Re: panic: biodone2 already
Jaromír Dole?ek wrote: > This is always a bug, driver processes same buf twice. It can do harm. > If the buf is reused for some other I/O, system can fail to store > data, or claim to read data when it didn't. I can reproduce the crash at will: running at the same time the two following commands reliabily trigger "panic biodone2 already" /sbin/dump -a0f /dev/null / sh /etc/daily That crash on NetBSD-8.0 domU, either i386 or amd64. At mine the only machines that managed to spare the bug so far were the one where the dump is finished when /etc/daily starts. All the other one keep rebooting at least once a day. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
Emmanuel Dreyfus wrote: > Here it is. And here is another flavor below I am now convinced the problem came with NetBSD 8.0: I found that two other domU crashed on daily backup since NetBSD 8.0 upgrade, and the panic is also biodone2 already. I start downgrading today. uvm_fault(0xc06d4960, 0, 1) -> 0xe fatal page fault in supervisor mode trap type 6 code 0 eip 0xc0360835 cs 0x9 eflags 0x210282 cr2 0x1c ilevel 0 esp 0xc59010b4 curlwp 0xc56af000 pid 0 lid 39 lowest kstack 0xdd8bc2c0 panic: trap cpu0: Begin traceback... vpanic(c04da74f,dd8bde0c,dd8bde90,c010bb65,c04da74f,dd8bde9c,dd8bde9c,27,dd8bc2c0,210282) at netbsd:vpanic+0x12d panic(c04da74f,dd8bde9c,dd8bde9c,27,dd8bc2c0,210282,1c,0,c59010b4,dd8bde9c) at netbsd:panic+0x18 trap() at netbsd:trap+0xc75 --- trap (number 6) --- uvm_aio_aiodone_pages(dd8bdf20,8,0,0,0,0,0,0,0,0) at netbsd:uvm_aio_aiodone_pages+0x15 uvm_aio_aiodone(c59010b4,0,c52b0118,c52b0110,c52b0104,c52b00c0,c03ca660,c56af000,0,c0102031) at netbsd:uvm_aio_aiodone+0x9a workqueue_worker(c52b00c0,b6f000,c0595200,0,c0100075,0,0,0,0,0) at netbsd:workqueue_worker+0xbe cpu0: End traceback... dumping to dev 142,1 offset 8 dump uvm_fault(0xc06d2be0, 0xfd894000, 2) -> 0xe fatal page fault in supervisor mode trap type 6 code 0x2 eip 0xc010806d cs 0x9 eflags 0x210202 cr2 0xfd894fff ilevel 0x8 esp 0xc06ba580 curlwp 0xc56af000 pid 0 lid 39 lowest kstack 0xdd8bc2c0 Skipping crash dump on recursive panic panic: trap cpu0: Begin traceback... vpanic(c04da74f,dd8bdc98,dd8bdd1c,c010bb65,c04da74f,dd8bdd28,dd8bdd28,27,dd8bc2c0,210202) at netbsd:vpanic+0x12d panic(c04da74f,dd8bdd28,dd8bdd28,27,dd8bc2c0,210202,fd894fff,8,c06ba580,dd8bdd28) at netbsd:panic+0x18 trap() at netbsd:trap+0xc75 --- trap (number 6) --- dodumpsys(dd8bde0c,0,104,c01091fa,8,0,fffa,104,c04da74f,dd8bddf0) at netbsd:dodumpsys+0x39d dumpsys(104,0,c04da74f,dd8bde0c,5,c56af000,6,dd8bde00,c03c4e48,c04da74f) at netbsd:dumpsys+0x14 vpanic(c04da74f,dd8bde0c,dd8bde90,c010bb65,c04da74f,dd8bde9c,dd8bde9c,27,dd8bc2c0,210282) at netbsd:vpanic+0x13b panic(c04da74f,dd8bde9c,dd8bde9c,27,dd8bc2c0,210282,1c,0,c59010b4,dd8bde9c) at netbsd:panic+0x18 trap() at netbsd:trap+0xc75 --- trap (number 6) --- uvm_aio_aiodone_pages(dd8bdf20,8,0,0,0,0,0,0,0,0) at netbsd:uvm_aio_aiodone_pages+0x15 uvm_aio_aiodone(c59010b4,0,c52b0118,c52b0110,c52b0104,c52b00c0,c03ca660,c56af000,0,c0102031) at netbsd:uvm_aio_aiodone+0x9a workqueue_worker(c52b00c0,b6f000,c0595200,0,c0100075,0,0,0,0,0) at netbsd:workqueue_worker+0xbe cpu0: End traceback... -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
Martin Husemann wrote: > What driver is this? xbd, this is an NetBSD-8.0/i386 Xen domU on top of a NetBSD-8.0/amd64 dom0 running on Xen 4.8.3. In the dom0, the disk image is in a file in a FFSv2 filesystem on a RAIDframe RAID 1, with two wd disks. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
On Mon, Aug 06, 2018 at 08:37:56PM +0200, Emmanuel Dreyfus wrote: > cpu0: Begin traceback... > vpanic(c04da74f,dcbdfdd4,dcbdfe58,c010bb65,c04da74f,dcbdfe64,dcbdfe64,4,dcbde2c0,210202) > at netbsd:vpanic+0x12d > panic(c04da74f,dcbdfe64,dcbdfe64,4,dcbde2c0,210202,fd893fff,8,c06ba580,ff491) > at netbsd:panic+0x18 > trap() at netbsd:trap+0xc75 > --- trap (number 6) --- > dodumpsys(dcbdff48,0,104,c01091fa,8,0,fffb,104,c04f9e2c,dcbdff2c) at > netbsd:dodumpsys+0x39d > dumpsys(104,0,c04f9e2c,dcbdff48,c5bf9950,c057fe00,dcc182cc,dcbdff3c,c03c4e48,c04f9e2c) > at netbsd:dumpsys+0x14 > vpanic(c04f9e2c,dcbdff48,dcbdff54,c03fdd2f,c04f9e2c,c0114934,c5bf9950,c057fe00,dcbdff68,c03fdd5d) > at netbsd:vpanic+0x13b > panic(c04f9e2c,c0114934,c5bf9950,c057fe00,dcbdff68,c03fdd5d,0,0,dcc18004,dcbdff8c) > at netbsd:panic+0x18 > biodone2(0,0,dcc18004,dcbdff8c,c039e66d,0,dcbdff88,c52ae7e0,0,dcc18004) at > netbsd:biodone2+0xff > biointr(0,dcbdff88,c52ae7e0,0,dcc18004,c039e610,c52ae7e0,0,c0102031,dcc18004) > at netbsd:biointr+0x2d > softint_thread(dcc18004,b6f000,c0595200,0,c0100075,0,0,0,0,0) at > netbsd:softint_thread+0x5d > cpu0: End traceback... What driver is this? Martin
Re: panic: biodone2 already
Jaromír Dole?ek wrote: > Can you give full backtrace? Here it is. I wonder if it could change things without -o log cpu0: Begin traceback... vpanic(c04da74f,dcbdfdd4,dcbdfe58,c010bb65,c04da74f,dcbdfe64,dcbdfe64,4,dcbde2c0,210202) at netbsd:vpanic+0x12d panic(c04da74f,dcbdfe64,dcbdfe64,4,dcbde2c0,210202,fd893fff,8,c06ba580,ff491) at netbsd:panic+0x18 trap() at netbsd:trap+0xc75 --- trap (number 6) --- dodumpsys(dcbdff48,0,104,c01091fa,8,0,fffb,104,c04f9e2c,dcbdff2c) at netbsd:dodumpsys+0x39d dumpsys(104,0,c04f9e2c,dcbdff48,c5bf9950,c057fe00,dcc182cc,dcbdff3c,c03c4e48,c04f9e2c) at netbsd:dumpsys+0x14 vpanic(c04f9e2c,dcbdff48,dcbdff54,c03fdd2f,c04f9e2c,c0114934,c5bf9950,c057fe00,dcbdff68,c03fdd5d) at netbsd:vpanic+0x13b panic(c04f9e2c,c0114934,c5bf9950,c057fe00,dcbdff68,c03fdd5d,0,0,dcc18004,dcbdff8c) at netbsd:panic+0x18 biodone2(0,0,dcc18004,dcbdff8c,c039e66d,0,dcbdff88,c52ae7e0,0,dcc18004) at netbsd:biodone2+0xff biointr(0,dcbdff88,c52ae7e0,0,dcc18004,c039e610,c52ae7e0,0,c0102031,dcc18004) at netbsd:biointr+0x2d softint_thread(dcc18004,b6f000,c0595200,0,c0100075,0,0,0,0,0) at netbsd:softint_thread+0x5d cpu0: End traceback... -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: panic: biodone2 already
This is always a bug, driver processes same buf twice. It can do harm. If the buf is reused for some other I/O, system can fail to store data, or claim to read data when it didn't. Can you give full backtrace? Jaromir 2018-08-06 17:56 GMT+02:00 Emmanuel Dreyfus : > Hello > > I have a Xen domU that now frequently panics with "biodone2 already". I > suspect it started after upgrading to NetBSD 8.0. > > The offending code is in src/sys/kern/vfs_bio.c (see below). I wonder if > we could not just release the mutex and exit in that case. Can it make > some harm? > > static void > biodone2(buf_t *bp) > { > void (*callout)(buf_t *); > > SDT_PROBE1(io, kernel, ,done, bp); > > BIOHIST_FUNC(__func__); > BIOHIST_CALLARGS(biohist, "bp=%#jx", (uintptr_t)bp, 0, 0, 0); > > mutex_enter(bp->b_objlock); > /* Note that the transfer is done. */ > if (ISSET(bp->b_oflags, BO_DONE)) > panic("biodone2 already"); > > Cou > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > m...@netbsd.org
panic: biodone2 already
Hello I have a Xen domU that now frequently panics with "biodone2 already". I suspect it started after upgrading to NetBSD 8.0. The offending code is in src/sys/kern/vfs_bio.c (see below). I wonder if we could not just release the mutex and exit in that case. Can it make some harm? static void biodone2(buf_t *bp) { void (*callout)(buf_t *); SDT_PROBE1(io, kernel, ,done, bp); BIOHIST_FUNC(__func__); BIOHIST_CALLARGS(biohist, "bp=%#jx", (uintptr_t)bp, 0, 0, 0); mutex_enter(bp->b_objlock); /* Note that the transfer is done. */ if (ISSET(bp->b_oflags, BO_DONE)) panic("biodone2 already"); Cou -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org