Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
I do not have a test case to reproduce this issue. It is seen rarely. The fix looks good to me, will confirm if I am able to reproduce the error scenario. Regards Shaju On 8/14/19, 4:21 AM, "John Snow" wrote: On 7/7/19 10:55 PM, shaju.abra...@nutanix.com wrote: > From: Shaju Abraham > > During the IDE DMA transfer for a ISCSI target,when libiscsi encounters > a SENSE KEY error, it sets the task->sense to the value "COMMAND ABORTED". > The function iscsi_translate_sense() later translaters this error to -ECANCELED > and this value is passed to the callback function. In the case of IDE DMA read > or write, the callback function returns immediately if the value of the ret > argument is -ECANCELED. > Later when ide_cancel_dma_sync() function is invoked the assertion > "s->bus->dma->aiocb == ((void *)0)" fails and the qemu process gets terminated. > Fix the issue by making the value of s->bus->dma->aiocb = NULL when > -ECANCELED is passed to the callback. > > Signed-off-by: Shaju Abraham > --- > hw/ide/core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/hw/ide/core.c b/hw/ide/core.c > index 6afadf8..78ea357 100644 > --- a/hw/ide/core.c > +++ b/hw/ide/core.c > @@ -841,6 +841,7 @@ static void ide_dma_cb(void *opaque, int ret) > bool stay_active = false; > > if (ret == -ECANCELED) { > +s->bus->dma->aiocb = NULL; > return; > } > > Hopefully just as adequately addressed by the patches in https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_jnsnow_qemu_commits_ide&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=sY-XeNqcuy_ruBQ9T7A2LmG6ktyYXXSxRB1ljkxMepI&m=lmNnHLnsZKEaZkunWBMldNPiL87un4Q2Brtsa0zCKiQ&s=KGmAtez5AckTpNugzMzxMObkZKQ3A5vIIiukShVYUXM&e= but if you wanted to give it a test and confirm for me, I wouldn't be upset by that. --js
Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
On 7/7/19 10:55 PM, shaju.abra...@nutanix.com wrote: > From: Shaju Abraham > > During the IDE DMA transfer for a ISCSI target,when libiscsi encounters > a SENSE KEY error, it sets the task->sense to the value "COMMAND ABORTED". > The function iscsi_translate_sense() later translaters this error to > -ECANCELED > and this value is passed to the callback function. In the case of IDE DMA > read > or write, the callback function returns immediately if the value of the ret > argument is -ECANCELED. > Later when ide_cancel_dma_sync() function is invoked the assertion > "s->bus->dma->aiocb == ((void *)0)" fails and the qemu process gets > terminated. > Fix the issue by making the value of s->bus->dma->aiocb = NULL when > -ECANCELED is passed to the callback. > > Signed-off-by: Shaju Abraham > --- > hw/ide/core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/hw/ide/core.c b/hw/ide/core.c > index 6afadf8..78ea357 100644 > --- a/hw/ide/core.c > +++ b/hw/ide/core.c > @@ -841,6 +841,7 @@ static void ide_dma_cb(void *opaque, int ret) > bool stay_active = false; > > if (ret == -ECANCELED) { > +s->bus->dma->aiocb = NULL; > return; > } > > Hopefully just as adequately addressed by the patches in https://github.com/jnsnow/qemu/commits/ide but if you wanted to give it a test and confirm for me, I wouldn't be upset by that. --js
Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
On 29/07/19 23:37, John Snow wrote: >>> >>> If it does -- if there are indeed no places in the code today that >>> artificially inject -ECANCELED -- I need to remove these special stanzas >>> from the IDE code and allow the IDE state machine to handle these errors >>> as true errors. >> The bug is that there is no place to inject -ECANCELED in the dbs->bh >> case. I've sent an obviously^W untested patch. > > Where does it inject -ECANCELED in the non-dbs->bh case? It's simply part of the contract of bdrv_aio_cancel_async that the completion callback will be invoked (most of the time the I/O will just go on without any cancellation, as you noted). Paolo
Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
On 7/29/19 5:32 PM, Paolo Bonzini wrote: > On 29/07/19 21:45, John Snow wrote: >> Next, we'll unschedule the BH if there is one. I think the only case >> where there is one is the reschedule_dma case of dma_blk_cb. (I'm not >> too familiar with these DMA helpers: in what cases do we expect the iov >> to be empty?) > > When there is another I/O that is using the DMA bounce buffer (the one > case that comes to mind in which you do DMA from MMIO areas is > loading/saving VGA RAM). > >> So it looks like this cancellation will produce one of two effects, >> depending on when it's invoked: >> >> 1) We'll stall the DMA permanently by deleting that BH, because >> dma_complete will never get invoked and therefore nobody will ever call >> ide_dma_cb with any return value of any kind. The IDE state machine >> likely just hangs waiting for the DMA to finish until the guest OS >> decides to reset the errant controller. >> >> 2) The DMA will continue blissfully unaware it was canceled, because the >> lower AIOCB has no cancel method, and so will finish, call back to >> dma_blk_cb, and continue the transfer loop unaware. >> >> >> ... Does your reading align with mine? >> >> >> If it does -- if there are indeed no places in the code today that >> artificially inject -ECANCELED -- I need to remove these special stanzas >> from the IDE code and allow the IDE state machine to handle these errors >> as true errors. > > The bug is that there is no place to inject -ECANCELED in the dbs->bh > case. I've sent an obviously^W untested patch. > Where does it inject -ECANCELED in the non-dbs->bh case? --js
Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
On 29/07/19 21:45, John Snow wrote: > Next, we'll unschedule the BH if there is one. I think the only case > where there is one is the reschedule_dma case of dma_blk_cb. (I'm not > too familiar with these DMA helpers: in what cases do we expect the iov > to be empty?) When there is another I/O that is using the DMA bounce buffer (the one case that comes to mind in which you do DMA from MMIO areas is loading/saving VGA RAM). > So it looks like this cancellation will produce one of two effects, > depending on when it's invoked: > > 1) We'll stall the DMA permanently by deleting that BH, because > dma_complete will never get invoked and therefore nobody will ever call > ide_dma_cb with any return value of any kind. The IDE state machine > likely just hangs waiting for the DMA to finish until the guest OS > decides to reset the errant controller. > > 2) The DMA will continue blissfully unaware it was canceled, because the > lower AIOCB has no cancel method, and so will finish, call back to > dma_blk_cb, and continue the transfer loop unaware. > > > ... Does your reading align with mine? > > > If it does -- if there are indeed no places in the code today that > artificially inject -ECANCELED -- I need to remove these special stanzas > from the IDE code and allow the IDE state machine to handle these errors > as true errors. The bug is that there is no place to inject -ECANCELED in the dbs->bh case. I've sent an obviously^W untested patch. Paolo > I'm just not confident enough in my unwinding of the DMA callback > spaghetti, though. > > --js >
Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
On 7/29/19 6:09 AM, Stefan Hajnoczi wrote: > On Fri, Jul 26, 2019 at 04:18:46PM -0400, John Snow wrote: >> Paolo, Stefan and Kevin: can I loop you in here? I'm quite uncertain >> about this and I'd like to clear this up quickly if it's possible: >> >> On 7/25/19 8:58 PM, John Snow wrote: >>> >>> >>> On 7/7/19 10:55 PM, shaju.abra...@nutanix.com wrote: From: Shaju Abraham During the IDE DMA transfer for a ISCSI target,when libiscsi encounters a SENSE KEY error, it sets the task->sense to the value "COMMAND ABORTED". The function iscsi_translate_sense() later translaters this error to -ECANCELED and this value is passed to the callback function. In the case of IDE DMA read or write, the callback function returns immediately if the value of the ret argument is -ECANCELED. Later when ide_cancel_dma_sync() function is invoked the assertion "s->bus->dma->aiocb == ((void *)0)" fails and the qemu process gets terminated. Fix the issue by making the value of s->bus->dma->aiocb = NULL when -ECANCELED is passed to the callback. Signed-off-by: Shaju Abraham --- hw/ide/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/ide/core.c b/hw/ide/core.c index 6afadf8..78ea357 100644 --- a/hw/ide/core.c +++ b/hw/ide/core.c @@ -841,6 +841,7 @@ static void ide_dma_cb(void *opaque, int ret) bool stay_active = false; if (ret == -ECANCELED) { +s->bus->dma->aiocb = NULL; return; } >>> >>> The part that makes me nervous here is that I can't remember why we do >>> NO cleanup whatsoever for the ECANCELED case. >>> >>> commit 0d910cfeaf2076b116b4517166d5deb0fea76394 >>> Author: Fam Zheng >>> Date: Thu Sep 11 13:41:07 2014 +0800 >>> >>> ide/ahci: Check for -ECANCELED in aio callbacks >>> >>> >>> ... This looks like we never expected the aio callbacks to ever get >>> called with ECANCELED, so we treat this as a QEMU-internal signal. >>> >>> It looks like we expect these callbacks to do NOTHING in this case; but >>> I'm not sure where the IDE state machine does its cleanup otherwise. >>> (The DMA might have been canceled, but the DMA and IDE state machines >>> still need to exit their loop.) >>> >>> If you take a look at this patch from 2014 though, there are many other >>> spots where we have littered ECANCELED checks that might also cause >>> problems if we're receiving error codes we thought we couldn't get normally. >>> >>> I am worried this patch papers over something worse. >>> >> I'm not clear why Fam's patch adds a do-nothing return to the ide_dma_cb >> if it's invoked with ECANCELED: shouldn't it be the case that the IDE >> state machine needs to know that a transfer it was relying on to service >> an ATA command was canceled and treat it like an error? >> >> Why was it ever correct to ignore these? Is it because we only ever >> canceled DMA during reset/shutdown/etc? >> >> It appears as if iscsi requests can actually genuinely return an >> ECANCELED errno, so there are likely several places in the IDE code that >> need to accommodate this from happening. >> >> The easiest fix LOOKS like just deleting the special-casing of ECANCELED >> altogether and letting the error pathways handle things as normal. >> >> Am I mistaken? > > I think your instincts are right that there are deeper issues. The > first step would be test cases, then you can be sure various scenarios > have been handled correctly. > Suggestions? I'm not sure what's supposed to work and in what way here. I guess this stuff was introduced for bdrv_aio_cancel_async, but it's not immediately clear what's supposed to happen when you call that. > I noticed that ide_sector_read_cb(), ide_sector_write_cb(), and > ide_flush_cb() all differ in whether they reset s->pio_aiocb and > s->status before returning early due to -ECANCELED. That must be a bug. > > I didn't look at the ide_dma_cb() code path. > > Stefan > Hm ... It looks like canceling the ide_dma_cb AIOCB objects doesn't do anything too useful? dma_blk_io and friends will establish dma_aio_cancel as the async cancel callback. So if we do cancel these objects, we're going to call this function: static void dma_aio_cancel(BlockAIOCB *acb) { DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common); trace_dma_aio_cancel(dbs); if (dbs->acb) { blk_aio_cancel_async(dbs->acb); } if (dbs->bh) { cpu_unregister_map_client(dbs->bh); qemu_bh_delete(dbs->bh); dbs->bh = NULL; } } but there's no cancel callback for the lower layer in dbs->acb, so that's just a nop, I think -- blk_aio_prwv doesn't offer an asynchronous cancel mechanism. Next, we'll unschedule the BH if there is one. I think the only case where there is one is the reschedule_dma case of dma_blk_cb. (I'm not too familiar with these DMA helpers: in what cases do we expect the
Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
On Fri, Jul 26, 2019 at 04:18:46PM -0400, John Snow wrote: > Paolo, Stefan and Kevin: can I loop you in here? I'm quite uncertain > about this and I'd like to clear this up quickly if it's possible: > > On 7/25/19 8:58 PM, John Snow wrote: > > > > > > On 7/7/19 10:55 PM, shaju.abra...@nutanix.com wrote: > >> From: Shaju Abraham > >> > >> During the IDE DMA transfer for a ISCSI target,when libiscsi encounters > >> a SENSE KEY error, it sets the task->sense to the value "COMMAND ABORTED". > >> The function iscsi_translate_sense() later translaters this error to > >> -ECANCELED > >> and this value is passed to the callback function. In the case of IDE DMA > >> read > >> or write, the callback function returns immediately if the value of the ret > >> argument is -ECANCELED. > >> Later when ide_cancel_dma_sync() function is invoked the assertion > >> "s->bus->dma->aiocb == ((void *)0)" fails and the qemu process gets > >> terminated. > >> Fix the issue by making the value of s->bus->dma->aiocb = NULL when > >> -ECANCELED is passed to the callback. > >> > >> Signed-off-by: Shaju Abraham > >> --- > >> hw/ide/core.c | 1 + > >> 1 file changed, 1 insertion(+) > >> > >> diff --git a/hw/ide/core.c b/hw/ide/core.c > >> index 6afadf8..78ea357 100644 > >> --- a/hw/ide/core.c > >> +++ b/hw/ide/core.c > >> @@ -841,6 +841,7 @@ static void ide_dma_cb(void *opaque, int ret) > >> bool stay_active = false; > >> > >> if (ret == -ECANCELED) { > >> +s->bus->dma->aiocb = NULL; > >> return; > >> } > >> > >> > > > > The part that makes me nervous here is that I can't remember why we do > > NO cleanup whatsoever for the ECANCELED case. > > > > commit 0d910cfeaf2076b116b4517166d5deb0fea76394 > > Author: Fam Zheng > > Date: Thu Sep 11 13:41:07 2014 +0800 > > > > ide/ahci: Check for -ECANCELED in aio callbacks > > > > > > ... This looks like we never expected the aio callbacks to ever get > > called with ECANCELED, so we treat this as a QEMU-internal signal. > > > > It looks like we expect these callbacks to do NOTHING in this case; but > > I'm not sure where the IDE state machine does its cleanup otherwise. > > (The DMA might have been canceled, but the DMA and IDE state machines > > still need to exit their loop.) > > > > If you take a look at this patch from 2014 though, there are many other > > spots where we have littered ECANCELED checks that might also cause > > problems if we're receiving error codes we thought we couldn't get normally. > > > > I am worried this patch papers over something worse. > > > I'm not clear why Fam's patch adds a do-nothing return to the ide_dma_cb > if it's invoked with ECANCELED: shouldn't it be the case that the IDE > state machine needs to know that a transfer it was relying on to service > an ATA command was canceled and treat it like an error? > > Why was it ever correct to ignore these? Is it because we only ever > canceled DMA during reset/shutdown/etc? > > It appears as if iscsi requests can actually genuinely return an > ECANCELED errno, so there are likely several places in the IDE code that > need to accommodate this from happening. > > The easiest fix LOOKS like just deleting the special-casing of ECANCELED > altogether and letting the error pathways handle things as normal. > > Am I mistaken? I think your instincts are right that there are deeper issues. The first step would be test cases, then you can be sure various scenarios have been handled correctly. I noticed that ide_sector_read_cb(), ide_sector_write_cb(), and ide_flush_cb() all differ in whether they reset s->pio_aiocb and s->status before returning early due to -ECANCELED. That must be a bug. I didn't look at the ide_dma_cb() code path. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
Paolo, Stefan and Kevin: can I loop you in here? I'm quite uncertain about this and I'd like to clear this up quickly if it's possible: On 7/25/19 8:58 PM, John Snow wrote: > > > On 7/7/19 10:55 PM, shaju.abra...@nutanix.com wrote: >> From: Shaju Abraham >> >> During the IDE DMA transfer for a ISCSI target,when libiscsi encounters >> a SENSE KEY error, it sets the task->sense to the value "COMMAND ABORTED". >> The function iscsi_translate_sense() later translaters this error to >> -ECANCELED >> and this value is passed to the callback function. In the case of IDE DMA >> read >> or write, the callback function returns immediately if the value of the ret >> argument is -ECANCELED. >> Later when ide_cancel_dma_sync() function is invoked the assertion >> "s->bus->dma->aiocb == ((void *)0)" fails and the qemu process gets >> terminated. >> Fix the issue by making the value of s->bus->dma->aiocb = NULL when >> -ECANCELED is passed to the callback. >> >> Signed-off-by: Shaju Abraham >> --- >> hw/ide/core.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/hw/ide/core.c b/hw/ide/core.c >> index 6afadf8..78ea357 100644 >> --- a/hw/ide/core.c >> +++ b/hw/ide/core.c >> @@ -841,6 +841,7 @@ static void ide_dma_cb(void *opaque, int ret) >> bool stay_active = false; >> >> if (ret == -ECANCELED) { >> +s->bus->dma->aiocb = NULL; >> return; >> } >> >> > > The part that makes me nervous here is that I can't remember why we do > NO cleanup whatsoever for the ECANCELED case. > > commit 0d910cfeaf2076b116b4517166d5deb0fea76394 > Author: Fam Zheng > Date: Thu Sep 11 13:41:07 2014 +0800 > > ide/ahci: Check for -ECANCELED in aio callbacks > > > ... This looks like we never expected the aio callbacks to ever get > called with ECANCELED, so we treat this as a QEMU-internal signal. > > It looks like we expect these callbacks to do NOTHING in this case; but > I'm not sure where the IDE state machine does its cleanup otherwise. > (The DMA might have been canceled, but the DMA and IDE state machines > still need to exit their loop.) > > If you take a look at this patch from 2014 though, there are many other > spots where we have littered ECANCELED checks that might also cause > problems if we're receiving error codes we thought we couldn't get normally. > > I am worried this patch papers over something worse. > I'm not clear why Fam's patch adds a do-nothing return to the ide_dma_cb if it's invoked with ECANCELED: shouldn't it be the case that the IDE state machine needs to know that a transfer it was relying on to service an ATA command was canceled and treat it like an error? Why was it ever correct to ignore these? Is it because we only ever canceled DMA during reset/shutdown/etc? It appears as if iscsi requests can actually genuinely return an ECANCELED errno, so there are likely several places in the IDE code that need to accommodate this from happening. The easiest fix LOOKS like just deleting the special-casing of ECANCELED altogether and letting the error pathways handle things as normal. Am I mistaken? --js
Re: [Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
On 7/7/19 10:55 PM, shaju.abra...@nutanix.com wrote: > From: Shaju Abraham > > During the IDE DMA transfer for a ISCSI target,when libiscsi encounters > a SENSE KEY error, it sets the task->sense to the value "COMMAND ABORTED". > The function iscsi_translate_sense() later translaters this error to > -ECANCELED > and this value is passed to the callback function. In the case of IDE DMA > read > or write, the callback function returns immediately if the value of the ret > argument is -ECANCELED. > Later when ide_cancel_dma_sync() function is invoked the assertion > "s->bus->dma->aiocb == ((void *)0)" fails and the qemu process gets > terminated. > Fix the issue by making the value of s->bus->dma->aiocb = NULL when > -ECANCELED is passed to the callback. > > Signed-off-by: Shaju Abraham > --- > hw/ide/core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/hw/ide/core.c b/hw/ide/core.c > index 6afadf8..78ea357 100644 > --- a/hw/ide/core.c > +++ b/hw/ide/core.c > @@ -841,6 +841,7 @@ static void ide_dma_cb(void *opaque, int ret) > bool stay_active = false; > > if (ret == -ECANCELED) { > +s->bus->dma->aiocb = NULL; > return; > } > > The part that makes me nervous here is that I can't remember why we do NO cleanup whatsoever for the ECANCELED case. commit 0d910cfeaf2076b116b4517166d5deb0fea76394 Author: Fam Zheng Date: Thu Sep 11 13:41:07 2014 +0800 ide/ahci: Check for -ECANCELED in aio callbacks ... This looks like we never expected the aio callbacks to ever get called with ECANCELED, so we treat this as a QEMU-internal signal. It looks like we expect these callbacks to do NOTHING in this case; but I'm not sure where the IDE state machine does its cleanup otherwise. (The DMA might have been canceled, but the DMA and IDE state machines still need to exit their loop.) If you take a look at this patch from 2014 though, there are many other spots where we have littered ECANCELED checks that might also cause problems if we're receiving error codes we thought we couldn't get normally. I am worried this patch papers over something worse. --js
[Qemu-devel] [PATCH] Fix Guest VM crash due to iSCSI Sense Key error
From: Shaju Abraham During the IDE DMA transfer for a ISCSI target,when libiscsi encounters a SENSE KEY error, it sets the task->sense to the value "COMMAND ABORTED". The function iscsi_translate_sense() later translaters this error to -ECANCELED and this value is passed to the callback function. In the case of IDE DMA read or write, the callback function returns immediately if the value of the ret argument is -ECANCELED. Later when ide_cancel_dma_sync() function is invoked the assertion "s->bus->dma->aiocb == ((void *)0)" fails and the qemu process gets terminated. Fix the issue by making the value of s->bus->dma->aiocb = NULL when -ECANCELED is passed to the callback. Signed-off-by: Shaju Abraham --- hw/ide/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/ide/core.c b/hw/ide/core.c index 6afadf8..78ea357 100644 --- a/hw/ide/core.c +++ b/hw/ide/core.c @@ -841,6 +841,7 @@ static void ide_dma_cb(void *opaque, int ret) bool stay_active = false; if (ret == -ECANCELED) { +s->bus->dma->aiocb = NULL; return; } -- 1.9.4