Re: [PATCH] crypto: virtio-crypto: call finalize with bh disabled
On Tue, 26 Sep 2023 18:41:58 +0200 Halil Pasic wrote: > > + local_bh_disable(); > > crypto_finalize_akcipher_request(vc_akcipher_req->base.dataq->engine, > > req, err); > > + local_bh_enable(); > > Thanks Gonglei! > > I did this a quick spin, and it does not seem to be sufficient on s390x. > Which does not come as a surprise to me, because > > #define lockdep_assert_in_softirq() \ > do {\ > WARN_ON_ONCE(__lockdep_enabled && \ > (!in_softirq() || in_irq() || in_nmi())); \ > } while (0) > > will still warn because in_irq() still evaluates to true (your patch > addresses the !in_softirq() part). > > I don't have any results on x86 yet. My current understanding is that the > virtio-pci transport code disables interrupts locally somewhere in the > call chain (actually in vp_vring_interrupt() via spin_lock_irqsave()) > and then x86 would be fine. But I will get that verified. [ 35.177962][ C0] WARNING: CPU: 0 PID: 152 at kernel/softirq.c:306 __local_bh_disable_ip (kernel/softirq.c:306 (discriminator 1)) [ 35.178551][C0] Modules linked in: vmw_vsock_virtio_transport(+) vmw_vsock_virtio_transport_common virtio_crypto(+) crypto_engine vsock [ 35.179930][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 35.180548][ C0] RIP: 0010:__local_bh_disable_ip (kernel/softirq.c:306 (discriminator 1)) [ 35.180936][ C0] Code: eb 7d 65 8b 05 ef 90 eb 7d 31 f0 f6 c4 ff 74 13 9c 58 f6 c4 02 75 17 80 e7 02 74 01 fb 5b c3 cc cc cc cc e8 48 2f 15 00 eb e6 <0f> 0b eb ca e8 2d 88 03 03 eb e2 66 66 2e 0f 1f 84 00 00 00 00 00 All code 0: eb 7d jmp0x7f 2: 65 8b 05 ef 90 eb 7dmov%gs:0x7deb90ef(%rip),%eax# 0x7deb90f8 9: 31 f0 xor%esi,%eax b: f6 c4 fftest $0xff,%ah e: 74 13 je 0x23 10: 9c pushf 11: 58 pop%rax 12: f6 c4 02test $0x2,%ah 15: 75 17 jne0x2e 17: 80 e7 02and$0x2,%bh 1a: 74 01 je 0x1d 1c: fb sti 1d: 5b pop%rbx 1e: c3 ret 1f: cc int3 20: cc int3 21: cc int3 22: cc int3 23: e8 48 2f 15 00 call 0x152f70 28: eb e6 jmp0x10 2a:* 0f 0b ud2 <-- trapping instruction 2c: eb ca jmp0xfff8 2e: e8 2d 88 03 03 call 0x3038860 33: eb e2 jmp0x17 35: 66 66 2e 0f 1f 84 00data16 cs nopw 0x0(%rax,%rax,1) 3c: 00 00 00 00 Code starting with the faulting instruction === 0: 0f 0b ud2 2: eb ca jmp0xffce 4: e8 2d 88 03 03 call 0x3038836 9: eb e2 jmp0xffed b: 66 66 2e 0f 1f 84 00data16 cs nopw 0x0(%rax,%rax,1) 12: 00 00 00 00 [ 35.182237][C0] RSP: 0018:c9007d88 EFLAGS: 00010006 [ 35.182637][C0] RAX: 80010003 RBX: 888108308538 RCX: c9007d50 [ 35.183186][C0] RDX: 88811ae36300 RSI: 0200 RDI: c02b16cc [ 35.183700][C0] RBP: 8881083084e8 R08: R09: fbfff0d04f04 [ 35.184216][C0] R10: 86827823 R11: 852013e6 R12: 0001 [ 35.184730][C0] R13: R14: 888108308538 R15: dc00 [ 35.185248][C0] FS: 7f06cb551800() GS:88811ae0() knlGS: [ 35.185831][C0] CS: 0010 DS: ES: CR0: 80050033 [ 35.186271][C0] CR2: 55dc93010628 CR3: 000116b28000 CR4: 06f0 [ 35.186789][C0] Call Trace: [ 35.187010][C0] [ 35.187204][ C0] ? __warn (kernel/panic.c:673) [ 35.187505][ C0] ? __local_bh_disable_ip (kernel/softirq.c:306 (discriminator 1)) [ 35.187857][ C0] ? report_bug (lib/bug.c:180 lib/bug.c:219) [ 35.188197][ C0] ? handle_bug (arch/x86/kernel/traps.c:237 (discriminator 1)) [ 35.188483][ C0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) [ 35.188790][ C0] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568) [ 35.189120][ C0] ? asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636) [ 35.189466][ C0] ? virtio_crypto_dataq_sym_callback (drivers/crypto/virtio/virtio_crypto_skcipher_algs.c:567 drivers/crypto/virtio/virtio_crypto_sk
Re: [PATCH] crypto: virtio-crypto: call finalize with bh disabled
On Wed, 27 Sep 2023 09:24:09 + "Gonglei (Arei)" wrote: > > On a related note, config change callback is also handled incorrectly in > > this > > driver, it takes a mutex from interrupt context. > > Good catch. Will fix it. Thanks Gonglei! Sorry I first misunderstood this as a problem within the virtio-ccw driver, but it is actually about virtio-crypto. Thanks for fixing this! Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] crypto: virtio-crypto: call finalize with bh disabled
On Wed, 27 Sep 2023 14:12:19 +0200 Cornelia Huck wrote: > On Wed, Sep 27 2023, Halil Pasic wrote: > > > On Wed, 27 Sep 2023 12:08:43 +0200 > > Cornelia Huck wrote: > > > >> > On the other hand virtio_airq_handler() calls vring_interrupt() with > >> > interrupts enabled. (While vring_interrupt() is called in a (read) > >> > critical section in virtio_airq_handler() we use read_lock() and > >> > not read_lock_irqsave() to grab the lock. Whether that is correct in > >> > it self (i.e. disregarding the crypto problem) or not I'm not sure right > >> > now. Will think some more about it tomorrow.) If the way to go forward > >> > is disabling interrupts in virtio-ccw before vring_interrupt() is > >> > called, I would be glad to spin a patch for that. > >> > >> virtio_airq_handler() is supposed to be an interrupt handler for an > >> adapter interrupt -- as such I would expect it to always run with > >> interrupts disabled (and I'd expect vring_interrupt() to be called > >> with interrupts disabled as well; if that's not the case, I think it > >> would need to run asynchronously.) At least that was my understanding at > >> the time I wrote the code. > > > > Thanks Connie! I don't quite understand what do you mean by "run with > > interrupts disabled" in this context. > > > > Do you mean that if I were to add the following warning: > > > > diff --git a/drivers/s390/virtio/virtio_ccw.c > > b/drivers/s390/virtio/virtio_ccw.c > > index ac67576301bf..2a9c73f5964f 100644 > > --- a/drivers/s390/virtio/virtio_ccw.c > > +++ b/drivers/s390/virtio/virtio_ccw.c > > @@ -211,6 +211,8 @@ static void virtio_airq_handler(struct airq_struct > > *airq, > > struct airq_info *info = container_of(airq, struct airq_info, airq); > > unsigned long ai; > > > > + WARN_ONCE(in_irq(), "irqs are ought to be disabled but are not\n"); > > + > > inc_irq_stat(IRQIO_VAI); > > > > it would/should never trigger, or do you mean something different? > > > > If yes, does that mean that you would expect the common airq code (i.e. > > something > > like do_airq_interrupt()) to disable interrupts, or call > > virtio_airq_handler()? > > asynchronously sort of as a bottom half (my understanding of bottom halves > > is currently > > not complete). > > > > If no what do you actually mean? > > My understanding (at the time) was that we're coming from the low-level > interrupt handler (which disables interrupts via the NEW PSW); > interrupts will be re-enabled once the basic processing is done. This > might no longer be the case, but I currently don't have the time to dig > into the code -- it has been some time. > It disables IO interrupts. I happen to have the PSW :) But AFAIU we may still get machine check type interrupts. So I'm leaning towards: the code is actually safe, but I will double check again. But then what we do on s390x probably does not fit well with Linux abstractions. AFAIU in Linux we don't have the granularity "this lock is used in irq context but only IO irq context, so we don't care that we may get interrupted by a non-IO irq"... This complication is why I asked what do you mean by "run with interrupts disabled", because this code does run with "IO interrupts locally disabled, but not with *all* interrupts disabled". I fully understand that you are bandwith limited. I'm adding Peter and Vineeth, since this also concerns CIO. The easy fix for this warning is to disable interrupts locally I guess. In any case I will take care of this one way or another. Thanks Conny! Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] crypto: virtio-crypto: call finalize with bh disabled
On Wed, 27 Sep 2023 12:08:43 +0200 Cornelia Huck wrote: > > On the other hand virtio_airq_handler() calls vring_interrupt() with > > interrupts enabled. (While vring_interrupt() is called in a (read) > > critical section in virtio_airq_handler() we use read_lock() and > > not read_lock_irqsave() to grab the lock. Whether that is correct in > > it self (i.e. disregarding the crypto problem) or not I'm not sure right > > now. Will think some more about it tomorrow.) If the way to go forward > > is disabling interrupts in virtio-ccw before vring_interrupt() is > > called, I would be glad to spin a patch for that. > > virtio_airq_handler() is supposed to be an interrupt handler for an > adapter interrupt -- as such I would expect it to always run with > interrupts disabled (and I'd expect vring_interrupt() to be called > with interrupts disabled as well; if that's not the case, I think it > would need to run asynchronously.) At least that was my understanding at > the time I wrote the code. Thanks Connie! I don't quite understand what do you mean by "run with interrupts disabled" in this context. Do you mean that if I were to add the following warning: diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c index ac67576301bf..2a9c73f5964f 100644 --- a/drivers/s390/virtio/virtio_ccw.c +++ b/drivers/s390/virtio/virtio_ccw.c @@ -211,6 +211,8 @@ static void virtio_airq_handler(struct airq_struct *airq, struct airq_info *info = container_of(airq, struct airq_info, airq); unsigned long ai; + WARN_ONCE(in_irq(), "irqs are ought to be disabled but are not\n"); + inc_irq_stat(IRQIO_VAI); it would/should never trigger, or do you mean something different? If yes, does that mean that you would expect the common airq code (i.e. something like do_airq_interrupt()) to disable interrupts, or call virtio_airq_handler()? asynchronously sort of as a bottom half (my understanding of bottom halves is currently not complete). If no what do you actually mean? Regards, Halil Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] crypto: virtio-crypto: call finalize with bh disabled
On Tue, 26 Sep 2023 13:13:51 -0400 "Michael S. Tsirkin" wrote: > > On the other hand virtio_airq_handler() calls vring_interrupt() with > > interrupts enabled. (While vring_interrupt() is called in a (read) > > critical section in virtio_airq_handler() we use read_lock() and > > not read_lock_irqsave() to grab the lock. Whether that is correct in > > it self (i.e. disregarding the crypto problem) or not I'm not sure right > > now. Will think some more about it tomorrow.) If the way to go forward > > is disabling interrupts in virtio-ccw before vring_interrupt() is > > called, I would be glad to spin a patch for that. > > > > Copying Conny, as she may have an opinion on this (if I'm not wrong she > > authored that code). > > > > Regards, > > Halil > > On a related note, config change callback is also handled incorrectly > in this driver, it takes a mutex from interrupt context. Thanks Michael! I intend to give Connie a little more time to chime in. Assumed no controversies emerge I intend to cook up two patches for the two issues later during the day. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] crypto: virtio-crypto: call finalize with bh disabled
[..] > --- a/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c > +++ b/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c > @@ -61,8 +61,9 @@ static void virtio_crypto_akcipher_finalize_req( > vc_akcipher_req->src_buf = NULL; > vc_akcipher_req->dst_buf = NULL; > virtcrypto_clear_request(_akcipher_req->base); > - > + local_bh_disable(); > crypto_finalize_akcipher_request(vc_akcipher_req->base.dataq->engine, > req, err); > + local_bh_enable(); Thanks Gonglei! I did this a quick spin, and it does not seem to be sufficient on s390x. Which does not come as a surprise to me, because #define lockdep_assert_in_softirq() \ do {\ WARN_ON_ONCE(__lockdep_enabled && \ (!in_softirq() || in_irq() || in_nmi())); \ } while (0) will still warn because in_irq() still evaluates to true (your patch addresses the !in_softirq() part). I don't have any results on x86 yet. My current understanding is that the virtio-pci transport code disables interrupts locally somewhere in the call chain (actually in vp_vring_interrupt() via spin_lock_irqsave()) and then x86 would be fine. But I will get that verified. On the other hand virtio_airq_handler() calls vring_interrupt() with interrupts enabled. (While vring_interrupt() is called in a (read) critical section in virtio_airq_handler() we use read_lock() and not read_lock_irqsave() to grab the lock. Whether that is correct in it self (i.e. disregarding the crypto problem) or not I'm not sure right now. Will think some more about it tomorrow.) If the way to go forward is disabling interrupts in virtio-ccw before vring_interrupt() is called, I would be glad to spin a patch for that. Copying Conny, as she may have an opinion on this (if I'm not wrong she authored that code). Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: virtcrypto_dataq_callback calls crypto_finalize_request() from irq context
On Sun, 24 Sep 2023 11:56:25 + "Gonglei (Arei)" wrote: > Hi Halil, > > Commit 4058cf08945 introduced a check for detecting crypto completion > function > called with enable BH, and indeed the virtio-crypto driver didn't disable BH, > which needs > a patch to fix it. > > P.S.: > https://lore.kernel.org/lkml/20220221120833.2618733-5-cla...@baylibre.com/T/ > > Regards, > -Gonglei Thanks Gonglei! Thanks! I would be glad to test that fix on s390x. Are you about to send one? Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
BUG: virtcrypto_dataq_callback calls crypto_finalize_request() from irq context
Hi Gonglei! Our CI has found that virtio-crypto does not honor the requirement of crypto_finalize_request() being called in softirq context which is asserted in that function via lockdep_assert_in_softirq() since commit 4058cf08945c ("crypto: engine - check if BH is disabled during completion"). The problem was originally found on s390x but Marc Hartmayer was so kind to reproduce it on amd64. Please find the corresponding kernel messages at the end of this email. The call chain looks like this. interrupt handler for queue notification --> virtcrypto_dataq_callback() --> via vc_req->alg_cb either virtio_crypto_skcipher_finalize_req() or virtio_crypto_akcipher_finalize_req() --> crypto_finalize_skcipher_request() or crypto_finalize_akcipher_request() --> crypto_finalize_request() Everything above is happening in the interrupt handler (and in "hard" irq context). I'm not really familiar with the implementation of virtio_crypto or with the crypto_engine interfaces. I assume the problem is on the side of virtio-crypto so I would like to kindly ask you as the maintainer of virtio-crypt to have a look at it. But if you think it is rather in the crypto_engine, please clarify that with Herbert. I have no strong opinion on this issue. Regards, Halil [ 31.033415][ C0] WARNING: CPU: 0 PID: 136 at crypto/crypto_engine.c:58 crypto_finalize_request (crypto/crypto_engine.c:58 (discriminator 23)) crypto_engine [ 31.034131][C0] Modules linked in: virtio_crypto(+) vmw_vsock_virtio_transport_common(+) crypto_engine vsock [ 31.035326][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 31.035917][ C0] RIP: 0010:crypto_finalize_request (crypto/crypto_engine.c:58 (discriminator 23)) crypto_engine [ 31.036398][ C0] Code: 08 5b 5d 41 5c 41 5d e9 bf 88 1c c1 65 8b 05 b0 36 01 40 f6 c4 ff 74 12 a9 00 00 0f 00 75 0b a9 00 00 f0 00 0f 84 54 ff ff ff <0f> 0b e9 4d ff ff ff 4c 8d 6b 38 4c 89 ef e8 8e 47 1b c4 48 8d bb All code 0: 08 5b 5dor %bl,0x5d(%rbx) 3: 41 5c pop%r12 5: 41 5d pop%r13 7: e9 bf 88 1c c1 jmp0xc11c88cb c: 65 8b 05 b0 36 01 40mov%gs:0x400136b0(%rip),%eax# 0x400136c3 13: f6 c4 fftest $0xff,%ah 16: 74 12 je 0x2a 18: a9 00 00 0f 00 test $0xf,%eax 1d: 75 0b jne0x2a 1f: a9 00 00 f0 00 test $0xf0,%eax 24: 0f 84 54 ff ff ff je 0xff7e 2a:* 0f 0b ud2 <-- trapping instruction 2c: e9 4d ff ff ff jmp0xff7e 31: 4c 8d 6b 38 lea0x38(%rbx),%r13 35: 4c 89 efmov%r13,%rdi 38: e8 8e 47 1b c4 call 0xc41b47cb 3d: 48 rex.W 3e: 8d .byte 0x8d 3f: bb .byte 0xbb Code starting with the faulting instruction === 0: 0f 0b ud2 2: e9 4d ff ff ff jmp0xff54 7: 4c 8d 6b 38 lea0x38(%rbx),%r13 b: 4c 89 efmov%r13,%rdi e: e8 8e 47 1b c4 call 0xc41b47a1 13: 48 rex.W 14: 8d .byte 0x8d 15: bb .byte 0xbb [ 31.037591][C0] RSP: 0018:c9007da0 EFLAGS: 00010046 [ 31.037976][C0] RAX: 80010002 RBX: 888006c87428 RCX: 10c0e523 [ 31.038471][C0] RDX: RSI: 88810d0819e8 RDI: 888006c87449 [ 31.038967][C0] RBP: 88810d0819e8 R08: R09: fbfff0b04f04 [ 31.039463][C0] R10: 85827823 R11: 842013e6 R12: [ 31.039963][C0] R13: 0001 R14: 88810d081a18 R15: dc00 [ 31.040475][C0] FS: 7f80c0cc6800() GS:88811ae0() knlGS: [ 31.041058][C0] CS: 0010 DS: ES: CR0: 80050033 [ 31.041473][C0] CR2: 7f22ad455270 CR3: 000106b22000 CR4: 06f0 [ 31.042024][C0] Call Trace: [ 31.042250][C0] [ 31.042433][ C0] ? __warn (kernel/panic.c:673) [ 31.042710][ C0] ? crypto_finalize_request (crypto/crypto_engine.c:58 (discriminator 23)) crypto_engine [ 31.043161][ C0] ? report_bug (lib/bug.c:180 lib/bug.c:219) [ 31.043451][ C0] ? handle_bug (arch/x86/kernel/traps.c:237 (discriminator 1)) [ 31.043728][ C0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) [ 31.044039][ C0] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568) [ 31.044385][ C0] ? asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636) [ 31.044746][ C0] ? crypto_finalize_request (crypto/crypto_engine.c:58 (discriminator 23)) crypto_engine [
Re: [PATCH V5 6/9] virtio-ccw: implement synchronize_cbs()
On Wed, 18 May 2022 11:59:48 +0800 Jason Wang wrote: > This patch tries to implement the synchronize_cbs() for ccw. For the > vring_interrupt() that is called via virtio_airq_handler(), the > synchronization is simply done via the airq_info's lock. For the > vring_interrupt() that is called via virtio_ccw_int_handler(), a per > device rwlock is introduced ans used in the synchronization method. > > Cc: Thomas Gleixner > Cc: Peter Zijlstra > Cc: "Paul E. McKenney" > Cc: Marc Zyngier > Cc: Halil Pasic > Cc: Cornelia Huck > Cc: Vineeth Vijayan > Cc: Peter Oberparleiter > Cc: linux-s...@vger.kernel.org > Signed-off-by: Jason Wang Reviewed-by: Halil Pasic ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 0/9] rework on the IRQ hardening of virtio
On Mon, 23 May 2022 10:53:23 +0200 Halil Pasic wrote: > On Wed, 18 May 2022 11:59:42 +0800 > Jason Wang wrote: > > > Hi All: > > Sorry for being slow on this one. I'm pretty much under water. Will try > to get some regression-testing done till tomorrow end of day. > Did some testing with the two stage indicators disabled. Didn't see any significant difference in performance, and with that also no performance regression. IMHO we are good to go ahead! Sorry it took so long. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 0/9] rework on the IRQ hardening of virtio
On Wed, 18 May 2022 11:59:42 +0800 Jason Wang wrote: > Hi All: Sorry for being slow on this one. I'm pretty much under water. Will try to get some regression-testing done till tomorrow end of day. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V4 0/9] rework on the IRQ hardening of virtio
On Thu, 12 May 2022 11:31:08 +0800 Jason Wang wrote: > > > It looks to me we need to use write_lock_irq()/write_unlock_irq() to > > > do the synchronization. > > > > > > And we probably need to keep the > > > read_lock_irqsave()/read_lock_irqrestore() logic since I can see the > > > virtio_ccw_int_handler() to be called from process context (e.g from > > > the io_subchannel_quiesce()). > > > > > > > Sounds correct. > > As Cornelia and Vineeth pointed out, all the paths the vring_interrupt > is called with irq disabled. > > So I will use spin_lock()/spin_unlock() in the next version. Can we do some sort of an assertion that if the kernel is built with the corresponding debug features will make sure this assumption holds (and warn if it does not)? That assertion would also document the fact. If an assertion is not possible, I think we should at least place a strategic comment that documents our assumption. Regards, Halil > > Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V4 0/9] rework on the IRQ hardening of virtio
On Wed, 11 May 2022 10:22:59 +0800 Jason Wang wrote: > >CPU0 > > > > lock(>irq_lock); > > > > lock(>irq_lock); > > > > *** DEADLOCK *** > > It looks to me we need to use write_lock_irq()/write_unlock_irq() to > do the synchronization. > > And we probably need to keep the > read_lock_irqsave()/read_lock_irqrestore() logic since I can see the > virtio_ccw_int_handler() to be called from process context (e.g from > the io_subchannel_quiesce()). > Sounds correct. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V4 8/9] virtio: harden vring IRQ
On Wed, 11 May 2022 17:27:44 +0800 Jason Wang wrote: > On Wed, May 11, 2022 at 4:44 PM Cornelia Huck wrote: > > > > On Wed, May 11 2022, Jason Wang wrote: > > > > > On Tue, May 10, 2022 at 7:32 PM Michael S. Tsirkin > > > wrote: > > >> > > >> On Sat, May 07, 2022 at 03:19:53PM +0800, Jason Wang wrote: > > >> > diff --git a/include/linux/virtio_config.h > > >> > b/include/linux/virtio_config.h > > >> > index d8a2340f928e..23f1694cdbd5 100644 > > >> > --- a/include/linux/virtio_config.h > > >> > +++ b/include/linux/virtio_config.h > > >> > @@ -256,6 +256,18 @@ void virtio_device_ready(struct virtio_device > > >> > *dev) > > >> > unsigned status = dev->config->get_status(dev); > > >> > > > >> > BUG_ON(status & VIRTIO_CONFIG_S_DRIVER_OK); > > >> > + > > >> > + /* > > >> > + * The virtio_synchronize_cbs() makes sure vring_interrupt() > > >> > + * will see the driver specific setup if it sees vq->broken > > >> > + * as false. > > >> > + */ > > >> > + virtio_synchronize_cbs(dev); > > >> > > >> since you mention vq->broken above, maybe add > > >> "set vq->broken to false" > > > > > > Ok. > > > > > >> > > >> > + __virtio_unbreak_device(dev); > > >> > + /* > > >> > + * The transport is expected ensure the visibility of > > >> > > >> to ensure > > > > > > Will fix. > > > > > >> > > >> > + * vq->broken > > >> > > >> let's add: "visibility by vq callbacks" > > > > > > Sure. > > > > > >> > > >> > before setting VIRTIO_CONFIG_S_DRIVER_OK. > > >> > + */ > > >> > > >> > > >> Can I see some analysis of existing transports showing > > >> this is actually the case for them? > > > > > > Yes. > > > > > >> And maybe add a comment near set_status to document the > > >> requirement. > > > > > > For PCI and MMIO, we can quote the memory-barriers.txt or explain that > > > wmb() is not needed before the MMIO writel(). > > > For CCW, it looks not obvious, it looks to me the IO was submitted via > > > __ssch() which has an inline assembly. Cornelia and Hali, could you > > > help me to understand if and how did virtio_ccw_set_status() can > > > ensure the visibility of the previous driver setup and vq->broken > > > here? > > > > I'm not sure I completely understand the question here, but let me try: > > It's something like the following case: > > CPU 0: vq->broken = false > CPU 0: set_status(DRIVER_OK) > CPU 1: vring_interrupt() { if (vq->broken) return IRQ_NONE; } > > We need to make sure the CPU 1 sees the vq->broken if the interrupt is > raised after DRVER_OK. > > For PCI, we use MMIO of writel() for set_status(), a wmb() is not > needed in this case according to memory-barriers.txt. > > " > Note that, when using writel(), a prior > wmb() is not needed to guarantee that the cache coherent memory writes > have completed before writing to the MMIO region. > " IMHO the key facts here are the following: * ssch and all other I/O instructions are serializing instructions * all interruptions are serializing operations For reference see https://www.ibm.com/resources/publications/OutputPubsDetails?PubID=SA22783213 page 5-138. Maybe we should add that to the linux documentation somewhere if not already mentioned. So IMHO we don't need CPU0 to do a wmb() because of the ssch. > > So CPU 1 will see the broken as false. But barriers need to be paired. And in my understanding the ssch doesn't really ensure that CPU1 is about to see the change, unless there is a suitable barrier that pairs with the barrier implied the ssch instruction. Assumed vring_interrupt() is always done in hard-irq context, AFAIU, we should be fine. Is that assumption correct? Why are we fine: * Either the ssch was performed before the interrupt for vring_interrupt() got delivered on CPU1, and then we are guaranteed to see the updated value for vq->broken, * or the interrupt that triggered vring_interrupt() was delivered before the ssch instruction got executed. But in this case it is fine to ignore the notification, because this is actually the bad case we want to guard against: we got a notification when notifications are not allowed. We may end up with !vq->broken and !DEVICE_OK as well, but that should be fine because, although that notification would be a should not happen one, I understand it would not catch us with our pants down. Regards, Halil > > > > > virtio_ccw_set_status() uses a channel command to set the status, with > > the interesting stuff done inside ccw_io_helper(). That function > > - takes the subchannel lock, disabling interrupts > > Then it is, for x86 the operation to disable interrupt is a full > barrier. I guess this should apply to other architecture like s390. I > see a stnsm is used in this case but a quick google doesn't tell me if > it's a barrier. > If this is true. The vring_interrupt will see broken as false. > > > - does the ssch; this instruction will fail if there's already another
Re: [PATCH V3 6/9] virtio-ccw: implement synchronize_cbs()
On Wed, 27 Apr 2022 11:27:03 +0200 Cornelia Huck wrote: > On Tue, Apr 26 2022, "Michael S. Tsirkin" wrote: > > > On Tue, Apr 26, 2022 at 05:47:17PM +0200, Cornelia Huck wrote: > >> On Mon, Apr 25 2022, "Michael S. Tsirkin" wrote: > >> > >> > On Mon, Apr 25, 2022 at 11:53:24PM -0400, Michael S. Tsirkin wrote: > >> >> On Tue, Apr 26, 2022 at 11:42:45AM +0800, Jason Wang wrote: > >> >> > > >> >> > 在 2022/4/26 11:38, Michael S. Tsirkin 写道: > >> >> > > On Mon, Apr 25, 2022 at 11:35:41PM -0400, Michael S. Tsirkin wrote: > >> >> > > > >> >> > > > On Tue, Apr 26, 2022 at 04:29:11AM +0200, Halil Pasic wrote: > >> >> > > > > On Mon, 25 Apr 2022 09:59:55 -0400 > >> >> > > > > "Michael S. Tsirkin" wrote: > >> >> > > > > > >> >> > > > > > On Mon, Apr 25, 2022 at 10:54:24AM +0200, Cornelia Huck > >> >> > > > > > wrote: > >> >> > > > > > > On Mon, Apr 25 2022, "Michael S. Tsirkin" > >> >> > > > > > > wrote: > >> >> > > > > > > > On Mon, Apr 25, 2022 at 10:44:15AM +0800, Jason Wang > >> >> > > > > > > > wrote: > >> >> > > > > > > > > This patch tries to implement the synchronize_cbs() for > >> >> > > > > > > > > ccw. For the > >> >> > > > > > > > > vring_interrupt() that is called via > >> >> > > > > > > > > virtio_airq_handler(), the > >> >> > > > > > > > > synchronization is simply done via the airq_info's > >> >> > > > > > > > > lock. For the > >> >> > > > > > > > > vring_interrupt() that is called via > >> >> > > > > > > > > virtio_ccw_int_handler(), a per > >> >> > > > > > > > > device spinlock for irq is introduced ans used in the > >> >> > > > > > > > > synchronization > >> >> > > > > > > > > method. > >> >> > > > > > > > > > >> >> > > > > > > > > Cc: Thomas Gleixner > >> >> > > > > > > > > Cc: Peter Zijlstra > >> >> > > > > > > > > Cc: "Paul E. McKenney" > >> >> > > > > > > > > Cc: Marc Zyngier > >> >> > > > > > > > > Cc: Halil Pasic > >> >> > > > > > > > > Cc: Cornelia Huck > >> >> > > > > > > > > Signed-off-by: Jason Wang > >> >> > > > > > > > > >> >> > > > > > > > This is the only one that is giving me pause. Halil, > >> >> > > > > > > > Cornelia, > >> >> > > > > > > > should we be concerned about the performance impact here? > >> >> > > > > > > > Any chance it can be tested? > >> >> > > > > > > We can have a bunch of devices using the same airq > >> >> > > > > > > structure, and the > >> >> > > > > > > sync cb creates a choke point, same as > >> >> > > > > > > registering/unregistering. > >> >> > > > > > BTW can callbacks for multiple VQs run on multiple CPUs at > >> >> > > > > > the moment? > >> >> > > > > I'm not sure I understand the question. > >> >> > > > > > >> >> > > > > I do think we can have multiple CPUs that are executing some > >> >> > > > > portion of > >> >> > > > > virtio_ccw_int_handler(). So I guess the answer is yes. Connie > >> >> > > > > what do you think? > >> >> > > > > > >> >> > > > > On the other hand we could also end up serializing > >> >> > > > > synchronize_cbs() > >> >> > > > > calls for different devices if they happen to use the same > >> >> > >
Re: [PATCH V3 6/9] virtio-ccw: implement synchronize_cbs()
On Mon, 25 Apr 2022 10:54:24 +0200 Cornelia Huck wrote: > On Mon, Apr 25 2022, "Michael S. Tsirkin" wrote: > > > On Mon, Apr 25, 2022 at 10:44:15AM +0800, Jason Wang wrote: > >> This patch tries to implement the synchronize_cbs() for ccw. For the > >> vring_interrupt() that is called via virtio_airq_handler(), the > >> synchronization is simply done via the airq_info's lock. For the > >> vring_interrupt() that is called via virtio_ccw_int_handler(), a per > >> device spinlock for irq is introduced ans used in the synchronization > >> method. > >> > >> Cc: Thomas Gleixner > >> Cc: Peter Zijlstra > >> Cc: "Paul E. McKenney" > >> Cc: Marc Zyngier > >> Cc: Halil Pasic > >> Cc: Cornelia Huck > >> Signed-off-by: Jason Wang > > > > > > This is the only one that is giving me pause. Halil, Cornelia, > > should we be concerned about the performance impact here? > > Any chance it can be tested? > > We can have a bunch of devices using the same airq structure, and the > sync cb creates a choke point, same as registering/unregistering. If > invoking the sync cb is a rare operation (same as (un)registering), it > should not affect interrupt processing for other devices too much, but > it really should be rare. With the notable difference that the critical section in sync_cb is basically empty, so it should be less intrusive that register/unregister. I would also argue, that since after the reset we (re-)discover our virtqueues and (re-)register adapter interrupts, and thus before or as a part of the reset we probably do an unregister to clean up the adapter interrupts and de-allocate the bits in the info, this should not incur any mayor overhead for the airq case, which is the preferred one. Or am I missing something? > > For testing, you would probably want to use a setup with many devices > that share the same airq area (you can fit a lot of devices if they have > few queues), generate traffic on the queues, and then do something that > triggers the callback (adding/removing a new device in a loop?) > > I currently don't have such a setup handy; Halil, would you be able to > test that? Neither do I. I would also have to start from scratch. I guess it would be also sufficient to do a setup with two devices: a nic with many busy queues, and another device that is responsible for generating the resets. Regards, Halil > > > > >> --- > >> drivers/s390/virtio/virtio_ccw.c | 27 +++ > >> 1 file changed, 27 insertions(+) > >> > >> diff --git a/drivers/s390/virtio/virtio_ccw.c > >> b/drivers/s390/virtio/virtio_ccw.c > >> index d35e7a3f7067..c19f07a82d62 100644 > >> --- a/drivers/s390/virtio/virtio_ccw.c > >> +++ b/drivers/s390/virtio/virtio_ccw.c > >> @@ -62,6 +62,7 @@ struct virtio_ccw_device { > >>unsigned int revision; /* Transport revision */ > >>wait_queue_head_t wait_q; > >>spinlock_t lock; > >> + spinlock_t irq_lock; > >>struct mutex io_lock; /* Serializes I/O requests */ > >>struct list_head virtqueues; > >>bool is_thinint; > >> @@ -984,6 +985,27 @@ static const char *virtio_ccw_bus_name(struct > >> virtio_device *vdev) > >>return dev_name(>cdev->dev); > >> } > >> > >> +static void virtio_ccw_synchronize_cbs(struct virtio_device *vdev) > >> +{ > >> + struct virtio_ccw_device *vcdev = to_vc_device(vdev); > >> + struct airq_info *info = vcdev->airq_info; > >> + > >> + /* > >> + * Synchronize with the vring_interrupt() called by > >> + * virtio_ccw_int_handler(). > >> + */ > >> + spin_lock(>irq_lock); > >> + spin_unlock(>irq_lock); > >> + > >> + if (info) { > >> + /* > >> + * Synchronize with the vring_interrupt() with airq indicator > >> + */ > >> + write_lock(>lock); > >> + write_unlock(>lock); > >> + } > > I think we can make this an either/or operation (devices will either use > classic interrupts or adapter interrupts)? Right, for virtqueue notifications. I second Connie's motion! > > >> +} > >> + > >> static const struct virtio_config_ops virtio_ccw_config_ops = { > >>.get_features = virtio_ccw_get_features, > >>.finalize_features = virtio_ccw_finalize_features, > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 6/9] virtio-ccw: implement synchronize_cbs()
On Mon, 25 Apr 2022 09:59:55 -0400 "Michael S. Tsirkin" wrote: > On Mon, Apr 25, 2022 at 10:54:24AM +0200, Cornelia Huck wrote: > > On Mon, Apr 25 2022, "Michael S. Tsirkin" wrote: > > > > > On Mon, Apr 25, 2022 at 10:44:15AM +0800, Jason Wang wrote: > > >> This patch tries to implement the synchronize_cbs() for ccw. For the > > >> vring_interrupt() that is called via virtio_airq_handler(), the > > >> synchronization is simply done via the airq_info's lock. For the > > >> vring_interrupt() that is called via virtio_ccw_int_handler(), a per > > >> device spinlock for irq is introduced ans used in the synchronization > > >> method. > > >> > > >> Cc: Thomas Gleixner > > >> Cc: Peter Zijlstra > > >> Cc: "Paul E. McKenney" > > >> Cc: Marc Zyngier > > >> Cc: Halil Pasic > > >> Cc: Cornelia Huck > > >> Signed-off-by: Jason Wang > > > > > > > > > This is the only one that is giving me pause. Halil, Cornelia, > > > should we be concerned about the performance impact here? > > > Any chance it can be tested? > > > > We can have a bunch of devices using the same airq structure, and the > > sync cb creates a choke point, same as registering/unregistering. > > BTW can callbacks for multiple VQs run on multiple CPUs at the moment? I'm not sure I understand the question. I do think we can have multiple CPUs that are executing some portion of virtio_ccw_int_handler(). So I guess the answer is yes. Connie what do you think? On the other hand we could also end up serializing synchronize_cbs() calls for different devices if they happen to use the same airq_info. But this probably was not your question > this patch serializes them on a spinlock. > Those could then pile up on the newly introduced spinlock. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V2 4/5] virtio-pci: implement synchronize_vqs()
On Tue, 12 Apr 2022 10:24:35 +0800 Jason Wang wrote: > > Regarding the question "are we safe against notifications before > > indicators have been registered" I think we really need to think about > > something like Secure Execution. We don't have, and we are unlikely > > to have in hardware virtio-ccw implementations, and for a malicious > > hypervisor > > that has full access to the guest memory hardening makes no sense. > > Does s390 have something like memory encryption? (I guess yes). In the > case of x86 VM encryption, the I/O buffers were now done via software > IOTLB, that's why hardening of the virtio driver is needed to prevent > the hypervisor to poke the swiotlb etc. Yep! Secure Execution is a confidential computing solution which is much like encrypted guest memory, except for one gets exceptions when trying to access private memory instead of ending up with garbage because of the encryption. These improvements are IMHO relevant to us! Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V2 4/5] virtio-pci: implement synchronize_vqs()
On Mon, 11 Apr 2022 16:27:41 +0200 Cornelia Huck wrote: > On Mon, Apr 11 2022, Jason Wang wrote: > > > On Sun, Apr 10, 2022 at 3:51 PM Michael S. Tsirkin wrote: > > > >> > >> On Fri, Apr 08, 2022 at 03:03:07PM +0200, Halil Pasic wrote: > >> > On Wed, 06 Apr 2022 15:04:32 +0200 > >> > Cornelia Huck wrote: > >> > > >> > > On Wed, Apr 06 2022, "Michael S. Tsirkin" wrote: > >> > > > >> > > > On Wed, Apr 06, 2022 at 04:35:37PM +0800, Jason Wang wrote: > >> > > >> This patch implements PCI version of synchronize_vqs(). > >> > > >> > >> > > >> Cc: Thomas Gleixner > >> > > >> Cc: Peter Zijlstra > >> > > >> Cc: "Paul E. McKenney" > >> > > >> Cc: Marc Zyngier > >> > > >> Signed-off-by: Jason Wang > >> > > > > >> > > > Please add implementations at least for ccw and mmio. > >> > > > >> > > I'm not sure what (if anything) can/should be done for ccw... > >> > > >> > If nothing needs to be done I would like to have at least a comment in > >> > the code that explains why. So that somebody who reads the code > >> > doesn't wonder: why is virtio-ccw not implementing that callback. > >> > >> Right. > >> > >> I am currently thinking instead of making this optional in the > >> core we should make it mandatory, and have transports which do not > >> need to sync have an empty stub with documentation explaining why. > > Yes, that makes sense to me. If we can explain why we don't need to do > anything, we should keep that explanation easily accessible. > > >> > >> Also, do we want to document this sync is explicitly for irq > >> enable/disable? > >> synchronize_irq_enable_disable? > > > > I would not since the transport is not guaranteed to use an interrupt > > for callbacks. > > > >> > >> > >> > > > >> > > > > >> > > >> --- > >> > > >> drivers/virtio/virtio_pci_common.c | 14 ++ > >> > > >> drivers/virtio/virtio_pci_common.h | 2 ++ > >> > > >> drivers/virtio/virtio_pci_legacy.c | 1 + > >> > > >> drivers/virtio/virtio_pci_modern.c | 2 ++ > >> > > >> 4 files changed, 19 insertions(+) > >> > > >> > >> > > >> diff --git a/drivers/virtio/virtio_pci_common.c > >> > > >> b/drivers/virtio/virtio_pci_common.c > >> > > >> index d724f676608b..b78c8bc93a97 100644 > >> > > >> --- a/drivers/virtio/virtio_pci_common.c > >> > > >> +++ b/drivers/virtio/virtio_pci_common.c > >> > > >> @@ -37,6 +37,20 @@ void vp_synchronize_vectors(struct virtio_device > >> > > >> *vdev) > >> > > >> synchronize_irq(pci_irq_vector(vp_dev->pci_dev, > >> > > >> i)); > >> > > >> } > >> > > >> > >> > > >> +void vp_synchronize_vqs(struct virtio_device *vdev) > >> > > >> +{ > >> > > >> +struct virtio_pci_device *vp_dev = to_vp_device(vdev); > >> > > >> +int i; > >> > > >> + > >> > > >> +if (vp_dev->intx_enabled) { > >> > > >> +synchronize_irq(vp_dev->pci_dev->irq); > >> > > >> +return; > >> > > >> +} > >> > > >> + > >> > > >> +for (i = 0; i < vp_dev->msix_vectors; ++i) > >> > > >> +synchronize_irq(pci_irq_vector(vp_dev->pci_dev, > >> > > >> i)); > >> > > >> +} > >> > > >> + > >> > > > >> > > ...given that this seems to synchronize threaded interrupt handlers? > >> > > Halil, do you think ccw needs to do anything? (AFAICS, we only have one > >> > > 'irq' for channel devices anyway, and the handler just calls the > >> > > relevant callbacks directly.) > >> > > >> > Sorry I don't understand enough yet. A more verbose documentation on > >> > "virtio_synchronize_vqs - synchronize with virtqueue callbacks" would &g
Re: [PATCH V2 4/5] virtio-pci: implement synchronize_vqs()
On Wed, 06 Apr 2022 15:04:32 +0200 Cornelia Huck wrote: > On Wed, Apr 06 2022, "Michael S. Tsirkin" wrote: > > > On Wed, Apr 06, 2022 at 04:35:37PM +0800, Jason Wang wrote: > >> This patch implements PCI version of synchronize_vqs(). > >> > >> Cc: Thomas Gleixner > >> Cc: Peter Zijlstra > >> Cc: "Paul E. McKenney" > >> Cc: Marc Zyngier > >> Signed-off-by: Jason Wang > > > > Please add implementations at least for ccw and mmio. > > I'm not sure what (if anything) can/should be done for ccw... If nothing needs to be done I would like to have at least a comment in the code that explains why. So that somebody who reads the code doesn't wonder: why is virtio-ccw not implementing that callback. > > > > >> --- > >> drivers/virtio/virtio_pci_common.c | 14 ++ > >> drivers/virtio/virtio_pci_common.h | 2 ++ > >> drivers/virtio/virtio_pci_legacy.c | 1 + > >> drivers/virtio/virtio_pci_modern.c | 2 ++ > >> 4 files changed, 19 insertions(+) > >> > >> diff --git a/drivers/virtio/virtio_pci_common.c > >> b/drivers/virtio/virtio_pci_common.c > >> index d724f676608b..b78c8bc93a97 100644 > >> --- a/drivers/virtio/virtio_pci_common.c > >> +++ b/drivers/virtio/virtio_pci_common.c > >> @@ -37,6 +37,20 @@ void vp_synchronize_vectors(struct virtio_device *vdev) > >>synchronize_irq(pci_irq_vector(vp_dev->pci_dev, i)); > >> } > >> > >> +void vp_synchronize_vqs(struct virtio_device *vdev) > >> +{ > >> + struct virtio_pci_device *vp_dev = to_vp_device(vdev); > >> + int i; > >> + > >> + if (vp_dev->intx_enabled) { > >> + synchronize_irq(vp_dev->pci_dev->irq); > >> + return; > >> + } > >> + > >> + for (i = 0; i < vp_dev->msix_vectors; ++i) > >> + synchronize_irq(pci_irq_vector(vp_dev->pci_dev, i)); > >> +} > >> + > > ...given that this seems to synchronize threaded interrupt handlers? > Halil, do you think ccw needs to do anything? (AFAICS, we only have one > 'irq' for channel devices anyway, and the handler just calls the > relevant callbacks directly.) Sorry I don't understand enough yet. A more verbose documentation on "virtio_synchronize_vqs - synchronize with virtqueue callbacks" would surely benefit me. It may be more than enough for a back-belt but it ain't enough for me to tell what is the callback supposed to accomplish. I will have to study this discussion and the code more thoroughly. Tentatively I side with Jason and Michael in a sense, that I don't believe virtio-ccw is safe against rough interrupts. Sorry for the late response. I intend to revisit this on Monday. If I don't please feel encouraged to ping. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: acknowledge all features before access
On Fri, 14 Jan 2022 15:09:14 -0500 "Michael S. Tsirkin" wrote: > The feature negotiation was designed in a way that > makes it possible for devices to know which config > fields will be accessed by drivers. > > This is broken since commit 404123c2db79 ("virtio: allow drivers to > validate features") with fallout in at least block and net. > We have a partial work-around in commit 2f9a174f918e ("virtio: write > back F_VERSION_1 before validate") which at least lets devices > find out which format should config space have, but this > is a partial fix: guests should not access config space > without acknowledging features since otherwise we'll never > be able to change the config space format. I agree with that. The crux is what does "acknowledge features" exactly mean. Is it "write features" or "complete the feature negotiation, including setting FEATURES_OK". My understanding is, that we should not rely on that the device is going to act according to the negotiated feature set unless FEATURES_OK was set successfully. That would mean, that this change ain't guaranteed to help with the stated problem. We simply don't know if the fact that features were written is going to have a side-effect or not. Also see below. > > As a side effect, this also reduces the amount of hypervisor accesses - > we now only acknowledge features once unless we are clearing any > features when validating. My understanding is that this patch basically does for all the features, what commit 2f9a174f918e ("virtio: write back F_VERSION_1 before validate") did only for F_VERSION_1 and under certain conditions to be minimally invasive. I don't like when s390 is the oddball, so I'm very happy to see us moving away from that. > > Cc: sta...@vger.kernel.org > Fixes: 404123c2db79 ("virtio: allow drivers to validate features") > Fixes: 2f9a174f918e ("virtio: write back F_VERSION_1 before validate") > Cc: "Halil Pasic" > Signed-off-by: Michael S. Tsirkin > --- > > Halil, I thought hard about our situation with transitional and > today I finally thought of something I am happy with. > Pls let me know what you think. Testing on big endian would > also be much appreciated! Thanks! I will first provide some comments, and I intend to come back with the test results later. > > drivers/virtio/virtio.c | 31 +-- > 1 file changed, 17 insertions(+), 14 deletions(-) > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > index d891b0a354b0..2ed6e2451fd8 100644 > --- a/drivers/virtio/virtio.c > +++ b/drivers/virtio/virtio.c > @@ -168,12 +168,10 @@ EXPORT_SYMBOL_GPL(virtio_add_status); > > static int virtio_finalize_features(struct virtio_device *dev) > { > - int ret = dev->config->finalize_features(dev); > unsigned status; > + int ret; > > might_sleep(); > - if (ret) > - return ret; > > ret = arch_has_restricted_virtio_memory_access(); > if (ret) { > @@ -244,17 +242,6 @@ static int virtio_dev_probe(struct device *_d) > driver_features_legacy = driver_features; > } > > - /* > - * Some devices detect legacy solely via F_VERSION_1. Write > - * F_VERSION_1 to force LE config space accesses before FEATURES_OK for > - * these when needed. > - */ > - if (drv->validate && !virtio_legacy_is_little_endian() > - && device_features & BIT_ULL(VIRTIO_F_VERSION_1)) { > - dev->features = BIT_ULL(VIRTIO_F_VERSION_1); > - dev->config->finalize_features(dev); > - } > - > if (device_features & (1ULL << VIRTIO_F_VERSION_1)) > dev->features = driver_features & device_features; > else > @@ -265,10 +252,22 @@ static int virtio_dev_probe(struct device *_d) > if (device_features & (1ULL << i)) > __virtio_set_bit(dev, i); > > + err = dev->config->finalize_features(dev); A side note: config->finalize_features() ain't the best name for what the thing does. After config->finalize_features() the features are not final. Unlike after virtio_finalize_features(). IMHO filter_and_write_features() would be a more accurate, although longer name. After this point, the features aren't final yet, and one can not say that a some feature X has been negotiated. But with regards to features, the spec does not really consider this limbo state. Should this change? Do we want to say: the device SHOULD pick up, and act upon the new features *before* FEATURES_OK is set? ... > + if (err) > + goto err;
Re: [PATCH] virtio: acknowledge all features before access
On Fri, 14 Jan 2022 15:09:14 -0500 "Michael S. Tsirkin" wrote: > The feature negotiation was designed in a way that > makes it possible for devices to know which config > fields will be accessed by drivers. > > This is broken since commit 404123c2db79 ("virtio: allow drivers to > validate features") with fallout in at least block and net. > We have a partial work-around in commit 2f9a174f918e ("virtio: write > back F_VERSION_1 before validate") which at least lets devices > find out which format should config space have, but this > is a partial fix: guests should not access config space > without acknowledging features since otherwise we'll never > be able to change the config space format. > > As a side effect, this also reduces the amount of hypervisor accesses - > we now only acknowledge features once unless we are clearing any > features when validating. > > Cc: sta...@vger.kernel.org > Fixes: 404123c2db79 ("virtio: allow drivers to validate features") > Fixes: 2f9a174f918e ("virtio: write back F_VERSION_1 before validate") > Cc: "Halil Pasic" > Signed-off-by: Michael S. Tsirkin > --- > > Halil, I thought hard about our situation with transitional and > today I finally thought of something I am happy with. > Pls let me know what you think. Testing on big endian would > also be much appreciated! Hi Michael! I was just about to have a look into this. But it does not apply cleanly to Linus master (fetched a couple of minutes ago). I also tride with d9679d0013a66849~1 but no luck. What is a suitable base for this patch? Regards, Halil > drivers/virtio/virtio.c | 31 +-- > 1 file changed, 17 insertions(+), 14 deletions(-) > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > index d891b0a354b0..2ed6e2451fd8 100644 > --- a/drivers/virtio/virtio.c > +++ b/drivers/virtio/virtio.c > @@ -168,12 +168,10 @@ EXPORT_SYMBOL_GPL(virtio_add_status); > > static int virtio_finalize_features(struct virtio_device *dev) > { > - int ret = dev->config->finalize_features(dev); > unsigned status; > + int ret; > > might_sleep(); > - if (ret) > - return ret; > > ret = arch_has_restricted_virtio_memory_access(); > if (ret) { > @@ -244,17 +242,6 @@ static int virtio_dev_probe(struct device *_d) > driver_features_legacy = driver_features; > } > > - /* > - * Some devices detect legacy solely via F_VERSION_1. Write > - * F_VERSION_1 to force LE config space accesses before FEATURES_OK for > - * these when needed. > - */ > - if (drv->validate && !virtio_legacy_is_little_endian() > - && device_features & BIT_ULL(VIRTIO_F_VERSION_1)) { > - dev->features = BIT_ULL(VIRTIO_F_VERSION_1); > - dev->config->finalize_features(dev); > - } > - > if (device_features & (1ULL << VIRTIO_F_VERSION_1)) > dev->features = driver_features & device_features; > else > @@ -265,10 +252,22 @@ static int virtio_dev_probe(struct device *_d) > if (device_features & (1ULL << i)) > __virtio_set_bit(dev, i); > > + err = dev->config->finalize_features(dev); > + if (err) > + goto err; > + > if (drv->validate) { > + u64 features = dev->features; > + > err = drv->validate(dev); > if (err) > goto err; > + > + if (features != dev->features) { > + err = dev->config->finalize_features(dev); > + if (err) > + goto err; > + } > } > > err = virtio_finalize_features(dev); > @@ -495,6 +494,10 @@ int virtio_device_restore(struct virtio_device *dev) > /* We have a driver! */ > virtio_add_status(dev, VIRTIO_CONFIG_S_DRIVER); > > + ret = dev->config->finalize_features(dev); > + if (ret) > + goto err; > + > ret = virtio_finalize_features(dev); > if (ret) > goto err; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 1/4] virtio_ring: validate used buffer length
On Wed, 24 Nov 2021 10:33:28 +0800 Jason Wang wrote: > > > > Let's see how far we can get. But yes, maybe we were too aggressive in > > > > breaking things by default, a warning might be a better choice for a > > > > couple of cycles. > > > > Ok, considering we saw the issues with balloons I think I can post a > > patch to use warn instead. I wonder if we need to taint the kernel in > > this case. > > Rethink this, consider we still have some time, I tend to convert the > drivers to validate the length by themselves. Does this make sense? I do find value in doing the validation in a single place for every driver. This is really a common concern. But I think, not breaking what used to work before is a good idea. So I would opt for producing a warning, but otherwise preserving old behavior unless there is an explicit opt-in for something more strict. BTW AFAIU if we detect a problem here, there are basically two cases: (1) Either the device is over-reporting what it has written, or (2) we have a memory corruption in the guest because the device has written beyond the end of the provided buffer. This can be because (2.1) the driver provided a smaller buffer than mandated by the spec, or (2.2) the device is broken. Case (1) is relatively harmless, and I believe a warning for it is more than appropriate. Whoever sees the warning should push for a fixed device if possible. Case (2) is nasty. What would be the sanest course of action if we were reasonably sure we have have case (2.2)? Maybe we can detect case (2) with a canary. I.e. artificially extend the buffer with an extra descriptor that has a poisoned buffer, and check if the value of that poisoned buffer is different than poison. I'm not sure it is worth the effort though in production. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] vhost/vsock: fix incorrect used length reported to the guest
On Mon, 22 Nov 2021 17:35:24 +0100 Stefano Garzarella wrote: > The "used length" reported by calling vhost_add_used() must be the > number of bytes written by the device (using "in" buffers). > > In vhost_vsock_handle_tx_kick() the device only reads the guest > buffers (they are all "out" buffers), without writing anything, > so we must pass 0 as "used length" to comply virtio spec. > > Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko") > Cc: sta...@vger.kernel.org > Reported-by: Halil Pasic > Suggested-by: Jason Wang > Signed-off-by: Stefano Garzarella Reviewed-by: Halil Pasic > --- > drivers/vhost/vsock.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c > index 938aefbc75ec..4e3b95af7ee4 100644 > --- a/drivers/vhost/vsock.c > +++ b/drivers/vhost/vsock.c > @@ -554,7 +554,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work > *work) > virtio_transport_free_pkt(pkt); > > len += sizeof(pkt->hdr); > - vhost_add_used(vq, head, len); > + vhost_add_used(vq, head, 0); > total_len += len; > added = true; > } while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len))); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 1/4] virtio_ring: validate used buffer length
On Tue, 23 Nov 2021 07:17:05 -0500 "Michael S. Tsirkin" wrote: > On Mon, Nov 22, 2021 at 02:50:03PM +0100, Halil Pasic wrote: > > On Mon, 22 Nov 2021 14:25:26 +0800 > > Jason Wang wrote: > > > > > On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic wrote: > > > > > > > > On Mon, 22 Nov 2021 06:35:18 +0100 > > > > Halil Pasic wrote: > > > > > > > > > > I think it should be a common issue, looking at > > > > > > vhost_vsock_handle_tx_kick(), it did: > > > > > > > > > > > > len += sizeof(pkt->hdr); > > > > > > vhost_add_used(vq, head, len); > > > > > > > > > > > > which looks like a violation of the spec since it's TX. > > > > > > > > > > I'm not sure the lines above look like a violation of the spec. If you > > > > > examine vhost_vsock_alloc_pkt() I believe that you will agree that: > > > > > len == pkt->len == pkt->hdr.len > > > > > which makes sense since according to the spec both tx and rx messages > > > > > are hdr+payload. And I believe hdr.len is the size of the payload, > > > > > although that does not seem to be properly documented by the spec. > > > > > > Sorry for being unclear, what I meant is that we probably should use > > > zero here. TX doesn't use in buffer actually. > > > > > > According to the spec, 0 should be the used length: > > > > > > "and len the total of bytes written into the buffer." > > > > Right, I was wrong. I somehow assumed this is the total length and not > > just the number of bytes written. > > > > > > > > > > > > > > > On the other hand tx messages are stated to be device read-only (in > > > > > the > > > > > spec) so if the device writes stuff, that is certainly wrong. > > > > > > > > > > > Yes. > > > > > > > > If that is what happens. > > > > > > > > > > Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what > > > > > happens. My hypothesis is that we just a last descriptor is an 'in' > > > > > type descriptor (i.e. a device writable one). For tx that assumption > > > > > would be wrong. > > > > > > > > > > I will have another look at this today and send a fix patch if my > > > > > suspicion is confirmed. > > > > Yeah, I didn't remember the semantic of > > vq->split.vring.used->ring[last_used].len > > correctly, and in fact also how exactly the rings work. So your objection > > is correct. > > > > Maybe updating some stuff would make it easier to not make this mistake. > > > > For example the spec and also the linux header says: > > > > /* le32 is used here for ids for padding reasons. */ > > struct virtq_used_elem { > > /* Index of start of used descriptor chain. */ > > le32 id; > > /* Total length of the descriptor chain which was used (written to) > > */ > > le32 len; > > }; > > > > I think that comment isn't as clear as it could be. I would prefer: > > /* The number of bytes written into the device writable portion of the > > buffer described by the descriptor chain. */ > > > > I believe "the descriptor chain which was used" includes both the > > descriptors that map the device read only and the device write > > only portions of the buffer described by the descriptor chain. And the > > total length of that descriptor chain may be defined either as a number > > of the descriptors that form the chain, or the length of the buffer. > > > > One has to use the descriptor chain even if the whole buffer is device > > read only. So "used" == "written to" does not make any sense to me. > > The virtio spec actually says > > Total length of the descriptor chain which was written to > > without the "used" part. Yes, that is in the text after the (pseudo-)code listing which contains the description I was referring to (in 2.6.8 The Virtqueue Used Ring). > > > Also something like > > int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int > > bytes_written) > > instead of > > int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len) > > would make it easier to read the code correctly. > > I think we agree here. Patches? > Will send some :D Thanks! [..] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 1/4] virtio_ring: validate used buffer length
On Mon, 22 Nov 2021 14:25:26 +0800 Jason Wang wrote: > I think the fixes are: > > 1) fixing the vhost vsock > 2) use suppress_used_validation=true to let vsock driver to validate > the in buffer length > 3) probably a new feature so the driver can only enable the validation > when the feature is enabled. I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly good feature. Frankly the set of such bugs is device implementation specific and it makes little sense to specify a feature bit that says the device implementation claims to adhere to some aspect of the specification. Also what would be the semantic of not negotiating F_DEV_Y_FIXED_BUG_X? On the other hand I see no other way to keep the validation permanently enabled for fixed implementations, and get around the problem with broken implementations. So we could have something like VHOST_USED_LEN_STRICT. Maybe, we can also think of 'warn and don't alter behavior' instead of 'warn' and alter behavior. Or maybe even not having such checks on in production, but only when testing. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 1/4] virtio_ring: validate used buffer length
On Mon, 22 Nov 2021 12:08:22 +0100 Stefano Garzarella wrote: > On Mon, Nov 22, 2021 at 08:55:24AM +0100, Stefano Garzarella wrote: > >On Mon, Nov 22, 2021 at 02:25:26PM +0800, Jason Wang wrote: > >>On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic wrote: > >>> > >>>On Mon, 22 Nov 2021 06:35:18 +0100 > >>>Halil Pasic wrote: > >>> > >>>> > I think it should be a common issue, looking at > >>>> > vhost_vsock_handle_tx_kick(), it did: > >>>> > > >>>> > len += sizeof(pkt->hdr); > >>>> > vhost_add_used(vq, head, len); > >>>> > > >>>> > which looks like a violation of the spec since it's TX. > >>>> > >>>> I'm not sure the lines above look like a violation of the spec. If you > >>>> examine vhost_vsock_alloc_pkt() I believe that you will agree that: > >>>> len == pkt->len == pkt->hdr.len > >>>> which makes sense since according to the spec both tx and rx messages > >>>> are hdr+payload. And I believe hdr.len is the size of the payload, > >>>> although that does not seem to be properly documented by the spec. > >> > >>Sorry for being unclear, what I meant is that we probably should use > >>zero here. TX doesn't use in buffer actually. > >> > >>According to the spec, 0 should be the used length: > >> > >>"and len the total of bytes written into the buffer." > >> > >>>> > >>>> On the other hand tx messages are stated to be device read-only (in the > >>>> spec) so if the device writes stuff, that is certainly wrong. > >>>> > >> > >>Yes. > >> > >>>> If that is what happens. > >>>> > >>>> Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what > >>>> happens. My hypothesis is that we just a last descriptor is an 'in' > >>>> type descriptor (i.e. a device writable one). For tx that assumption > >>>> would be wrong. > >>>> > >>>> I will have another look at this today and send a fix patch if my > >>>> suspicion is confirmed. > >>> > >>>If my suspicion is right something like: > >>> > >>>diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > >>>index 00f64f2f8b72..efb57898920b 100644 > >>>--- a/drivers/virtio/virtio_ring.c > >>>+++ b/drivers/virtio/virtio_ring.c > >>>@@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct > >>>virtqueue *_vq, > >>>struct vring_virtqueue *vq = to_vvq(_vq); > >>>void *ret; > >>>unsigned int i; > >>>+ bool has_in; > >>>u16 last_used; > >>> > >>>START_USE(vq); > >>>@@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct > >>>virtqueue *_vq, > >>>vq->split.vring.used->ring[last_used].id); > >>>*len = virtio32_to_cpu(_vq->vdev, > >>>vq->split.vring.used->ring[last_used].len); > >>>+ has_in = virtio16_to_cpu(_vq->vdev, > >>>+ vq->split.vring.used->ring[last_used].flags) > >>>+ & VRING_DESC_F_WRITE; > >> > >>Did you mean vring.desc actually? If yes, it's better not depend on > >>the descriptor ring which can be modified by the device. We've stored > >>the flags in desc_extra[]. > >> > >>> > >>>if (unlikely(i >= vq->split.vring.num)) { > >>>BAD_RING(vq, "id %u out of range\n", i); > >>>@@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct > >>>virtqueue *_vq, > >>>BAD_RING(vq, "id %u is not a head!\n", i); > >>>return NULL; > >>>} > >>>- if (vq->buflen && unlikely(*len > vq->buflen[i])) { > >>>+ if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) { > >>>BAD_RING(vq, "used len %d is larger than in buflen %u\n", > >>>*len, vq->buflen[i]); > >>>return NULL; > >>> > >>>would fix the problem for split. I will
Re: [PATCH V5 1/4] virtio_ring: validate used buffer length
On Mon, 22 Nov 2021 14:25:26 +0800 Jason Wang wrote: > On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic wrote: > > > > On Mon, 22 Nov 2021 06:35:18 +0100 > > Halil Pasic wrote: > > > > > > I think it should be a common issue, looking at > > > > vhost_vsock_handle_tx_kick(), it did: > > > > > > > > len += sizeof(pkt->hdr); > > > > vhost_add_used(vq, head, len); > > > > > > > > which looks like a violation of the spec since it's TX. > > > > > > I'm not sure the lines above look like a violation of the spec. If you > > > examine vhost_vsock_alloc_pkt() I believe that you will agree that: > > > len == pkt->len == pkt->hdr.len > > > which makes sense since according to the spec both tx and rx messages > > > are hdr+payload. And I believe hdr.len is the size of the payload, > > > although that does not seem to be properly documented by the spec. > > Sorry for being unclear, what I meant is that we probably should use > zero here. TX doesn't use in buffer actually. > > According to the spec, 0 should be the used length: > > "and len the total of bytes written into the buffer." Right, I was wrong. I somehow assumed this is the total length and not just the number of bytes written. > > > > > > > On the other hand tx messages are stated to be device read-only (in the > > > spec) so if the device writes stuff, that is certainly wrong. > > > > > Yes. > > > > If that is what happens. > > > > > > Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what > > > happens. My hypothesis is that we just a last descriptor is an 'in' > > > type descriptor (i.e. a device writable one). For tx that assumption > > > would be wrong. > > > > > > I will have another look at this today and send a fix patch if my > > > suspicion is confirmed. Yeah, I didn't remember the semantic of vq->split.vring.used->ring[last_used].len correctly, and in fact also how exactly the rings work. So your objection is correct. Maybe updating some stuff would make it easier to not make this mistake. For example the spec and also the linux header says: /* le32 is used here for ids for padding reasons. */ struct virtq_used_elem { /* Index of start of used descriptor chain. */ le32 id; /* Total length of the descriptor chain which was used (written to) */ le32 len; }; I think that comment isn't as clear as it could be. I would prefer: /* The number of bytes written into the device writable portion of the buffer described by the descriptor chain. */ I believe "the descriptor chain which was used" includes both the descriptors that map the device read only and the device write only portions of the buffer described by the descriptor chain. And the total length of that descriptor chain may be defined either as a number of the descriptors that form the chain, or the length of the buffer. One has to use the descriptor chain even if the whole buffer is device read only. So "used" == "written to" does not make any sense to me. Also something like int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int bytes_written) instead of int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len) would make it easier to read the code correctly. > > > > If my suspicion is right something like: > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > index 00f64f2f8b72..efb57898920b 100644 > > --- a/drivers/virtio/virtio_ring.c > > +++ b/drivers/virtio/virtio_ring.c > > @@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct > > virtqueue *_vq, > > struct vring_virtqueue *vq = to_vvq(_vq); > > void *ret; > > unsigned int i; > > + bool has_in; > > u16 last_used; > > > > START_USE(vq); > > @@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct > > virtqueue *_vq, > > vq->split.vring.used->ring[last_used].id); > > *len = virtio32_to_cpu(_vq->vdev, > > vq->split.vring.used->ring[last_used].len); > > + has_in = virtio16_to_cpu(_vq->vdev, > > + vq->split.vring.used->ring[last_used].flags) > > + & VRING_DESC_F_WRITE; > > Did you mean vring.desc actually? If yes, it's better not depend on > the descriptor ring which can be modified by the device. We've stored > the flags in desc_extra[]. > > > > &g
Re: [PATCH V5 1/4] virtio_ring: validate used buffer length
On Mon, 22 Nov 2021 06:35:18 +0100 Halil Pasic wrote: > > I think it should be a common issue, looking at > > vhost_vsock_handle_tx_kick(), it did: > > > > len += sizeof(pkt->hdr); > > vhost_add_used(vq, head, len); > > > > which looks like a violation of the spec since it's TX. > > I'm not sure the lines above look like a violation of the spec. If you > examine vhost_vsock_alloc_pkt() I believe that you will agree that: > len == pkt->len == pkt->hdr.len > which makes sense since according to the spec both tx and rx messages > are hdr+payload. And I believe hdr.len is the size of the payload, > although that does not seem to be properly documented by the spec. > > On the other hand tx messages are stated to be device read-only (in the > spec) so if the device writes stuff, that is certainly wrong. > > If that is what happens. > > Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what > happens. My hypothesis is that we just a last descriptor is an 'in' > type descriptor (i.e. a device writable one). For tx that assumption > would be wrong. > > I will have another look at this today and send a fix patch if my > suspicion is confirmed. If my suspicion is right something like: diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 00f64f2f8b72..efb57898920b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, struct vring_virtqueue *vq = to_vvq(_vq); void *ret; unsigned int i; + bool has_in; u16 last_used; START_USE(vq); @@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, vq->split.vring.used->ring[last_used].id); *len = virtio32_to_cpu(_vq->vdev, vq->split.vring.used->ring[last_used].len); + has_in = virtio16_to_cpu(_vq->vdev, + vq->split.vring.used->ring[last_used].flags) + & VRING_DESC_F_WRITE; if (unlikely(i >= vq->split.vring.num)) { BAD_RING(vq, "id %u out of range\n", i); @@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, BAD_RING(vq, "id %u is not a head!\n", i); return NULL; } - if (vq->buflen && unlikely(*len > vq->buflen[i])) { + if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) { BAD_RING(vq, "used len %d is larger than in buflen %u\n", *len, vq->buflen[i]); return NULL; would fix the problem for split. I will try that out and let you know later. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 1/4] virtio_ring: validate used buffer length
On Mon, 22 Nov 2021 11:51:09 +0800 Jason Wang wrote: > On Fri, Nov 19, 2021 at 11:10 PM Halil Pasic wrote: > > > > On Wed, 27 Oct 2021 10:21:04 +0800 > > Jason Wang wrote: > > > > > This patch validate the used buffer length provided by the device > > > before trying to use it. This is done by record the in buffer length > > > in a new field in desc_state structure during virtqueue_add(), then we > > > can fail the virtqueue_get_buf() when we find the device is trying to > > > give us a used buffer length which is greater than the in buffer > > > length. > > > > > > Since some drivers have already done the validation by themselves, > > > this patch tries to makes the core validation optional. For the driver > > > that doesn't want the validation, it can set the > > > suppress_used_validation to be true (which could be overridden by > > > force_used_validation module parameter). To be more efficient, a > > > dedicate array is used for storing the validate used length, this > > > helps to eliminate the cache stress if validation is done by the > > > driver. > > > > > > Signed-off-by: Jason Wang > > > > Hi Jason! > > > > Our CI has detected, that virtio-vsock became unusable with this > > patch on s390x. I didn't test on x86 yet. The guest kernel says > > something like: > > vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in > > buflen 0 > > > > Did you, or anybody else, see something like this on platforms other that > > s390x? > > Adding Stefan and Stefano. > > I think it should be a common issue, looking at > vhost_vsock_handle_tx_kick(), it did: > > len += sizeof(pkt->hdr); > vhost_add_used(vq, head, len); > > which looks like a violation of the spec since it's TX. I'm not sure the lines above look like a violation of the spec. If you examine vhost_vsock_alloc_pkt() I believe that you will agree that: len == pkt->len == pkt->hdr.len which makes sense since according to the spec both tx and rx messages are hdr+payload. And I believe hdr.len is the size of the payload, although that does not seem to be properly documented by the spec. On the other hand tx messages are stated to be device read-only (in the spec) so if the device writes stuff, that is certainly wrong. If that is what happens. Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what happens. My hypothesis is that we just a last descriptor is an 'in' type descriptor (i.e. a device writable one). For tx that assumption would be wrong. I will have another look at this today and send a fix patch if my suspicion is confirmed. > > > > > I had a quick look at this code, and I speculate that it probably > > uncovers a pre-existig bug, rather than introducing a new one. > > I agree. > :) I'm not so sure any more myself. > > > > If somebody is already working on this please reach out to me. > > AFAIK, no. Thanks for the info! Then I will dig a little deeper. I asked in order to avoid doing the debugging and fixing just to see that somebody was faster :D > I think the plan is to fix both the device and drive side > (but I'm not sure we need a new feature for this if we stick to the > validation). > > Thanks > Thank you! Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 1/4] virtio_ring: validate used buffer length
On Wed, 27 Oct 2021 10:21:04 +0800 Jason Wang wrote: > This patch validate the used buffer length provided by the device > before trying to use it. This is done by record the in buffer length > in a new field in desc_state structure during virtqueue_add(), then we > can fail the virtqueue_get_buf() when we find the device is trying to > give us a used buffer length which is greater than the in buffer > length. > > Since some drivers have already done the validation by themselves, > this patch tries to makes the core validation optional. For the driver > that doesn't want the validation, it can set the > suppress_used_validation to be true (which could be overridden by > force_used_validation module parameter). To be more efficient, a > dedicate array is used for storing the validate used length, this > helps to eliminate the cache stress if validation is done by the > driver. > > Signed-off-by: Jason Wang Hi Jason! Our CI has detected, that virtio-vsock became unusable with this patch on s390x. I didn't test on x86 yet. The guest kernel says something like: vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0 Did you, or anybody else, see something like this on platforms other that s390x? I had a quick look at this code, and I speculate that it probably uncovers a pre-existig bug, rather than introducing a new one. If somebody is already working on this please reach out to me. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 1/1] virtio: write back F_VERSION_1 before validate
On Wed, 13 Oct 2021 08:24:53 -0400 "Michael S. Tsirkin" wrote: > > > OK this looks good! How about a QEMU patch to make it spec compliant on > > > BE? > > > > Who is going to do that? Halil? you? Conny? > > Halil said he'll do it... Right, Halil? I can do it but not right away. Maybe in a couple of weeks. I have some other bugs to hunt down, before proceeding to this. If somebody else wants to do it, I'm fine with that as well. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 1/1] virtio: write back F_VERSION_1 before validate
The virtio specification virtio-v1.1-cs01 states: "Transitional devices MUST detect Legacy drivers by detecting that VIRTIO_F_VERSION_1 has not been acknowledged by the driver." This is exactly what QEMU as of 6.1 has done relying solely on VIRTIO_F_VERSION_1 for detecting that. However, the specification also says: "... the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device ..." before setting FEATURES_OK. In that case, any transitional device relying solely on VIRTIO_F_VERSION_1 for detecting legacy drivers will return data in legacy format. In particular, this implies that it is in big endian format for big endian guests. This naturally confuses the driver which expects little endian in the modern mode. It is probably a good idea to amend the spec to clarify that VIRTIO_F_VERSION_1 can only be relied on after the feature negotiation is complete. Before validate callback existed, config space was only read after FEATURES_OK. However, we already have two regressions, so let's address this here as well. The regressions affect the VIRTIO_NET_F_MTU feature of virtio-net and the VIRTIO_BLK_F_BLK_SIZE feature of virtio-blk for BE guests when virtio 1.0 is used on both sides. The latter renders virtio-blk unusable with DASD backing, because things simply don't work with the default. See Fixes tags for relevant commits. For QEMU, we can work around the issue by writing out the feature bits with VIRTIO_F_VERSION_1 bit set. We (ab)use the finalize_features config op for this. This isn't enough to address all vhost devices since these do not get the features until FEATURES_OK, however it looks like the affected devices actually never handled the endianness for legacy mode correctly, so at least that's not a regression. No devices except virtio net and virtio blk seem to be affected. Long term the right thing to do is to fix the hypervisors. Cc: #v4.11 Signed-off-by: Halil Pasic Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in config space") Fixes: fe36cbe0671e ("virtio_net: clear MTU when out of range") Reported-by: mark...@us.ibm.com Reviewed-by: Cornelia Huck --- @Connie: I made some more commit message changes to accommodate Michael's requests. I just assumed these will work or you as well and kept your r-b. Please shout at me if it needs to be dropped :) --- drivers/virtio/virtio.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 0a5b54034d4b..236081afe9a2 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -239,6 +239,17 @@ static int virtio_dev_probe(struct device *_d) driver_features_legacy = driver_features; } + /* +* Some devices detect legacy solely via F_VERSION_1. Write +* F_VERSION_1 to force LE config space accesses before FEATURES_OK for +* these when needed. +*/ + if (drv->validate && !virtio_legacy_is_little_endian() + && device_features & BIT_ULL(VIRTIO_F_VERSION_1)) { + dev->features = BIT_ULL(VIRTIO_F_VERSION_1); + dev->config->finalize_features(dev); + } + if (device_features & (1ULL << VIRTIO_F_VERSION_1)) dev->features = driver_features & device_features; else base-commit: 60a9483534ed0d99090a2ee1d4bb0b8179195f51 -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/1] virtio: write back F_VERSION_1 before validate
On Fri, 8 Oct 2021 09:05:03 -0400 "Michael S. Tsirkin" wrote: > On Fri, Oct 08, 2021 at 02:34:22PM +0200, Halil Pasic wrote: > > The virtio specification virtio-v1.1-cs01 states: "Transitional devices > > MUST detect Legacy drivers by detecting that VIRTIO_F_VERSION_1 has not > > been acknowledged by the driver." This is exactly what QEMU as of 6.1 > > has done relying solely on VIRTIO_F_VERSION_1 for detecting that. > > > > However, the specification also says: "... the driver MAY read (but MUST > > NOT write) the device-specific configuration fields to check that it can > > support the device ..." before setting FEATURES_OK. > > > > In that case, any transitional device relying solely on > > VIRTIO_F_VERSION_1 for detecting legacy drivers will return data in > > legacy format. In particular, this implies that it is in big endian > > format for big endian guests. This naturally confuses the driver which > > expects little endian in the modern mode. > > > > It is probably a good idea to amend the spec to clarify that > > VIRTIO_F_VERSION_1 can only be relied on after the feature negotiation > > is complete. However, we already have a regression so let's try to address > > actually, regressions. and we can add > "since originally before validate callback existed > config space was only read after > FEATURES_OK. See Fixes tags for relevant commits" > > > it. How about replacing the paragraph above with the following? "It is probably a good idea to amend the spec to clarify that VIRTIO_F_VERSION_1 can only be relied on after the feature negotiation is complete. Before validate callback existed, config space was only read after FEATURES_OK. However, we already have two regression, so let's address this here as well." > > > > The regressions affect the VIRTIO_NET_F_MTU feature of virtio-net and > > the VIRTIO_BLK_F_BLK_SIZE feature of virtio-blk for BE guests when > > virtio 1.0 is used on both sides. The latter renders virtio-blk > > unusable with DASD backing, because things simply don't work with > > the default. and add "See Fixes tags for relevant commits." here. > > Let's add a work around description now: > > > For QEMU, we can work around the issue by writing out the features > register with VIRTIO_F_VERSION_1 bit set. We (ab) use the s/features register/feature bits/ rationale: ccw does not have a features register, and qemu does not really act as if its behavior was controlled by the values in a features register. I.e. when we read the register we see VIRTIO_F_VERSION_! because the feature is offered. In QEMU we basically read host_featues but write the guest_features. And what drives device behavior is mostly guest_features. s/(ab) use/(ab)use/ > finalize_features config op for this. It's not enough to address vhost s/It's/This is/ > user and vhost block devices since these do not get the features until s/vhost user and vhost block/some vhost-user and vhost-vdpa/ ? Ratioale: I think vhost block is just a vhost-user device. On the other hand vhost-user-fs works like charm because the config space is implemented in qemu and not in the vhost-user device. I didn't check vhost_net. I'm not even sure qemu offers a vhost_net implementation. Anyway I wouldn't like to make any false statements here. > FEATURES_OK, however it looks like these two actually never handled the > endian-ness for legacy mode correctly, so at least that's not a > regression. > > No devices except virtio net and virtio blk seem to be affected. > > Long term the right thing to do is to fix the hypervisors. > Sounds good. Thanks! Are you OK with my changes proposed to your changes? Regards, Halil > > > > > Cc: #v4.11 > > Signed-off-by: Halil Pasic > > Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in > > config space") Fixes: fe36cbe0671e ("virtio_net: clear MTU when out > > of range") Reported-by: mark...@us.ibm.com > > --- > > drivers/virtio/virtio.c | 11 +++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > > index 0a5b54034d4b..236081afe9a2 100644 > > --- a/drivers/virtio/virtio.c > > +++ b/drivers/virtio/virtio.c > > @@ -239,6 +239,17 @@ static int virtio_dev_probe(struct device *_d) > > driver_features_legacy = driver_features; > > } > > > > + /* > > +* Some devices detect legacy solely via F_VERSION_1. Write > > +* F_VERSION_1 to force LE config space accesses before > > FEATURES_OK for > > +* these when needed. > &g
[PATCH v2 1/1] virtio: write back F_VERSION_1 before validate
The virtio specification virtio-v1.1-cs01 states: "Transitional devices MUST detect Legacy drivers by detecting that VIRTIO_F_VERSION_1 has not been acknowledged by the driver." This is exactly what QEMU as of 6.1 has done relying solely on VIRTIO_F_VERSION_1 for detecting that. However, the specification also says: "... the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device ..." before setting FEATURES_OK. In that case, any transitional device relying solely on VIRTIO_F_VERSION_1 for detecting legacy drivers will return data in legacy format. In particular, this implies that it is in big endian format for big endian guests. This naturally confuses the driver which expects little endian in the modern mode. It is probably a good idea to amend the spec to clarify that VIRTIO_F_VERSION_1 can only be relied on after the feature negotiation is complete. However, we already have a regression so let's try to address it. The regressions affect the VIRTIO_NET_F_MTU feature of virtio-net and the VIRTIO_BLK_F_BLK_SIZE feature of virtio-blk for BE guests when virtio 1.0 is used on both sides. The latter renders virtio-blk unusable with DASD backing, because things simply don't work with the default. Cc: #v4.11 Signed-off-by: Halil Pasic Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in config space") Fixes: fe36cbe0671e ("virtio_net: clear MTU when out of range") Reported-by: mark...@us.ibm.com --- drivers/virtio/virtio.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 0a5b54034d4b..236081afe9a2 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -239,6 +239,17 @@ static int virtio_dev_probe(struct device *_d) driver_features_legacy = driver_features; } + /* +* Some devices detect legacy solely via F_VERSION_1. Write +* F_VERSION_1 to force LE config space accesses before FEATURES_OK for +* these when needed. +*/ + if (drv->validate && !virtio_legacy_is_little_endian() + && device_features & BIT_ULL(VIRTIO_F_VERSION_1)) { + dev->features = BIT_ULL(VIRTIO_F_VERSION_1); + dev->config->finalize_features(dev); + } + if (device_features & (1ULL << VIRTIO_F_VERSION_1)) dev->features = driver_features & device_features; else base-commit: 60a9483534ed0d99090a2ee1d4bb0b8179195f51 -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] virtio: write back F_VERSION_1 before validate
On Thu, 07 Oct 2021 17:25:52 +0200 Cornelia Huck wrote: > On Thu, Oct 07 2021, Halil Pasic wrote: > > > On Thu, 07 Oct 2021 13:52:24 +0200 > > Cornelia Huck wrote: > > > >> On Wed, Oct 06 2021, Halil Pasic wrote: > >> > >> > The virtio specification virtio-v1.1-cs01 states: "Transitional devices > >> > MUST detect Legacy drivers by detecting that VIRTIO_F_VERSION_1 has not > >> > been acknowledged by the driver." This is exactly what QEMU as of 6.1 > >> > has done relying solely on VIRTIO_F_VERSION_1 for detecting that. > >> > > >> > However, the specification also says: "... driver MAY read (but MUST NOT > >> > write) the device-specific configuration fields to check that it can > >> > support the device ..." before setting FEATURES_OK. > >> > >> Suggest to put the citations from the spec into quotes, so that they are > >> distinguishable from the rest of the text. > > > > For the record: I basically took Michael's description, the one which you > > said you prefer, with some minor changes. > > Well I did look at what the text said, not the details in the formatting... > > > > > This is one of the changes, which renders this a paraphrase and not a > > quote. Michael didn't use quotation marks so I was not sure it is was > > a word by word quote anyway. It was. But the spec depends on "During this > > step" which does not make any sense without the context. That is why I made > > the end of step explicit. > > I still think that would be nicer while using some quotation marks, even > if you are just doing a partial quote. > > In the first paragraph, however, we really should mark the quote > properly. It gave me a stop when I first read it. I've added in some quotation marks and ellipsis marks. Does that look good for you? > > > > > I think we are fine without quotation marks. Those who care can read the > > spec. > > > >> > >> > > >> > In that case, any transitional device relying solely on > >> > VIRTIO_F_VERSION_1 for detecting legacy drivers will return data in > >> > legacy format. In particular, this implies that it is in big endian > >> > format for big endian guests. This naturally confuses the driver which > >> > expects little endian in the modern mode. > >> > > >> > It is probably a good idea to amend the spec to clarify that > >> > VIRTIO_F_VERSION_1 can only be relied on after the feature negotiation > >> > is complete. However, we already have regression so let's try to address > >> > > >> > >> s/regression/a regression/ > >> > > > > Yes. Was like this in the original. Will change > > > >> > it. > >> > >> Maybe mention what the regression is? > > > > How about the following? > > > > The regressions affect the VIRTIO_NET_F_MTU feature of virtio-net and the > > VIRTIO_BLK_F_BLK_SIZE feature of virtio-blk for BE guests when virtio > > 1.0 is used on both sides. The latter renders virtio-blk unusable with > > DASD backing, because things simply don't work with the default. > > Sounds good to me. Will add it to the end. > > > > >> > >> Also mention that we use this workaround for modern on BE only? > > > > We have that already, don't we. The sentence that starts with "In > > particular". The regression description should reinforce that > > sufficiently IMHO. > > No strong opinion here. Anyone else? > > > > >> > >> > > >> > Signed-off-by: Halil Pasic > >> > Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in > >> > config space") > >> > Fixes: fe36cbe0671e ("virtio_net: clear MTU when out of range") > >> > Reported-by: mark...@us.ibm.com > >> > --- > >> > drivers/virtio/virtio.c | 10 ++ > >> > 1 file changed, 10 insertions(+) > >> > > >> > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > >> > index 0a5b54034d4b..494cfecd3376 100644 > >> > --- a/drivers/virtio/virtio.c > >> > +++ b/drivers/virtio/virtio.c > >> > @@ -239,6 +239,16 @@ static int virtio_dev_probe(struct device *_d) > >> > driver_features_legacy = driver_features; > >> > } > >> > > >> &g
Re: [PATCH 1/1] virtio: write back F_VERSION_1 before validate
On Thu, 07 Oct 2021 13:52:24 +0200 Cornelia Huck wrote: > On Wed, Oct 06 2021, Halil Pasic wrote: > > > The virtio specification virtio-v1.1-cs01 states: Transitional devices > > MUST detect Legacy drivers by detecting that VIRTIO_F_VERSION_1 has not > > been acknowledged by the driver. This is exactly what QEMU as of 6.1 > > has done relying solely on VIRTIO_F_VERSION_1 for detecting that. > > > > However, the specification also says: driver MAY read (but MUST NOT > > write) the device-specific configuration fields to check that it can > > support the device before setting FEATURES_OK. > > Suggest to put the citations from the spec into quotes, so that they are > distinguishable from the rest of the text. For the record: I basically took Michael's description, the one which you said you prefer, with some minor changes. This is one of the changes, which renders this a paraphrase and not a quote. Michael didn't use quotation marks so I was not sure it is was a word by word quote anyway. It was. But the spec depends on "During this step" which does not make any sense without the context. That is why I made the end of step explicit. I think we are fine without quotation marks. Those who care can read the spec. > > > > > In that case, any transitional device relying solely on > > VIRTIO_F_VERSION_1 for detecting legacy drivers will return data in > > legacy format. In particular, this implies that it is in big endian > > format for big endian guests. This naturally confuses the driver which > > expects little endian in the modern mode. > > > > It is probably a good idea to amend the spec to clarify that > > VIRTIO_F_VERSION_1 can only be relied on after the feature negotiation > > is complete. However, we already have regression so let's try to address > > s/regression/a regression/ > Yes. Was like this in the original. Will change > > it. > > Maybe mention what the regression is? How about the following? The regressions affect the VIRTIO_NET_F_MTU feature of virtio-net and the VIRTIO_BLK_F_BLK_SIZE feature of virtio-blk for BE guests when virtio 1.0 is used on both sides. The latter renders virtio-blk unusable with DASD backing, because things simply don't work with the default. > > Also mention that we use this workaround for modern on BE only? We have that already, don't we. The sentence that starts with "In particular". The regression description should reinforce that sufficiently IMHO. > > > > > Signed-off-by: Halil Pasic > > Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in config > > space") > > Fixes: fe36cbe0671e ("virtio_net: clear MTU when out of range") > > Reported-by: mark...@us.ibm.com > > --- > > drivers/virtio/virtio.c | 10 ++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > > index 0a5b54034d4b..494cfecd3376 100644 > > --- a/drivers/virtio/virtio.c > > +++ b/drivers/virtio/virtio.c > > @@ -239,6 +239,16 @@ static int virtio_dev_probe(struct device *_d) > > driver_features_legacy = driver_features; > > } > > > > + /* > > +* Some devices detect legacy solely via F_VERSION_1. Write > > +* F_VERSION_1 to force LE for these when needed. > > "...to force LE config space accesses before FEATURES_OK for these when > needed (BE)." > > ? Can do, but I would rather omit the (BE) at the end. All the conditions are necessary: * have validate callback * device offered VERSION_1 * virtio legacy is be > > > +*/ > > + if (drv->validate && !virtio_legacy_is_little_endian() > > + && BIT_ULL(VIRTIO_F_VERSION_1) & device_features) { > > Nit: putting device_features first would read more naturally to me. > Can do. > > + dev->features = BIT_ULL(VIRTIO_F_VERSION_1); > > + dev->config->finalize_features(dev); > > + } > > + > > if (device_features & (1ULL << VIRTIO_F_VERSION_1)) > > dev->features = driver_features & device_features; > > else > > Patch LGTM. > > Thanks for having a look. If you are fine with the proposed solution please tell me, so I can send out a v2. If not let us work towards an acceptable solution. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/1] virtio: write back F_VERSION_1 before validate
The virtio specification virtio-v1.1-cs01 states: Transitional devices MUST detect Legacy drivers by detecting that VIRTIO_F_VERSION_1 has not been acknowledged by the driver. This is exactly what QEMU as of 6.1 has done relying solely on VIRTIO_F_VERSION_1 for detecting that. However, the specification also says: driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before setting FEATURES_OK. In that case, any transitional device relying solely on VIRTIO_F_VERSION_1 for detecting legacy drivers will return data in legacy format. In particular, this implies that it is in big endian format for big endian guests. This naturally confuses the driver which expects little endian in the modern mode. It is probably a good idea to amend the spec to clarify that VIRTIO_F_VERSION_1 can only be relied on after the feature negotiation is complete. However, we already have regression so let's try to address it. Signed-off-by: Halil Pasic Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in config space") Fixes: fe36cbe0671e ("virtio_net: clear MTU when out of range") Reported-by: mark...@us.ibm.com --- drivers/virtio/virtio.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 0a5b54034d4b..494cfecd3376 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -239,6 +239,16 @@ static int virtio_dev_probe(struct device *_d) driver_features_legacy = driver_features; } + /* +* Some devices detect legacy solely via F_VERSION_1. Write +* F_VERSION_1 to force LE for these when needed. +*/ + if (drv->validate && !virtio_legacy_is_little_endian() + && BIT_ULL(VIRTIO_F_VERSION_1) & device_features) { + dev->features = BIT_ULL(VIRTIO_F_VERSION_1); + dev->config->finalize_features(dev); + } + if (device_features & (1ULL << VIRTIO_F_VERSION_1)) dev->features = driver_features & device_features; else base-commit: 60a9483534ed0d99090a2ee1d4bb0b8179195f51 -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Tue, 05 Oct 2021 13:13:31 +0200 Cornelia Huck wrote: > On Tue, Oct 05 2021, Halil Pasic wrote: > > > On Mon, 4 Oct 2021 05:07:13 -0400 > > "Michael S. Tsirkin" wrote: > >> Well we established that we can know. Here's an alternative explanation: > > > > > > I thin we established how this should be in the future, where a transport > > specific mechanism is used to decide are we operating in legacy mode or > > in modern mode. But with the current QEMU reality, I don't think so. > > Namely currently the switch native-endian config -> little endian config > > happens when the VERSION_1 is negotiated, which may happen whenever > > the VERSION_1 bit is changed, or only when FEATURES_OK is set > > (vhost-user). > > > > This is consistent with device should detect a legacy driver by checking > > for VERSION_1, which is what the spec currently says. > > > > So for transitional we start out with native-endian config. For modern > > only the config is always LE. > > > > The guest can distinguish between a legacy only device and a modern > > capable device after the revision negotiation. A legacy device would > > reject the CCW. > > > > But both a transitional device and a modern only device would accept > > a revision > 0. So the guest does not know for ccw. > > Well, for pci I think the driver knows that it is using either legacy or > modern, no? It is mighty complicated. virtio-blk-pci-non-transitional and virtio-net-pci-non-transitional will give you BE, but virtio-crypto-pci, which is also non-transitional will get you LE, before VERSION_1 is set (becausevirtio-crypto uses stl_le_p()). That is fact. The deal is that virtio-blk and virtion-net was written with transitional in mind, and config code is the same for transitional and non-transitional. That is how things are now. With the QEMU changes things will be simpler. > > And for ccw, the driver knows at that point in time which revision it > negotiated, so it should know that a revision > 0 will use LE (and the > device will obviously know that as well.) With the future changes in QEMU, yes. Without these changes no. Without these changes we get BE when the guest code things it is going to get LE. That is what causes the regression. The commit message for this patch is written from the perspective of right now, and not from the perspective of future changes. Or can you hack up a guest patch that looks at the revision, figures out what endiannes is the early config access in, and does the right thing? I don't think so. I tried to explain why that is impossible. Because that would be preferable to messing with the the device and introducing another exit. > > Or am I misunderstanding what you're getting at? > Probably. I'm talking about pre- "do transport specific legacy detection in the device instead of looking at VERSION_1" you are probably talking about the post-state. If we had this new behavior for all relevant hypervisors then we wouldn't need to do a thing in the guest. The current code would work like charm. Does that answer your question? Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [RFC PATCH 1/1] virtio: write back features before verify
On Mon, 4 Oct 2021 16:01:12 -0400 "Michael S. Tsirkin" wrote: > > > > Ok, so what about something like > > > > "If FEATURES_OK is not set, the driver MAY change the set of features it > > accepts." > > > > in the device initialization section? > > Maybe "as long as". However Halil implied that some features are not > turned off properly if that happens. Halil could you pls provide > some examples? static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features) { ... if (virtio_has_feature(features, VIRTIO_NET_F_STANDBY)) { qapi_event_send_failover_negotiated(n->netclient_name); qatomic_set(>failover_primary_hidden, false); failover_add_primary(n, ); if (err) { warn_report_err(err); } } } This is probably the only one in QEMU. Back then I stopped looking after the first hit. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Tue, 5 Oct 2021 03:53:17 -0400 "Michael S. Tsirkin" wrote: > > Wouldn't a call from transport code into virtio core > > be more handy? What I have in mind is stuff like vhost-user and vdpa. My > > understanding is, that for vhost setups where the config is outside qemu, > > we probably need a new command that tells the vhost backend what > > endiannes to use for config. I don't think we can use > > VHOST_USER_SET_VRING_ENDIAN because that one is on a virtqueue basis > > according to the doc. So for vhost-user and similar we would fire that > > command and probably also set the filed, while for devices for which > > control plane is handled by QEMU we would just set the field. > > > > Does that sound about right? > > I'm fine either way, but when would you invoke this? > With my idea backends can check the field when get_config > is invoked. > > As for using this in VHOST, can we maybe re-use SET_FEATURES? > > Kind of hacky but nice in that it will actually make existing backends > work... Basically the equivalent of this patch, just on the vhost interface, right? Could work I have to look into it :) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Mon, 4 Oct 2021 05:07:13 -0400 "Michael S. Tsirkin" wrote: > On Mon, Oct 04, 2021 at 04:23:23AM +0200, Halil Pasic wrote: > > On Sat, 2 Oct 2021 14:13:37 -0400 > > "Michael S. Tsirkin" wrote: > > > > > > Anyone else have an idea? This is a nasty regression; we could revert > > > > the > > > > patch, which would remove the symptoms and give us some time, but that > > > > doesn't really feel right, I'd do that only as a last resort. > > > > > > Well we have Halil's hack (except I would limit it > > > to only apply to BE, only do devices with validate, > > > and only in modern mode), and we will fix QEMU to be spec compliant. > > > Between these why do we need any conditional compiles? > > > > We don't. As I stated before, this hack is flawed because it > > effectively breaks fencing features by the driver with QEMU. Some > > features can not be unset after once set, because we tend to try to > > enable the corresponding functionality whenever we see a write > > features operation with the feature bit set, and we don't disable, if a > > subsequent features write operation stores the feature bit as not set. > > Something to fix in QEMU too, I think. Possibly. But it is the same situation: it probably has a long history. And it may even make some sense. The obvious trigger for doing the conditional initialization for modern is the setting of FEATURES_OK. The problem is, legacy doesn't do FEATURES_OK. So we would need a different trigger. > > > But it looks like VIRTIO_1 is fine to get cleared afterwards. > > We'd never clear it though - why would we? > Right. > > So my hack > > should actually look like posted below, modulo conditions. > > > Looking at it some more, I see that vhost-user actually > does not send features to the backend until FEATURES_OK. I.e. the hack does not work for transitional vhost-user devices, but it doesn't break them either. Furthermore, I believe there is not much we can do to support transitional devices with vhost-user and similar, without extending the protocol. The transport specific detection idea would need a new vhost-user thingy to tell the device what has been figured out, right? In theory modern only could work, if the backends were paying extra attention to endianness, instead of just assuming that the code is running little-endian. > However, the code in contrib for vhost-user-blk at least seems > broken wrt endian-ness ATM. Agree. For example config is native endian ATM AFAICT. > What about other backends though? I think whenever the config is owned and managed by the vhost-backend we have a problem with transitional. And we don't have everything in the protocol to deal with this problem. I didn't check modern for the different vhost-user backends. I don't think we recommend our users on s390 to use those. My understanding of the use-cases is far form complete. > Hard to be sure right? I agree. > Cc Raphael and Stefan so they can take a look. > And I guess it's time we CC'd qemu-devel too. > > For now I am beginning to think we should either revert or just limit > validation to LE and think about all this some more. And I am inclining > to do a revert. I'm fine with either of these as a quick fix, but we will eventually have to find a solution. AFAICT this solution works for the s390 setups we care about the most, but so would a revert. > These are all hypervisors that shipped for a long time. > Do we need a flag for early config space access then? You mean a feature bit? I think it is a good idea even if it weren't strictly necessary. We will have a behavior change for some devices, and I think the ability to detect those is valuable. Your spec change proposal, makes it IMHO pretty clear, that we are changing our understanding of how transitional should work. Strictly, transitional is not a normative part of the spec AFAIU, but still... > > > > > > > Regarding the conditions I guess checking that driver_features has > > F_VERSION_1 already satisfies "only modern mode", or? > > Right. > > > For now > > I've deliberately omitted the has verify and the is big endian > > conditions so we have a better chance to see if something breaks > > (i.e. the approach does not work). I can add in those extra conditions > > later. > > Or maybe if we will go down that road just the verify check (for > performance). I'm a bit unhappy we have the extra exit but consistency > seems more important. > I'm fine either way. The extra exit is only for the initialization and one per 1 device, I have no feeling if this has a measurable performance impact. > > > >
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Mon, 4 Oct 2021 09:11:04 -0400 "Michael S. Tsirkin" wrote: > > >> static inline bool virtio_access_is_big_endian(VirtIODevice *vdev) > > >> { > > >> #if defined(LEGACY_VIRTIO_IS_BIENDIAN) > > >> return virtio_is_big_endian(vdev); > > >> #elif defined(TARGET_WORDS_BIGENDIAN) > > >> if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) { > > >> /* Devices conforming to VIRTIO 1.0 or later are always LE. */ > > >> return false; > > >> } > > >> return true; > > >> #else > > >> return false; > > >> #endif > > >> } > > >> > > > > > > ok so that's a QEMU bug. Any virtio 1.0 and up > > > compatible device must use LE. > > > It can also present a legacy config space where the > > > endian depends on the guest. > > > > So, how is the virtio core supposed to determine this? A > > transport-specific callback? > > I'd say a field in VirtIODevice is easiest. Wouldn't a call from transport code into virtio core be more handy? What I have in mind is stuff like vhost-user and vdpa. My understanding is, that for vhost setups where the config is outside qemu, we probably need a new command that tells the vhost backend what endiannes to use for config. I don't think we can use VHOST_USER_SET_VRING_ENDIAN because that one is on a virtqueue basis according to the doc. So for vhost-user and similar we would fire that command and probably also set the filed, while for devices for which control plane is handled by QEMU we would just set the field. Does that sound about right? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Mon, 04 Oct 2021 09:01:42 +0200 Cornelia Huck wrote: > On Sat, Oct 02 2021, "Michael S. Tsirkin" wrote: > > > On Fri, Oct 01, 2021 at 05:18:46PM +0200, Cornelia Huck wrote: > >> I'd say we need a hack here so that we assume little-endian config space > >> if VERSION_1 has been offered; if your patch here works, I assume QEMU > >> does what we expect (assmuming little-endian as well.) I'm mostly > >> wondering what happens if you use a different VMM; can we expect it to > >> work similar to QEMU? > > > > Hard to say of course ... hopefully other VMMs are actually > > implementing the spec. E.g. IIUC rust vmm is modern only. > > Yes, I kind of hope they are simply doing LE config space accesses. > > Are there any other VMMs that are actually supported on s390x (or other > BE architectures)? > I think zCX (z/OS Container Extensions) is relevant as it uses virtio. That is all I know about. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Sat, 2 Oct 2021 14:13:37 -0400 "Michael S. Tsirkin" wrote: > > Anyone else have an idea? This is a nasty regression; we could revert the > > patch, which would remove the symptoms and give us some time, but that > > doesn't really feel right, I'd do that only as a last resort. > > Well we have Halil's hack (except I would limit it > to only apply to BE, only do devices with validate, > and only in modern mode), and we will fix QEMU to be spec compliant. > Between these why do we need any conditional compiles? We don't. As I stated before, this hack is flawed because it effectively breaks fencing features by the driver with QEMU. Some features can not be unset after once set, because we tend to try to enable the corresponding functionality whenever we see a write features operation with the feature bit set, and we don't disable, if a subsequent features write operation stores the feature bit as not set. But it looks like VIRTIO_1 is fine to get cleared afterwards. So my hack should actually look like posted below, modulo conditions. Regarding the conditions I guess checking that driver_features has F_VERSION_1 already satisfies "only modern mode", or? For now I've deliberately omitted the has verify and the is big endian conditions so we have a better chance to see if something breaks (i.e. the approach does not work). I can add in those extra conditions later. ------8<- From: Halil Pasic Date: Thu, 30 Sep 2021 02:38:47 +0200 Subject: [PATCH] virtio: write back feature VERSION_1 before verify This patch fixes a regression introduced by commit 82e89ea077b9 ("virtio-blk: Add validation for block size in config space") and enables similar checks in verify() on big endian platforms. The problem with checking multi-byte config fields in the verify callback, on big endian platforms, and with a possibly transitional device is the following. The verify() callback is called between config->get_features() and virtio_finalize_features(). That we have a device that offered F_VERSION_1 then we have the following options either the device is transitional, and then it has to present the legacy interface, i.e. a big endian config space until F_VERSION_1 is negotiated, or we have a non-transitional device, which makes F_VERSION_1 mandatory, and only implements the non-legacy interface and thus presents a little endian config space. Because at this point we can't know if the device is transitional or non-transitional, we can't know do we need to byte swap or not. The virtio spec explicitly states that the driver MAY read config between reading and writing the features so saying that first accessing the config before feature negotiation is done is not an option. The specification ain't clear about setting the features multiple times before FEATURES_OK, so I guess that should be fine to set F_VERSION_1 since at this point we already know that we are about to negotiate F_VERSION_1. I don't consider this patch super clean, but frankly I don't think we have a ton of options. Another option that may or man not be cleaner, but is also IMHO much uglier is to figure out whether the device is transitional by rejecting _F_VERSION_1, then resetting it and proceeding according tho what we have figured out, hoping that the characteristics of the device didn't change. Signed-off-by: Halil Pasic Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in config space") Reported-by: mark...@us.ibm.com --- drivers/virtio/virtio.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 0a5b54034d4b..2b9358f2e22a 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -239,6 +239,12 @@ static int virtio_dev_probe(struct device *_d) driver_features_legacy = driver_features; } + /* Write F_VERSION_1 feature to pin down endianness */ + if (device_features & (1ULL << VIRTIO_F_VERSION_1) & driver_features) { + dev->features = (1ULL << VIRTIO_F_VERSION_1); + dev->config->finalize_features(dev); + } + if (device_features & (1ULL << VIRTIO_F_VERSION_1)) dev->features = driver_features & device_features; else -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Sat, 2 Oct 2021 14:20:47 -0400 "Michael S. Tsirkin" wrote: > > >From my perspective the problem is that the version of the device > > remains in limbo as long as the features have not yet been finalized, > > which means that the endianness of the config space remains in limbo as > > well. Both device and driver might come to different conclusions. > > Version === legacy versus modern? > It is true that feature negotiation can not be used by device to decide that > question simply because it happens too late. > So let's not use it for that then ;) > > Yes we have VERSION_1 which looks like it should allow this, but > unfortunately it only helps with that for the driver, not the device. > > In practice legacy versus modern has to be determined by > transport specific versioning, luckily we have that for all > specified transports (can't say what happens with rproc). So if we look at ccw, you say that the revision negotiation already determines whether VERSION_1 is negotiated or not, and the feature bit VERSION_1 is superfluous? That would also imply, that 1) if revision > 0 was negotiated then the device must offer VERSION_1 2) if revision > 0 was negotiated and the driver cleared VERSION_1 the device must refuse to operate. 3) if revision > 0 was negotiated then the driver should reject to drive a device if it does not offer VERSION_1 4) if revision > 0 was negotiated the driver must accept VERSION_1 5) if revision > 0 was *not* negotiated then the device should not offer VERSION_1 because at this point it is already certain that the device can not act in accordance to the virtio 1.0 or higher interface. Does that sound about right? IMHO we should also change https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-160003 and the definition of VERSION_1 because both sides have to know what is going on before features are fully negotiated. Or? Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Thu, 30 Sep 2021 13:31:04 +0200 Cornelia Huck wrote: > On Thu, Sep 30 2021, Halil Pasic wrote: > > > On Thu, 30 Sep 2021 11:28:23 +0200 > > Cornelia Huck wrote: > > > >> On Thu, Sep 30 2021, Halil Pasic wrote: > >> > >> > This patch fixes a regression introduced by commit 82e89ea077b9 > >> > ("virtio-blk: Add validation for block size in config space") and > >> > enables similar checks in verify() on big endian platforms. > >> > > >> > The problem with checking multi-byte config fields in the verify > >> > callback, on big endian platforms, and with a possibly transitional > >> > device is the following. The verify() callback is called between > >> > config->get_features() and virtio_finalize_features(). That we have a > >> > device that offered F_VERSION_1 then we have the following options > >> > either the device is transitional, and then it has to present the legacy > >> > interface, i.e. a big endian config space until F_VERSION_1 is > >> > negotiated, or we have a non-transitional device, which makes > >> > F_VERSION_1 mandatory, and only implements the non-legacy interface and > >> > thus presents a little endian config space. Because at this point we > >> > can't know if the device is transitional or non-transitional, we can't > >> > know do we need to byte swap or not. > >> > > >> > The virtio spec explicitly states that the driver MAY read config > >> > between reading and writing the features so saying that first accessing > >> > the config before feature negotiation is done is not an option. The > >> > specification ain't clear about setting the features multiple times > >> > before FEATURES_OK, so I guess that should be fine. > >> > > >> > I don't consider this patch super clean, but frankly I don't think we > >> > have a ton of options. Another option that may or man not be cleaner, > >> > but is also IMHO much uglier is to figure out whether the device is > >> > transitional by rejecting _F_VERSION_1, then resetting it and proceeding > >> > according tho what we have figured out, hoping that the characteristics > >> > of the device didn't change. > >> > > >> > Signed-off-by: Halil Pasic > >> > Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in > >> > config space") > >> > Reported-by: mark...@us.ibm.com > >> > --- > >> > drivers/virtio/virtio.c | 4 > >> > 1 file changed, 4 insertions(+) > >> > > >> > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > >> > index 0a5b54034d4b..9dc3cfa17b1c 100644 > >> > --- a/drivers/virtio/virtio.c > >> > +++ b/drivers/virtio/virtio.c > >> > @@ -249,6 +249,10 @@ static int virtio_dev_probe(struct device *_d) > >> > if (device_features & (1ULL << i)) > >> > __virtio_set_bit(dev, i); > >> > > >> > +/* Write back features before validate to know endianness */ > >> > +if (device_features & (1ULL << VIRTIO_F_VERSION_1)) > >> > +dev->config->finalize_features(dev); > >> > >> This really looks like a mess :( > >> > >> We end up calling ->finalize_features twice: once before ->validate, and > >> once after, that time with the complete song and dance. The first time, > >> we operate on one feature set; after validation, we operate on another, > >> and there might be interdependencies between the two (like a that a bit > >> is cleared because of another bit, which would not happen if validate > >> had a chance to clear that bit before). > > > > Basically the second set is a subset of the first set. > > I don't think that's clear. Validate can only remove features, or? So I guess after validate is a subset of before validate. > > > > >> > >> I'm not sure whether that is even a problem in the spec: while the > >> driver may read the config before finally accepting features > > > > I'm not sure I'm following you. Let me please qoute the specification: > > """ > > 4. Read device feature bits, and write the subset of feature bits > > understood by the OS and driver to the device. During this step the driver > > MAY read (but MUST NOT write) the device-specific co
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Thu, 30 Sep 2021 07:12:21 -0400 "Michael S. Tsirkin" wrote: > On Thu, Sep 30, 2021 at 03:20:49AM +0200, Halil Pasic wrote: > > This patch fixes a regression introduced by commit 82e89ea077b9 > > ("virtio-blk: Add validation for block size in config space") and > > enables similar checks in verify() on big endian platforms. > > > > The problem with checking multi-byte config fields in the verify > > callback, on big endian platforms, and with a possibly transitional > > device is the following. The verify() callback is called between > > config->get_features() and virtio_finalize_features(). That we have a > > device that offered F_VERSION_1 then we have the following options > > either the device is transitional, and then it has to present the legacy > > interface, i.e. a big endian config space until F_VERSION_1 is > > negotiated, or we have a non-transitional device, which makes > > F_VERSION_1 mandatory, and only implements the non-legacy interface and > > thus presents a little endian config space. Because at this point we > > can't know if the device is transitional or non-transitional, we can't > > know do we need to byte swap or not. > > Hmm which transport does this refer to? It is the same with virtio-ccw and virtio-pci. I see the same problem with both on s390x. I didn't try with virtio-blk-pci-non-transitional yet (have to figure out how to do that with libvirt) for pci I used virtio-blk-pci. > Distinguishing between legacy and modern drivers is transport > specific. PCI presents > legacy and modern at separate addresses so distinguishing > between these two should be no trouble. You mean the device id? Yes that is bolted down in the spec, but currently we don't exploit that information. Furthermore there is a fat chance that with QEMU even the allegedly non-transitional devices only present a little endian config space after VERSION_1 was negotiated. Namely get_config for virtio-blk is implemented in virtio_blk_update_config() which does virtio_stl_p(vdev, _size, blk_size) and in there we don't care about transitional or not: static inline bool virtio_access_is_big_endian(VirtIODevice *vdev) { #if defined(LEGACY_VIRTIO_IS_BIENDIAN) return virtio_is_big_endian(vdev); #elif defined(TARGET_WORDS_BIGENDIAN) if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) { /* Devices conforming to VIRTIO 1.0 or later are always LE. */ return false; } return true; #else return false; #endif } > Channel i/o has versioning so same thing? > Don't think so. Both a transitional and a non-transitional device would have to accept revisions higher than 0 if the driver tried to negotiate those (and we do in our case). > > The virtio spec explicitly states that the driver MAY read config > > between reading and writing the features so saying that first accessing > > the config before feature negotiation is done is not an option. The > > specification ain't clear about setting the features multiple times > > before FEATURES_OK, so I guess that should be fine. > > > > I don't consider this patch super clean, but frankly I don't think we > > have a ton of options. Another option that may or man not be cleaner, > > but is also IMHO much uglier is to figure out whether the device is > > transitional by rejecting _F_VERSION_1, then resetting it and proceeding > > according tho what we have figured out, hoping that the characteristics > > of the device didn't change. > > I am confused here. So is the problem at the device or at the driver level? We have a driver regression. Since the 82e89ea077b9 ("virtio-blk: Add validation for block size in config space") virtio-blk is broken on s390. The deeper problem is in the spec. We stated that the driver may read config space before the feature negotiation is finalized, but we didn't think enough about what happens when native endiannes is not little endian in the different cases. I believe, for non-transitional devices we have a problem in the host as well (i.e. in QEMU). > I suspect it's actually the host that has the issue, not > the guest? I tend to say we have a problem both in the host and in the guest. I'm more concerned about the problem in the guest, because that is a really nasty regression. For the host. I think for legacy we don't have a problem, because both sides would operate on the assumption no _F_VERSION_1, IMHO the implementation for the transitional devices is correct. For non-transitional flavor, it depends on the device. For example virtio-net and virtio-blk is broken, because we use primitives like virtio_stl_p() and those don't do the right thing before feature negotiation is completed. On the other hand virtio-crypto.c as a truly non-transitional device uses
Re: [RFC PATCH 1/1] virtio: write back features before verify
On Thu, 30 Sep 2021 11:28:23 +0200 Cornelia Huck wrote: > On Thu, Sep 30 2021, Halil Pasic wrote: > > > This patch fixes a regression introduced by commit 82e89ea077b9 > > ("virtio-blk: Add validation for block size in config space") and > > enables similar checks in verify() on big endian platforms. > > > > The problem with checking multi-byte config fields in the verify > > callback, on big endian platforms, and with a possibly transitional > > device is the following. The verify() callback is called between > > config->get_features() and virtio_finalize_features(). That we have a > > device that offered F_VERSION_1 then we have the following options > > either the device is transitional, and then it has to present the legacy > > interface, i.e. a big endian config space until F_VERSION_1 is > > negotiated, or we have a non-transitional device, which makes > > F_VERSION_1 mandatory, and only implements the non-legacy interface and > > thus presents a little endian config space. Because at this point we > > can't know if the device is transitional or non-transitional, we can't > > know do we need to byte swap or not. > > > > The virtio spec explicitly states that the driver MAY read config > > between reading and writing the features so saying that first accessing > > the config before feature negotiation is done is not an option. The > > specification ain't clear about setting the features multiple times > > before FEATURES_OK, so I guess that should be fine. > > > > I don't consider this patch super clean, but frankly I don't think we > > have a ton of options. Another option that may or man not be cleaner, > > but is also IMHO much uglier is to figure out whether the device is > > transitional by rejecting _F_VERSION_1, then resetting it and proceeding > > according tho what we have figured out, hoping that the characteristics > > of the device didn't change. > > > > Signed-off-by: Halil Pasic > > Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in config > > space") > > Reported-by: mark...@us.ibm.com > > --- > > drivers/virtio/virtio.c | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > > index 0a5b54034d4b..9dc3cfa17b1c 100644 > > --- a/drivers/virtio/virtio.c > > +++ b/drivers/virtio/virtio.c > > @@ -249,6 +249,10 @@ static int virtio_dev_probe(struct device *_d) > > if (device_features & (1ULL << i)) > > __virtio_set_bit(dev, i); > > > > + /* Write back features before validate to know endianness */ > > + if (device_features & (1ULL << VIRTIO_F_VERSION_1)) > > + dev->config->finalize_features(dev); > > This really looks like a mess :( > > We end up calling ->finalize_features twice: once before ->validate, and > once after, that time with the complete song and dance. The first time, > we operate on one feature set; after validation, we operate on another, > and there might be interdependencies between the two (like a that a bit > is cleared because of another bit, which would not happen if validate > had a chance to clear that bit before). Basically the second set is a subset of the first set. > > I'm not sure whether that is even a problem in the spec: while the > driver may read the config before finally accepting features I'm not sure I'm following you. Let me please qoute the specification: """ 4. Read device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it. 5. Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits after this step. """ https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-930001 > , it does > not really make sense to do so before a feature bit as basic as > VERSION_1 which determines the endianness has been negotiated. Are you suggesting that ->verify() should be after virtio_finalize_features()? Wouldn't that mean that verify() can't reject feature bits? But that is the whole point of commit 82e89ea077b9 ("virtio-blk: Add validation for block size in config space"). Do you think that the commit in question is conceptually flawed? My understanding of the verify is, that it is supposed to fence features and feature bits we can't support, e.g. because of config space things, but I may be wrong. The trouble is, feature bits are not negotiated one
[RFC PATCH 1/1] virtio: write back features before verify
This patch fixes a regression introduced by commit 82e89ea077b9 ("virtio-blk: Add validation for block size in config space") and enables similar checks in verify() on big endian platforms. The problem with checking multi-byte config fields in the verify callback, on big endian platforms, and with a possibly transitional device is the following. The verify() callback is called between config->get_features() and virtio_finalize_features(). That we have a device that offered F_VERSION_1 then we have the following options either the device is transitional, and then it has to present the legacy interface, i.e. a big endian config space until F_VERSION_1 is negotiated, or we have a non-transitional device, which makes F_VERSION_1 mandatory, and only implements the non-legacy interface and thus presents a little endian config space. Because at this point we can't know if the device is transitional or non-transitional, we can't know do we need to byte swap or not. The virtio spec explicitly states that the driver MAY read config between reading and writing the features so saying that first accessing the config before feature negotiation is done is not an option. The specification ain't clear about setting the features multiple times before FEATURES_OK, so I guess that should be fine. I don't consider this patch super clean, but frankly I don't think we have a ton of options. Another option that may or man not be cleaner, but is also IMHO much uglier is to figure out whether the device is transitional by rejecting _F_VERSION_1, then resetting it and proceeding according tho what we have figured out, hoping that the characteristics of the device didn't change. Signed-off-by: Halil Pasic Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in config space") Reported-by: mark...@us.ibm.com --- drivers/virtio/virtio.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 0a5b54034d4b..9dc3cfa17b1c 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -249,6 +249,10 @@ static int virtio_dev_probe(struct device *_d) if (device_features & (1ULL << i)) __virtio_set_bit(dev, i); + /* Write back features before validate to know endianness */ + if (device_features & (1ULL << VIRTIO_F_VERSION_1)) + dev->config->finalize_features(dev); + if (drv->validate) { err = drv->validate(dev); if (err) base-commit: 02d5e016800d082058b3d3b7c3ede136cdc6ddcb -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] virtio/s390: fix vritio-ccw device teardown
On Tue, 21 Sep 2021 15:31:03 +0200 Vineeth Vijayan wrote: > On Tue, 2021-09-21 at 05:25 +0200, Halil Pasic wrote: > > On Mon, 20 Sep 2021 12:07:23 +0200 > > Cornelia Huck wrote: > > > > > On Mon, Sep 20 2021, Vineeth Vijayan wrote: > > > > > > > On Mon, 2021-09-20 at 00:39 +0200, Halil Pasic wrote: > > > > > On Fri, 17 Sep 2021 10:40:20 +0200 > > > > > Cornelia Huck wrote: > > > > > > > > > ...snip... > > > > > > > Thanks, if I find time for it, I will try to understand > > > > > > > this > > > > > > > better and > > > > > > > come back with my findings. > > > > > > > > > > > > > > > > * Can virtio_ccw_remove() get called while !cdev- > > > > > > > > > >online and > > > > > > > > > virtio_ccw_online() is running on a different cpu? If > > > > > > > > > yes, > > > > > > > > > what would > > > > > > > > > happen then? > > > > > > > > > > > > > > > > All of the remove/online/... etc. callbacks are invoked > > > > > > > > via the > > > > > > > > ccw bus > > > > > > > > code. We have to trust that it gets it correct :) (Or > > > > > > > > have the > > > > > > > > common > > > > > > > > I/O layer maintainers double-check it.) > > > > > > > > > > > > > > > > > > > > > > Vineeth, what is your take on this? Are the struct > > > > > > > ccw_driver > > > > > > > virtio_ccw_remove and the virtio_ccw_online callbacks > > > > > > > mutually > > > > > > > exclusive. Please notice that we may initiate the onlining > > > > > > > by > > > > > > > calling ccw_device_set_online() from a workqueue. > > > > > > > > > > > > > > @Conny: I'm not sure what is your definition of 'it gets it > > > > > > > correct'... > > > > > > > I doubt CIO can make things 100% foolproof in this > > > > > > > area. > > > > > > > > > > > > Not 100% foolproof, but "don't online a device that is in the > > > > > > progress > > > > > > of going away" seems pretty basic to me. > > > > > > > > > > > > > > > > I hope Vineeth will chime in on this. > > > > Considering the online/offline processing, > > > > The ccw_device_set_offline function or the online/offline is > > > > handled > > > > inside device_lock. Also, the online_store function takes care of > > > > avoiding multiple online/offline processing. > > > > > > > > Now, when we consider the unconditional remove of the device, > > > > I am not familiar with the virtio_ccw driver. My assumptions are > > > > based > > > > on how CIO/dasd drivers works. If i understand correctly, the > > > > dasd > > > > driver sets different flags to make sure that a device_open is > > > > getting > > > > prevented while the the device is in progress of offline-ing. > > > > > > Hm, if we are invoking the online/offline callbacks under the > > > device > > > lock already, > > > > I believe we have a misunderstanding here. I believe that Vineeth is > > trying to tell us, that online_store_handle_offline() and > > online_store_handle_offline() are called under the a device lock of > > the ccw device. Right, Vineeth? > Yes. I wanted to bring-out both the scenario.The set_offline/_online() > calls and the unconditional-remove call. I don't understand the paragraph above. I can't map the terms set_offline/_online() and unconditional-remove call to chunks of code. :( > For the set_online The virtio_ccw_online() also invoked with ccwlock > held. (ref: ccw_device_set_online) I don't think virtio_ccw_online() is invoked with the ccwlock held. I think we call virtio_ccw_online() in this line: https://elixir.bootlin.com/linux/v5.15-rc2/source/drivers/s390/cio/device.c#L394 and we have released the cdev->ccwlock literally 2 lines above. > > > > Conny, I believe, by online/offline callbacks, you mean
Re: [PATCH 1/1] virtio/s390: fix vritio-ccw device teardown
On Mon, 20 Sep 2021 12:07:23 +0200 Cornelia Huck wrote: > On Mon, Sep 20 2021, Vineeth Vijayan wrote: > > > On Mon, 2021-09-20 at 00:39 +0200, Halil Pasic wrote: > >> On Fri, 17 Sep 2021 10:40:20 +0200 > >> Cornelia Huck wrote: > >> > > ...snip... > >> > > > >> > > Thanks, if I find time for it, I will try to understand this > >> > > better and > >> > > come back with my findings. > >> > > > >> > > > > * Can virtio_ccw_remove() get called while !cdev->online and > >> > > > > virtio_ccw_online() is running on a different cpu? If yes, > >> > > > > what would > >> > > > > happen then? > >> > > > > >> > > > All of the remove/online/... etc. callbacks are invoked via the > >> > > > ccw bus > >> > > > code. We have to trust that it gets it correct :) (Or have the > >> > > > common > >> > > > I/O layer maintainers double-check it.) > >> > > > > >> > > > >> > > Vineeth, what is your take on this? Are the struct ccw_driver > >> > > virtio_ccw_remove and the virtio_ccw_online callbacks mutually > >> > > exclusive. Please notice that we may initiate the onlining by > >> > > calling ccw_device_set_online() from a workqueue. > >> > > > >> > > @Conny: I'm not sure what is your definition of 'it gets it > >> > > correct'... > >> > > I doubt CIO can make things 100% foolproof in this area. > >> > > >> > Not 100% foolproof, but "don't online a device that is in the > >> > progress > >> > of going away" seems pretty basic to me. > >> > > >> > >> I hope Vineeth will chime in on this. > > Considering the online/offline processing, > > The ccw_device_set_offline function or the online/offline is handled > > inside device_lock. Also, the online_store function takes care of > > avoiding multiple online/offline processing. > > > > Now, when we consider the unconditional remove of the device, > > I am not familiar with the virtio_ccw driver. My assumptions are based > > on how CIO/dasd drivers works. If i understand correctly, the dasd > > driver sets different flags to make sure that a device_open is getting > > prevented while the the device is in progress of offline-ing. > > Hm, if we are invoking the online/offline callbacks under the device > lock already, I believe we have a misunderstanding here. I believe that Vineeth is trying to tell us, that online_store_handle_offline() and online_store_handle_offline() are called under the a device lock of the ccw device. Right, Vineeth? Conny, I believe, by online/offline callbacks, you mean virtio_ccw_online() and virtio_ccw_offline(), right? But the thing is that virtio_ccw_online() may get called (and is typically called, AFAICT) with no locks held via: virtio_ccw_probe() --> async_schedule(virtio_ccw_auto_online, cdev) -*-> virtio_ccw_auto_online(cdev) --> ccw_device_set_online(cdev) --> virtio_ccw_online() Furthermore after a closer look, I believe because we don't take a reference to the cdev in probe, we may get virtio_ccw_auto_online() called with an invalid pointer (the pointer is guaranteed to be valid in probe, but because of async we have no guarantee that it will be called in the context of probe). Shouldn't we take a reference to the cdev in probe? BTW what is the reason for the async? > how would that affect the remove callback? I believe dev->bus->remove(dev) is called by bus_remove_device() with the device lock held. I.e. I believe that means that virtio_ccw_remove() is called with the ccw devices device lock held. Vineeth can you confirm that? The thing is, both virtio_ccw_remove() and virtio_ccw_offline() are very similar, with the notable exception that offline assumes we are online() at the moment, while remove() does the same only if it decides based on vcdev && cdev->online that we are online. > Shouldn't they > be serialized under the device lock already? I think we are fine. AFAICT virtio_ccw_remove() and virtio_ccw_offline() are serialized against each other under the device lock. And also against virtio_ccw_online() iff it was initiated via the sysfs, and not via the auto-online mechanism. Thus I don't think we are fine at the moment. > > For dasd, I think they also need to deal with the block device > lifetimes. For virtio-ccw, we are basically a transport that does not > know about devices further down the chain (that are associated with the > virtio device, whose lifetime is tied to online/offline processing.) I'd > presume that the serialization above would be enough. > I don't know about dasd that much. For the reasons stated above, I don't think the serialization we have right now is entirely sufficient. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] virtio/s390: fix vritio-ccw device teardown
On Mon, 20 Sep 2021 12:30:39 +0200 Cornelia Huck wrote: > On Mon, Sep 20 2021, Halil Pasic wrote: [..] > > Basically, the vcdev is supposed to be around while the ccw device is > online (with a tail end until references have been given up, of course.) > It embeds a virtio device that has the ccw device as a parent, which > will give us a reference on the ccw device as long as the virtio device > is alive. Any interactions with the ccw device (except freeing the dma > buffer) are limited to the time where we still have a reference to it > via the virtio device. > I didn't remember that device_add() takes a reference to the parent, and that device_del() before device_put(dev) and remove callback. > > > >> > >> > So my intuition tells me that > >> > drivers should manage explicitly. Yes virtio_ccw happens to have dma > >> > memory whose lifetime is more or less the lifetime of struct virtio_ccw, > >> > but that may not be always the case. > >> > >> I'm not sure what you're getting at here. Regardless of the lifetime of > >> the dma memory, it depends on the presence of the ccw device to which it > >> is tied. This means that the ccw device must not be released while the > >> dma memory is alive. We can use the approach in your patch here due to > >> the lifetime of the dma memory that virtio-ccw allocates when we start > >> using the device and frees when we stop using the device, or we can use > >> get/put with every allocate/release dma memory pair, which should be > >> safe for everyone? > >> > > > > What I mean is that ccw_device_dma_[zalloc,free]() take a pointer to the > > ccw_device. If we get/put in those we can ensure that, provided the > > alloc and the free calls are properly paired, the device will be still > > alive (and the pointer valid) for the free, if it was valid for the > > alloc. But it does not ensure that each and every call to alloc is with > > a valid pointer, or that other uses of the pointer are OK. So I don't > > think it is completely safe for everyone, because we could try to use > > a pointer to a ccw device when not having any dma memory allocated from > > its pool. > > But the problem is the dma memory, right? Also, it is the same issue for > any potential caller of the ccw_device_dma_* interfaces. I tend to agree, my argument was based on the assumption that we did not use to take a reference to the ccw device in virtio_ccw_online(), but we do via register_virtio_device(). This reference however gets dropped right before virtio_ccw_release_dev() is called. > > > > > This patch takes reference to cdev before the pointer is published via > > vcdev->cdev and drops the reference after *vcdev is freed. The idea is > > that the pointee basically outlives the pointer. (Without having a full > > understanding of how things are synchronized). > > I don't think we have to care about accessing ->cdev (see above.) Plus, > as we give up the dma memory at the very last point, we would also give > up the reference via that memory at the very last point, so I'm not sure > what additional problems could come up. I understand now. Let me think about it some more. I'm wonderning about leafs. Will come back at you shortly. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] virtio/s390: fix vritio-ccw device teardown
On Fri, 17 Sep 2021 10:40:20 +0200 Cornelia Huck wrote: > On Thu, Sep 16 2021, Halil Pasic wrote: > > > On Thu, 16 Sep 2021 10:59:15 +0200 > > Cornelia Huck wrote: > > > >> > Since commit 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and > >> > classic notifiers") we were supposed to make sure that > >> > virtio_ccw_release_dev() completes before the ccw device, and the > >> > attached dma pool are torn down, but unfortunately we did not. > >> > Before that commit it used to be OK to delay cleaning up the memory > >> > allocated by virtio-ccw indefinitely (which isn't really intuitive for > >> > guys used to destruction happens in reverse construction order). > >> > > >> > To accomplish this let us take a reference on the ccw device before we > >> > allocate the dma_area and give it up after dma_area was freed. > >> > > >> > Signed-off-by: Halil Pasic > >> > Fixes: 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and > >> > classic notifiers") > >> > Reported-by: b...@redhat.com > >> > --- > >> > > >> > I'm not certain this is the only hot-unplug and teardonw related problem > >> > with virtio-ccw. > >> > > >> > Some things that are not perfectly clear to me: > >> > * What would happen if we observed an hot-unplug while we are doing > >> > wait_event() in ccw_io_helper()? Do we get stuck? I don't thin we > >> > are guaranteed to receive an irq for a subchannel that is gone. > >> > >> Hm. I think we may need to do a wake_up during remove handling. > > > > My guess is that the BQL is saving us from ever seeing this with QEMU > > as the hypervisor-userspace. Nevertheless I don't think we should rely > > on that. > > I agree. Let's do that via a separate patch. > I understand you would like us to finish the discussion on the alternate approach before giving an r-b for this patch, right? > > > >> > >> > * cdev->online seems to be manipulated under cdev->ccwlock, but > >> > in virtio_ccw_remove() we look at it to decide should we clean up > >> > or not. What is the idea there? I guess we want to avoid doing > >> > if nothing is there or twice. But I don't understand how stuff > >> > interlocks. > >> > >> We only created the virtio device when we onlined the ccw device. Do you > >> have a better idea how to check for that? (And yes, I'm not sure the > >> locking is correct.) > >> > > > > Thanks, if I find time for it, I will try to understand this better and > > come back with my findings. > > > >> > * Can virtio_ccw_remove() get called while !cdev->online and > >> > virtio_ccw_online() is running on a different cpu? If yes, what would > >> > happen then? > >> > >> All of the remove/online/... etc. callbacks are invoked via the ccw bus > >> code. We have to trust that it gets it correct :) (Or have the common > >> I/O layer maintainers double-check it.) > >> > > > > Vineeth, what is your take on this? Are the struct ccw_driver > > virtio_ccw_remove and the virtio_ccw_online callbacks mutually > > exclusive. Please notice that we may initiate the onlining by > > calling ccw_device_set_online() from a workqueue. > > > > @Conny: I'm not sure what is your definition of 'it gets it correct'... > > I doubt CIO can make things 100% foolproof in this area. > > Not 100% foolproof, but "don't online a device that is in the progress > of going away" seems pretty basic to me. > I hope Vineeth will chime in on this. > > > >> > > >> > The main addresse of these questions is Conny ;). > > > > In any case, I think we can go step by step. I would like the issue > > this patch intends to address, addressed first. Then we can think > > about the rest. > > > >> > > >> > An alternative to this approach would be to inc and dec the refcount > >> > in ccw_device_dma_zalloc() and ccw_device_dma_free() respectively. > >> > >> Yeah, I also thought about that. This would give us more get/put > >> operations, but might be the safer option. > > > > My understanding is, that having the ccw device go away while in a > > middle of doing ccw stuff (about to submit, or waiting for a channel > > program, or whatever) was bad befor
Re: [PATCH 1/1] virtio/s390: fix vritio-ccw device teardown
On Thu, 16 Sep 2021 10:59:15 +0200 Cornelia Huck wrote: > > Since commit 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and > > classic notifiers") we were supposed to make sure that > > virtio_ccw_release_dev() completes before the ccw device, and the > > attached dma pool are torn down, but unfortunately we did not. > > Before that commit it used to be OK to delay cleaning up the memory > > allocated by virtio-ccw indefinitely (which isn't really intuitive for > > guys used to destruction happens in reverse construction order). > > > > To accomplish this let us take a reference on the ccw device before we > > allocate the dma_area and give it up after dma_area was freed. > > > > Signed-off-by: Halil Pasic > > Fixes: 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and > > classic notifiers") > > Reported-by: b...@redhat.com > > --- > > > > I'm not certain this is the only hot-unplug and teardonw related problem > > with virtio-ccw. > > > > Some things that are not perfectly clear to me: > > * What would happen if we observed an hot-unplug while we are doing > > wait_event() in ccw_io_helper()? Do we get stuck? I don't thin we > > are guaranteed to receive an irq for a subchannel that is gone. > > Hm. I think we may need to do a wake_up during remove handling. My guess is that the BQL is saving us from ever seeing this with QEMU as the hypervisor-userspace. Nevertheless I don't think we should rely on that. > > > * cdev->online seems to be manipulated under cdev->ccwlock, but > > in virtio_ccw_remove() we look at it to decide should we clean up > > or not. What is the idea there? I guess we want to avoid doing > > if nothing is there or twice. But I don't understand how stuff > > interlocks. > > We only created the virtio device when we onlined the ccw device. Do you > have a better idea how to check for that? (And yes, I'm not sure the > locking is correct.) > Thanks, if I find time for it, I will try to understand this better and come back with my findings. > > * Can virtio_ccw_remove() get called while !cdev->online and > > virtio_ccw_online() is running on a different cpu? If yes, what would > > happen then? > > All of the remove/online/... etc. callbacks are invoked via the ccw bus > code. We have to trust that it gets it correct :) (Or have the common > I/O layer maintainers double-check it.) > Vineeth, what is your take on this? Are the struct ccw_driver virtio_ccw_remove and the virtio_ccw_online callbacks mutually exclusive. Please notice that we may initiate the onlining by calling ccw_device_set_online() from a workqueue. @Conny: I'm not sure what is your definition of 'it gets it correct'... I doubt CIO can make things 100% foolproof in this area. > > > > The main addresse of these questions is Conny ;). In any case, I think we can go step by step. I would like the issue this patch intends to address, addressed first. Then we can think about the rest. > > > > An alternative to this approach would be to inc and dec the refcount > > in ccw_device_dma_zalloc() and ccw_device_dma_free() respectively. > > Yeah, I also thought about that. This would give us more get/put > operations, but might be the safer option. My understanding is, that having the ccw device go away while in a middle of doing ccw stuff (about to submit, or waiting for a channel program, or whatever) was bad before. So my intuition tells me that drivers should manage explicitly. Yes virtio_ccw happens to have dma memory whose lifetime is more or less the lifetime of struct virtio_ccw, but that may not be always the case. Thanks for your comments! Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] virtio/s390: fix vritio-ccw device teardown
s/vritio/virtio/ (subject) [..] On Wed, 15 Sep 2021 23:57:42 +0200 Halil Pasic wrote: > Since commit 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and > classic notifiers") we were supposed to make sure that > virtio_ccw_release_dev() completes before the ccw device, and the > attached dma pool are torn down, but unfortunately we did not. > Before that commit it used to be OK to delay cleaning up the memory > allocated by virtio-ccw indefinitely (which isn't really intuitive for > guys used to destruction happens in reverse construction order). > [..] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/1] virtio/s390: fix vritio-ccw device teardown
Since commit 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and classic notifiers") we were supposed to make sure that virtio_ccw_release_dev() completes before the ccw device, and the attached dma pool are torn down, but unfortunately we did not. Before that commit it used to be OK to delay cleaning up the memory allocated by virtio-ccw indefinitely (which isn't really intuitive for guys used to destruction happens in reverse construction order). To accomplish this let us take a reference on the ccw device before we allocate the dma_area and give it up after dma_area was freed. Signed-off-by: Halil Pasic Fixes: 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and classic notifiers") Reported-by: b...@redhat.com --- I'm not certain this is the only hot-unplug and teardonw related problem with virtio-ccw. Some things that are not perfectly clear to me: * What would happen if we observed an hot-unplug while we are doing wait_event() in ccw_io_helper()? Do we get stuck? I don't thin we are guaranteed to receive an irq for a subchannel that is gone. * cdev->online seems to be manipulated under cdev->ccwlock, but in virtio_ccw_remove() we look at it to decide should we clean up or not. What is the idea there? I guess we want to avoid doing if nothing is there or twice. But I don't understand how stuff interlocks. * Can virtio_ccw_remove() get called while !cdev->online and virtio_ccw_online() is running on a different cpu? If yes, what would happen then? The main addresse of these questions is Conny ;). An alternative to this approach would be to inc and dec the refcount in ccw_device_dma_zalloc() and ccw_device_dma_free() respectively. --- drivers/s390/virtio/virtio_ccw.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c index d35e7a3f7067..99141df3259b 100644 --- a/drivers/s390/virtio/virtio_ccw.c +++ b/drivers/s390/virtio/virtio_ccw.c @@ -1006,10 +1006,12 @@ static void virtio_ccw_release_dev(struct device *_d) { struct virtio_device *dev = dev_to_virtio(_d); struct virtio_ccw_device *vcdev = to_vc_device(dev); + struct ccw_device *cdev = READ_ONCE(vcdev->cdev); ccw_device_dma_free(vcdev->cdev, vcdev->dma_area, sizeof(*vcdev->dma_area)); kfree(vcdev); + put_device(>dev); } static int irb_is_error(struct irb *irb) @@ -1262,6 +1264,7 @@ static int virtio_ccw_online(struct ccw_device *cdev) struct virtio_ccw_device *vcdev; unsigned long flags; + get_device(>dev); vcdev = kzalloc(sizeof(*vcdev), GFP_KERNEL); if (!vcdev) { dev_warn(>dev, "Could not get memory for virtio\n"); @@ -1315,6 +1318,7 @@ static int virtio_ccw_online(struct ccw_device *cdev) sizeof(*vcdev->dma_area)); } kfree(vcdev); + put_device(>dev); return ret; } base-commit: 3ca706c189db861b2ca2019a0901b94050ca49d8 -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio/s390: implement virtio-ccw revision 2 correctly
On Tue, 16 Feb 2021 11:39:07 +0100 Cornelia Huck wrote: > > > > Reviewed-by: Halil Pasic > > Thanks! > > I'll do a v2 with a tweaked commit message and cc:stable. Sounds good! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio/s390: implement virtio-ccw revision 2 correctly
On Mon, 15 Feb 2021 12:47:02 +0100 Cornelia Huck wrote: > On Fri, 12 Feb 2021 18:04:11 +0100 > Cornelia Huck wrote: > > > CCW_CMD_READ_STATUS was introduced with revision 2 of virtio-ccw, > > and drivers should only rely on it being implemented when they > > negotiated at least that revision with the device. > > > > However, virtio_ccw_get_status() issued READ_STATUS for any > > device operating at least at revision 1. If the device accepts > > READ_STATUS regardless of the negotiated revision (which it is > > free to do), > > So, looking at the standard again, the device is actually required to > reject the READ_STATUS if only rev 1 had been negotiated... regardless > of that, I don't think we should change QEMU's behaviour, as it would > affect existing guests (they would lose access to the status bits as > observed by the device, including DEVICE_NEEDS_RESET.) Not only that, without READ_STATUS, we can't do device reset which is a prerequisite for a proper cleanup, as required by the spec. You certainly remember, the driver has may not assume the reset was performed (and thus virtqueues are not live) until it reads back status 0. But without READ_STATUS virtio_ccw_get_status() will keep returning the status the driver last set via virtio_ccw_set_status(). And CCW_CMD_VDEV_RESET is of course revision 1 material. This looks ugly! > > > everything works as intended; a device rejecting the > > command should also be handled gracefully. For correctness, we > > should really limit the command to revision 2 or higher, though. > > > > We also negotiated the revision to at most 1, as we never bumped > > the maximum revision; let's do that now. > > > > Fixes: 7d3ce5ab9430 ("virtio/s390: support READ_STATUS command for > > virtio-ccw") > > Signed-off-by: Cornelia Huck > > --- > > > > QEMU does not fence off READ_STATUS for revisions < 2, which is probably > > why we never noticed this. I'm not aware of other hypervisors that do > > fence it off, nor any that cannot deal properly with an unknown command. > > > > Not sure whether this is stable worthy? > > Maybe it is, given the MUST reject clause in the standard? > Yes, IMHO this must be backported. A device that ain't violating the spec would currently reject READ_STATUS. Which would break RESET_VDEV like I described above. Can we change that MUST to should? There are now good reasons for not doing like the spec says in case of READ_STATUS. > > > > --- > > drivers/s390/virtio/virtio_ccw.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/s390/virtio/virtio_ccw.c > > b/drivers/s390/virtio/virtio_ccw.c > > index 5730572b52cd..54e686dca6de 100644 > > --- a/drivers/s390/virtio/virtio_ccw.c > > +++ b/drivers/s390/virtio/virtio_ccw.c > > @@ -117,7 +117,7 @@ struct virtio_rev_info { > > }; > > > > /* the highest virtio-ccw revision we support */ > > -#define VIRTIO_CCW_REV_MAX 1 > > +#define VIRTIO_CCW_REV_MAX 2 > > > > struct virtio_ccw_vq_info { > > struct virtqueue *vq; > > @@ -952,7 +952,7 @@ static u8 virtio_ccw_get_status(struct virtio_device > > *vdev) > > u8 old_status = vcdev->dma_area->status; > > struct ccw1 *ccw; > > > > - if (vcdev->revision < 1) > > + if (vcdev->revision < 2) > > return vcdev->dma_area->status; I don't think our faking of the status read (i.e. returning the old one) is contributing to spec compliance. Especially not if the inability to READ is not transient. Also return old_status; would tell the story better, but on the other hand, that would be an unrelated cosmetic change. Maybe a separate patch? Reviewed-by: Halil Pasic Regards, Halil > > > > ccw = ccw_device_dma_zalloc(vcdev->cdev, sizeof(*ccw)); > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 0/2] s390: virtio: let arch validate VIRTIO features
On Mon, 7 Sep 2020 11:39:05 +0200 Pierre Morel wrote: > Hi all, > > The goal of the series is to give a chance to the architecture > to validate VIRTIO device features. Michael, is this going in via your tree? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 2/2] s390: virtio: PV needs VIRTIO I/O device protection
On Mon, 7 Sep 2020 11:39:07 +0200 Pierre Morel wrote: > If protected virtualization is active on s390, VIRTIO has only retricted > access to the guest memory. > Define CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS and export > arch_has_restricted_virtio_memory_access to advertize VIRTIO if that's > the case, preventing a host error on access attempt. The description is a little inaccurate, but I don't care hence the r-b. The function arch_has_restricted_virtio_memory_access() returning true can not prevent the host from attempting to access memory if it decides to do so. And as far as I know there was no host error on access attempt. The page gets exported, and the host will operate on the encrypted page. But in the end we do run into trouble, which is usually fatal for the guest (not the host). What we actually do here is the following. If we detect an ill configured device we fail it (device status field), because attempting to drive it is a recipe for disaster. > > Signed-off-by: Pierre Morel > Reviewed-by: Cornelia Huck Reviewed-by: Halil Pasic ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 1/2] virtio: let arch advertise guest's memory access restrictions
On Mon, 7 Sep 2020 11:39:06 +0200 Pierre Morel wrote: > An architecture may restrict host access to guest memory, > e.g. IBM s390 Secure Execution or AMD SEV. > > Provide a new Kconfig entry the architecture can select, > CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS, when it provides > the arch_has_restricted_virtio_memory_access callback to advertise > to VIRTIO common code when the architecture restricts memory access > from the host. > > The common code can then fail the probe for any device where > VIRTIO_F_ACCESS_PLATFORM is required, but not set. > > Signed-off-by: Pierre Morel > Reviewed-by: Cornelia Huck Reviewed-by: Halil Pasic [..] > > +config ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS > + bool > + help > + This option is selected if the architecture may need to enforce > + VIRTIO_F_IOMMU_PLATFORM. > + A small nit: you use F_ACCESS_PLATFORM everywhere but here. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 2/2] s390: virtio: PV needs VIRTIO I/O device protection
On Thu, 9 Jul 2020 16:51:04 +0200 Pierre Morel wrote: > > > On 2020-07-09 16:47, Halil Pasic wrote: > > On Thu, 9 Jul 2020 12:51:58 +0200 > > Pierre Morel wrote: > > > >>>> +int arch_validate_virtio_features(struct virtio_device *dev) > >>>> +{ > >>>> +if (!is_prot_virt_guest()) > >>>> +return 0; > >>>> + > >>>> +if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) { > >>>> +dev_warn(>dev, "device must provide > >>>> VIRTIO_F_VERSION_1\n"); > >>> > >>> I'd probably use "legacy virtio not supported with protected > >>> virtualization". > >>> > >>>> +return -ENODEV; > >>>> +} > >>>> + > >>>> +if (!virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) { > >>>> +dev_warn(>dev, > >>>> + "device must provide > >>>> VIRTIO_F_IOMMU_PLATFORM\n"); > >>> > >>> "support for limited memory access required for protected > >>> virtualization" > >>> > >>> ? > >>> > >>> Mentioning the feature flag is shorter in both cases, though. > >> > >> And I think easier to look for in case of debugging purpose. > >> I change it if there is more demands. > > > > Not all our end users are kernel and/or qemu developers. I find the > > messages from v4 less technical, more informative, and way better. > > > > Regards, > > Halil > > > > Can you please tell me the messages you are speaking of, because for me > the warning's messages are exactly the same in v4 and v5!? > > I checked many times, but may be I still missed something. > Sorry, my bad. My brain is over-generating. The messages where discussed at v3 and Connie made a very similar proposal to the one above which I seconded (for reference look at Message-ID: <833c71f2-0057-896a-5e21-2c6263834...@linux.ibm.com>). I was under the impression that it got implemented in v4, but it was not. That's why I ended up talking bs. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 2/2] s390: virtio: PV needs VIRTIO I/O device protection
On Thu, 9 Jul 2020 12:51:58 +0200 Pierre Morel wrote: > >> +int arch_validate_virtio_features(struct virtio_device *dev) > >> +{ > >> + if (!is_prot_virt_guest()) > >> + return 0; > >> + > >> + if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) { > >> + dev_warn(>dev, "device must provide > >> VIRTIO_F_VERSION_1\n"); > > > > I'd probably use "legacy virtio not supported with protected > > virtualization". > > > >> + return -ENODEV; > >> + } > >> + > >> + if (!virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) { > >> + dev_warn(>dev, > >> + "device must provide VIRTIO_F_IOMMU_PLATFORM\n"); > > > > "support for limited memory access required for protected > > virtualization" > > > > ? > > > > Mentioning the feature flag is shorter in both cases, though. > > And I think easier to look for in case of debugging purpose. > I change it if there is more demands. Not all our end users are kernel and/or qemu developers. I find the messages from v4 less technical, more informative, and way better. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 1/2] virtio: let arch validate VIRTIO features
On Thu, 9 Jul 2020 10:39:18 +0200 Pierre Morel wrote: > An architecture may need to validate the VIRTIO devices features > based on architecture specifics. > > Signed-off-by: Pierre Morel > Reviewed-by: Cornelia Huck > Acked-by: Christian Borntraeger Acked-by: Halil Pasic > --- > drivers/virtio/virtio.c | 19 +++ > include/linux/virtio_config.h | 1 + > 2 files changed, 20 insertions(+) > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > index a977e32a88f2..c4e14d46a5b6 100644 > --- a/drivers/virtio/virtio.c > +++ b/drivers/virtio/virtio.c > @@ -167,6 +167,21 @@ void virtio_add_status(struct virtio_device *dev, > unsigned int status) > } > EXPORT_SYMBOL_GPL(virtio_add_status); > > +/* > + * arch_validate_virtio_features - provide arch specific hook when finalizing > + * features for VIRTIO device dev > + * @dev: the VIRTIO device being added > + * > + * Permits the platform to handle architecture-specific requirements when > + * device features are finalized. This is the default implementation. > + * Architecture implementations can override this. > + */ > + > +int __weak arch_validate_virtio_features(struct virtio_device *dev) > +{ > + return 0; > +} > + > int virtio_finalize_features(struct virtio_device *dev) > { > int ret = dev->config->finalize_features(dev); > @@ -176,6 +191,10 @@ int virtio_finalize_features(struct virtio_device *dev) > if (ret) > return ret; > > + ret = arch_validate_virtio_features(dev); > + if (ret) > + return ret; > + > if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) > return 0; > > diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h > index bb4cc4910750..3f4117adf311 100644 > --- a/include/linux/virtio_config.h > +++ b/include/linux/virtio_config.h > @@ -459,4 +459,5 @@ static inline void virtio_cwrite64(struct virtio_device > *vdev, > _r; \ > }) > > +int arch_validate_virtio_features(struct virtio_device *dev); > #endif /* _LINUX_VIRTIO_CONFIG_H */ ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 2/2] s390: virtio: PV needs VIRTIO I/O device protection
On Thu, 9 Jul 2020 10:57:33 +0200 Cornelia Huck wrote: > On Thu, 9 Jul 2020 10:39:19 +0200 > Pierre Morel wrote: > > > If protected virtualization is active on s390, the virtio queues are > > not accessible to the host, unless VIRTIO_F_IOMMU_PLATFORM has been > > negotiated. Use the new arch_validate_virtio_features() interface to > > fail probe if that's not the case, preventing a host error on access > > attempt Punctuation at the end? Also 'that's not the case' refers to the negation 'VIRTIO_F_IOMMU_PLATFORM has been negotiated', arch_validate_virtio_features() is however part of virtio_finalize_features(), which is in turn part of the feature negotiation. But that is details. I'm fine with keeping the message as is. > > > > Signed-off-by: Pierre Morel > > --- > > arch/s390/mm/init.c | 27 +++ > > 1 file changed, 27 insertions(+) > > > > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c > > index 6dc7c3b60ef6..b8e6f90117da 100644 > > --- a/arch/s390/mm/init.c > > +++ b/arch/s390/mm/init.c > > @@ -45,6 +45,7 @@ > > #include > > #include > > #include > > +#include > > > > pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(.bss..swapper_pg_dir); > > > > @@ -161,6 +162,32 @@ bool force_dma_unencrypted(struct device *dev) > > return is_prot_virt_guest(); > > } > > > > +/* > > + * arch_validate_virtio_features > > + * @dev: the VIRTIO device being added > > + * > > + * Return an error if required features are missing on a guest running > > + * with protected virtualization. > > + */ > > +int arch_validate_virtio_features(struct virtio_device *dev) > > +{ > > + if (!is_prot_virt_guest()) > > + return 0; > > + > > + if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) { > > + dev_warn(>dev, "device must provide VIRTIO_F_VERSION_1\n"); > > I'd probably use "legacy virtio not supported with protected > virtualization". > > > + return -ENODEV; > > + } > > + > > + if (!virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) { > > + dev_warn(>dev, > > +"device must provide VIRTIO_F_IOMMU_PLATFORM\n"); > > "support for limited memory access required for protected > virtualization" > > ? > > Mentioning the feature flag is shorter in both cases, though. I liked the messages in v4. Why did we change those? Did somebody complain? I prefer the old ones, but it any case: Acked-by: Halil Pasic > > > + return -ENODEV; > > + } > > + > > + return 0; > > +} > > + > > /* protected virtualization */ > > static void pv_init(void) > > { > > Either way, > > Reviewed-by: Cornelia Huck > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 2/2] s390: virtio: PV needs VIRTIO I/O device protection
On Tue, 7 Jul 2020 12:38:17 +0200 Pierre Morel wrote: > > > On 2020-07-07 11:46, Cornelia Huck wrote: > > On Tue, 7 Jul 2020 10:44:37 +0200 > > Pierre Morel wrote: > > > >> S390, protecting the guest memory against unauthorized host access > >> needs to enforce VIRTIO I/O device protection through the use of > >> VIRTIO_F_VERSION_1 and VIRTIO_F_IOMMU_PLATFORM. > > > > Hm... what about: > > > > "If protected virtualization is active on s390, the virtio queues are > > not accessible to the host, unless VIRTIO_F_IOMMU_PLATFORM has been > > negotiated. Use the new arch_validate_virtio_features() interface to > > enforce this." > > Yes, thanks. > > > > > >> > >> Signed-off-by: Pierre Morel > >> --- > >> arch/s390/kernel/uv.c | 25 + Is this the right place to put this stuff? This file seems to be about implementing the interface for interacting with the ultravisor. I would rather expect something like arch/s390/kernel/virtio.c Should we ever get arch hooks for balloon those could go in arch/s390/kernel/virtio.c as well. > >> 1 file changed, 25 insertions(+) > >> > >> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c > >> index c296e5c8dbf9..106330f6eda1 100644 > >> --- a/arch/s390/kernel/uv.c > >> +++ b/arch/s390/kernel/uv.c > >> @@ -14,6 +14,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> #include > >> #include > >> #include > >> @@ -413,3 +414,27 @@ static int __init uv_info_init(void) > >> } > >> device_initcall(uv_info_init); > >> #endif > >> + > >> +/* > >> + * arch_validate_virtio_iommu_platform > > > > s/arch_validate_virtio_iommu_platform/arch_validate_virtio_features/ > > > >> + * @dev: the VIRTIO device being added > >> + * > >> + * Return value: returns -ENODEV if any features of the > >> + * device breaks the protected virtualization > >> + * 0 otherwise. > > > > I don't think you need to specify the contract here: that belongs to > > the definition in the virtio core. What about simply adding a sentence > > "Return an error if required features are missing on a guest running > > with protected virtualization." ? > > OK, right. > > > > >> + */ > >> +int arch_validate_virtio_features(struct virtio_device *dev) > >> +{ > > > > Maybe jump out immediately if the guest is not protected? > > > >> + if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) { > >> + dev_warn(>dev, "device must provide VIRTIO_F_VERSION_1\n"); > >> + return is_prot_virt_guest() ? -ENODEV : 0; > >> + } > >> + > >> + if (!virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) { > >> + dev_warn(>dev, > >> + "device must provide VIRTIO_F_IOMMU_PLATFORM\n"); > >> + return is_prot_virt_guest() ? -ENODEV : 0; > >> + } > > > > if (!is_prot_virt_guest()) > > return 0; > > > > if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) { > > dev_warn(>dev, > > "legacy virtio is incompatible with protected guests"); > > return -ENODEV; > > } > > > > if (!virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) { > > dev_warn(>dev, > > "device does not work with limited memory access in protected > > guests"); > > return -ENODEV; > > } > > Yes, easier to read. > Not only easier to read but does not produce warnings if !is_prot_virt_guest(). I strongly prefer the variant proposed by Connie. Otherwise LGTM. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 1/1] s390: virtio: let arch accept devices without IOMMU feature
On Fri, 19 Jun 2020 11:20:51 +0200 Cornelia Huck wrote: > > > + if (arch_needs_virtio_iommu_platform(dev) && > > > + !virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) { > > > + dev_warn(>dev, > > > + "virtio: device must provide > > > VIRTIO_F_IOMMU_PLATFORM\n"); > > > > I'm not sure, divulging the current Linux name of this feature bit is a > > good idea, but if everybody else is fine with this, I don't care that > > Not sure if that feature name will ever change, as it is exported in > headers. At most, we might want to add the new ACCESS_PLATFORM define > and keep the old one, but that would still mean some churn. > > > much. An alternative would be: > > "virtio: device falsely claims to have full access to the memory, > > aborting the device" > > "virtio: device does not work with limited memory access" ? > > But no issue with keeping the current message. I think I prefer Conny's version, but no strong feelings here. Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 1/1] s390: virtio: let arch accept devices without IOMMU feature
On Wed, 17 Jun 2020 12:43:57 +0200 Pierre Morel wrote: > An architecture protecting the guest memory against unauthorized host > access may want to enforce VIRTIO I/O device protection through the > use of VIRTIO_F_IOMMU_PLATFORM. > > Let's give a chance to the architecture to accept or not devices > without VIRTIO_F_IOMMU_PLATFORM. > [..] I'm still not really satisfied with your commit message, furthermore I did some thinking about the abstraction you introduce here. I will give a short analysis of that, but first things first. Your patch does the job of preventing calamity, and the details can be changed any time, thus: Acked-by: Halil Pasic Regarding the interaction of architecture specific code with virtio core, I believe we could have made the interface more generic. One option is to introduce virtio_arch_finalize_features(), a hook that could reject any feature that is inappropriate. Another option would be to find a common name for is_prot_virt_guest() (arch/s390) sev_active() (arch/x86) and is_secure_guest() (arch/powerpc) and use that instead of arch_needs_virtio_iommu_platform() and where-ever appropriate. Currently we seem to want this info in driver code only for virtio, but if the virtio driver has a legitimate need to know, other drivers may as well have a legitimate need to know. For example if we wanted to protect ourselves in ccw device drivers from somebody setting up a vfio-ccw device and attach it to the prot-virt guest (AFAICT we only lack guest enablement for this) such a function could be useful. But since this can be rewritten any time, let's go with the option people already agree with, instead of more discussion. Just another question. Do we want this backported? Do we need cc stable? [..] > int virtio_finalize_features(struct virtio_device *dev) > { > int ret = dev->config->finalize_features(dev); > @@ -179,6 +194,13 @@ int virtio_finalize_features(struct virtio_device *dev) > if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) > return 0; > > + if (arch_needs_virtio_iommu_platform(dev) && > + !virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) { > + dev_warn(>dev, > + "virtio: device must provide > VIRTIO_F_IOMMU_PLATFORM\n"); I'm not sure, divulging the current Linux name of this feature bit is a good idea, but if everybody else is fine with this, I don't care that much. An alternative would be: "virtio: device falsely claims to have full access to the memory, aborting the device" Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/1] s390: virtio: let arch accept devices without IOMMU feature
On Tue, 16 Jun 2020 12:52:50 +0200 Pierre Morel wrote: > >> int virtio_finalize_features(struct virtio_device *dev) > >> { > >>int ret = dev->config->finalize_features(dev); > >> @@ -179,6 +184,10 @@ int virtio_finalize_features(struct virtio_device > >> *dev) > >>if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) > >>return 0; > >> > >> + if (arch_needs_iommu_platform(dev) && > >> + !virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) > >> + return -EIO; > >> + > > > > Why EIO? > > Because I/O can not occur correctly? > I am open to suggestions. We use -ENODEV if feature when the device rejects the features we tried to negotiate (see virtio_finalize_features()) and -EINVAL when the F_VERSION_1 and the virtio-ccw revision ain't coherent (in virtio_ccw_finalize_features()). Any of those seems more fitting that EIO to me. BTW does the error code itself matter in any way, or is it just OK vs some error? Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/1] s390: virtio: let arch accept devices without IOMMU feature
On Mon, 15 Jun 2020 14:39:24 +0200 Pierre Morel wrote: I find the subject (commit short) sub optimal. The 'arch' is already accepting devices 'without IOMMU feature'. What you are introducing is the ability to reject. > An architecture protecting the guest memory against unauthorized host > access may want to enforce VIRTIO I/O device protection through the > use of VIRTIO_F_IOMMU_PLATFORM. > > Let's give a chance to the architecture to accept or not devices > without VIRTIO_F_IOMMU_PLATFORM. > I don't particularly like the commit message. In general, I believe using access_platform instead of iommu_platform would really benefit us. > Signed-off-by: Pierre Morel > --- > arch/s390/mm/init.c | 6 ++ > drivers/virtio/virtio.c | 9 + > include/linux/virtio.h | 2 ++ > 3 files changed, 17 insertions(+) > > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c > index 87b2d024e75a..3f04ad09650f 100644 > --- a/arch/s390/mm/init.c > +++ b/arch/s390/mm/init.c > @@ -46,6 +46,7 @@ > #include > #include > #include > +#include arch/s390/mm/init.c including virtio.h looks a bit strange to me, but if Heiko and Vasily don't mind, neither do I. > > pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(.bss..swapper_pg_dir); > > @@ -162,6 +163,11 @@ bool force_dma_unencrypted(struct device *dev) > return is_prot_virt_guest(); > } > > +int arch_needs_iommu_platform(struct virtio_device *dev) Maybe prefixing the name with virtio_ would help provide the proper context. > +{ > + return is_prot_virt_guest(); > +} > + > /* protected virtualization */ > static void pv_init(void) > { > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > index a977e32a88f2..30091089bee8 100644 > --- a/drivers/virtio/virtio.c > +++ b/drivers/virtio/virtio.c > @@ -167,6 +167,11 @@ void virtio_add_status(struct virtio_device *dev, > unsigned int status) > } > EXPORT_SYMBOL_GPL(virtio_add_status); > > +int __weak arch_needs_iommu_platform(struct virtio_device *dev) > +{ > + return 0; > +} > + Adding some people that could be interested in overriding this as well to the cc list. > int virtio_finalize_features(struct virtio_device *dev) > { > int ret = dev->config->finalize_features(dev); > @@ -179,6 +184,10 @@ int virtio_finalize_features(struct virtio_device *dev) > if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) > return 0; > > + if (arch_needs_iommu_platform(dev) && > + !virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) > + return -EIO; > + Why EIO? Overall, I think it is a good idea to have something that is going to protect us from this scenario. Regards, Halil > virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); > status = dev->config->get_status(dev); > if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index a493eac08393..2c46b310c38c 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -195,4 +195,6 @@ void unregister_virtio_driver(struct virtio_driver *drv); > #define module_virtio_driver(__virtio_driver) \ > module_driver(__virtio_driver, register_virtio_driver, \ > unregister_virtio_driver) > + > +int arch_needs_iommu_platform(struct virtio_device *dev); > #endif /* _LINUX_VIRTIO_H */ ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] s390: protvirt: virtio: Refuse device without IOMMU
On Mon, 15 Jun 2020 11:01:55 +0800 Jason Wang wrote: > > hum, in between I found another way which seems to me much better: > > > > We already have the force_dma_unencrypted() function available which > > AFAIU is what we want for encrypted memory protection and is already > > used by power and x86 SEV/SME in a way that seems AFAIU compatible > > with our problem. > > > > Even DMA and IOMMU are different things, I think they should be used > > together in our case. > > > > What do you think? > > > > The patch would then be something like: > > > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > > index a977e32a88f2..53476d5bbe35 100644 > > --- a/drivers/virtio/virtio.c > > +++ b/drivers/virtio/virtio.c > > @@ -4,6 +4,7 @@ > > #include > > #include > > #include > > +#include > > #include > > > > /* Unique numbering for virtio devices. */ > > @@ -179,6 +180,10 @@ int virtio_finalize_features(struct virtio_device > > *dev) > > if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) > > return 0; > > > > + if (force_dma_unencrypted(>dev) && > > + !virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) > > + return -EIO; > > + > > virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); > > status = dev->config->get_status(dev); > > if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { > > > I think this can work but need to listen from Michael I don't think Christoph Hellwig will like force_dma_unencrypted() in virtio code: https://lkml.org/lkml/2020/2/20/630 Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] s390: protvirt: virtio: Refuse device without IOMMU
On Wed, 10 Jun 2020 15:11:51 +0200 Pierre Morel wrote: > Protected Virtualisation protects the memory of the guest and > do not allow a the host to access all of its memory. > > Let's refuse a VIRTIO device which does not use IOMMU > protected access. > Should we ever get a virtio-ccw device implemented in hardware, we could in theory have a trusted device, i.e. one that is not affected by the memory protection. IMHO this restriction applies to paravitualized devices, that is devices realized by the hypervisor. In that sense this is not about ccw, but could in the future affect PCI as well. Thus the transport code may not be the best place to fence this, but frankly looking at how the QEMU discussion is going (the one where I try to prevent this condition) I would be glad to have something like this as a safety net. I would however like the commit message to reflect what is written above. Do we need backports (and cc-stable) for this? Connie what do you think? > Signed-off-by: Pierre Morel > --- > drivers/s390/virtio/virtio_ccw.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/drivers/s390/virtio/virtio_ccw.c > b/drivers/s390/virtio/virtio_ccw.c > index 5730572b52cd..06ffbc96587a 100644 > --- a/drivers/s390/virtio/virtio_ccw.c > +++ b/drivers/s390/virtio/virtio_ccw.c > @@ -986,6 +986,11 @@ static void virtio_ccw_set_status(struct virtio_device > *vdev, u8 status) > if (!ccw) > return; > > + /* Protected Virtualisation guest needs IOMMU */ > + if (is_prot_virt_guest() && > + !__virtio_test_bit(vdev, VIRTIO_F_IOMMU_PLATFORM)) If you were to add && !__virtio_test_bit(vdev, VIRTIO_F_ORDER_PLATFORM) we could confine this check and the bail out to paravirtualized devices, because a device realized in HW is expected to give us both F_ACCESS_PLATFORM and F_ORDER_PLATFORM. I would not bet it will ever matter for virtio-ccw though. Connie, do you have an opinion on this? Regards, Halil > + status &= ~VIRTIO_CONFIG_S_FEATURES_OK; > + > /* Write the status to the host. */ > vcdev->dma_area->status = status; > ccw->cmd_code = CCW_CMD_WRITE_STATUS; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] s390/virtio: remove unused pm callbacks
On Tue, 26 May 2020 11:36:29 +0200 Cornelia Huck wrote: > Support for hibernation on s390 has been recently been removed with > commit 394216275c7d ("s390: remove broken hibernate / power management > support"), no need to keep unused code around. > > Signed-off-by: Cornelia Huck Reviewed-by: Halil Pasic > --- > drivers/s390/virtio/virtio_ccw.c | 26 -- > 1 file changed, 26 deletions(-) > > diff --git a/drivers/s390/virtio/virtio_ccw.c > b/drivers/s390/virtio/virtio_ccw.c > index 957889a42d2e..5730572b52cd 100644 > --- a/drivers/s390/virtio/virtio_ccw.c > +++ b/drivers/s390/virtio/virtio_ccw.c > @@ -1372,27 +1372,6 @@ static struct ccw_device_id virtio_ids[] = { > {}, > }; > > -#ifdef CONFIG_PM_SLEEP > -static int virtio_ccw_freeze(struct ccw_device *cdev) > -{ > - struct virtio_ccw_device *vcdev = dev_get_drvdata(>dev); > - > - return virtio_device_freeze(>vdev); > -} > - > -static int virtio_ccw_restore(struct ccw_device *cdev) > -{ > - struct virtio_ccw_device *vcdev = dev_get_drvdata(>dev); > - int ret; > - > - ret = virtio_ccw_set_transport_rev(vcdev); > - if (ret) > - return ret; > - > - return virtio_device_restore(>vdev); > -} > -#endif > - > static struct ccw_driver virtio_ccw_driver = { > .driver = { > .owner = THIS_MODULE, > @@ -1405,11 +1384,6 @@ static struct ccw_driver virtio_ccw_driver = { > .set_online = virtio_ccw_online, > .notify = virtio_ccw_cio_notify, > .int_class = IRQIO_VIR, > -#ifdef CONFIG_PM_SLEEP > - .freeze = virtio_ccw_freeze, > - .thaw = virtio_ccw_restore, > - .restore = virtio_ccw_restore, > -#endif > }; > > static int __init pure_hex(char **cp, unsigned int *val, int min_digit, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/2] virtio-blk: improve handling of DMA mapping failures
On Tue, 3 Mar 2020 09:49:21 -0500 "Michael S. Tsirkin" wrote: > On Tue, Mar 03, 2020 at 03:12:52PM +0100, Halil Pasic wrote: > > On Thu, 13 Feb 2020 13:37:26 +0100 > > Halil Pasic wrote: > > > > > Two patches are handling new edge cases introduced by doing DMA mappings > > > (which can fail) in virtio core. > > > > > > I stumbled upon this while stress testing I/O for Protected Virtual > > > Machines. I deliberately chose a tiny swiotlb size and have generated > > > load with fio. With more than one virtio-blk disk in use I experienced > > > hangs. > > > > > > The goal of this series is to fix those hangs. > > > > > > Halil Pasic (2): > > > virtio-blk: fix hw_queue stopped on arbitrary error > > > virtio-blk: improve virtqueue error to BLK_STS > > > > > > drivers/block/virtio_blk.c | 17 - > > > 1 file changed, 12 insertions(+), 5 deletions(-) > > > > > > > > > base-commit: 39bed42de2e7d74686a2d5a45638d6a5d7e7d473 > > > > ping > > > > Hi Michael, hi Jason, > > > > I got some favorable reviews for this, but AFAIK I got nothing form the > > maintainers yet. > > > > Is some of you going to pick these? > > > > Regards, > > Halil > > I've queued this, thanks! > Thank you! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/2] virtio-blk: improve handling of DMA mapping failures
On Thu, 13 Feb 2020 13:37:26 +0100 Halil Pasic wrote: > Two patches are handling new edge cases introduced by doing DMA mappings > (which can fail) in virtio core. > > I stumbled upon this while stress testing I/O for Protected Virtual > Machines. I deliberately chose a tiny swiotlb size and have generated > load with fio. With more than one virtio-blk disk in use I experienced > hangs. > > The goal of this series is to fix those hangs. > > Halil Pasic (2): > virtio-blk: fix hw_queue stopped on arbitrary error > virtio-blk: improve virtqueue error to BLK_STS > > drivers/block/virtio_blk.c | 17 - > 1 file changed, 12 insertions(+), 5 deletions(-) > > > base-commit: 39bed42de2e7d74686a2d5a45638d6a5d7e7d473 ping Hi Michael, hi Jason, I got some favorable reviews for this, but AFAIK I got nothing form the maintainers yet. Is some of you going to pick these? Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
On Fri, 21 Feb 2020 17:36:45 +0100 Christoph Hellwig wrote: > > By "legacy devices" I assume you mean pre-virtio-1.0 devices, that > > lack the F_VERSION_1 feature flag. Is that right? Because I don't > > see how being a legacy device or not relates to use of the DMA API. > > No. "legacy" is anything that does not set F_ACCESS_PLATFORM. FYI in virtio-speak the term 'legacy devices' is already taken and it ain't congruent with your intended semantics. Please look it up https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-60001 But with that understood your statement does provide insisting in how do you see F_ACCESS_PLATFORM. I.e. something that should be enabled in general, except for legacy reasons. But I'm afraid Michael sees it differently: i.e. something that should be enabled when necessary, and otherwise not (because it is likely to cost performance). Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] mm: move force_dma_unencrypted() to mem_encrypt.h
On Mon, 24 Feb 2020 14:33:14 +1100 David Gibson wrote: > On Fri, Feb 21, 2020 at 07:07:02PM +0100, Halil Pasic wrote: > > On Fri, 21 Feb 2020 10:48:15 -0500 > > "Michael S. Tsirkin" wrote: > > > > > On Fri, Feb 21, 2020 at 02:06:39PM +0100, Halil Pasic wrote: > > > > On Fri, 21 Feb 2020 14:27:27 +1100 > > > > David Gibson wrote: > > > > > > > > > On Thu, Feb 20, 2020 at 05:31:35PM +0100, Christoph Hellwig wrote: > > > > > > On Thu, Feb 20, 2020 at 05:23:20PM +0100, Christian Borntraeger > > > > > > wrote: > > > > > > > >From a users perspective it makes absolutely perfect sense to > > > > > > > >use the > > > > > > > bounce buffers when they are NEEDED. > > > > > > > Forcing the user to specify iommu_platform just because you need > > > > > > > bounce buffers > > > > > > > really feels wrong. And obviously we have a severe performance > > > > > > > issue > > > > > > > because of the indirections. > > > > > > > > > > > > The point is that the user should not have to specify > > > > > > iommu_platform. > > > > > > We need to make sure any new hypervisor (especially one that might > > > > > > require > > > > > > bounce buffering) always sets it, > > > > > > > > > > So, I have draft qemu patches which enable iommu_platform by default. > > > > > But that's really because of other problems with !iommu_platform, not > > > > > anything to do with bounce buffering or secure VMs. > > > > > > > > > > The thing is that the hypervisor *doesn't* require bounce buffering. > > > > > In the POWER (and maybe s390 as well) models for Secure VMs, it's the > > > > > *guest*'s choice to enter secure mode, so the hypervisor has no reason > > > > > to know whether the guest needs bounce buffering. As far as the > > > > > hypervisor and qemu are concerned that's a guest internal detail, it > > > > > just expects to get addresses it can access whether those are GPAs > > > > > (iommu_platform=off) or IOVAs (iommu_platform=on). > > > > > > > > I very much agree! > > > > > > > > > > > > > > > as was a rather bogus legacy hack > > > > > > > > > > It was certainly a bad idea, but it was a bad idea that went into a > > > > > public spec and has been widely deployed for many years. We can't > > > > > just pretend it didn't happen and move on. > > > > > > > > > > Turning iommu_platform=on by default breaks old guests, some of which > > > > > we still care about. We can't (automatically) do it only for guests > > > > > that need bounce buffering, because the hypervisor doesn't know that > > > > > ahead of time. > > > > > > > > Turning iommu_platform=on for virtio-ccw makes no sense whatsover, > > > > because for CCW I/O there is no such thing as IOMMU and the addresses > > > > are always physical addresses. > > > > > > Fix the name then. The spec calls is ACCESS_PLATFORM now, which > > > makes much more sense. > > > > I don't quite get it. Sorry. Maybe I will revisit this later. > > Halil, I think I can clarify this. > > The "iommu_platform" flag doesn't necessarily have anything to do with > an iommu, although it often will. Basically it means "access guest > memory via the bus's normal DMA mechanism" rather than "access guest > memory using GPA, because you're the hypervisor and you can do that". > Unfortunately, I don't think this is what is conveyed to the end users. Let's see what do we have documented: Neither Qemu user documentation (https://www.qemu.org/docs/master/qemu-doc.html) nor online help: qemu-system-s390x -device virtio-net-ccw,?|grep iommu iommu_platform= - on/off (default: false) has any documentation on it. But libvirt does have have documenttion on the knob that contros iommu_platform for QEMU (when QEMU is managed by libvirt): """ Virtio-related options QEMU's virtio devices have some attributes related to the virtio transport under the driver element: The iommu attribute enables the use of emulated IOMMU by the device. The attribute ats controls the Address Translation Service support for PCIe devices. This is needed to make use of IOTLB support (
Re: [PATCH 0/2] virtio: decouple protected guest RAM form VIRTIO_F_IOMMU_PLATFORM
On Mon, 24 Feb 2020 17:26:20 +0800 Jason Wang wrote: > That's better. > > How about attached? > > Thanks Thanks Jason! It does avoid the translation overhead in vhost. Tested-by: Halil Pasic Regarding the code, you fence it in virtio-net.c, but AFAIU this feature has relevance for other vhost devices as well. E.g. what about vhost user? Would it be the responsibility of each virtio device to fence this on its own? I'm also a bit confused about the semantics of the vhost feature bit F_ACCESS_PLATFORM. What we have specified on virtio level is: """ This feature indicates that the device can be used on a platform where device access to data in memory is limited and/or translated. E.g. this is the case if the device can be located behind an IOMMU that translates bus addresses from the device into physical addresses in memory, if the device can be limited to only access certain memory addresses or if special commands such as a cache flush can be needed to synchronise data in memory with the device. Whether accesses are actually limited or translated is described by platform-specific means. If this feature bit is set to 0, then the device has same access to memory addresses supplied to it as the driver has. In particular, the device will always use physical addresses matching addresses used by the driver (typically meaning physical addresses used by the CPU) and not translated further, and can access any address supplied to it by the driver. When clear, this overrides any platform-specific description of whether device access is limited or translated in any way, e.g. whether an IOMMU may be present. """ I read this like the addresses may be IOVAs which require IMMU translation or GPAs which don't. On the vhost level however, it seems that F_IOMMU_PLATFORM means that vhost has to do the translation (via IOTLB API). Do I understand this correctly? If yes, I believe we should document this properly. BTW I'm still not 100% on the purpose and semantics of the F_ACCESS_PLATFORM feature bit. But that is a different problem. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
On Fri, 21 Feb 2020 17:39:38 +0100 Christoph Hellwig wrote: > On Fri, Feb 21, 2020 at 03:33:40PM +0100, Halil Pasic wrote: > > > Hell no. This is a detail of the platform DMA direct implementation. > > > > I beg to differ. If it was a detail of the DMA direct implementation, it > > should have/would have been private to kernel/dma/direct.c. > > It can't given that platforms have to implement it. It is an arch hook > for dma-direct. > > > Consider what would we have to do to make PCI devices do I/O trough > > pages that were shared when the guest is running in a protected VM. The > > s390_pci_dma_ops would also need to know whether to 'force dma uencrypted' > > or not, and it's the exact same logic. I doubt simply using DMA direct > > for zPCI would do, because we still have to do all the Z specific IOMMU > > management. > > And your IOMMU can't deal with the encryption bit? There is no encrypt bit, and our memory is not encrypted, but protected. Means e.g. when buggy/malicious hypervisor tries to read a protected page it wont get ciphertext, but a slap on its finger. In order do make memory accessible to the hypervisor (or another guest, or a real device) the guest must make a so called utlravisor call (talk to the firmware) and share the respective page. We tapped into the memory encryption infrastructure, because both is protecting the guest memory form the host (just by different means), and because it made no sense to build up something in parallel when most of the stuff we need was already there. But most unfortunately the names are deceiving when it comes to s390 protected virtualization and it's guest I/O enablement. > In the case we > could think of allowing IOMMU implementation to access it. But the > point that it is an internal detail of the DMA implementation and by > now means for drivers. >From the perspective, that any driver that does anything remotely DMAish, that is, some external entity (possibly a hypervisor, possibly a channel subsystem, possibly a DMA controller) to should the memory, should do DMA API first, to make sure, the DMAish goes well, your argument makes perfect sense. But form that perspective !F_ACCESS_PLATFORM is also a DMAish. And the virtio spec mandates that !F_ACCESS_PLATFORM implies GPA's. For virtio-ccw I want GPA's and not IOVA's on s390, for virtio-pci, which we also support in general but not with protected virtualization, well, it's a different story. With protected visualization however I must make sure all I/O goes through shared pages. We use swiotlb for that. But then the old infrastructure won't cut it. Jet we still need GPA's on the ring (with the extra requirement that the page must be shared). DMA API is a nice fit there because we can allocate DMA coherent memory (such that what comes back from our DMA ops is a GPA), so we have shared memory that the hypervisor and the guest is allowed to look at concurrently, and for the buffers that are going to be put on the vring, we can use the streaming API, which uses bounce buffers. The returned IOVA (in DMA API speak) is a GPA of the bounce buffer, and the guest is not allowed to peek until it unmaps, so everything is cozy. But for that to work, we all (AMD SEV, power, and s390) must go through the DMA API, because the old infrastructure in virtio core simply won't cut it. And it has nothing to do with the device. David explained it very well. My series is about controlling virtio-core's usage of DMA API. I believe, I did it in a way that doesn't hurt any arch at the moment. Maybe the conflict can be resolved if the transport gets a say in whether to use the DMA API or not. In the end the VIRTIO spec does say that "Whether accesses are actually limited or translated is described by platform-specific means." Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] mm: move force_dma_unencrypted() to mem_encrypt.h
On Fri, 21 Feb 2020 10:48:15 -0500 "Michael S. Tsirkin" wrote: > On Fri, Feb 21, 2020 at 02:06:39PM +0100, Halil Pasic wrote: > > On Fri, 21 Feb 2020 14:27:27 +1100 > > David Gibson wrote: > > > > > On Thu, Feb 20, 2020 at 05:31:35PM +0100, Christoph Hellwig wrote: > > > > On Thu, Feb 20, 2020 at 05:23:20PM +0100, Christian Borntraeger wrote: > > > > > >From a users perspective it makes absolutely perfect sense to use the > > > > > bounce buffers when they are NEEDED. > > > > > Forcing the user to specify iommu_platform just because you need > > > > > bounce buffers > > > > > really feels wrong. And obviously we have a severe performance issue > > > > > because of the indirections. > > > > > > > > The point is that the user should not have to specify iommu_platform. > > > > We need to make sure any new hypervisor (especially one that might > > > > require > > > > bounce buffering) always sets it, > > > > > > So, I have draft qemu patches which enable iommu_platform by default. > > > But that's really because of other problems with !iommu_platform, not > > > anything to do with bounce buffering or secure VMs. > > > > > > The thing is that the hypervisor *doesn't* require bounce buffering. > > > In the POWER (and maybe s390 as well) models for Secure VMs, it's the > > > *guest*'s choice to enter secure mode, so the hypervisor has no reason > > > to know whether the guest needs bounce buffering. As far as the > > > hypervisor and qemu are concerned that's a guest internal detail, it > > > just expects to get addresses it can access whether those are GPAs > > > (iommu_platform=off) or IOVAs (iommu_platform=on). > > > > I very much agree! > > > > > > > > > as was a rather bogus legacy hack > > > > > > It was certainly a bad idea, but it was a bad idea that went into a > > > public spec and has been widely deployed for many years. We can't > > > just pretend it didn't happen and move on. > > > > > > Turning iommu_platform=on by default breaks old guests, some of which > > > we still care about. We can't (automatically) do it only for guests > > > that need bounce buffering, because the hypervisor doesn't know that > > > ahead of time. > > > > Turning iommu_platform=on for virtio-ccw makes no sense whatsover, > > because for CCW I/O there is no such thing as IOMMU and the addresses > > are always physical addresses. > > Fix the name then. The spec calls is ACCESS_PLATFORM now, which > makes much more sense. I don't quite get it. Sorry. Maybe I will revisit this later. Regards, Halil > > > > > > > > that isn't extensibe for cases that for example require bounce > > > > buffering. > > > > > > In fact bounce buffering isn't really the issue from the hypervisor > > > (or spec's) point of view. It's the fact that not all of guest memory > > > is accessible to the hypervisor. Bounce buffering is just one way the > > > guest might deal with that. > > > > > > > Agreed. > > > > Regards, > > Halil > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
On Fri, 21 Feb 2020 10:56:45 -0500 "Michael S. Tsirkin" wrote: > On Fri, Feb 21, 2020 at 02:12:30PM +0100, Halil Pasic wrote: > > On Thu, 20 Feb 2020 15:55:14 -0500 > > "Michael S. Tsirkin" wrote: [..] > > > To summarize, the necessary conditions for a hack along these lines > > > (using DMA API without VIRTIO_F_ACCESS_PLATFORM) are that we detect that: > > > > > > - secure guest mode is enabled - so we know that since we don't share > > > most memory regular virtio code won't > > > work, even though the buggy hypervisor didn't set > > > VIRTIO_F_ACCESS_PLATFORM > > > > force_dma_unencrypted(>dev) is IMHO exactly about this. > > > > > - DMA API is giving us addresses that are actually also physical > > > addresses > > > > In case of s390 this is given. > > I talked with the power people before > > posting this, and they ensured me they can are willing to deal with > > this. I was hoping to talk abut this with the AMD SEV people here (hence > > the cc). > > We'd need a part of DMA API that promises this though. Platform > maintainers aren't going to go out of their way to do the > right thing just for virtio, and I can't track all arches > to make sure they don't violate virtio requirements. > One would have to track only the arches that have force_dma_unecrypted(), and generally speaking the arches shall make sure the DMA ops are suitable, whatever that means in the given context. > > > > > - Hypervisor is buggy and didn't enable VIRTIO_F_ACCESS_PLATFORM > > > > > > > I don't get this point. The argument where the hypervisor is buggy is a > > bit hard to follow for me. If hypervisor is buggy we have already lost > > anyway most of the time, or? > > If VIRTIO_F_ACCESS_PLATFORM is set then things just work. If > VIRTIO_F_ACCESS_PLATFORM is clear device is supposed to have access to > all of memory. You can argue in various ways but it's easier to just > declare a behaviour that violates this a bug. Which might still be worth > working around, for various reasons. > I don't agree. The spec explicitly states "If this feature bit is set to 0, then the device has same access to memory addresses supplied to it as the driver has." That ain't any guest memory, but the addresses supplied to it. BTW this is how channel I/O works as well: the channel program authorizes the device to use the memory locations specified by the channel program, for as long as the channel program is executed. That's how channel I/O does isolation without an architected IOMMU. Can you please show me the words in the specification that say, the device has to have access to the entire guest memory all the time? Maybe I have to understand all the intentions behind VIRTIO_F_ACCESS_PLATFORM better. I've read the spec several times, but I still have the feeling, when we discuss, that I didn't get it right. IOMMUs and PCI style DMA are unfortunately not my bread and butter. I only know that the devices do not need any new device capability (I assume VIRTIO_F_ACCESS_PLATFORM does express a device capability), to work with protected virtualization. Unless one defines VIRTIO_F_ACCESS_PLATFORM is the capability that the device won't poke *arbitrary* guest memory. From that perspective mandating the flag feels wrong. (CCW devices are never allowed to poke arbitrary memory.) Yet, if VIRTIO_F_ACCESS_PLATFORM is a flag that every nice and modern virtio device should have (i.e. a lack of it means kida broken), then I have the feeling virtio-ccw should probably evolve in the direction, that having VIRTIO_F_ACCESS_PLATFORM set does not hurt. I have to think some more. > > > > I don't see how this patch does this. > > > > I do get your point. I don't know of a good way to check that DMA API > > is giving us addresses that are actually physical addresses, and the > > situation you describe definitely has some risk to it. > > One way would be to extend the DMA API with such an API. Seems Christoph does not like that idea. > > Another would be to make virtio always use DMA API > and hide the logic in there. I thought Christoph wants that, but I was wrong. > This second approach is not easy, in particular since DMA API adds > a bunch of overhead which we need to find ways to > measure and mitigate. > Agreed. From s390 perspective, I think it ain't to bad if we get GFP_DMA. Many thanks for your patience! Halil [..] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/2] virtio: decouple protected guest RAM form VIRTIO_F_IOMMU_PLATFORM
On Fri, 21 Feb 2020 14:22:26 +0800 Jason Wang wrote: > > On 2020/2/21 上午12:06, Halil Pasic wrote: > > Currently if one intends to run a memory protection enabled VM with > > virtio devices and linux as the guest OS, one needs to specify the > > VIRTIO_F_IOMMU_PLATFORM flag for each virtio device to make the guest > > linux use the DMA API, which in turn handles the memory > > encryption/protection stuff if the guest decides to turn itself into > > a protected one. This however makes no sense due to multiple reasons: > > * The device is not changed by the fact that the guest RAM is > > protected. The so called IOMMU bypass quirk is not affected. > > * This usage is not congruent with standardised semantics of > > VIRTIO_F_IOMMU_PLATFORM. Guest memory protected is an orthogonal reason > > for using DMA API in virtio (orthogonal with respect to what is > > expressed by VIRTIO_F_IOMMU_PLATFORM). > > > > This series aims to decouple 'have to use DMA API because my (guest) RAM > > is protected' and 'have to use DMA API because the device told me > > VIRTIO_F_IOMMU_PLATFORM'. > > > > Please find more detailed explanations about the conceptual aspects in > > the individual patches. There is however also a very practical problem > > that is addressed by this series. > > > > For vhost-net the feature VIRTIO_F_IOMMU_PLATFORM has the following side > > effect The vhost code assumes it the addresses on the virtio descriptor > > ring are not guest physical addresses but iova's, and insists on doing a > > translation of these regardless of what transport is used (e.g. whether > > we emulate a PCI or a CCW device). (For details see commit 6b1e6cc7855b > > "vhost: new device IOTLB API".) On s390 this results in severe > > performance degradation (c.a. factor 10). > > > Do you see a consistent degradation on the performance, or it only > happen when for during the beginning of the test? > AFAIK the degradation is consistent. > > > BTW with ccw I/O there is > > (architecturally) no IOMMU, so the whole address translation makes no > > sense in the context of virtio-ccw. > > > I suspect we can do optimization in qemu side. > > E.g send memtable entry via IOTLB API when vIOMMU is not enabled. > > If this makes sense, I can draft patch to see if there's any difference. Frankly I would prefer to avoid IOVAs on the descriptor ring (and the then necessary translation) for virtio-ccw altogether. But Michael voiced his opinion that we should mandate F_IOMMU_PLATFORM for devices that could be used with guests running in protected mode. I don't share his opinion, but that's an ongoing discussion. Should we end up having to do translation from IOVA in vhost, we are very interested in that translation being fast and efficient. In that sense we would be very happy to test any optimization that aim into that direction. Thank you very much for your input! Regards, Halil > > Thanks > > > > > > Halil Pasic (2): > >mm: move force_dma_unencrypted() to mem_encrypt.h > >virtio: let virtio use DMA API when guest RAM is protected > > > > drivers/virtio/virtio_ring.c | 3 +++ > > include/linux/dma-direct.h | 9 - > > include/linux/mem_encrypt.h | 10 ++ > > 3 files changed, 13 insertions(+), 9 deletions(-) > > > > > > base-commit: ca7e1fd1026c5af6a533b4b5447e1d2f153e28f2 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
On Thu, 20 Feb 2020 17:13:09 +0100 Christoph Hellwig wrote: > On Thu, Feb 20, 2020 at 05:06:06PM +0100, Halil Pasic wrote: > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > index 867c7ebd3f10..fafc8f924955 100644 > > --- a/drivers/virtio/virtio_ring.c > > +++ b/drivers/virtio/virtio_ring.c > > @@ -243,6 +243,9 @@ static bool vring_use_dma_api(struct virtio_device > > *vdev) > > if (!virtio_has_iommu_quirk(vdev)) > > return true; > > > > + if (force_dma_unencrypted(>dev)) > > + return true; > > Hell no. This is a detail of the platform DMA direct implementation. I beg to differ. If it was a detail of the DMA direct implementation, it should have/would have been private to kernel/dma/direct.c. A look at $ git grep -e force_dma_unencrypted arch/powerpc/include/asm/mem_encrypt.h:static inline bool force_dma_unencrypted(struct device *dev) arch/s390/mm/init.c:bool force_dma_unencrypted(struct device *dev) arch/x86/mm/mem_encrypt.c:bool force_dma_unencrypted(struct device *dev) include/linux/dma-direct.h:bool force_dma_unencrypted(struct device *dev); include/linux/dma-direct.h:static inline bool force_dma_unencrypted(struct device *dev) kernel/dma/direct.c:if (force_dma_unencrypted(dev)) kernel/dma/direct.c:if (force_dma_unencrypted(dev)) kernel/dma/direct.c:!force_dma_unencrypted(dev)) { kernel/dma/direct.c:if (force_dma_unencrypted(dev)) kernel/dma/direct.c:if (force_dma_unencrypted(dev)) kernel/dma/direct.c:!force_dma_unencrypted(dev)) { kernel/dma/direct.c:if (force_dma_unencrypted(dev)) tells you, that force_dma_unencrypted() is *consumed* by dma direct, but *provided* by the memory encryption or memory management code. I.e. platform code tells the dma (direct) code what decisions to make under certain circumstances. Consider what would we have to do to make PCI devices do I/O trough pages that were shared when the guest is running in a protected VM. The s390_pci_dma_ops would also need to know whether to 'force dma uencrypted' or not, and it's the exact same logic. I doubt simply using DMA direct for zPCI would do, because we still have to do all the Z specific IOMMU management. > Drivers have no business looking at this flag, and virtio finally needs > to be fixed to use the DMA API properly for everything but legacy devices. See the follow on discussion with David Gibson. In short: I'm in favor of always using DMA API iff we keep conformance with the VIRTIO spec and if it does not imply any degradations for s390 (virtio-ccw), or any other regressions. > > No amount of desparate hacks is going to fix that fundamental problem, > and I'm starting to get really sick and tired of all the crap patches > published in this area. I don't think I deserve the offensive language. AFAIU you have a positive attitude towards the idea, that !F_VIRTIO_PLATFORM implies 'no DMA API is used by virtio' should be scrapped. I would like to accomplish that without adverse effects to virtio-ccw (because caring for virtio-ccw is a part of job description). Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/2] virtio: decouple protected guest RAM form VIRTIO_F_IOMMU_PLATFORM
On Thu, 20 Feb 2020 16:33:35 -0500 "Michael S. Tsirkin" wrote: > On Thu, Feb 20, 2020 at 05:06:04PM +0100, Halil Pasic wrote: > > For vhost-net the feature VIRTIO_F_IOMMU_PLATFORM has the following side > > effect The vhost code assumes it the addresses on the virtio descriptor > > ring are not guest physical addresses but iova's, and insists on doing a > > translation of these regardless of what transport is used (e.g. whether > > we emulate a PCI or a CCW device). (For details see commit 6b1e6cc7855b > > "vhost: new device IOTLB API".) On s390 this results in severe > > performance degradation (c.a. factor 10). BTW with ccw I/O there is > > (architecturally) no IOMMU, so the whole address translation makes no > > sense in the context of virtio-ccw. > > So it sounds like a host issue: the emulation of s390 unnecessarily > complicated. > Working around it by the guest looks wrong ... While do think that forcing IOMMU_PLATFORM on devices that do not want or need it, just to trigger DMA API usage in guest is conceptually wrong, I do agree, that we might have a host issue. Namely, unlike PCI, CCW does not have an IOMMU, and the spec clearly states that "Whether accesses are actually limited or translated is described by platform-specific means.". With CCW devices we don't have address translation by IOMMU, and in that sense vhost is probably wrong about trying to do the translation. That is why I mentioned the patch, and that it's done regardless of the transport/platform. Regards, Halil > > > Halil Pasic (2): > > mm: move force_dma_unencrypted() to mem_encrypt.h > > virtio: let virtio use DMA API when guest RAM is protected > > > > drivers/virtio/virtio_ring.c | 3 +++ > > include/linux/dma-direct.h | 9 - > > include/linux/mem_encrypt.h | 10 ++ > > 3 files changed, 13 insertions(+), 9 deletions(-) > > > > > > base-commit: ca7e1fd1026c5af6a533b4b5447e1d2f153e28f2 > > -- > > 2.17.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/2] virtio: decouple protected guest RAM form VIRTIO_F_IOMMU_PLATFORM
On Thu, 20 Feb 2020 16:29:50 -0500 "Michael S. Tsirkin" wrote: > On Thu, Feb 20, 2020 at 05:06:04PM +0100, Halil Pasic wrote: > > * This usage is not congruent with standardised semantics of > > VIRTIO_F_IOMMU_PLATFORM. Guest memory protected is an orthogonal reason > > for using DMA API in virtio (orthogonal with respect to what is > > expressed by VIRTIO_F_IOMMU_PLATFORM). > > Quoting the spec: > > \item[VIRTIO_F_ACCESS_PLATFORM(33)] This feature indicates that > the device can be used on a platform where device access to data > in memory is limited and/or translated. E.g. this is the case if the device > can be located > behind an IOMMU that translates bus addresses from the device into physical > addresses in memory, if the device can be limited to only access > certain memory addresses or if special commands such as > a cache flush can be needed to synchronise data in memory with > the device. Whether accesses are actually limited or translated > is described by platform-specific means. > If this feature bit is set to 0, then the device > has same access to memory addresses supplied to it as the > driver has. > In particular, the device will always use physical addresses > matching addresses used by the driver (typically meaning > physical addresses used by the CPU) > and not translated further, and can access any address supplied to it by > the driver. When clear, this overrides any platform-specific description of > whether device access is limited or translated in any way, e.g. > whether an IOMMU may be present. > > since device can't access encrypted memory, > this seems to match your case reasonably well. > As David already explained, the device does not have to access encrypted memory. In fact, we don't have memory encryption but memory protection on s390. All the device *should* ever see is non-protected memory (one that was previously shared by the guest). Our protected guests start as non-protected ones, and may or may not turn protected during their life-span. From the device perspective, really, nothing changes. I believe David explained this much better than I did. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
On Fri, 21 Feb 2020 11:41:57 +0800 Jason Wang wrote: > > I *think* what you are suggesting here is that virtio devices that > > have !F_IOMMU_PLATFORM should have their dma_ops set up so that the > > DMA API treats IOVA==PA, which will satisfy what the device expects. > > > Can this work for swiotlb? It works on s390. I guess it would be the responsibility of however provides the dma ops for the virtio device to ensure that if !F_IOMMU_PLATFORM the addresses are GPA like *mandated* by the VIRTIO specification. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
On Fri, 21 Feb 2020 13:59:15 +1100 David Gibson wrote: > On Thu, Feb 20, 2020 at 05:13:09PM +0100, Christoph Hellwig wrote: > > On Thu, Feb 20, 2020 at 05:06:06PM +0100, Halil Pasic wrote: > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > > index 867c7ebd3f10..fafc8f924955 100644 > > > --- a/drivers/virtio/virtio_ring.c > > > +++ b/drivers/virtio/virtio_ring.c > > > @@ -243,6 +243,9 @@ static bool vring_use_dma_api(struct virtio_device > > > *vdev) > > > if (!virtio_has_iommu_quirk(vdev)) > > > return true; > > > > > > + if (force_dma_unencrypted(>dev)) > > > + return true; > > > > Hell no. This is a detail of the platform DMA direct implementation. > > Drivers have no business looking at this flag, and virtio finally needs > > to be fixed to use the DMA API properly for everything but legacy devices. > > So, this patch definitely isn't right as it stands, but I'm struggling > to understand what it is you're saying is the right way. > > By "legacy devices" I assume you mean pre-virtio-1.0 devices, that > lack the F_VERSION_1 feature flag. Is that right? Because I don't > see how being a legacy device or not relates to use of the DMA API. > > I *think* what you are suggesting here is that virtio devices that > have !F_IOMMU_PLATFORM should have their dma_ops set up so that the > DMA API treats IOVA==PA, which will satisfy what the device expects. > Then the virtio driver can use the DMA API the same way for both > F_IOMMU_PLATFORM and !F_IOMMU_PLATFORM devices. I've considered this idea, and as a matter a fact would be my preferred solution. It would be, if we could use GFP_DMA when allocating coherent memory through the DMA API. For CCW devices we *must* have some of the device accessible memory 31 bit addressable (in physical address space), because the s390 architecture *mandates it*. I had patches (https://lkml.org/lkml/2019/9/23/313) that do that but Christoph was not open to the idea. Always pushing all the stuff allocated by virtio devices into the DMA zone is not a good option for us. We did that for protected guests, because that was what Christoph was willing to accept, but not happy with it. I'm willing to go down that rute, but I really need the capability to tell the DMA API 'do this allocation from ZONE_DMA'. I don't care how. > > But if that works for !F_IOMMU_PLATFORM_DEVICES+F_VERSION_1 devices, > then AFAICT it will work equally well for legacy devices. I agree. > > Using the DMA API for *everything* in virtio, legacy or not, seems > like a reasonable approach to me. But, AFAICT, that does require the > DMA layer to have some kind of explicit call to turn on this > behaviour, which the virtio driver would call during initializsation. > I don't think we can do it 100% within the DMA layer, because only the > driver can reasonably know when a device has this weird non-standard > DMA behaviour. > IMHO one could make the DMA API work with different DMA ops. The virtio transport or the virtio core can probably intervene. But then, we (virtio-ccw) would need the iommu=off DMA ops to do the bounce buffering iff force_dma_unencrypted(). Regards, Halil pgpPLLTZnydZy.pgp Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
On Thu, 20 Feb 2020 15:55:14 -0500 "Michael S. Tsirkin" wrote: > On Thu, Feb 20, 2020 at 05:06:06PM +0100, Halil Pasic wrote: > > Currently the advanced guest memory protection technologies (AMD SEV, > > powerpc secure guest technology and s390 Protected VMs) abuse the > > VIRTIO_F_IOMMU_PLATFORM flag to make virtio core use the DMA API, which > > is in turn necessary, to make IO work with guest memory protection. > > > > But VIRTIO_F_IOMMU_PLATFORM a.k.a. VIRTIO_F_ACCESS_PLATFORM is really a > > different beast: with virtio devices whose implementation runs on an SMP > > CPU we are still fine with doing all the usual optimizations, it is just > > that we need to make sure that the memory protection mechanism does not > > get in the way. The VIRTIO_F_ACCESS_PLATFORM mandates more work on the > > side of the guest (and possibly he host side as well) than we actually > > need. > > > > An additional benefit of teaching the guest to make the right decision > > (and use DMA API) on it's own is: removing the need, to mandate special > > VM configuration for guests that may run with protection. This is > > especially interesting for s390 as VIRTIO_F_IOMMU_PLATFORM pushes all > > the virtio control structures into the first 2G of guest memory: > > something we don't necessarily want to do per-default. > > > > Signed-off-by: Halil Pasic > > Tested-by: Ram Pai > > Tested-by: Michael Mueller > > This might work for you but it's fragile, since without > VIRTIO_F_ACCESS_PLATFORM hypervisor assumes it gets > GPA's, not DMA addresses. > Thanks for your constructive approach. I do want the hypervisor to assume it gets GPA's. My train of thought was that the guys that need to use IOVA's that are not GPA's when force_dma_unencrypted() will have to to specify VIRTIO_F_ACCESS_PLATFORM (at the device) anyway, because otherwise it won't work. But I see your point: in case of a mis-configuration and provided the DMA API returns IOVA's one could end up trying to touch wrong memory locations. But this should be similar to what would happen if DMA ops are not used, and memory is not made accessible. > > > IOW this looks like another iteration of: > > virtio: Support encrypted memory on powerpc secure guests > > which I was under the impression was abandoned as unnecessary. Unnecessary for powerpc because they do normal PCI. In the context of CCW there are only guest physical addresses (CCW I/O has no concept of IOMMU or IOVAs). > > > To summarize, the necessary conditions for a hack along these lines > (using DMA API without VIRTIO_F_ACCESS_PLATFORM) are that we detect that: > > - secure guest mode is enabled - so we know that since we don't share > most memory regular virtio code won't > work, even though the buggy hypervisor didn't set VIRTIO_F_ACCESS_PLATFORM force_dma_unencrypted(>dev) is IMHO exactly about this. > - DMA API is giving us addresses that are actually also physical > addresses In case of s390 this is given. I talked with the power people before posting this, and they ensured me they can are willing to deal with this. I was hoping to talk abut this with the AMD SEV people here (hence the cc). > - Hypervisor is buggy and didn't enable VIRTIO_F_ACCESS_PLATFORM > I don't get this point. The argument where the hypervisor is buggy is a bit hard to follow for me. If hypervisor is buggy we have already lost anyway most of the time, or? > I don't see how this patch does this. I do get your point. I don't know of a good way to check that DMA API is giving us addresses that are actually physical addresses, and the situation you describe definitely has some risk to it. Let me comment on other ideas that came up. I would be very happy to go with the best one. Thank you very much. Regards, Halil > > > > --- > > drivers/virtio/virtio_ring.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > index 867c7ebd3f10..fafc8f924955 100644 > > --- a/drivers/virtio/virtio_ring.c > > +++ b/drivers/virtio/virtio_ring.c > > @@ -243,6 +243,9 @@ static bool vring_use_dma_api(struct virtio_device > > *vdev) > > if (!virtio_has_iommu_quirk(vdev)) > > return true; > > > > + if (force_dma_unencrypted(>dev)) > > + return true; > > + > > /* Otherwise, we are left to guess. */ > > /* > > * In theory, it's possible to have a buggy QEMU-supposed > > -- > > 2.17.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] mm: move force_dma_unencrypted() to mem_encrypt.h
On Fri, 21 Feb 2020 14:27:27 +1100 David Gibson wrote: > On Thu, Feb 20, 2020 at 05:31:35PM +0100, Christoph Hellwig wrote: > > On Thu, Feb 20, 2020 at 05:23:20PM +0100, Christian Borntraeger wrote: > > > >From a users perspective it makes absolutely perfect sense to use the > > > bounce buffers when they are NEEDED. > > > Forcing the user to specify iommu_platform just because you need bounce > > > buffers > > > really feels wrong. And obviously we have a severe performance issue > > > because of the indirections. > > > > The point is that the user should not have to specify iommu_platform. > > We need to make sure any new hypervisor (especially one that might require > > bounce buffering) always sets it, > > So, I have draft qemu patches which enable iommu_platform by default. > But that's really because of other problems with !iommu_platform, not > anything to do with bounce buffering or secure VMs. > > The thing is that the hypervisor *doesn't* require bounce buffering. > In the POWER (and maybe s390 as well) models for Secure VMs, it's the > *guest*'s choice to enter secure mode, so the hypervisor has no reason > to know whether the guest needs bounce buffering. As far as the > hypervisor and qemu are concerned that's a guest internal detail, it > just expects to get addresses it can access whether those are GPAs > (iommu_platform=off) or IOVAs (iommu_platform=on). I very much agree! > > > as was a rather bogus legacy hack > > It was certainly a bad idea, but it was a bad idea that went into a > public spec and has been widely deployed for many years. We can't > just pretend it didn't happen and move on. > > Turning iommu_platform=on by default breaks old guests, some of which > we still care about. We can't (automatically) do it only for guests > that need bounce buffering, because the hypervisor doesn't know that > ahead of time. Turning iommu_platform=on for virtio-ccw makes no sense whatsover, because for CCW I/O there is no such thing as IOMMU and the addresses are always physical addresses. > > > that isn't extensibe for cases that for example require bounce buffering. > > In fact bounce buffering isn't really the issue from the hypervisor > (or spec's) point of view. It's the fact that not all of guest memory > is accessible to the hypervisor. Bounce buffering is just one way the > guest might deal with that. > Agreed. Regards, Halil pgpOjODca04D4.pgp Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/2] mm: move force_dma_unencrypted() to mem_encrypt.h
Currently force_dma_unencrypted() is only used by the direct implementation of the DMA API, and thus resides in dma-direct.h. But there is nothing dma-direct specific about it: if one was -- for whatever reason -- to implement custom DMA ops that have to in the encrypted/protected scenarios dma-direct currently deals with, one would need exactly this kind of information. More importantly, virtio has to use DMA API (so that the memory encryption (x86) or protection (power, s390) is handled) under the very same circumstances force_dma_unencrypted() returns true. Furthermore, the in these cases the reason to go via the DMA API is distinct, compared to the reason indicated by VIRTIO_F_IOMMU_PLATFORM: we need to use DMA API independently of the device's properties with regards to access to memory. I.e. the addresses in the descriptors are still guest physical addresses, the device may still be implemented by a SMP CPU, and thus the device shall use those without any further translation. See [1]. Let's move force_dma_unencrypted() the so virtio, or other implementations of DMA ops can make the right decisions. [1] https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-416 (In the spec VIRTIO_F_IOMMU_PLATFORM is called VIRTIO_F_ACCESS_PLATFORM). Signed-off-by: Halil Pasic Tested-by: Ram Pai Tested-by: Michael Mueller --- include/linux/dma-direct.h | 9 - include/linux/mem_encrypt.h | 10 ++ 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 24b8684aa21d..590b15d881b0 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -26,15 +26,6 @@ static inline phys_addr_t __dma_to_phys(struct device *dev, dma_addr_t dev_addr) } #endif /* !CONFIG_ARCH_HAS_PHYS_TO_DMA */ -#ifdef CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED -bool force_dma_unencrypted(struct device *dev); -#else -static inline bool force_dma_unencrypted(struct device *dev) -{ - return false; -} -#endif /* CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED */ - /* * If memory encryption is supported, phys_to_dma will set the memory encryption * bit in the DMA address, and dma_to_phys will clear it. The raw __phys_to_dma diff --git a/include/linux/mem_encrypt.h b/include/linux/mem_encrypt.h index 5c4a18a91f89..64a48c4b01a2 100644 --- a/include/linux/mem_encrypt.h +++ b/include/linux/mem_encrypt.h @@ -22,6 +22,16 @@ static inline bool mem_encrypt_active(void) { return false; } #endif /* CONFIG_ARCH_HAS_MEM_ENCRYPT */ +struct device; +#ifdef CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED +bool force_dma_unencrypted(struct device *dev); +#else +static inline bool force_dma_unencrypted(struct device *dev) +{ + return false; +} +#endif /* CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED */ + #ifdef CONFIG_AMD_MEM_ENCRYPT /* * The __sme_set() and __sme_clr() macros are useful for adding or removing -- 2.17.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 0/2] virtio: decouple protected guest RAM form VIRTIO_F_IOMMU_PLATFORM
Currently if one intends to run a memory protection enabled VM with virtio devices and linux as the guest OS, one needs to specify the VIRTIO_F_IOMMU_PLATFORM flag for each virtio device to make the guest linux use the DMA API, which in turn handles the memory encryption/protection stuff if the guest decides to turn itself into a protected one. This however makes no sense due to multiple reasons: * The device is not changed by the fact that the guest RAM is protected. The so called IOMMU bypass quirk is not affected. * This usage is not congruent with standardised semantics of VIRTIO_F_IOMMU_PLATFORM. Guest memory protected is an orthogonal reason for using DMA API in virtio (orthogonal with respect to what is expressed by VIRTIO_F_IOMMU_PLATFORM). This series aims to decouple 'have to use DMA API because my (guest) RAM is protected' and 'have to use DMA API because the device told me VIRTIO_F_IOMMU_PLATFORM'. Please find more detailed explanations about the conceptual aspects in the individual patches. There is however also a very practical problem that is addressed by this series. For vhost-net the feature VIRTIO_F_IOMMU_PLATFORM has the following side effect The vhost code assumes it the addresses on the virtio descriptor ring are not guest physical addresses but iova's, and insists on doing a translation of these regardless of what transport is used (e.g. whether we emulate a PCI or a CCW device). (For details see commit 6b1e6cc7855b "vhost: new device IOTLB API".) On s390 this results in severe performance degradation (c.a. factor 10). BTW with ccw I/O there is (architecturally) no IOMMU, so the whole address translation makes no sense in the context of virtio-ccw. Halil Pasic (2): mm: move force_dma_unencrypted() to mem_encrypt.h virtio: let virtio use DMA API when guest RAM is protected drivers/virtio/virtio_ring.c | 3 +++ include/linux/dma-direct.h | 9 - include/linux/mem_encrypt.h | 10 ++ 3 files changed, 13 insertions(+), 9 deletions(-) base-commit: ca7e1fd1026c5af6a533b4b5447e1d2f153e28f2 -- 2.17.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
Currently the advanced guest memory protection technologies (AMD SEV, powerpc secure guest technology and s390 Protected VMs) abuse the VIRTIO_F_IOMMU_PLATFORM flag to make virtio core use the DMA API, which is in turn necessary, to make IO work with guest memory protection. But VIRTIO_F_IOMMU_PLATFORM a.k.a. VIRTIO_F_ACCESS_PLATFORM is really a different beast: with virtio devices whose implementation runs on an SMP CPU we are still fine with doing all the usual optimizations, it is just that we need to make sure that the memory protection mechanism does not get in the way. The VIRTIO_F_ACCESS_PLATFORM mandates more work on the side of the guest (and possibly he host side as well) than we actually need. An additional benefit of teaching the guest to make the right decision (and use DMA API) on it's own is: removing the need, to mandate special VM configuration for guests that may run with protection. This is especially interesting for s390 as VIRTIO_F_IOMMU_PLATFORM pushes all the virtio control structures into the first 2G of guest memory: something we don't necessarily want to do per-default. Signed-off-by: Halil Pasic Tested-by: Ram Pai Tested-by: Michael Mueller --- drivers/virtio/virtio_ring.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 867c7ebd3f10..fafc8f924955 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -243,6 +243,9 @@ static bool vring_use_dma_api(struct virtio_device *vdev) if (!virtio_has_iommu_quirk(vdev)) return true; + if (force_dma_unencrypted(>dev)) + return true; + /* Otherwise, we are left to guess. */ /* * In theory, it's possible to have a buggy QEMU-supposed -- 2.17.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] virtio-blk: fix hw_queue stopped on arbitrary error
On Wed, 19 Feb 2020 09:46:56 +0800 Ming Lei wrote: > On Tue, Feb 18, 2020 at 8:35 PM Halil Pasic wrote: > > > > On Tue, 18 Feb 2020 10:21:18 +0800 > > Ming Lei wrote: > > > > > On Thu, Feb 13, 2020 at 8:38 PM Halil Pasic wrote: > > > > > > > > Since nobody else is going to restart our hw_queue for us, the > > > > blk_mq_start_stopped_hw_queues() is in virtblk_done() is not sufficient > > > > necessarily sufficient to ensure that the queue will get started again. > > > > In case of global resource outage (-ENOMEM because mapping failure, > > > > because of swiotlb full) our virtqueue may be empty and we can get > > > > stuck with a stopped hw_queue. > > > > > > > > Let us not stop the queue on arbitrary errors, but only on -EONSPC which > > > > indicates a full virtqueue, where the hw_queue is guaranteed to get > > > > started by virtblk_done() before when it makes sense to carry on > > > > submitting requests. Let us also remove a stale comment. > > > > > > The generic solution may be to stop queue only when there is any > > > in-flight request > > > not completed. > > > > > > > I think this is a pretty close to that. The queue is stopped only on > > ENOSPC, which means virtqueue is full. > > > > > Checking -ENOMEM may not be enough, given -EIO can be returned from > > > virtqueue_add() > > > too in case of dma map failure. > > > > I'm not checking on -ENOMEM. So the queue would not be stopped on EIO. > > Maybe I'm misunderstanding something In any case, please have another > > look at the diff, and if your concerns persist please help me understand. > > Looks I misread the patch, and this patch is fine: > > Reviewed-by: Ming Lei Thank you very much! Regards, Halil > > > Thanks, > Ming Lei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] virtio-blk: fix hw_queue stopped on arbitrary error
On Tue, 18 Feb 2020 10:21:18 +0800 Ming Lei wrote: > On Thu, Feb 13, 2020 at 8:38 PM Halil Pasic wrote: > > > > Since nobody else is going to restart our hw_queue for us, the > > blk_mq_start_stopped_hw_queues() is in virtblk_done() is not sufficient > > necessarily sufficient to ensure that the queue will get started again. > > In case of global resource outage (-ENOMEM because mapping failure, > > because of swiotlb full) our virtqueue may be empty and we can get > > stuck with a stopped hw_queue. > > > > Let us not stop the queue on arbitrary errors, but only on -EONSPC which > > indicates a full virtqueue, where the hw_queue is guaranteed to get > > started by virtblk_done() before when it makes sense to carry on > > submitting requests. Let us also remove a stale comment. > > The generic solution may be to stop queue only when there is any > in-flight request > not completed. > I think this is a pretty close to that. The queue is stopped only on ENOSPC, which means virtqueue is full. > Checking -ENOMEM may not be enough, given -EIO can be returned from > virtqueue_add() > too in case of dma map failure. I'm not checking on -ENOMEM. So the queue would not be stopped on EIO. Maybe I'm misunderstanding something In any case, please have another look at the diff, and if your concerns persist please help me understand. Thanks for having a look! Regards, Halil > > Thanks, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] virtio-blk: fix hw_queue stopped on arbitrary error
On Fri, 14 Feb 2020 10:20:44 -0800 dongli.zh...@oracle.com wrote: > Hi Halil, > > When swiotlb full is hit for virtio_blk, there is below warning for once (the > warning is not by this patch set). Is this expected or just false positive? The warning is kind of expected. Certainly not a false positive, but it probably looks more dramatic than I would like it to look. If swiotlb cmdline parameter can be chosen so that the swiotlb won't run out of space, that is certainly preferable. But out of swiotlb space should merely result in performance degradation (provided the device drivers properly handle the condition). Thanks for having a look! Regards, Halil > > [ 54.767257] virtio-pci :00:04.0: swiotlb buffer is full (sz: 16 bytes), > total 32768 (slots), used 258 (slots) > [ 54.767260] virtio-pci :00:04.0: overflow 0x75770110+16 of DMA > mask bus limit 0 > [ 54.769192] [ cut here ] > [ 54.769200] WARNING: CPU: 3 PID: 102 at kernel/dma/direct.c:35 > report_addr+0x71/0x77 > [ 54.769200] Modules linked in: > [ 54.769203] CPU: 3 PID: 102 Comm: kworker/u8:2 Not tainted 5.6.0-rc1+ #2 > [ 54.769204] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > [ 54.769208] Workqueue: writeback wb_workfn (flush-253:0) > [ 54.769211] RIP: 0010:report_addr+0x71/0x77 > ... ... > [ 54.769226] Call Trace: > [ 54.769241] dma_direct_map_page+0xc9/0xe0 > [ 54.769245] virtqueue_add+0x172/0xaa0 > [ 54.769248] virtqueue_add_sgs+0x85/0xa0 > [ 54.769251] virtio_queue_rq+0x292/0x480 > [ 54.769255] __blk_mq_try_issue_directly+0x13e/0x1f0 > [ 54.769257] blk_mq_request_issue_directly+0x48/0xa0 > [ 54.769259] blk_mq_try_issue_list_directly+0x3c/0xb0 > [ 54.769261] blk_mq_sched_insert_requests+0xb6/0x100 > [ 54.769263] blk_mq_flush_plug_list+0x146/0x210 > [ 54.769264] blk_flush_plug_list+0xba/0xe0 > [ 54.769266] blk_mq_make_request+0x331/0x5b0 > [ 54.769268] generic_make_request+0x10d/0x2e0 > [ 54.769270] submit_bio+0x69/0x130 > [ 54.769273] ext4_io_submit+0x44/0x50 > [ 54.769275] ext4_writepages+0x56f/0xd30 > [ 54.769278] ? cpumask_next_and+0x19/0x20 > [ 54.769280] ? find_busiest_group+0x11a/0xa40 > [ 54.769283] do_writepages+0x15/0x70 > [ 54.769288] __writeback_single_inode+0x38/0x330 > [ 54.769290] writeback_sb_inodes+0x219/0x4c0 > [ 54.769292] __writeback_inodes_wb+0x82/0xb0 > [ 54.769293] wb_writeback+0x267/0x300 > [ 54.769295] wb_workfn+0x1aa/0x430 > [ 54.769298] process_one_work+0x156/0x360 > [ 54.769299] worker_thread+0x41/0x3b0 > [ 54.769300] kthread+0xf3/0x130 > [ 54.769302] ? process_one_work+0x360/0x360 > [ 54.769303] ? kthread_bind+0x10/0x10 > [ 54.769305] ret_from_fork+0x35/0x40 > [ 54.769307] ---[ end trace 923a87a9ce0e777a ]--- > > Thank you very much! > > Dongli Zhang > > On 2/13/20 4:37 AM, Halil Pasic wrote: > > Since nobody else is going to restart our hw_queue for us, the > > blk_mq_start_stopped_hw_queues() is in virtblk_done() is not sufficient > > necessarily sufficient to ensure that the queue will get started again. > > In case of global resource outage (-ENOMEM because mapping failure, > > because of swiotlb full) our virtqueue may be empty and we can get > > stuck with a stopped hw_queue. > > > > Let us not stop the queue on arbitrary errors, but only on -EONSPC which > > indicates a full virtqueue, where the hw_queue is guaranteed to get > > started by virtblk_done() before when it makes sense to carry on > > submitting requests. Let us also remove a stale comment. > > > > Signed-off-by: Halil Pasic > > Cc: Jens Axboe > > Fixes: f7728002c1c7 ("virtio_ring: fix return code on DMA mapping fails") > > --- > > > > I'm in doubt with regards to the Fixes tag. The thing is, virtio-blk > > probably made an assumption on virtqueue_add: the failure is either > > because the virtqueue is full, or the failure is fatal. In both cases it > > seems acceptable to stop the queue, although the fatal case is arguable. > > Since commit f7728002c1c7 it the dma mapping has failed returns -ENOMEM > > and not -EIO, and thus we have a recoverable failure that ain't > > virtqueue full. So I lean towards to a fixes tag that references that > > commit, although it ain't broken. Alternatively one could say 'Fixes: > > e467cde23818 ("Block driver using virtio.")', if the aforementioned > > assumption shouldn't have made in the first place. > > --- > > drivers/block/virtio_blk.c | 8 +--- > > 1 file changed, 5 insertions(+), 3 de
[PATCH 2/2] virtio-blk: improve virtqueue error to BLK_STS
Let's change the mapping between virtqueue_add errors to BLK_STS statuses, so that -ENOSPC, which indicates virtqueue full is still mapped to BLK_STS_DEV_RESOURCE, but -ENOMEM which indicates non-device specific resource outage is mapped to BLK_STS_RESOURCE. Signed-off-by: Halil Pasic --- See comment about BLK_STS_DEV_RESOURCE in include/linux/blk_types.h --- drivers/block/virtio_blk.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index adfe43f5ffe4..0736248999b0 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -251,9 +251,14 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, if (err == -ENOSPC) blk_mq_stop_hw_queue(hctx); spin_unlock_irqrestore(>vqs[qid].lock, flags); - if (err == -ENOMEM || err == -ENOSPC) + switch (err) { + case -ENOSPC: return BLK_STS_DEV_RESOURCE; - return BLK_STS_IOERR; + case -ENOMEM: + return BLK_STS_RESOURCE; + default: + return BLK_STS_IOERR; + } } if (bd->last && virtqueue_kick_prepare(vblk->vqs[qid].vq)) -- 2.17.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/2] virtio-blk: fix hw_queue stopped on arbitrary error
Since nobody else is going to restart our hw_queue for us, the blk_mq_start_stopped_hw_queues() is in virtblk_done() is not sufficient necessarily sufficient to ensure that the queue will get started again. In case of global resource outage (-ENOMEM because mapping failure, because of swiotlb full) our virtqueue may be empty and we can get stuck with a stopped hw_queue. Let us not stop the queue on arbitrary errors, but only on -EONSPC which indicates a full virtqueue, where the hw_queue is guaranteed to get started by virtblk_done() before when it makes sense to carry on submitting requests. Let us also remove a stale comment. Signed-off-by: Halil Pasic Cc: Jens Axboe Fixes: f7728002c1c7 ("virtio_ring: fix return code on DMA mapping fails") --- I'm in doubt with regards to the Fixes tag. The thing is, virtio-blk probably made an assumption on virtqueue_add: the failure is either because the virtqueue is full, or the failure is fatal. In both cases it seems acceptable to stop the queue, although the fatal case is arguable. Since commit f7728002c1c7 it the dma mapping has failed returns -ENOMEM and not -EIO, and thus we have a recoverable failure that ain't virtqueue full. So I lean towards to a fixes tag that references that commit, although it ain't broken. Alternatively one could say 'Fixes: e467cde23818 ("Block driver using virtio.")', if the aforementioned assumption shouldn't have made in the first place. --- drivers/block/virtio_blk.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 54158766334b..adfe43f5ffe4 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -245,10 +245,12 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, err = virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num); if (err) { virtqueue_kick(vblk->vqs[qid].vq); - blk_mq_stop_hw_queue(hctx); + /* Don't stop the queue if -ENOMEM: we may have failed to +* bounce the buffer due to global resource outage. +*/ + if (err == -ENOSPC) + blk_mq_stop_hw_queue(hctx); spin_unlock_irqrestore(>vqs[qid].lock, flags); - /* Out of mem doesn't actually happen, since we fall back -* to direct descriptors */ if (err == -ENOMEM || err == -ENOSPC) return BLK_STS_DEV_RESOURCE; return BLK_STS_IOERR; -- 2.17.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 0/2] virtio-blk: improve handling of DMA mapping failures
Two patches are handling new edge cases introduced by doing DMA mappings (which can fail) in virtio core. I stumbled upon this while stress testing I/O for Protected Virtual Machines. I deliberately chose a tiny swiotlb size and have generated load with fio. With more than one virtio-blk disk in use I experienced hangs. The goal of this series is to fix those hangs. Halil Pasic (2): virtio-blk: fix hw_queue stopped on arbitrary error virtio-blk: improve virtqueue error to BLK_STS drivers/block/virtio_blk.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) base-commit: 39bed42de2e7d74686a2d5a45638d6a5d7e7d473 -- 2.17.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] virtio_ring: fix return code on DMA mapping fails
On Tue, 26 Nov 2019 19:45:27 +0100 Christoph Hellwig wrote: > On Sat, Nov 23, 2019 at 09:39:08AM -0600, Tom Lendacky wrote: > > Ideally, having a pool of shared pages for DMA, outside of standard > > SWIOTLB, might be a good thing. On x86, SWIOTLB really seems geared > > towards devices that don't support 64-bit DMA. If a device supports 64-bit > > DMA then it can use shared pages that reside anywhere to perform the DMA > > and bounce buffering. I wonder if the SWIOTLB support can be enhanced to > > support something like this, using today's low SWIOTLB buffers if the DMA > > mask necessitates it, otherwise using a dynamically sized pool of shared > > pages that can live anywhere. > > I think that can be done relatively easily. I've actually been thinking > of multiple pool support for a whіle to replace the bounce buffering > in the block layer for ISA devices (24-bit addressing). > > I've also been looking into a dma_alloc_pages interface to help people > just allocate pages that are always dma addressable, but don't need > a coherent allocation. My last version I shared is here: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma_alloc_pages > > But it turns out this still doesn't work with SEV as we'll always > bounce. And I've been kinda lost on figuring out a way how to > allocate unencrypted pages that we we can feed into the normal > dma_map_page & co interfaces due to the magic encryption bit in > the address. I guess we could have a fallback path in the mapping > path and just unconditionally clear that bit in the dma_to_phys > path. Thanks Christoph! Thanks Tom! I will do some looking and thinking and report back. Regards, Halil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization