Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> On 13.05.16 at 17:27,wrote: > We're approaching the release date so I would like to wrap this up. > > As I understand it, there is indeed an issue in the emulator, but a > proper fix that take into consideration all cases has not been proposed. > > Should we make this a blocker for the release? I'm inclined to say no > because it has been like this for a long time and doesn't seem to cause > trouble for our main use case. > > Thoughts? I agree. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
We're approaching the release date so I would like to wrap this up. As I understand it, there is indeed an issue in the emulator, but a proper fix that take into consideration all cases has not been proposed. Should we make this a blocker for the release? I'm inclined to say no because it has been like this for a long time and doesn't seem to cause trouble for our main use case. Thoughts? Wei. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru05/05/16 11:24 AM >>> >On 05/04/2016 04:42 PM, Jan Beulich wrote: > On 04.05.16 at 13:32, wrote: >>> But while implementing a stub that falls back to the actual LOCK CMPXCHG >>> and replacing hvm_copy_to_guest_virt() with it would indeed be an >>> improvement (with the added advantage of being able to treat >>> non-emulated LOCK CMPXCHG cases), I don't understand how that would >>> solve the read-modify-write atomicity problem. >>> >>> AFAICT, this would only solve the write problem. Assuming we have VCPU1 >>> and VCPU2 emulating a LOCKed instruction expecting rmw atomicity, the >>> stub alone would not prevent this: >>> >>> VCPU1: read, modify >>> VCPU2: read, modify, write >>> VCPU1: write >> >> I'm not sure I follow what you mean here: Does the above represent >> what the guest does, or what the hypervisor does as steps to emulate >> a _single_ guest instruction? In the former case, I don't see what >> you're after. And in the latter case I don't understand why you think >> using CMPXCHG instead of WRITE wouldn't help. > >Briefly, this is the scenario: assuming a guest with two VCPUs and an >introspection application that has restricted access to a page, the >guest runs two LOCK instructions that touch the page, causing a page >fault for each instruction. This further translates to two EPT fault >vm_events being placed in the ring buffer. > >By the time the introspection application polls the event channel, both >VCPUs are paused, waiting for replies to the vm_events. > >If the monitoring application processes both events (puts both replies, >with the emulate option on, in the ring buffer), then signals the event >channel, it is possible that both VCPUs get woken up, ending up running >x86_emulate() simultaneously. > >In this case, my understanding is that just using CMPXCHG will not work >(although it is clearly superior to the current implementation), because >the read part and the write part of x86_emulate() (when LOCKed >instructions are involved) should be executed atomically, but writing >the CMPXCHG stub would only make sure that two simultaneous writes won't >occur. > >In other words, this would still be possible (atomicity would still not >be guaranteed for LOCKed instructions): > >VCPU1: read >VCPU2: read, write >VCPU1: write > >when what we want for LOCKed instructions is: > >VCPU1: read, write >VCPU2: read, write Okay, in short I take this to mean "single instruction" as answer to my actual question. >Am I misunderstanding how x86_emulate() works? No, but I suppose you're misunderstanding what I'm trying to suggest. What you write above is not what will result when using CMPXCHG. Instead what we'll have is vCPU1: read vCPU2: read vCPU2: cmpxchg vCPU1: cmpxchg Note that the second cmpxchg will fail unless the first one wrote back an unchanged value. Hence vCPU1 will be told to re-execute the instruction. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 05/04/2016 04:42 PM, Jan Beulich wrote: On 04.05.16 at 13:32,wrote: >> But while implementing a stub that falls back to the actual LOCK CMPXCHG >> and replacing hvm_copy_to_guest_virt() with it would indeed be an >> improvement (with the added advantage of being able to treat >> non-emulated LOCK CMPXCHG cases), I don't understand how that would >> solve the read-modify-write atomicity problem. >> >> AFAICT, this would only solve the write problem. Assuming we have VCPU1 >> and VCPU2 emulating a LOCKed instruction expecting rmw atomicity, the >> stub alone would not prevent this: >> >> VCPU1: read, modify >> VCPU2: read, modify, write >> VCPU1: write > > I'm not sure I follow what you mean here: Does the above represent > what the guest does, or what the hypervisor does as steps to emulate > a _single_ guest instruction? In the former case, I don't see what > you're after. And in the latter case I don't understand why you think > using CMPXCHG instead of WRITE wouldn't help. Briefly, this is the scenario: assuming a guest with two VCPUs and an introspection application that has restricted access to a page, the guest runs two LOCK instructions that touch the page, causing a page fault for each instruction. This further translates to two EPT fault vm_events being placed in the ring buffer. By the time the introspection application polls the event channel, both VCPUs are paused, waiting for replies to the vm_events. If the monitoring application processes both events (puts both replies, with the emulate option on, in the ring buffer), then signals the event channel, it is possible that both VCPUs get woken up, ending up running x86_emulate() simultaneously. In this case, my understanding is that just using CMPXCHG will not work (although it is clearly superior to the current implementation), because the read part and the write part of x86_emulate() (when LOCKed instructions are involved) should be executed atomically, but writing the CMPXCHG stub would only make sure that two simultaneous writes won't occur. In other words, this would still be possible (atomicity would still not be guaranteed for LOCKed instructions): VCPU1: read VCPU2: read, write VCPU1: write when what we want for LOCKed instructions is: VCPU1: read, write VCPU2: read, write Am I misunderstanding how x86_emulate() works? Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> On 04.05.16 at 13:32,wrote: > But while implementing a stub that falls back to the actual LOCK CMPXCHG > and replacing hvm_copy_to_guest_virt() with it would indeed be an > improvement (with the added advantage of being able to treat > non-emulated LOCK CMPXCHG cases), I don't understand how that would > solve the read-modify-write atomicity problem. > > AFAICT, this would only solve the write problem. Assuming we have VCPU1 > and VCPU2 emulating a LOCKed instruction expecting rmw atomicity, the > stub alone would not prevent this: > > VCPU1: read, modify > VCPU2: read, modify, write > VCPU1: write I'm not sure I follow what you mean here: Does the above represent what the guest does, or what the hypervisor does as steps to emulate a _single_ guest instruction? In the former case, I don't see what you're after. And in the latter case I don't understand why you think using CMPXCHG instead of WRITE wouldn't help. Jan > Moreover, since reads and writes are not synchronized, it would be > possible for VCPU2's read to occur while VCPU1 writes, and VCPU1 could > read part of the old data + part of the new data. > > So the problem originally addressed by the patch would still need to be > addressed like that: with a read / write lock covering all the relevant > parts of x86_emulate(). Unless I'm mistaken, the stub part is only > needed to make sure that CMPXCHG alone does not race when a VCPU > emulates and another does not. > > > Thanks, > Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 05/03/2016 06:13 PM, Jan Beulich wrote: On 03.05.16 at 16:41,wrote: >> On 05/03/2016 05:30 PM, Jan Beulich wrote: >> On 03.05.16 at 16:20, wrote: I've kept experimenting with the patch but can't quite figure out why minimizing the lock scope to the writeback part would not be sufficient, but it isn't. I.e. with this code: 3824 writeback: 3825 ops->smp_lock(lock_prefix); 3826 switch ( dst.type ) 3827 { 3828 case OP_REG: 3829 /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */ 3830 switch ( dst.bytes ) 3831 { 3832 case 1: *(uint8_t *)dst.reg = (uint8_t)dst.val; break; 3833 case 2: *(uint16_t *)dst.reg = (uint16_t)dst.val; break; 3834 case 4: *dst.reg = (uint32_t)dst.val; break; /* 64b: zero-ext */ 3835 case 8: *dst.reg = dst.val; break; 3836 } 3837 break; 3838 case OP_MEM: 3839 if ( !(d & Mov) && (dst.orig_val == dst.val) && 3840 !ctxt->force_writeback ) 3841 /* nothing to do */; 3842 else if ( lock_prefix ) 3843 rc = ops->cmpxchg( 3844 dst.mem.seg, dst.mem.off, _val, 3845 , dst.bytes, ctxt); 3846 else 3847 rc = ops->write( 3848 dst.mem.seg, dst.mem.off, , dst.bytes, ctxt); 3849 if ( rc != 0 ) 3850 { 3851 ops->smp_unlock(lock_prefix); 3852 goto done; 3853 } 3854 default: 3855 break; 3856 } 3857 ops->smp_unlock(lock_prefix); I can still reproduce the guest hang. But if I lock at the very beginning of x86_emulate() and unlock before each return, no more hangs. >>> >>> Isn't that obvious? Locked instructions are necessarily >>> read-modify-write ones, and hence the lock needs to be taken >>> before the read, and dropped after the write. But remember, I'll >>> continue to show opposition to this getting "fixed" this way (in the >>> emulator itself), as long as no proper explanation can be given >>> why making hvmemul_cmpxchg() do what its name says isn't all >>> we need (and hence why i386-like bus lock behavior is needed). >> >> Yes, that's what I thought, but at a previous time I've described my >> attempt to lock _only_ hvmemul_cmpxchg() (which failed) - and the >> failure of that change to address the issue has been considered curious. >> I've probably not been able to explain clearly what I've tried, or have >> misunderstood the answer, and took it to mean that for some reason a >> similar change is supposed to be able to fix it. >> >> Obviously locking hvmemul_cmpxchg() would have only affected the OP_MEM >> case above (with lock_prefix == 1), actually an even smaller scope than >> the new one, and with no read locking either. > > Yeah, I may have got confused there too (resulting in me confusing > you, perhaps): Adding a spin lock to hvmemul_cmpxchg() of course > won't help. Making it use (or force us of) LOCK CMPXCHG would, I > suppose. > >> I guess the question now is what avenues are there to make >> hvmemul_cmpxchg() do what its name says - I'm certainly open to trying >> out any alternatives - my main concern is to have the problem fixed in >> the best way possible, certainly not to have any specific version of >> this patch make it into Xen. > > I guess making it no longer call hvmemul_write(), copying most of > that function's code while suitably replacing just the call to > hvm_copy_to_guest_virt() (with the assumption that the MMIO > paths at least for now don't need strict LOCK handling) is the only > viable route. Thank you for clearing that up! But while implementing a stub that falls back to the actual LOCK CMPXCHG and replacing hvm_copy_to_guest_virt() with it would indeed be an improvement (with the added advantage of being able to treat non-emulated LOCK CMPXCHG cases), I don't understand how that would solve the read-modify-write atomicity problem. AFAICT, this would only solve the write problem. Assuming we have VCPU1 and VCPU2 emulating a LOCKed instruction expecting rmw atomicity, the stub alone would not prevent this: VCPU1: read, modify VCPU2: read, modify, write VCPU1: write Moreover, since reads and writes are not synchronized, it would be possible for VCPU2's read to occur while VCPU1 writes, and VCPU1 could read part of the old data + part of the new data. So the problem originally addressed by the patch would still need to be addressed like that: with a read / write lock covering all the relevant parts of x86_emulate(). Unless I'm mistaken, the stub part is only needed to make sure that CMPXCHG alone does not race when a VCPU emulates and another does not. Thanks, Razvan
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> On 03.05.16 at 16:41,wrote: > On 05/03/2016 05:30 PM, Jan Beulich wrote: > On 03.05.16 at 16:20, wrote: >>> I've kept experimenting with the patch but can't quite figure out why >>> minimizing the lock scope to the writeback part would not be sufficient, >>> but it isn't. >>> >>> I.e. with this code: >>> >>> 3824 writeback: >>> 3825 ops->smp_lock(lock_prefix); >>> 3826 switch ( dst.type ) >>> 3827 { >>> 3828 case OP_REG: >>> 3829 /* The 4-byte case *is* correct: in 64-bit mode we >>> zero-extend. */ >>> 3830 switch ( dst.bytes ) >>> 3831 { >>> 3832 case 1: *(uint8_t *)dst.reg = (uint8_t)dst.val; break; >>> 3833 case 2: *(uint16_t *)dst.reg = (uint16_t)dst.val; break; >>> 3834 case 4: *dst.reg = (uint32_t)dst.val; break; /* 64b: zero-ext >>> */ >>> 3835 case 8: *dst.reg = dst.val; break; >>> 3836 } >>> 3837 break; >>> 3838 case OP_MEM: >>> 3839 if ( !(d & Mov) && (dst.orig_val == dst.val) && >>> 3840 !ctxt->force_writeback ) >>> 3841 /* nothing to do */; >>> 3842 else if ( lock_prefix ) >>> 3843 rc = ops->cmpxchg( >>> 3844 dst.mem.seg, dst.mem.off, _val, >>> 3845 , dst.bytes, ctxt); >>> 3846 else >>> 3847 rc = ops->write( >>> 3848 dst.mem.seg, dst.mem.off, , dst.bytes, ctxt); >>> 3849 if ( rc != 0 ) >>> 3850 { >>> 3851 ops->smp_unlock(lock_prefix); >>> 3852 goto done; >>> 3853 } >>> 3854 default: >>> 3855 break; >>> 3856 } >>> 3857 ops->smp_unlock(lock_prefix); >>> >>> I can still reproduce the guest hang. But if I lock at the very >>> beginning of x86_emulate() and unlock before each return, no more hangs. >> >> Isn't that obvious? Locked instructions are necessarily >> read-modify-write ones, and hence the lock needs to be taken >> before the read, and dropped after the write. But remember, I'll >> continue to show opposition to this getting "fixed" this way (in the >> emulator itself), as long as no proper explanation can be given >> why making hvmemul_cmpxchg() do what its name says isn't all >> we need (and hence why i386-like bus lock behavior is needed). > > Yes, that's what I thought, but at a previous time I've described my > attempt to lock _only_ hvmemul_cmpxchg() (which failed) - and the > failure of that change to address the issue has been considered curious. > I've probably not been able to explain clearly what I've tried, or have > misunderstood the answer, and took it to mean that for some reason a > similar change is supposed to be able to fix it. > > Obviously locking hvmemul_cmpxchg() would have only affected the OP_MEM > case above (with lock_prefix == 1), actually an even smaller scope than > the new one, and with no read locking either. Yeah, I may have got confused there too (resulting in me confusing you, perhaps): Adding a spin lock to hvmemul_cmpxchg() of course won't help. Making it use (or force us of) LOCK CMPXCHG would, I suppose. > I guess the question now is what avenues are there to make > hvmemul_cmpxchg() do what its name says - I'm certainly open to trying > out any alternatives - my main concern is to have the problem fixed in > the best way possible, certainly not to have any specific version of > this patch make it into Xen. I guess making it no longer call hvmemul_write(), copying most of that function's code while suitably replacing just the call to hvm_copy_to_guest_virt() (with the assumption that the MMIO paths at least for now don't need strict LOCK handling) is the only viable route. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 05/03/2016 05:30 PM, Jan Beulich wrote: On 03.05.16 at 16:20,wrote: >> I've kept experimenting with the patch but can't quite figure out why >> minimizing the lock scope to the writeback part would not be sufficient, >> but it isn't. >> >> I.e. with this code: >> >> 3824 writeback: >> 3825 ops->smp_lock(lock_prefix); >> 3826 switch ( dst.type ) >> 3827 { >> 3828 case OP_REG: >> 3829 /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. >> */ >> 3830 switch ( dst.bytes ) >> 3831 { >> 3832 case 1: *(uint8_t *)dst.reg = (uint8_t)dst.val; break; >> 3833 case 2: *(uint16_t *)dst.reg = (uint16_t)dst.val; break; >> 3834 case 4: *dst.reg = (uint32_t)dst.val; break; /* 64b: zero-ext */ >> 3835 case 8: *dst.reg = dst.val; break; >> 3836 } >> 3837 break; >> 3838 case OP_MEM: >> 3839 if ( !(d & Mov) && (dst.orig_val == dst.val) && >> 3840 !ctxt->force_writeback ) >> 3841 /* nothing to do */; >> 3842 else if ( lock_prefix ) >> 3843 rc = ops->cmpxchg( >> 3844 dst.mem.seg, dst.mem.off, _val, >> 3845 , dst.bytes, ctxt); >> 3846 else >> 3847 rc = ops->write( >> 3848 dst.mem.seg, dst.mem.off, , dst.bytes, ctxt); >> 3849 if ( rc != 0 ) >> 3850 { >> 3851 ops->smp_unlock(lock_prefix); >> 3852 goto done; >> 3853 } >> 3854 default: >> 3855 break; >> 3856 } >> 3857 ops->smp_unlock(lock_prefix); >> >> I can still reproduce the guest hang. But if I lock at the very >> beginning of x86_emulate() and unlock before each return, no more hangs. > > Isn't that obvious? Locked instructions are necessarily > read-modify-write ones, and hence the lock needs to be taken > before the read, and dropped after the write. But remember, I'll > continue to show opposition to this getting "fixed" this way (in the > emulator itself), as long as no proper explanation can be given > why making hvmemul_cmpxchg() do what its name says isn't all > we need (and hence why i386-like bus lock behavior is needed). Yes, that's what I thought, but at a previous time I've described my attempt to lock _only_ hvmemul_cmpxchg() (which failed) - and the failure of that change to address the issue has been considered curious. I've probably not been able to explain clearly what I've tried, or have misunderstood the answer, and took it to mean that for some reason a similar change is supposed to be able to fix it. Obviously locking hvmemul_cmpxchg() would have only affected the OP_MEM case above (with lock_prefix == 1), actually an even smaller scope than the new one, and with no read locking either. I guess the question now is what avenues are there to make hvmemul_cmpxchg() do what its name says - I'm certainly open to trying out any alternatives - my main concern is to have the problem fixed in the best way possible, certainly not to have any specific version of this patch make it into Xen. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> On 03.05.16 at 16:20,wrote: > I've kept experimenting with the patch but can't quite figure out why > minimizing the lock scope to the writeback part would not be sufficient, > but it isn't. > > I.e. with this code: > > 3824 writeback: > 3825 ops->smp_lock(lock_prefix); > 3826 switch ( dst.type ) > 3827 { > 3828 case OP_REG: > 3829 /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. > */ > 3830 switch ( dst.bytes ) > 3831 { > 3832 case 1: *(uint8_t *)dst.reg = (uint8_t)dst.val; break; > 3833 case 2: *(uint16_t *)dst.reg = (uint16_t)dst.val; break; > 3834 case 4: *dst.reg = (uint32_t)dst.val; break; /* 64b: zero-ext */ > 3835 case 8: *dst.reg = dst.val; break; > 3836 } > 3837 break; > 3838 case OP_MEM: > 3839 if ( !(d & Mov) && (dst.orig_val == dst.val) && > 3840 !ctxt->force_writeback ) > 3841 /* nothing to do */; > 3842 else if ( lock_prefix ) > 3843 rc = ops->cmpxchg( > 3844 dst.mem.seg, dst.mem.off, _val, > 3845 , dst.bytes, ctxt); > 3846 else > 3847 rc = ops->write( > 3848 dst.mem.seg, dst.mem.off, , dst.bytes, ctxt); > 3849 if ( rc != 0 ) > 3850 { > 3851 ops->smp_unlock(lock_prefix); > 3852 goto done; > 3853 } > 3854 default: > 3855 break; > 3856 } > 3857 ops->smp_unlock(lock_prefix); > > I can still reproduce the guest hang. But if I lock at the very > beginning of x86_emulate() and unlock before each return, no more hangs. Isn't that obvious? Locked instructions are necessarily read-modify-write ones, and hence the lock needs to be taken before the read, and dropped after the write. But remember, I'll continue to show opposition to this getting "fixed" this way (in the emulator itself), as long as no proper explanation can be given why making hvmemul_cmpxchg() do what its name says isn't all we need (and hence why i386-like bus lock behavior is needed). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/27/2016 10:14 AM, Razvan Cojocaru wrote: > On 04/27/2016 09:22 AM, Jan Beulich wrote: > On 26.04.16 at 19:23,wrote: >>> On 04/26/16 19:03, George Dunlap wrote: On 19/04/16 17:35, Jan Beulich wrote: Razvan Cojocaru 04/19/16 1:01 PM >>> >> I think this might be because the LOCK prefix should guarantee that the >> instruction that follows it has exclusive use of shared memory (for both >> reads and writes) but I might be misreading the docs: > > LOCK definitely has no effect on other than the instruction it gets > applied > to. Sorry I wasn't involved in this discussion -- what was the conclusion here? FWIW Andy's suggestion of using a stub seemed like the most robust solution, if that could be made to work. If you're going to submit a patch substantially similar to this one, let me know so I can review the mm bits of the original patch. >>> >>> I'm not really sure. >>> >>> Regarding this version of the patch, Jan has asked for more information >>> on the performance impact, but I'm not sure how to obtain it in a >>> rigorous manner. >> >> That was only one half, which Andrew has now answered. The >> other was the not understood issue with a later variant you had >> tried. > > Right, there's the fact that this version (with read / write locking the > whole function) works whereas just synchonizing hvmemul_cmpxchg() with a > spin lock does not. It is not yet fully clear why (the part of the > conversation at the very top of this email was an early attempt to > elucidate it). I've kept experimenting with the patch but can't quite figure out why minimizing the lock scope to the writeback part would not be sufficient, but it isn't. I.e. with this code: 3824 writeback: 3825 ops->smp_lock(lock_prefix); 3826 switch ( dst.type ) 3827 { 3828 case OP_REG: 3829 /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */ 3830 switch ( dst.bytes ) 3831 { 3832 case 1: *(uint8_t *)dst.reg = (uint8_t)dst.val; break; 3833 case 2: *(uint16_t *)dst.reg = (uint16_t)dst.val; break; 3834 case 4: *dst.reg = (uint32_t)dst.val; break; /* 64b: zero-ext */ 3835 case 8: *dst.reg = dst.val; break; 3836 } 3837 break; 3838 case OP_MEM: 3839 if ( !(d & Mov) && (dst.orig_val == dst.val) && 3840 !ctxt->force_writeback ) 3841 /* nothing to do */; 3842 else if ( lock_prefix ) 3843 rc = ops->cmpxchg( 3844 dst.mem.seg, dst.mem.off, _val, 3845 , dst.bytes, ctxt); 3846 else 3847 rc = ops->write( 3848 dst.mem.seg, dst.mem.off, , dst.bytes, ctxt); 3849 if ( rc != 0 ) 3850 { 3851 ops->smp_unlock(lock_prefix); 3852 goto done; 3853 } 3854 default: 3855 break; 3856 } 3857 ops->smp_unlock(lock_prefix); I can still reproduce the guest hang. But if I lock at the very beginning of x86_emulate() and unlock before each return, no more hangs. I would appreciate any suggestions on how to go about trying to narrow the scope of the lock, or figure out what subtlety is at work here. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 27/04/2016 07:25, Jan Beulich wrote: On 26.04.16 at 19:39,wrote: >> On 26/04/16 18:23, Razvan Cojocaru wrote: >>> Regarding this version of the patch, Jan has asked for more information >>> on the performance impact, but I'm not sure how to obtain it in a >>> rigorous manner. If it is decided that a version of this patch is >>> desirable, I can go on fixing the issues we've found and address the >>> comments we've had so far and submit a new version. >> XenServer did performance testing. No observable impact for normal VM >> workloads (which is to be expected, as an OS wouldn't normally LOCK the >> instructions it uses for MMIO). The per-cpu rwlocks have ~0 overhead >> when the lock isn't held for writing. > So how about PV guests doing locked page table updates, or the > impact on log-dirty mode? As I sad - no observable change. Emulated pagetable writes are already a slowpath for PV guests, so avoided (Hypercalls are several times faster). Logdirty doesn't directly cause any instructions to be emulated. For hap guests, the EPT/NPT violation is resolved and the guest re-entered. For PV guests and shadow hvm, we do get emulated pagetable writes, but how many of those are actually locked? The majority in Xen at least and just explicit-width mov's. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/27/2016 09:22 AM, Jan Beulich wrote: On 26.04.16 at 19:23,wrote: >> On 04/26/16 19:03, George Dunlap wrote: >>> On 19/04/16 17:35, Jan Beulich wrote: >>> Razvan Cojocaru 04/19/16 1:01 PM >>> > I think this might be because the LOCK prefix should guarantee that the > instruction that follows it has exclusive use of shared memory (for both > reads and writes) but I might be misreading the docs: LOCK definitely has no effect on other than the instruction it gets applied to. >>> >>> Sorry I wasn't involved in this discussion -- what was the conclusion here? >>> >>> FWIW Andy's suggestion of using a stub seemed like the most robust >>> solution, if that could be made to work. >>> >>> If you're going to submit a patch substantially similar to this one, let >>> me know so I can review the mm bits of the original patch. >> >> I'm not really sure. >> >> Regarding this version of the patch, Jan has asked for more information >> on the performance impact, but I'm not sure how to obtain it in a >> rigorous manner. > > That was only one half, which Andrew has now answered. The > other was the not understood issue with a later variant you had > tried. Right, there's the fact that this version (with read / write locking the whole function) works whereas just synchonizing hvmemul_cmpxchg() with a spin lock does not. It is not yet fully clear why (the part of the conversation at the very top of this email was an early attempt to elucidate it). Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> On 26.04.16 at 19:39,wrote: > On 26/04/16 18:23, Razvan Cojocaru wrote: >> Regarding this version of the patch, Jan has asked for more information >> on the performance impact, but I'm not sure how to obtain it in a >> rigorous manner. If it is decided that a version of this patch is >> desirable, I can go on fixing the issues we've found and address the >> comments we've had so far and submit a new version. > > XenServer did performance testing. No observable impact for normal VM > workloads (which is to be expected, as an OS wouldn't normally LOCK the > instructions it uses for MMIO). The per-cpu rwlocks have ~0 overhead > when the lock isn't held for writing. So how about PV guests doing locked page table updates, or the impact on log-dirty mode? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> On 26.04.16 at 19:23,wrote: > On 04/26/16 19:03, George Dunlap wrote: >> On 19/04/16 17:35, Jan Beulich wrote: >> Razvan Cojocaru 04/19/16 1:01 PM >>> I think this might be because the LOCK prefix should guarantee that the instruction that follows it has exclusive use of shared memory (for both reads and writes) but I might be misreading the docs: >>> >>> LOCK definitely has no effect on other than the instruction it gets applied >>> to. >> >> Sorry I wasn't involved in this discussion -- what was the conclusion here? >> >> FWIW Andy's suggestion of using a stub seemed like the most robust >> solution, if that could be made to work. >> >> If you're going to submit a patch substantially similar to this one, let >> me know so I can review the mm bits of the original patch. > > I'm not really sure. > > Regarding this version of the patch, Jan has asked for more information > on the performance impact, but I'm not sure how to obtain it in a > rigorous manner. That was only one half, which Andrew has now answered. The other was the not understood issue with a later variant you had tried. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 26/04/16 18:23, Razvan Cojocaru wrote: > On 04/26/16 19:03, George Dunlap wrote: >> On 19/04/16 17:35, Jan Beulich wrote: >> Razvan Cojocaru04/19/16 1:01 PM >>> I think this might be because the LOCK prefix should guarantee that the instruction that follows it has exclusive use of shared memory (for both reads and writes) but I might be misreading the docs: >>> LOCK definitely has no effect on other than the instruction it gets applied >>> to. >> Sorry I wasn't involved in this discussion -- what was the conclusion here? >> >> FWIW Andy's suggestion of using a stub seemed like the most robust >> solution, if that could be made to work. >> >> If you're going to submit a patch substantially similar to this one, let >> me know so I can review the mm bits of the original patch. > I'm not really sure. > > Regarding this version of the patch, Jan has asked for more information > on the performance impact, but I'm not sure how to obtain it in a > rigorous manner. If it is decided that a version of this patch is > desirable, I can go on fixing the issues we've found and address the > comments we've had so far and submit a new version. XenServer did performance testing. No observable impact for normal VM workloads (which is to be expected, as an OS wouldn't normally LOCK the instructions it uses for MMIO). The per-cpu rwlocks have ~0 overhead when the lock isn't held for writing. > > I'm not familiar with what the stub solution would imply, so I'm afraid > I can't comment on that. This is not code I've had that much contact > with prior to stumbling into this problem. As for the fix I suggested, its probably prohibitive to fix the current emulator, given the plans for a rewrite. (And on that note, I really need to write a design doc and post to the list). ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/26/16 19:03, George Dunlap wrote: > On 19/04/16 17:35, Jan Beulich wrote: > Razvan Cojocaru04/19/16 1:01 PM >>> >>> I think this might be because the LOCK prefix should guarantee that the >>> instruction that follows it has exclusive use of shared memory (for both >>> reads and writes) but I might be misreading the docs: >> >> LOCK definitely has no effect on other than the instruction it gets applied >> to. > > Sorry I wasn't involved in this discussion -- what was the conclusion here? > > FWIW Andy's suggestion of using a stub seemed like the most robust > solution, if that could be made to work. > > If you're going to submit a patch substantially similar to this one, let > me know so I can review the mm bits of the original patch. I'm not really sure. Regarding this version of the patch, Jan has asked for more information on the performance impact, but I'm not sure how to obtain it in a rigorous manner. If it is decided that a version of this patch is desirable, I can go on fixing the issues we've found and address the comments we've had so far and submit a new version. I'm not familiar with what the stub solution would imply, so I'm afraid I can't comment on that. This is not code I've had that much contact with prior to stumbling into this problem. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 19/04/16 17:35, Jan Beulich wrote: Razvan Cojocaru04/19/16 1:01 PM >>> >> I think this might be because the LOCK prefix should guarantee that the >> instruction that follows it has exclusive use of shared memory (for both >> reads and writes) but I might be misreading the docs: > > LOCK definitely has no effect on other than the instruction it gets applied > to. Sorry I wasn't involved in this discussion -- what was the conclusion here? FWIW Andy's suggestion of using a stub seemed like the most robust solution, if that could be made to work. If you're going to submit a patch substantially similar to this one, let me know so I can review the mm bits of the original patch. -George ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru04/19/16 1:01 PM >>> >I think this might be because the LOCK prefix should guarantee that the >instruction that follows it has exclusive use of shared memory (for both >reads and writes) but I might be misreading the docs: LOCK definitely has no effect on other than the instruction it gets applied to. Jan >(From the Intel SDM) "Causes the processor’s LOCK# signal to be asserted >during execution of the accompanying instruction (turns the instruction >into an atomic instruction). In a multiprocessor environment, the LOCK# >signal ensures that the processor has exclusive use of any shared memory >while the signal is asserted." > >Using a spin lock (or the rwlock in the manner described above) in >hvmemul_cmpxchg() only prevents two LOCKed instructions from running at >the same time, but presumably even non-LOCKed instructions just reading >that memory area should be prevented from running until the LOCKed >instruction is done (hence the guest starting up with the rwlock in >x86_emulate() but not with it locked only in hvmemul_cmpxchg()). > >Hopefully I haven't misunderstood the issue. > > >Thanks, >Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/18/2016 07:45 PM, Jan Beulich wrote: Razvan Cojocaru04/18/16 2:40 PM >>> >> On 04/14/2016 07:08 PM, Jan Beulich wrote: >> Razvan Cojocaru 04/14/16 5:45 PM >>> On 04/14/2016 06:40 PM, Jan Beulich wrote: > To be honest, just having remembered that we do the write back for locked > instructions using CMPXCHG, I'd first of all like to see a proper > description > of "the _whole_ issue". I believe at least part of the issue has to do with the comment on line 1013 from xen/arch/x86/hvm/emulate.c: >>> >994 static int hvmemul_cmpxchg( >>> >995 enum x86_segment seg, >>> >996 unsigned long offset, >>> >997 void *p_old, >>> >998 void *p_new, >>> >999 unsigned int bytes, 1000 struct x86_emulate_ctxt *ctxt) 1001 { 1002 struct hvm_emulate_ctxt *hvmemul_ctxt = 1003 container_of(ctxt, struct hvm_emulate_ctxt, ctxt); 1004 1005 if ( unlikely(hvmemul_ctxt->set_context) ) 1006 { 1007 int rc = set_context_data(p_new, bytes); 1008 1009 if ( rc != X86EMUL_OKAY ) 1010 return rc; 1011 } 1012 1013 /* Fix this in case the guest is really relying on r-m-w atomicity. */ 1014 return hvmemul_write(seg, offset, p_new, bytes, ctxt); 1015 } >>> >>> Ah, so _that's_ where the problem wants to be fixed then (leaving - afaict - >>> PV emulation paths completely unaffected). >> >> That's what I had hoped too, an early version of this patch simply used >> a spinlock that locked / unlock on entry / exit of hvmemul_cmpxchg(). >> Even with this patch, I've just tried it again with all ops->smp_lock() >> / ops->smp_unlock() calls commented out in x86_emulate(), and >> hvmemul_cmpxchg() modified as follows: >> > >994 static int hvmemul_cmpxchg( > >995 enum x86_segment seg, > >996 unsigned long offset, > >997 void *p_old, > >998 void *p_new, > >999 unsigned int bytes, >> 1000 struct x86_emulate_ctxt *ctxt) >> 1001 { >> 1002 struct hvm_emulate_ctxt *hvmemul_ctxt = >> 1003 container_of(ctxt, struct hvm_emulate_ctxt, ctxt); >> 1004 int rc; >> 1005 >> 1006 emulate_smp_lock(1); >> 1007 >> 1008 if ( unlikely(hvmemul_ctxt->set_context) ) >> 1009 { >> 1010 int rc = set_context_data(p_new, bytes); >> 1011 >> 1012 if ( rc != X86EMUL_OKAY ) >> 1013 return rc; >> 1014 } >> 1015 >> 1016 /* Fix this in case the guest is really relying on r-m-w >> atomicity. */ >> 1017 rc = hvmemul_write(seg, offset, p_new, bytes, ctxt); >> 1018 >> 1019 emulate_smp_unlock(1); >> 1020 >> 1021 return rc; >> 1022 } >> >> Unfortunately, with just this the guest still hangs, while with read and >> write locking in x86_emulate() it does not. > > That's unexpected at least at the first glance, but justifying some variant of > the patch you have submitted would require understanding why the above > change isn't enough and can't be suitably extended to be sufficient. I think this might be because the LOCK prefix should guarantee that the instruction that follows it has exclusive use of shared memory (for both reads and writes) but I might be misreading the docs: (From the Intel SDM) "Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted." Using a spin lock (or the rwlock in the manner described above) in hvmemul_cmpxchg() only prevents two LOCKed instructions from running at the same time, but presumably even non-LOCKed instructions just reading that memory area should be prevented from running until the LOCKed instruction is done (hence the guest starting up with the rwlock in x86_emulate() but not with it locked only in hvmemul_cmpxchg()). Hopefully I haven't misunderstood the issue. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru04/18/16 2:40 PM >>> >On 04/14/2016 07:08 PM, Jan Beulich wrote: > Razvan Cojocaru 04/14/16 5:45 PM >>> >>> On 04/14/2016 06:40 PM, Jan Beulich wrote: To be honest, just having remembered that we do the write back for locked instructions using CMPXCHG, I'd first of all like to see a proper description of "the _whole_ issue". >>> >>> I believe at least part of the issue has to do with the comment on line >>> 1013 from xen/arch/x86/hvm/emulate.c: >>> >> >994 static int hvmemul_cmpxchg( >> >995 enum x86_segment seg, >> >996 unsigned long offset, >> >997 void *p_old, >> >998 void *p_new, >> >999 unsigned int bytes, >>> 1000 struct x86_emulate_ctxt *ctxt) >>> 1001 { >>> 1002 struct hvm_emulate_ctxt *hvmemul_ctxt = >>> 1003 container_of(ctxt, struct hvm_emulate_ctxt, ctxt); >>> 1004 >>> 1005 if ( unlikely(hvmemul_ctxt->set_context) ) >>> 1006 { >>> 1007 int rc = set_context_data(p_new, bytes); >>> 1008 >>> 1009 if ( rc != X86EMUL_OKAY ) >>> 1010 return rc; >>> 1011 } >>> 1012 >>> 1013 /* Fix this in case the guest is really relying on r-m-w >>> atomicity. */ >>> 1014 return hvmemul_write(seg, offset, p_new, bytes, ctxt); >>> 1015 } >> >> Ah, so _that's_ where the problem wants to be fixed then (leaving - afaict - >> PV emulation paths completely unaffected). > >That's what I had hoped too, an early version of this patch simply used >a spinlock that locked / unlock on entry / exit of hvmemul_cmpxchg(). >Even with this patch, I've just tried it again with all ops->smp_lock() >/ ops->smp_unlock() calls commented out in x86_emulate(), and >hvmemul_cmpxchg() modified as follows: > >994 static int hvmemul_cmpxchg( >995 enum x86_segment seg, >996 unsigned long offset, >997 void *p_old, >998 void *p_new, >999 unsigned int bytes, >1000 struct x86_emulate_ctxt *ctxt) >1001 { >1002 struct hvm_emulate_ctxt *hvmemul_ctxt = >1003 container_of(ctxt, struct hvm_emulate_ctxt, ctxt); >1004 int rc; >1005 >1006 emulate_smp_lock(1); >1007 >1008 if ( unlikely(hvmemul_ctxt->set_context) ) >1009 { >1010 int rc = set_context_data(p_new, bytes); >1011 >1012 if ( rc != X86EMUL_OKAY ) >1013 return rc; >1014 } >1015 >1016 /* Fix this in case the guest is really relying on r-m-w >atomicity. */ >1017 rc = hvmemul_write(seg, offset, p_new, bytes, ctxt); >1018 >1019 emulate_smp_unlock(1); >1020 >1021 return rc; >1022 } > >Unfortunately, with just this the guest still hangs, while with read and >write locking in x86_emulate() it does not. That's unexpected at least at the first glance, but justifying some variant of the patch you have submitted would require understanding why the above change isn't enough and can't be suitably extended to be sufficient. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 07:08 PM, Jan Beulich wrote: Razvan Cojocaru04/14/16 5:45 PM >>> >> On 04/14/2016 06:40 PM, Jan Beulich wrote: >>> To be honest, just having remembered that we do the write back for locked >>> instructions using CMPXCHG, I'd first of all like to see a proper >>> description >>> of "the _whole_ issue". >> >> I believe at least part of the issue has to do with the comment on line >> 1013 from xen/arch/x86/hvm/emulate.c: >> > >994 static int hvmemul_cmpxchg( > >995 enum x86_segment seg, > >996 unsigned long offset, > >997 void *p_old, > >998 void *p_new, > >999 unsigned int bytes, >> 1000 struct x86_emulate_ctxt *ctxt) >> 1001 { >> 1002 struct hvm_emulate_ctxt *hvmemul_ctxt = >> 1003 container_of(ctxt, struct hvm_emulate_ctxt, ctxt); >> 1004 >> 1005 if ( unlikely(hvmemul_ctxt->set_context) ) >> 1006 { >> 1007 int rc = set_context_data(p_new, bytes); >> 1008 >> 1009 if ( rc != X86EMUL_OKAY ) >> 1010 return rc; >> 1011 } >> 1012 >> 1013 /* Fix this in case the guest is really relying on r-m-w atomicity. >> */ >> 1014 return hvmemul_write(seg, offset, p_new, bytes, ctxt); >> 1015 } > > Ah, so _that's_ where the problem wants to be fixed then (leaving - afaict - > PV emulation paths completely unaffected). That's what I had hoped too, an early version of this patch simply used a spinlock that locked / unlock on entry / exit of hvmemul_cmpxchg(). Even with this patch, I've just tried it again with all ops->smp_lock() / ops->smp_unlock() calls commented out in x86_emulate(), and hvmemul_cmpxchg() modified as follows: 994 static int hvmemul_cmpxchg( 995 enum x86_segment seg, 996 unsigned long offset, 997 void *p_old, 998 void *p_new, 999 unsigned int bytes, 1000 struct x86_emulate_ctxt *ctxt) 1001 { 1002 struct hvm_emulate_ctxt *hvmemul_ctxt = 1003 container_of(ctxt, struct hvm_emulate_ctxt, ctxt); 1004 int rc; 1005 1006 emulate_smp_lock(1); 1007 1008 if ( unlikely(hvmemul_ctxt->set_context) ) 1009 { 1010 int rc = set_context_data(p_new, bytes); 1011 1012 if ( rc != X86EMUL_OKAY ) 1013 return rc; 1014 } 1015 1016 /* Fix this in case the guest is really relying on r-m-w atomicity. */ 1017 rc = hvmemul_write(seg, offset, p_new, bytes, ctxt); 1018 1019 emulate_smp_unlock(1); 1020 1021 return rc; 1022 } Unfortunately, with just this the guest still hangs, while with read and write locking in x86_emulate() it does not. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru04/14/16 6:00 PM >>> >On 04/14/2016 06:44 PM, Jan Beulich wrote: >> That's the performance effect on the hypervisor you talk about. But what >> about >> the guest, which all of the sudden gets another domain wide lock applied? > >Well, yes, there's bound to be some performance loss - but I think the >question is: is the added safety worth it? As always, there are >trade-offs. I am quite possibly missing something, but surely a slightly >slower, running guest is better than a fast unstable one. > >As for the patch, it's definitely perfectible, and I appreciate all >suggestions to make it the best it can be (or even to scrape it for a >better solution). That's the while thing here: If what you have presented comes at a reasonable price, we may well take it with obvious issues fixed. Yet if the price is quite high, investing effort into finding a better solution is imo not just warranted, but necessary. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Andrew Cooper04/14/16 5:46 PM >>> >Short of having the instruction emulator convert any locked instruction >into a stub, I can't think of a solution which works for both emulated >and non-emulated instructions. That's my understanding too. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru04/14/16 5:45 PM >>> >On 04/14/2016 06:40 PM, Jan Beulich wrote: >> To be honest, just having remembered that we do the write back for locked >> instructions using CMPXCHG, I'd first of all like to see a proper description >> of "the _whole_ issue". > >I believe at least part of the issue has to do with the comment on line >1013 from xen/arch/x86/hvm/emulate.c: > >994 static int hvmemul_cmpxchg( >995 enum x86_segment seg, >996 unsigned long offset, >997 void *p_old, >998 void *p_new, >999 unsigned int bytes, >1000 struct x86_emulate_ctxt *ctxt) >1001 { >1002 struct hvm_emulate_ctxt *hvmemul_ctxt = >1003 container_of(ctxt, struct hvm_emulate_ctxt, ctxt); >1004 >1005 if ( unlikely(hvmemul_ctxt->set_context) ) >1006 { >1007 int rc = set_context_data(p_new, bytes); >1008 >1009 if ( rc != X86EMUL_OKAY ) >1010 return rc; >1011 } >1012 >1013 /* Fix this in case the guest is really relying on r-m-w atomicity. */ >1014 return hvmemul_write(seg, offset, p_new, bytes, ctxt); >1015 } Ah, so _that's_ where the problem wants to be fixed then (leaving - afaict - PV emulation paths completely unaffected). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 06:44 PM, Jan Beulich wrote: > Razvan Cojocaru04/14/16 7:57 AM >>> >> On 04/14/16 07:35, Jan Beulich wrote: >> Razvan Cojocaru 04/13/16 7:53 PM >>> @@ -1589,6 +1591,8 @@ x86_emulate( >>> >} >>> >done_prefixes: >>> > +ops->smp_lock(lock_prefix); + >>> >if ( rex_prefix & REX_W ) >>> >op_bytes = 8; >>> >>> Already from the context it is obvious that the lock can be taken later. >> >> I'll take it later. > > And that alone won't suffice: Instructions not touching memory shouldn't take > any lock at all. As shouldn't instructions that only read or only write > memory. > >>> Overall I can see why you want this, but what is the performance impact? >>> After >>> all you're basically making the emulator act like very old systems using a >>> bus >>> lock (i.e. a global one) instead of utilizing cacheline exclusive >>> ownership. Plus >>> you still don't (and cannot, afaict) deal with one LOCKed instruction >>> getting >>> emulated and another in parallel getting executed by the CPU without >>> emulation >>> (granted this shouldn't normally happen, but we also can't exclude that >>> case). >> >> I was under the impression that for reads, with the new percpu_rwlocks >> the performance impact is close to zero, with some performance impact >> for writes - but writes should, for one, be more infrequent, and then >> the added safety factor should make up for that. > > That's the performance effect on the hypervisor you talk about. But what about > the guest, which all of the sudden gets another domain wide lock applied? Well, yes, there's bound to be some performance loss - but I think the question is: is the added safety worth it? As always, there are trade-offs. I am quite possibly missing something, but surely a slightly slower, running guest is better than a fast unstable one. As for the patch, it's definitely perfectible, and I appreciate all suggestions to make it the best it can be (or even to scrape it for a better solution). Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 14/04/16 16:40, Jan Beulich wrote: Razvan Cojocaru04/14/16 1:43 PM >>> >> On 04/14/2016 01:35 PM, David Vrabel wrote: >>> On 13/04/16 13:26, Razvan Cojocaru wrote: LOCK-prefixed instructions are currenly allowed to run in parallel in x86_emulate(), which can lead the guest into an undefined state. This patch fixes the issue. >>> Is this sufficient? What if another VCPU is running on another PCPU and >>> doing an unemulated LOCK-prefixed instruction to the same memory address? >>> >>> This other VCPU could be for another domain (or Xen for that matter). >> The patch is only sufficient for parallel runs of emulated instructions, >> as previously stated. It is, however, able to prevent nasty guest lockups. >> >> This is what happened in a previous thread where I was hunting down the >> issue and initially thought that the xen-access.c model was broken when >> used with emulation, and even proceeded to check that the ring buffer >> memory accesses are synchronized properly. They were alright, the >> problem was in fact LOCKed instruction emulation happening in parallel, >> i.e. a race condition there. >> >> This is less obvious if we signal that vm_event responses are available >> immediately after processing each one (which greatly reduces the chances >> of a race happening), and more obvious with guests that have 2 (or more) >> VCPUs where all of them are paused waiting for a vm_event reply, and all >> of them are woken up at the same time, after processing all of the >> events, and asked to emulate. >> >> I do believe that somewhere in Xen emulating in this manner could occur, >> so I hope to make emulation generally safer. >> >> As for not fixing the _whole_ issue, as Jan has rightly pointed out, >> that's a rather difficult thing to do. > To be honest, just having remembered that we do the write back for locked > instructions using CMPXCHG, I'd first of all like to see a proper description > of "the _whole_ issue". All emulated instructions with a lock prefix end up calling into hvmemul_cmpxchg() I suspect the issue is to do with the implementation of hvmemul_cmpxchg(), which contains a TODO from 2008 of /* Fix this in case the guest is really relying on r-m-w atomicity. */ which, amongst other things, never updates *p_old. Short of having the instruction emulator convert any locked instruction into a stub, I can't think of a solution which works for both emulated and non-emulated instructions. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 06:40 PM, Jan Beulich wrote: Razvan Cojocaru04/14/16 1:43 PM >>> >> On 04/14/2016 01:35 PM, David Vrabel wrote: >>> On 13/04/16 13:26, Razvan Cojocaru wrote: LOCK-prefixed instructions are currenly allowed to run in parallel in x86_emulate(), which can lead the guest into an undefined state. This patch fixes the issue. >>> >>> Is this sufficient? What if another VCPU is running on another PCPU and >>> doing an unemulated LOCK-prefixed instruction to the same memory address? >>> >>> This other VCPU could be for another domain (or Xen for that matter). >> >> The patch is only sufficient for parallel runs of emulated instructions, >> as previously stated. It is, however, able to prevent nasty guest lockups. >> >> This is what happened in a previous thread where I was hunting down the >> issue and initially thought that the xen-access.c model was broken when >> used with emulation, and even proceeded to check that the ring buffer >> memory accesses are synchronized properly. They were alright, the >> problem was in fact LOCKed instruction emulation happening in parallel, >> i.e. a race condition there. >> >> This is less obvious if we signal that vm_event responses are available >> immediately after processing each one (which greatly reduces the chances >> of a race happening), and more obvious with guests that have 2 (or more) >> VCPUs where all of them are paused waiting for a vm_event reply, and all >> of them are woken up at the same time, after processing all of the >> events, and asked to emulate. >> >> I do believe that somewhere in Xen emulating in this manner could occur, >> so I hope to make emulation generally safer. >> >> As for not fixing the _whole_ issue, as Jan has rightly pointed out, >> that's a rather difficult thing to do. > > To be honest, just having remembered that we do the write back for locked > instructions using CMPXCHG, I'd first of all like to see a proper description > of "the _whole_ issue". I believe at least part of the issue has to do with the comment on line 1013 from xen/arch/x86/hvm/emulate.c: 994 static int hvmemul_cmpxchg( 995 enum x86_segment seg, 996 unsigned long offset, 997 void *p_old, 998 void *p_new, 999 unsigned int bytes, 1000 struct x86_emulate_ctxt *ctxt) 1001 { 1002 struct hvm_emulate_ctxt *hvmemul_ctxt = 1003 container_of(ctxt, struct hvm_emulate_ctxt, ctxt); 1004 1005 if ( unlikely(hvmemul_ctxt->set_context) ) 1006 { 1007 int rc = set_context_data(p_new, bytes); 1008 1009 if ( rc != X86EMUL_OKAY ) 1010 return rc; 1011 } 1012 1013 /* Fix this in case the guest is really relying on r-m-w atomicity. */ 1014 return hvmemul_write(seg, offset, p_new, bytes, ctxt); 1015 } Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
Razvan Cojocaru04/14/16 7:57 AM >>> >On 04/14/16 07:35, Jan Beulich wrote: > Razvan Cojocaru 04/13/16 7:53 PM >>> >>> @@ -1589,6 +1591,8 @@ x86_emulate( >> >} >> >done_prefixes: >> > >>> +ops->smp_lock(lock_prefix); >>> + >> >if ( rex_prefix & REX_W ) >> >op_bytes = 8; >> >> Already from the context it is obvious that the lock can be taken later. > >I'll take it later. And that alone won't suffice: Instructions not touching memory shouldn't take any lock at all. As shouldn't instructions that only read or only write memory. >> Overall I can see why you want this, but what is the performance impact? >> After >> all you're basically making the emulator act like very old systems using a >> bus >> lock (i.e. a global one) instead of utilizing cacheline exclusive ownership. >> Plus >> you still don't (and cannot, afaict) deal with one LOCKed instruction getting >> emulated and another in parallel getting executed by the CPU without >> emulation >> (granted this shouldn't normally happen, but we also can't exclude that >> case). > >I was under the impression that for reads, with the new percpu_rwlocks >the performance impact is close to zero, with some performance impact >for writes - but writes should, for one, be more infrequent, and then >the added safety factor should make up for that. That's the performance effect on the hypervisor you talk about. But what about the guest, which all of the sudden gets another domain wide lock applied? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 06:31 PM, Jan Beulich wrote: Razvan Cojocaru04/14/16 10:50 AM >>> >> On 04/14/2016 07:35 AM, Jan Beulich wrote: --- a/xen/arch/x86/hvm/emulate.c > +++ b/xen/arch/x86/hvm/emulate.c > @@ -25,6 +25,8 @@ >>> >#include >>> >#include >>> > > +DEFINE_PERCPU_RWLOCK_GLOBAL(emulate_locked_rwlock); >>> You should try hard to make this static. >> >> On second though, this would make the code somewhat more convoluted, as >> the functions in emulate.c need to access this variable, and so does the >> code in domain.c that initializes the lock when the domain is created. >> >> What I've done is similar to what I found in the current source code with: >> >> arch/x86/mm/p2m.c:DEFINE_PERCPU_RWLOCK_GLOBAL(p2m_percpu_rwlock); >> common/grant_table.c:DEFINE_PERCPU_RWLOCK_GLOBAL(grant_rwlock); > > Well, that's why I said "you should try hard" instead of "you have to". > >> But I could add a function to emulate.h, for example: >> >> void init_emulate_smp_lock(); > > Exactly (just that this is perhaps again the wrong header, considering you > also need this for PV emulation). I understand, I'll look for a more suitable place for the function declarations and definitions then (unless a specific place is requested). Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru04/14/16 1:43 PM >>> >On 04/14/2016 01:35 PM, David Vrabel wrote: >> On 13/04/16 13:26, Razvan Cojocaru wrote: >>> LOCK-prefixed instructions are currenly allowed to run in parallel >>> in x86_emulate(), which can lead the guest into an undefined state. >>> This patch fixes the issue. >> >> Is this sufficient? What if another VCPU is running on another PCPU and >> doing an unemulated LOCK-prefixed instruction to the same memory address? >> >> This other VCPU could be for another domain (or Xen for that matter). > >The patch is only sufficient for parallel runs of emulated instructions, >as previously stated. It is, however, able to prevent nasty guest lockups. > >This is what happened in a previous thread where I was hunting down the >issue and initially thought that the xen-access.c model was broken when >used with emulation, and even proceeded to check that the ring buffer >memory accesses are synchronized properly. They were alright, the >problem was in fact LOCKed instruction emulation happening in parallel, >i.e. a race condition there. > >This is less obvious if we signal that vm_event responses are available >immediately after processing each one (which greatly reduces the chances >of a race happening), and more obvious with guests that have 2 (or more) >VCPUs where all of them are paused waiting for a vm_event reply, and all >of them are woken up at the same time, after processing all of the >events, and asked to emulate. > >I do believe that somewhere in Xen emulating in this manner could occur, >so I hope to make emulation generally safer. > >As for not fixing the _whole_ issue, as Jan has rightly pointed out, >that's a rather difficult thing to do. To be honest, just having remembered that we do the write back for locked instructions using CMPXCHG, I'd first of all like to see a proper description of "the _whole_ issue". Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru04/14/16 11:08 AM >>> >On 04/14/2016 08:56 AM, Razvan Cojocaru wrote: --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -112,6 +112,7 @@ >>> > >#include >>> > >#include >>> > >#include >> +#include >>> > >>> > This header shouldn't be included here. You need to move the declarations >>> > elsewhere for them to be usable here. >> Understood. Is there any place particularly fitting you could suggest? > >Actually, I've removed that #include and the code continues to compile, >so it would appear that it is somehow #included indirectly, which makes >the declarations accessible implicitly. But - see my other reply just sent - it is _still_ the wrong header then. Stuff shared between PV and HVM doesn't belong in HVM specific files. There are a few counterexamples, but we shouldn't add more. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru04/14/16 10:50 AM >>> >On 04/14/2016 07:35 AM, Jan Beulich wrote: >>> --- a/xen/arch/x86/hvm/emulate.c >>> >+++ b/xen/arch/x86/hvm/emulate.c >>> >@@ -25,6 +25,8 @@ >> >#include >> >#include >> > >>> >+DEFINE_PERCPU_RWLOCK_GLOBAL(emulate_locked_rwlock); >> You should try hard to make this static. > >On second though, this would make the code somewhat more convoluted, as >the functions in emulate.c need to access this variable, and so does the >code in domain.c that initializes the lock when the domain is created. > >What I've done is similar to what I found in the current source code with: > >arch/x86/mm/p2m.c:DEFINE_PERCPU_RWLOCK_GLOBAL(p2m_percpu_rwlock); >common/grant_table.c:DEFINE_PERCPU_RWLOCK_GLOBAL(grant_rwlock); Well, that's why I said "you should try hard" instead of "you have to". >But I could add a function to emulate.h, for example: > >void init_emulate_smp_lock(); Exactly (just that this is perhaps again the wrong header, considering you also need this for PV emulation). >and call that from arch/x86/domain.c (as suggested by Andrew Cooper), >which would allow domain.c to stop referencing emulate_locked_rwlock >directly. Right, and to be honest I can't really see how my earlier comment could have been taken for "add an ARM stub". Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 01:35 PM, David Vrabel wrote: > On 13/04/16 13:26, Razvan Cojocaru wrote: >> LOCK-prefixed instructions are currenly allowed to run in parallel >> in x86_emulate(), which can lead the guest into an undefined state. >> This patch fixes the issue. > > Is this sufficient? What if another VCPU is running on another PCPU and > doing an unemulated LOCK-prefixed instruction to the same memory address? > > This other VCPU could be for another domain (or Xen for that matter). The patch is only sufficient for parallel runs of emulated instructions, as previously stated. It is, however, able to prevent nasty guest lockups. This is what happened in a previous thread where I was hunting down the issue and initially thought that the xen-access.c model was broken when used with emulation, and even proceeded to check that the ring buffer memory accesses are synchronized properly. They were alright, the problem was in fact LOCKed instruction emulation happening in parallel, i.e. a race condition there. This is less obvious if we signal that vm_event responses are available immediately after processing each one (which greatly reduces the chances of a race happening), and more obvious with guests that have 2 (or more) VCPUs where all of them are paused waiting for a vm_event reply, and all of them are woken up at the same time, after processing all of the events, and asked to emulate. I do believe that somewhere in Xen emulating in this manner could occur, so I hope to make emulation generally safer. As for not fixing the _whole_ issue, as Jan has rightly pointed out, that's a rather difficult thing to do. I will add a comment in the V2 of the patch to clearly state the limitations of the patch, as well as more information about how the patch proposes to fix the issue described (as requested by Jan Beulich). Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 13/04/16 13:26, Razvan Cojocaru wrote: > LOCK-prefixed instructions are currenly allowed to run in parallel > in x86_emulate(), which can lead the guest into an undefined state. > This patch fixes the issue. Is this sufficient? What if another VCPU is running on another PCPU and doing an unemulated LOCK-prefixed instruction to the same memory address? This other VCPU could be for another domain (or Xen for that matter). David ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 08:56 AM, Razvan Cojocaru wrote: >>> --- a/xen/arch/x86/mm.c >>> >> +++ b/xen/arch/x86/mm.c >>> >> @@ -112,6 +112,7 @@ >> > >#include >> > >#include >> > >#include >>> >> +#include >> > >> > This header shouldn't be included here. You need to move the declarations >> > elsewhere for them to be usable here. > Understood. Is there any place particularly fitting you could suggest? Actually, I've removed that #include and the code continues to compile, so it would appear that it is somehow #included indirectly, which makes the declarations accessible implicitly. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 07:35 AM, Jan Beulich wrote: >> --- a/xen/arch/x86/hvm/emulate.c >> >+++ b/xen/arch/x86/hvm/emulate.c >> >@@ -25,6 +25,8 @@ > >#include > >#include > > >> >+DEFINE_PERCPU_RWLOCK_GLOBAL(emulate_locked_rwlock); > You should try hard to make this static. On second though, this would make the code somewhat more convoluted, as the functions in emulate.c need to access this variable, and so does the code in domain.c that initializes the lock when the domain is created. What I've done is similar to what I found in the current source code with: arch/x86/mm/p2m.c:DEFINE_PERCPU_RWLOCK_GLOBAL(p2m_percpu_rwlock); common/grant_table.c:DEFINE_PERCPU_RWLOCK_GLOBAL(grant_rwlock); But I could add a function to emulate.h, for example: void init_emulate_smp_lock(); and call that from arch/x86/domain.c (as suggested by Andrew Cooper), which would allow domain.c to stop referencing emulate_locked_rwlock directly. Would you prefer this? Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 11:18 AM, Juergen Gross wrote: > On 14/04/16 10:01, Andrew Cooper wrote: >> On 14/04/2016 08:46, Juergen Gross wrote: >>> On 14/04/16 08:31, Razvan Cojocaru wrote: On 04/14/16 09:09, Juergen Gross wrote: > On 14/04/16 07:56, Razvan Cojocaru wrote: >> This indeed doesn't guard against LOCKed instructions being run in >> parallel with and without emulation, however that is a case that should >> almost never occur - at least not with introspection, where currently >> all emulation happens as a result of EPT faults - so either all >> instructions hitting a restricted page are emulated, or all ar run >> directly. As long as all emulation can safely run in parallel and all >> parallel non-emulation is also safe, it should be alright. But, yes, >> this patch doesn't cover the case you're mentioning. > What about grant pages? There could be parallel accesses from different > domains, one being introspected, the other not. I'm not familiar with the code there, but the main issue is, I think, LOCKed instructions that access (read / write) the same memory area - as long as that doesn't happen, it should be fine, which may be the reason why it hasn't caused problems so far. >>> Depends on the guest, I suppose. :-) >>> >>> I've been bitten by this before in my former position: we had a custom >>> pv-driver in dom0 which wasn't using LOCKed instructions accessing a >>> grant page. Reason was dom0 had one vcpu only and the Linux kernel >>> patched all LOCKs away as it didn't deem them being necessary. This >>> resulted in a very hard to debug communication failure between domU >>> and dom0. >>> While not perfect, I believe that the added safety is worth the small performance impact for writes. I feel that going from unsafe parallel emulation to safe parallel emulation is a good step to take, at least until the problem can be fixed completely by more complex measures. >>> I'm fine with you saying for your use case the solution is good enough. >>> >>> Just wanted to point out a possible problem. This might not happen >>> for most guest types, but you can't be sure. :-) >> >> But accesses into a mapped grant don't trap for emulation. Why would >> locks here be any different to usual? > > With memory introspection switched on they will trap, won't they? Only write or execute instructions referencing a handful of a HVM guest's pages trap for emulation with our introspection application, otherwise performance would be terrible - we don't trap all instructions for emulation. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 14/04/16 10:01, Andrew Cooper wrote: > On 14/04/2016 08:46, Juergen Gross wrote: >> On 14/04/16 08:31, Razvan Cojocaru wrote: >>> On 04/14/16 09:09, Juergen Gross wrote: On 14/04/16 07:56, Razvan Cojocaru wrote: > This indeed doesn't guard against LOCKed instructions being run in > parallel with and without emulation, however that is a case that should > almost never occur - at least not with introspection, where currently > all emulation happens as a result of EPT faults - so either all > instructions hitting a restricted page are emulated, or all ar run > directly. As long as all emulation can safely run in parallel and all > parallel non-emulation is also safe, it should be alright. But, yes, > this patch doesn't cover the case you're mentioning. What about grant pages? There could be parallel accesses from different domains, one being introspected, the other not. >>> I'm not familiar with the code there, but the main issue is, I think, >>> LOCKed instructions that access (read / write) the same memory area - as >>> long as that doesn't happen, it should be fine, which may be the reason >>> why it hasn't caused problems so far. >> Depends on the guest, I suppose. :-) >> >> I've been bitten by this before in my former position: we had a custom >> pv-driver in dom0 which wasn't using LOCKed instructions accessing a >> grant page. Reason was dom0 had one vcpu only and the Linux kernel >> patched all LOCKs away as it didn't deem them being necessary. This >> resulted in a very hard to debug communication failure between domU >> and dom0. >> >>> While not perfect, I believe that the added safety is worth the small >>> performance impact for writes. I feel that going from unsafe parallel >>> emulation to safe parallel emulation is a good step to take, at least >>> until the problem can be fixed completely by more complex measures. >> I'm fine with you saying for your use case the solution is good enough. >> >> Just wanted to point out a possible problem. This might not happen >> for most guest types, but you can't be sure. :-) > > But accesses into a mapped grant don't trap for emulation. Why would > locks here be any different to usual? With memory introspection switched on they will trap, won't they? Juergen ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/2016 11:07 AM, Andrew Cooper wrote: > On 14/04/2016 06:56, Razvan Cojocaru wrote: >> --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -272,6 +272,8 @@ struct domain *domain_create(domid_t domid, unsigned int domcr_flags, >>> > >>> >TRACE_1D(TRC_DOM0_DOM_ADD, d->domain_id); >>> > +percpu_rwlock_resource_init(>arch.emulate_lock, emulate_locked_rwlock); >>> I cannot see how this would build on ARM. >> I'll add separate functions for ARM and x86, with a no-op for ARM. > > Please move this line into arch_domain_create() arch/x86/domain.c > > Strictly speaking, common code should never reference ->arch. The bits > of ->arch which common code does touch should move up into common struct > domain. Of course, my bad. Will move it. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 14/04/2016 06:56, Razvan Cojocaru wrote: > >>> --- a/xen/common/domain.c >>> +++ b/xen/common/domain.c >>> @@ -272,6 +272,8 @@ struct domain *domain_create(domid_t domid, unsigned >>> int domcr_flags, >> > >> >TRACE_1D(TRC_DOM0_DOM_ADD, d->domain_id); >> > >>> +percpu_rwlock_resource_init(>arch.emulate_lock, >>> emulate_locked_rwlock); >> I cannot see how this would build on ARM. > I'll add separate functions for ARM and x86, with a no-op for ARM. Please move this line into arch_domain_create() arch/x86/domain.c Strictly speaking, common code should never reference ->arch. The bits of ->arch which common code does touch should move up into common struct domain. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 14/04/2016 08:46, Juergen Gross wrote: > On 14/04/16 08:31, Razvan Cojocaru wrote: >> On 04/14/16 09:09, Juergen Gross wrote: >>> On 14/04/16 07:56, Razvan Cojocaru wrote: This indeed doesn't guard against LOCKed instructions being run in parallel with and without emulation, however that is a case that should almost never occur - at least not with introspection, where currently all emulation happens as a result of EPT faults - so either all instructions hitting a restricted page are emulated, or all ar run directly. As long as all emulation can safely run in parallel and all parallel non-emulation is also safe, it should be alright. But, yes, this patch doesn't cover the case you're mentioning. >>> What about grant pages? There could be parallel accesses from different >>> domains, one being introspected, the other not. >> I'm not familiar with the code there, but the main issue is, I think, >> LOCKed instructions that access (read / write) the same memory area - as >> long as that doesn't happen, it should be fine, which may be the reason >> why it hasn't caused problems so far. > Depends on the guest, I suppose. :-) > > I've been bitten by this before in my former position: we had a custom > pv-driver in dom0 which wasn't using LOCKed instructions accessing a > grant page. Reason was dom0 had one vcpu only and the Linux kernel > patched all LOCKs away as it didn't deem them being necessary. This > resulted in a very hard to debug communication failure between domU > and dom0. > >> While not perfect, I believe that the added safety is worth the small >> performance impact for writes. I feel that going from unsafe parallel >> emulation to safe parallel emulation is a good step to take, at least >> until the problem can be fixed completely by more complex measures. > I'm fine with you saying for your use case the solution is good enough. > > Just wanted to point out a possible problem. This might not happen > for most guest types, but you can't be sure. :-) But accesses into a mapped grant don't trap for emulation. Why would locks here be any different to usual? ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 14/04/16 08:31, Razvan Cojocaru wrote: > On 04/14/16 09:09, Juergen Gross wrote: >> On 14/04/16 07:56, Razvan Cojocaru wrote: >>> This indeed doesn't guard against LOCKed instructions being run in >>> parallel with and without emulation, however that is a case that should >>> almost never occur - at least not with introspection, where currently >>> all emulation happens as a result of EPT faults - so either all >>> instructions hitting a restricted page are emulated, or all ar run >>> directly. As long as all emulation can safely run in parallel and all >>> parallel non-emulation is also safe, it should be alright. But, yes, >>> this patch doesn't cover the case you're mentioning. >> >> What about grant pages? There could be parallel accesses from different >> domains, one being introspected, the other not. > > I'm not familiar with the code there, but the main issue is, I think, > LOCKed instructions that access (read / write) the same memory area - as > long as that doesn't happen, it should be fine, which may be the reason > why it hasn't caused problems so far. Depends on the guest, I suppose. :-) I've been bitten by this before in my former position: we had a custom pv-driver in dom0 which wasn't using LOCKed instructions accessing a grant page. Reason was dom0 had one vcpu only and the Linux kernel patched all LOCKs away as it didn't deem them being necessary. This resulted in a very hard to debug communication failure between domU and dom0. > While not perfect, I believe that the added safety is worth the small > performance impact for writes. I feel that going from unsafe parallel > emulation to safe parallel emulation is a good step to take, at least > until the problem can be fixed completely by more complex measures. I'm fine with you saying for your use case the solution is good enough. Just wanted to point out a possible problem. This might not happen for most guest types, but you can't be sure. :-) Juergen ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/16 09:09, Juergen Gross wrote: > On 14/04/16 07:56, Razvan Cojocaru wrote: >> This indeed doesn't guard against LOCKed instructions being run in >> parallel with and without emulation, however that is a case that should >> almost never occur - at least not with introspection, where currently >> all emulation happens as a result of EPT faults - so either all >> instructions hitting a restricted page are emulated, or all ar run >> directly. As long as all emulation can safely run in parallel and all >> parallel non-emulation is also safe, it should be alright. But, yes, >> this patch doesn't cover the case you're mentioning. > > What about grant pages? There could be parallel accesses from different > domains, one being introspected, the other not. I'm not familiar with the code there, but the main issue is, I think, LOCKed instructions that access (read / write) the same memory area - as long as that doesn't happen, it should be fine, which may be the reason why it hasn't caused problems so far. While not perfect, I believe that the added safety is worth the small performance impact for writes. I feel that going from unsafe parallel emulation to safe parallel emulation is a good step to take, at least until the problem can be fixed completely by more complex measures. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 14/04/16 07:56, Razvan Cojocaru wrote: > This indeed doesn't guard against LOCKed instructions being run in > parallel with and without emulation, however that is a case that should > almost never occur - at least not with introspection, where currently > all emulation happens as a result of EPT faults - so either all > instructions hitting a restricted page are emulated, or all ar run > directly. As long as all emulation can safely run in parallel and all > parallel non-emulation is also safe, it should be alright. But, yes, > this patch doesn't cover the case you're mentioning. What about grant pages? There could be parallel accesses from different domains, one being introspected, the other not. Juergen ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
On 04/14/16 07:35, Jan Beulich wrote: Razvan Cojocaru04/13/16 7:53 PM >>> >> LOCK-prefixed instructions are currenly allowed to run in parallel >> in x86_emulate(), which can lead the guest into an undefined state. >> This patch fixes the issue. > > ... by ... (read: Too brief a description) I'll make it more detailed. >> --- a/xen/arch/x86/hvm/emulate.c >> +++ b/xen/arch/x86/hvm/emulate.c >> @@ -25,6 +25,8 @@ > >#include > >#include > > >> +DEFINE_PERCPU_RWLOCK_GLOBAL(emulate_locked_rwlock); > > You should try hard to make this static. I'll try. >> --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -112,6 +112,7 @@ > >#include > >#include > >#include >> +#include > > This header shouldn't be included here. You need to move the declarations > elsewhere for them to be usable here. Understood. Is there any place particularly fitting you could suggest? Thanks! >> @@ -1589,6 +1591,8 @@ x86_emulate( > >} > >done_prefixes: > > >> +ops->smp_lock(lock_prefix); >> + > >if ( rex_prefix & REX_W ) > >op_bytes = 8; > > Already from the context it is obvious that the lock can be taken later. I'll take it later. >> @@ -2380,7 +2390,12 @@ x86_emulate( > >} > >/* Write back the memory destination with implicit LOCK prefix. */ > >dst.val = src.val; >> -lock_prefix = 1; >> +if ( !lock_prefix ) >> +{ >> +ops->smp_unlock(lock_prefix); >> +lock_prefix = 1; >> +ops->smp_lock(lock_prefix); >> +} > > This can't be correct: You need to take the write lock _before_ reading the > memory location. Right. >> @@ -3859,6 +3874,9 @@ x86_emulate( > >done: > >_put_fpu(); > >put_stub(stub); >> + >> +ops->smp_unlock(lock_prefix); > > And just like above from the context alone it is clear that the unlock can > happen earlier. Locked regions should always as small as possible. I'll move it up. I was just following returns. >> --- a/xen/common/domain.c >> +++ b/xen/common/domain.c >> @@ -272,6 +272,8 @@ struct domain *domain_create(domid_t domid, unsigned int >> domcr_flags, > > > >TRACE_1D(TRC_DOM0_DOM_ADD, d->domain_id); > > >> +percpu_rwlock_resource_init(>arch.emulate_lock, >> emulate_locked_rwlock); > > I cannot see how this would build on ARM. I'll add separate functions for ARM and x86, with a no-op for ARM. > Overall I can see why you want this, but what is the performance impact? After > all you're basically making the emulator act like very old systems using a bus > lock (i.e. a global one) instead of utilizing cacheline exclusive ownership. > Plus > you still don't (and cannot, afaict) deal with one LOCKed instruction getting > emulated and another in parallel getting executed by the CPU without emulation > (granted this shouldn't normally happen, but we also can't exclude that case). I was under the impression that for reads, with the new percpu_rwlocks the performance impact is close to zero, with some performance impact for writes - but writes should, for one, be more infrequent, and then the added safety factor should make up for that. This indeed doesn't guard against LOCKed instructions being run in parallel with and without emulation, however that is a case that should almost never occur - at least not with introspection, where currently all emulation happens as a result of EPT faults - so either all instructions hitting a restricted page are emulated, or all ar run directly. As long as all emulation can safely run in parallel and all parallel non-emulation is also safe, it should be alright. But, yes, this patch doesn't cover the case you're mentioning. > With the above I'm also not convinced this is an issue that absolutely needs > to > be addressed in 4.7 - it's not a regression, and I suppose not a problem for > guests that aren't under introspection. The issue can also affect the pagetable and mmio hooks. Since Xen does emulate outside of introspection, I thought that this has some immediate importance. But obviously you know more about the code, so if you prefer I can drop the "for-4.7" tag. Thanks, Razvan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-4.7] x86/emulate: synchronize LOCKed instruction emulation
>>> Razvan Cojocaru04/13/16 7:53 PM >>> >LOCK-prefixed instructions are currenly allowed to run in parallel >in x86_emulate(), which can lead the guest into an undefined state. >This patch fixes the issue. ... by ... (read: Too brief a description) >--- a/xen/arch/x86/hvm/emulate.c >+++ b/xen/arch/x86/hvm/emulate.c >@@ -25,6 +25,8 @@ >#include >#include > >+DEFINE_PERCPU_RWLOCK_GLOBAL(emulate_locked_rwlock); You should try hard to make this static. >--- a/xen/arch/x86/mm.c >+++ b/xen/arch/x86/mm.c >@@ -112,6 +112,7 @@ >#include >#include >#include >+#include This header shouldn't be included here. You need to move the declarations elsewhere for them to be usable here. >@@ -1589,6 +1591,8 @@ x86_emulate( >} >done_prefixes: > >+ops->smp_lock(lock_prefix); >+ >if ( rex_prefix & REX_W ) >op_bytes = 8; Already from the context it is obvious that the lock can be taken later. >@@ -2380,7 +2390,12 @@ x86_emulate( >} >/* Write back the memory destination with implicit LOCK prefix. */ >dst.val = src.val; >-lock_prefix = 1; >+if ( !lock_prefix ) >+{ >+ops->smp_unlock(lock_prefix); >+lock_prefix = 1; >+ops->smp_lock(lock_prefix); >+} This can't be correct: You need to take the write lock _before_ reading the memory location. >@@ -3859,6 +3874,9 @@ x86_emulate( >done: >_put_fpu(); >put_stub(stub); >+ >+ops->smp_unlock(lock_prefix); And just like above from the context alone it is clear that the unlock can happen earlier. Locked regions should always as small as possible. >--- a/xen/common/domain.c >+++ b/xen/common/domain.c >@@ -272,6 +272,8 @@ struct domain *domain_create(domid_t domid, unsigned int >domcr_flags, > >TRACE_1D(TRC_DOM0_DOM_ADD, d->domain_id); > >+percpu_rwlock_resource_init(>arch.emulate_lock, emulate_locked_rwlock); I cannot see how this would build on ARM. Overall I can see why you want this, but what is the performance impact? After all you're basically making the emulator act like very old systems using a bus lock (i.e. a global one) instead of utilizing cacheline exclusive ownership. Plus you still don't (and cannot, afaict) deal with one LOCKed instruction getting emulated and another in parallel getting executed by the CPU without emulation (granted this shouldn't normally happen, but we also can't exclude that case). With the above I'm also not convinced this is an issue that absolutely needs to be addressed in 4.7 - it's not a regression, and I suppose not a problem for guests that aren't under introspection. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel