Re: Memory ordering question in the shutdown deferral code
(+ Xen-devel) Sorry I forgot to CC xen-devel. On 21/09/2020 12:38, Julien Grall wrote: Hi all, I have started to look at the deferral code (see vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and Arm will soon use it. The current implementation is using an smp_mb() to ensure ordering between a write then a read. The code looks roughly (I have slightly adapted it to make my question more obvious): domain_shutdown() d->is_shutting_down = 1; smp_mb(); if ( !vcpu0->defer_shutdown ) { vcpu_pause_nosync(v); v->paused_for_shutdown = 1; } vcpu_start_shutdown_deferral() vcpu0->defer_shutdown = 1; smp_mb(); if ( unlikely(d->is_shutting_down) ) vcpu_check_shutdown(v); return vcpu0->defer_shutdown; smp_mb() should only guarantee ordering (this may be stronger on some arch), so I think there is a race between the two functions. It would be possible to pause the vCPU in domain_shutdown() because vcpu0->defer_shutdown wasn't yet seen. Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down and therefore Xen may continue to send the I/O. Yet the vCPU will be paused so the I/O will never complete. I am not fully familiar with the IOREQ code, but it sounds to me this is not the behavior that was intended. Can someone more familiar with the code confirm it? Cheers, -- Julien Grall
RE: Memory ordering question in the shutdown deferral code
> -Original Message- > From: Julien Grall > Sent: 21 September 2020 12:41 > To: Jan Beulich ; Stefano Stabellini > ; > andrew.coop...@citrix.com; George Dunlap ; Durrant, > Paul > > Cc: Xia, Hongyan ; xen-devel@lists.xenproject.org > Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code > > CAUTION: This email originated from outside of the organization. Do not click > links or open > attachments unless you can confirm the sender and know the content is safe. > > > > (+ Xen-devel) > > Sorry I forgot to CC xen-devel. > > On 21/09/2020 12:38, Julien Grall wrote: > > Hi all, > > > > I have started to look at the deferral code (see > > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and > > Arm will soon use it. > > > > The current implementation is using an smp_mb() to ensure ordering > > between a write then a read. The code looks roughly (I have slightly > > adapted it to make my question more obvious): > > > > domain_shutdown() > > d->is_shutting_down = 1; > > smp_mb(); > > if ( !vcpu0->defer_shutdown ) > > { > >vcpu_pause_nosync(v); > >v->paused_for_shutdown = 1; > > } > > > > vcpu_start_shutdown_deferral() > > vcpu0->defer_shutdown = 1; > > smp_mb(); > > if ( unlikely(d->is_shutting_down) ) > >vcpu_check_shutdown(v); > > > > return vcpu0->defer_shutdown; > > > > smp_mb() should only guarantee ordering (this may be stronger on some > > arch), so I think there is a race between the two functions. > > > > It would be possible to pause the vCPU in domain_shutdown() because > > vcpu0->defer_shutdown wasn't yet seen. > > > > Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down > > and therefore Xen may continue to send the I/O. Yet the vCPU will be > > paused so the I/O will never complete. > > The barrier enforces global order, right? So, if domain_shutdown() pauses the vcpu then is_shutting_down must necessarily be visible all cpus. Thus vcpu_start_shutdown referral will execute vcpu_check_shutdown(), so I'm having a hard time seeing the race. > > I am not fully familiar with the IOREQ code, but it sounds to me this is > > not the behavior that was intended. Can someone more familiar with the > > code confirm it? > > No indeed. I think emulation should complete before the vcpu pauses. Paul > > Cheers, > > > > -- > Julien Grall
Re: Memory ordering question in the shutdown deferral code
On 21.09.2020 13:40, Julien Grall wrote: > (+ Xen-devel) > > Sorry I forgot to CC xen-devel. > > On 21/09/2020 12:38, Julien Grall wrote: >> Hi all, >> >> I have started to look at the deferral code (see >> vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and >> Arm will soon use it. >> >> The current implementation is using an smp_mb() to ensure ordering >> between a write then a read. The code looks roughly (I have slightly >> adapted it to make my question more obvious): >> >> domain_shutdown() >> d->is_shutting_down = 1; >> smp_mb(); >> if ( !vcpu0->defer_shutdown ) >> { >> vcpu_pause_nosync(v); >> v->paused_for_shutdown = 1; >> } >> >> vcpu_start_shutdown_deferral() >> vcpu0->defer_shutdown = 1; >> smp_mb(); >> if ( unlikely(d->is_shutting_down) ) >> vcpu_check_shutdown(v); >> >> return vcpu0->defer_shutdown; >> >> smp_mb() should only guarantee ordering (this may be stronger on some >> arch), so I think there is a race between the two functions. >> >> It would be possible to pause the vCPU in domain_shutdown() because >> vcpu0->defer_shutdown wasn't yet seen. >> >> Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down >> and therefore Xen may continue to send the I/O. Yet the vCPU will be >> paused so the I/O will never complete. Individually for each of these I agree. But isn't the goal merely to prevent both to enter their if()-s' bodies at the same time? And isn't the combined effect of the two barriers preventing just this? >> I am not fully familiar with the IOREQ code, but it sounds to me this is >> not the behavior that was intended. Can someone more familiar with the >> code confirm it? As to original intentions, I'm afraid among the people still listed as maintainers for any part of Xen it may only be Tim to possibly have been involved in the original installation of this model, and hence who may know of the precise intentions and considerations back at the time. As far as I'm concerned, to be honest I don't think I've ever managed to fully convince myself of the correctness of the model in the general case. But since it did look good enough for x86 ... Jan
Re: Memory ordering question in the shutdown deferral code
On 21/09/2020 13:55, Durrant, Paul wrote: -Original Message- From: Julien Grall Sent: 21 September 2020 12:41 To: Jan Beulich ; Stefano Stabellini ; andrew.coop...@citrix.com; George Dunlap ; Durrant, Paul Cc: Xia, Hongyan ; xen-devel@lists.xenproject.org Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. (+ Xen-devel) Sorry I forgot to CC xen-devel. On 21/09/2020 12:38, Julien Grall wrote: Hi all, I have started to look at the deferral code (see vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and Arm will soon use it. The current implementation is using an smp_mb() to ensure ordering between a write then a read. The code looks roughly (I have slightly adapted it to make my question more obvious): domain_shutdown() d->is_shutting_down = 1; smp_mb(); if ( !vcpu0->defer_shutdown ) { vcpu_pause_nosync(v); v->paused_for_shutdown = 1; } vcpu_start_shutdown_deferral() vcpu0->defer_shutdown = 1; smp_mb(); if ( unlikely(d->is_shutting_down) ) vcpu_check_shutdown(v); return vcpu0->defer_shutdown; smp_mb() should only guarantee ordering (this may be stronger on some arch), so I think there is a race between the two functions. It would be possible to pause the vCPU in domain_shutdown() because vcpu0->defer_shutdown wasn't yet seen. Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down and therefore Xen may continue to send the I/O. Yet the vCPU will be paused so the I/O will never complete. The barrier enforces global order, right? It is not clear to me what you mean by "global ordering". This seems to suggest a very expensive synchronization barrier between all the processors. From an arch-agnostic PoV, smp_mb() will enforce an ordering between loads/stores but it doesn't guarantee *when* they will be observed. So, if domain_shutdown() pauses the vcpu then is_shutting_down must necessarily be visible all cpus. That's not the guarantee provided by smp_mb() (see above). Thus vcpu_start_shutdown referral will execute vcpu_check_shutdown(), so I'm having a hard time seeing the race. I am not fully familiar with the IOREQ code, but it sounds to me this is not the behavior that was intended. Can someone more familiar with the code confirm it? No indeed. I think emulation should complete before the vcpu pauses. I think this part is racy at least on non-x86 platform as x86 seems to implement smp_mb() with a strong memory barrier (mfence). Cheers, -- Julien Grall
Re: Memory ordering question in the shutdown deferral code
On Mon, 2020-09-21 at 12:55 +, Durrant, Paul wrote: > > -Original Message- > > From: Julien Grall > > Sent: 21 September 2020 12:41 > > To: Jan Beulich ; Stefano Stabellini < > > sstabell...@kernel.org>; > > andrew.coop...@citrix.com; George Dunlap > > ; Durrant, Paul > > > > Cc: Xia, Hongyan ; > > xen-devel@lists.xenproject.org > > Subject: RE: [EXTERNAL] Memory ordering question in the shutdown > > deferral code > > > > CAUTION: This email originated from outside of the organization. Do > > not click links or open > > attachments unless you can confirm the sender and know the content > > is safe. > > > > > > > > (+ Xen-devel) > > > > Sorry I forgot to CC xen-devel. > > > > On 21/09/2020 12:38, Julien Grall wrote: > > > Hi all, > > > > > > I have started to look at the deferral code (see > > > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate > > > and > > > Arm will soon use it. > > > > > > The current implementation is using an smp_mb() to ensure > > > ordering > > > between a write then a read. The code looks roughly (I have > > > slightly > > > adapted it to make my question more obvious): > > > > > > domain_shutdown() > > > d->is_shutting_down = 1; > > > smp_mb(); > > > if ( !vcpu0->defer_shutdown ) > > > { > > >vcpu_pause_nosync(v); > > >v->paused_for_shutdown = 1; > > > } > > > > > > vcpu_start_shutdown_deferral() > > > vcpu0->defer_shutdown = 1; > > > smp_mb(); > > > if ( unlikely(d->is_shutting_down) ) > > >vcpu_check_shutdown(v); > > > > > > return vcpu0->defer_shutdown; > > > > > > smp_mb() should only guarantee ordering (this may be stronger on > > > some > > > arch), so I think there is a race between the two functions. > > > > > > It would be possible to pause the vCPU in domain_shutdown() > > > because > > > vcpu0->defer_shutdown wasn't yet seen. > > > > > > Equally, vcpu_start_shutdown_deferral() may not see d- > > > >is_shutting_down > > > and therefore Xen may continue to send the I/O. Yet the vCPU will > > > be > > > paused so the I/O will never complete. > > > > > The barrier enforces global order, right? So, if domain_shutdown() > pauses the vcpu then is_shutting_down must necessarily be visible all > cpus. Thus vcpu_start_shutdown referral will execute > vcpu_check_shutdown(), so I'm having a hard time seeing the race. I think the question is whether smp_mb() can enforce immediate visibility. It at least enforces ordering, so when other CPUs eventually see the memory accesses they will be in the right order. On the x86 side I think an mfence enforces that when the later load is executed, the previous store is globally visible. But, if it is only about the ordering seen by other cores, it could be possible that the write is seen much later. Hongyan
Re: Memory ordering question in the shutdown deferral code
On 21.09.2020 15:27, Julien Grall wrote: > I think this part is racy at least on non-x86 platform as x86 seems to > implement smp_mb() with a strong memory barrier (mfence). The "strength" of the memory barrier doesn't matter here imo. It's the fully coherent memory model (for WB type memory) which makes this be fine on x86. The barrier still only guarantees ordering, not visibility. Jan
RE: Memory ordering question in the shutdown deferral code
> -Original Message- > From: Jan Beulich > Sent: 21 September 2020 14:32 > To: Julien Grall > Cc: Durrant, Paul ; Stefano Stabellini > ; > andrew.coop...@citrix.com; George Dunlap ; Xia, > Hongyan > ; xen-devel@lists.xenproject.org > Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code > > CAUTION: This email originated from outside of the organization. Do not click > links or open > attachments unless you can confirm the sender and know the content is safe. > > > > On 21.09.2020 15:27, Julien Grall wrote: > > I think this part is racy at least on non-x86 platform as x86 seems to > > implement smp_mb() with a strong memory barrier (mfence). > > The "strength" of the memory barrier doesn't matter here imo. It's > the fully coherent memory model (for WB type memory) which makes > this be fine on x86. The barrier still only guarantees ordering, > not visibility. > In which case I misunderstood what the 'smp' means in this context then. Paul
Re: Memory ordering question in the shutdown deferral code
On 21.09.2020 15:35, Durrant, Paul wrote: >> From: Jan Beulich >> Sent: 21 September 2020 14:32 >> >> On 21.09.2020 15:27, Julien Grall wrote: >>> I think this part is racy at least on non-x86 platform as x86 seems to >>> implement smp_mb() with a strong memory barrier (mfence). >> >> The "strength" of the memory barrier doesn't matter here imo. It's >> the fully coherent memory model (for WB type memory) which makes >> this be fine on x86. The barrier still only guarantees ordering, >> not visibility. >> > > In which case I misunderstood what the 'smp' means in this context then. I find this confusing too, at times. But according to my reading of the doc the "smp" in there really only means "simple compiler barrier when UP". Jan
Re: Memory ordering question in the shutdown deferral code
Hi Jan, On 21/09/2020 14:11, Jan Beulich wrote: On 21.09.2020 13:40, Julien Grall wrote: (+ Xen-devel) Sorry I forgot to CC xen-devel. On 21/09/2020 12:38, Julien Grall wrote: Hi all, I have started to look at the deferral code (see vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and Arm will soon use it. The current implementation is using an smp_mb() to ensure ordering between a write then a read. The code looks roughly (I have slightly adapted it to make my question more obvious): domain_shutdown() d->is_shutting_down = 1; smp_mb(); if ( !vcpu0->defer_shutdown ) { vcpu_pause_nosync(v); v->paused_for_shutdown = 1; } vcpu_start_shutdown_deferral() vcpu0->defer_shutdown = 1; smp_mb(); if ( unlikely(d->is_shutting_down) ) vcpu_check_shutdown(v); return vcpu0->defer_shutdown; smp_mb() should only guarantee ordering (this may be stronger on some arch), so I think there is a race between the two functions. It would be possible to pause the vCPU in domain_shutdown() because vcpu0->defer_shutdown wasn't yet seen. Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down and therefore Xen may continue to send the I/O. Yet the vCPU will be paused so the I/O will never complete. Individually for each of these I agree. But isn't the goal merely to prevent both to enter their if()-s' bodies at the same time? And isn't the combined effect of the two barriers preventing just this? The code should already be able to deal with that as vcpu_check_shutdown() will request to hold d->shutdown_lock and then check v->paused_for_shutdown. So I am not sure why the barriers would matter here. I am not fully familiar with the IOREQ code, but it sounds to me this is not the behavior that was intended. Can someone more familiar with the code confirm it? As to original intentions, I'm afraid among the people still listed as maintainers for any part of Xen it may only be Tim to possibly have been involved in the original installation of this model, and hence who may know of the precise intentions and considerations back at the time. It would be useful to know the original intentions, so I have CCed Tim. However, I think it is more important to agree on what we want to achieve so we can decide whether the existing code is suitable. Do you agree that we only want to shutdown (or pause it at an architecturally restartable bounday) a domain with no I/Os inflights? As far as I'm concerned, to be honest I don't think I've ever managed to fully convince myself of the correctness of the model in the general case. But since it did look good enough for x86 ... Right, the memory model on x86 is quite simple compare to Arm :). I am pretty sure we need some sort of ordering, but I am not convinced we have the correct one in place if we want to cater architecture with more relaxed memory model. Cheers, -- Julien Grall
Re: Memory ordering question in the shutdown deferral code
On Mon, 21 Sep 2020, Julien Grall wrote: > On 21/09/2020 13:55, Durrant, Paul wrote: > > > (+ Xen-devel) > > > > > > Sorry I forgot to CC xen-devel. > > > > > > On 21/09/2020 12:38, Julien Grall wrote: > > > > Hi all, > > > > > > > > I have started to look at the deferral code (see > > > > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and > > > > Arm will soon use it. > > > > > > > > The current implementation is using an smp_mb() to ensure ordering > > > > between a write then a read. The code looks roughly (I have slightly > > > > adapted it to make my question more obvious): > > > > > > > > domain_shutdown() > > > > d->is_shutting_down = 1; > > > > smp_mb(); > > > > if ( !vcpu0->defer_shutdown ) > > > > { > > > > vcpu_pause_nosync(v); > > > > v->paused_for_shutdown = 1; > > > > } > > > > > > > > vcpu_start_shutdown_deferral() > > > > vcpu0->defer_shutdown = 1; > > > > smp_mb(); > > > > if ( unlikely(d->is_shutting_down) ) > > > > vcpu_check_shutdown(v); > > > > > > > > return vcpu0->defer_shutdown; > > > > > > > > smp_mb() should only guarantee ordering (this may be stronger on some > > > > arch), so I think there is a race between the two functions. > > > > > > > > It would be possible to pause the vCPU in domain_shutdown() because > > > > vcpu0->defer_shutdown wasn't yet seen. > > > > > > > > Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down > > > > and therefore Xen may continue to send the I/O. Yet the vCPU will be > > > > paused so the I/O will never complete. > > > > > > > > The barrier enforces global order, right? > > It is not clear to me what you mean by "global ordering". This seems to > suggest a very expensive synchronization barrier between all the processors. > > From an arch-agnostic PoV, smp_mb() will enforce an ordering between > loads/stores but it doesn't guarantee *when* they will be observed. > > > So, if domain_shutdown() pauses the vcpu then is_shutting_down must > > necessarily be visible all cpus. > > That's not the guarantee provided by smp_mb() (see above). I simplified the code further to help us reason about it: thread#1 | thread#2 | 1) WRITE A | WRITE B 2) BARRIER | BARRIER 3) READ B | READ A I think it is (theoretically) possible for thread#1 to be at 1) and about to do 2), while thread#2 goes ahead and does 1) 2) 3). By the time thread#1 does 2), thread#2 has already completed the entire sequence. If thread#2 has already done 2), and thread#1 is about to do 3), then I think we are guaranteed that thread#1 will see the new value of B. Or is this the core of the issue we are discussing? If it works the way I wrote, then it would confirm Paul's view. For your information I went to check what the Linux memory model has to say about this. It says: "smp_mb() guarantees to restore sequential consistency among accesses that use READ_ONCE, WRITE_ONCE(), or stronger. For example, the following Linux-kernel code would forbid non-SC outcomes" It is interesting that they chose the words "restore sequential consistency". It would be difficult to come up with an example that has "sequential consistency" but doesn't work the way described earlier.
Re: Memory ordering question in the shutdown deferral code
Hi, On 23/09/2020 23:57, Stefano Stabellini wrote: On Mon, 21 Sep 2020, Julien Grall wrote: On 21/09/2020 13:55, Durrant, Paul wrote: (+ Xen-devel) Sorry I forgot to CC xen-devel. On 21/09/2020 12:38, Julien Grall wrote: Hi all, I have started to look at the deferral code (see vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and Arm will soon use it. The current implementation is using an smp_mb() to ensure ordering between a write then a read. The code looks roughly (I have slightly adapted it to make my question more obvious): domain_shutdown() d->is_shutting_down = 1; smp_mb(); if ( !vcpu0->defer_shutdown ) { vcpu_pause_nosync(v); v->paused_for_shutdown = 1; } vcpu_start_shutdown_deferral() vcpu0->defer_shutdown = 1; smp_mb(); if ( unlikely(d->is_shutting_down) ) vcpu_check_shutdown(v); return vcpu0->defer_shutdown; smp_mb() should only guarantee ordering (this may be stronger on some arch), so I think there is a race between the two functions. It would be possible to pause the vCPU in domain_shutdown() because vcpu0->defer_shutdown wasn't yet seen. Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down and therefore Xen may continue to send the I/O. Yet the vCPU will be paused so the I/O will never complete. The barrier enforces global order, right? It is not clear to me what you mean by "global ordering". This seems to suggest a very expensive synchronization barrier between all the processors. From an arch-agnostic PoV, smp_mb() will enforce an ordering between loads/stores but it doesn't guarantee *when* they will be observed. So, if domain_shutdown() pauses the vcpu then is_shutting_down must necessarily be visible all cpus. That's not the guarantee provided by smp_mb() (see above). I simplified the code further to help us reason about it: thread#1 | thread#2 | 1) WRITE A | WRITE B 2) BARRIER | BARRIER 3) READ B | READ A Thank you for writing a simpler example. It allowed me to find a litmus (see [1]) and now I understood why it works. See more below. I think it is (theoretically) possible for thread#1 to be at 1) and about to do 2), while thread#2 goes ahead and does 1) 2) 3). Well it is not that theoritical :). There are many reasons where this situation can happen. To only cite a few: - Threads may run on the same pCPUs - The pCPU running the threads may get interrupted - The data modified is not in the L1 cache, there will be delay to access it. By the time thread#1 does 2), thread#2 has already completed the entire sequence. If thread#2 has already done 2), and thread#1 is about to do 3), then I think we are guaranteed that thread#1 will see the new value of B. > Or is this the core of the issue we are discussing? No you are right. I got confused because smp_mb() doesn't guarantee when the write/read is completed. So I blindly assumed that the ordering would be done just at the processor level. Instead, the ordering is done at the innershareable level (e.g between all the processors) as we are using dmb ish. Assuming A and B are initialized to 0 and we are writing 1, then there is no way for both thread to read 0. In which case, the existing shutdown code is fine. Cheers, [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/litmus-tests/SB+fencembonceonces.litmus -- Julien Grall