Re: Memory ordering question in the shutdown deferral code

2020-09-24 Thread Julien Grall

Hi,

On 23/09/2020 23:57, Stefano Stabellini wrote:

On Mon, 21 Sep 2020, Julien Grall wrote:

On 21/09/2020 13:55, Durrant, Paul wrote:

(+ Xen-devel)

Sorry I forgot to CC xen-devel.

On 21/09/2020 12:38, Julien Grall wrote:

Hi all,

I have started to look at the deferral code (see
vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
Arm will soon use it.

The current implementation is using an smp_mb() to ensure ordering
between a write then a read. The code looks roughly (I have slightly
adapted it to make my question more obvious):

domain_shutdown()
   d->is_shutting_down = 1;
   smp_mb();
   if ( !vcpu0->defer_shutdown )
   {
 vcpu_pause_nosync(v);
 v->paused_for_shutdown = 1;
   }

vcpu_start_shutdown_deferral()
   vcpu0->defer_shutdown = 1;
   smp_mb();
   if ( unlikely(d->is_shutting_down) )
 vcpu_check_shutdown(v);

   return vcpu0->defer_shutdown;

smp_mb() should only guarantee ordering (this may be stronger on some
arch), so I think there is a race between the two functions.

It would be possible to pause the vCPU in domain_shutdown() because
vcpu0->defer_shutdown wasn't yet seen.

Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
and therefore Xen may continue to send the I/O. Yet the vCPU will be
paused so the I/O will never complete.



The barrier enforces global order, right?


It is not clear to me what you mean by "global ordering". This seems to
suggest a very expensive synchronization barrier between all the processors.

 From an arch-agnostic PoV, smp_mb() will enforce an ordering between
loads/stores but it doesn't guarantee *when* they will be observed.


So, if domain_shutdown() pauses the vcpu then is_shutting_down must
necessarily be visible all cpus.


That's not the guarantee provided by smp_mb() (see above).



I simplified the code further to help us reason about it:


thread#1 |  thread#2
 |
1) WRITE A  |  WRITE B
2) BARRIER  |  BARRIER
3) READ B   |  READ A


Thank you for writing a simpler example. It allowed me to find a litmus 
(see [1]) and now I understood why it works. See more below.





I think it is (theoretically) possible for thread#1 to be at 1) and
about to do 2), while thread#2 goes ahead and does 1) 2) 3).


Well it is not that theoritical :). There are many reasons where this 
situation can happen. To only cite a few:

- Threads may run on the same pCPUs
- The pCPU running the threads may get interrupted
- The data modified is not in the L1 cache, there will be delay to 
access it.



By the time
thread#1 does 2), thread#2 has already completed the entire sequence.

If thread#2 has already done 2), and thread#1 is about to do 3), then I
think we are guaranteed that thread#1 will see the new value of B. > Or
is this the core of the issue we are discussing?


No you are right. I got confused because smp_mb() doesn't guarantee when 
the write/read is completed.


So I blindly assumed that the ordering would be done just at the 
processor level. Instead, the ordering is done at the innershareable 
level (e.g between all the processors) as we are using dmb ish.


Assuming A and B are initialized to 0 and we are writing 1, then there 
is no way for both thread to read 0. In which case, the existing 
shutdown code is fine.


Cheers,

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/litmus-tests/SB+fencembonceonces.litmus


--
Julien Grall



Re: Memory ordering question in the shutdown deferral code

2020-09-23 Thread Stefano Stabellini
On Mon, 21 Sep 2020, Julien Grall wrote:
> On 21/09/2020 13:55, Durrant, Paul wrote:
> > > (+ Xen-devel)
> > > 
> > > Sorry I forgot to CC xen-devel.
> > > 
> > > On 21/09/2020 12:38, Julien Grall wrote:
> > > > Hi all,
> > > > 
> > > > I have started to look at the deferral code (see
> > > > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
> > > > Arm will soon use it.
> > > > 
> > > > The current implementation is using an smp_mb() to ensure ordering
> > > > between a write then a read. The code looks roughly (I have slightly
> > > > adapted it to make my question more obvious):
> > > > 
> > > > domain_shutdown()
> > > >   d->is_shutting_down = 1;
> > > >   smp_mb();
> > > >   if ( !vcpu0->defer_shutdown )
> > > >   {
> > > > vcpu_pause_nosync(v);
> > > > v->paused_for_shutdown = 1;
> > > >   }
> > > > 
> > > > vcpu_start_shutdown_deferral()
> > > >   vcpu0->defer_shutdown = 1;
> > > >   smp_mb();
> > > >   if ( unlikely(d->is_shutting_down) )
> > > > vcpu_check_shutdown(v);
> > > > 
> > > >   return vcpu0->defer_shutdown;
> > > > 
> > > > smp_mb() should only guarantee ordering (this may be stronger on some
> > > > arch), so I think there is a race between the two functions.
> > > > 
> > > > It would be possible to pause the vCPU in domain_shutdown() because
> > > > vcpu0->defer_shutdown wasn't yet seen.
> > > > 
> > > > Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
> > > > and therefore Xen may continue to send the I/O. Yet the vCPU will be
> > > > paused so the I/O will never complete.
> > > > 
> > 
> > The barrier enforces global order, right?
> 
> It is not clear to me what you mean by "global ordering". This seems to
> suggest a very expensive synchronization barrier between all the processors.
> 
> From an arch-agnostic PoV, smp_mb() will enforce an ordering between
> loads/stores but it doesn't guarantee *when* they will be observed.
> 
> > So, if domain_shutdown() pauses the vcpu then is_shutting_down must
> > necessarily be visible all cpus.
> 
> That's not the guarantee provided by smp_mb() (see above).


I simplified the code further to help us reason about it:


   thread#1 |  thread#2
|
1) WRITE A  |  WRITE B
2) BARRIER  |  BARRIER
3) READ B   |  READ A


I think it is (theoretically) possible for thread#1 to be at 1) and
about to do 2), while thread#2 goes ahead and does 1) 2) 3). By the time
thread#1 does 2), thread#2 has already completed the entire sequence.

If thread#2 has already done 2), and thread#1 is about to do 3), then I
think we are guaranteed that thread#1 will see the new value of B. Or
is this the core of the issue we are discussing?

If it works the way I wrote, then it would confirm Paul's view.


For your information I went to check what the Linux memory model has to
say about this. It says:

"smp_mb() guarantees to restore sequential consistency among accesses that use 
READ_ONCE, WRITE_ONCE(), or stronger. For example, the following Linux-kernel 
code would forbid non-SC outcomes"

It is interesting that they chose the words "restore sequential
consistency". It would be difficult to come up with an example that has
"sequential consistency" but doesn't work the way described earlier.



Re: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Julien Grall

Hi Jan,

On 21/09/2020 14:11, Jan Beulich wrote:

On 21.09.2020 13:40, Julien Grall wrote:

(+ Xen-devel)

Sorry I forgot to CC xen-devel.

On 21/09/2020 12:38, Julien Grall wrote:

Hi all,

I have started to look at the deferral code (see
vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
Arm will soon use it.

The current implementation is using an smp_mb() to ensure ordering
between a write then a read. The code looks roughly (I have slightly
adapted it to make my question more obvious):

domain_shutdown()
      d->is_shutting_down = 1;
      smp_mb();
      if ( !vcpu0->defer_shutdown )
      {
    vcpu_pause_nosync(v);
    v->paused_for_shutdown = 1;
      }

vcpu_start_shutdown_deferral()
      vcpu0->defer_shutdown = 1;
      smp_mb();
      if ( unlikely(d->is_shutting_down) )
    vcpu_check_shutdown(v);

      return vcpu0->defer_shutdown;

smp_mb() should only guarantee ordering (this may be stronger on some
arch), so I think there is a race between the two functions.

It would be possible to pause the vCPU in domain_shutdown() because
vcpu0->defer_shutdown wasn't yet seen.

Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
and therefore Xen may continue to send the I/O. Yet the vCPU will be
paused so the I/O will never complete.


Individually for each of these I agree. But isn't the goal merely
to prevent both to enter their if()-s' bodies at the same time?
And isn't the combined effect of the two barriers preventing just
this?


The code should already be able to deal with that as 
vcpu_check_shutdown() will request to hold d->shutdown_lock and then 
check v->paused_for_shutdown.


So I am not sure why the barriers would matter here.




I am not fully familiar with the IOREQ code, but it sounds to me this is
not the behavior that was intended. Can someone more familiar with the
code confirm it?


As to original intentions, I'm afraid among the people still
listed as maintainers for any part of Xen it may only be Tim to
possibly have been involved in the original installation of
this model, and hence who may know of the precise intentions
and considerations back at the time.


It would be useful to know the original intentions, so I have CCed Tim.

However, I think it is more important to agree on what we want to 
achieve so we can decide whether the existing code is suitable.


Do you agree that we only want to shutdown (or pause it at an 
architecturally restartable bounday) a domain with no I/Os inflights?




As far as I'm concerned, to be honest I don't think I've ever
managed to fully convince myself of the correctness of the
model in the general case. But since it did look good enough
for x86 ...


Right, the memory model on x86 is quite simple compare to Arm :). I am 
pretty sure we need some sort of ordering, but I am not convinced we 
have the correct one in place if we want to cater architecture with more 
relaxed memory model.


Cheers,

--
Julien Grall



Re: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Jan Beulich
On 21.09.2020 15:35, Durrant, Paul wrote:
>> From: Jan Beulich 
>> Sent: 21 September 2020 14:32
>>
>> On 21.09.2020 15:27, Julien Grall wrote:
>>> I think this part is racy at least on non-x86 platform as x86 seems to
>>> implement smp_mb() with a strong memory barrier (mfence).
>>
>> The "strength" of the memory barrier doesn't matter here imo. It's
>> the fully coherent memory model (for WB type memory) which makes
>> this be fine on x86. The barrier still only guarantees ordering,
>> not visibility.
>>
> 
> In which case I misunderstood what the 'smp' means in this context then.

I find this confusing too, at times. But according to my reading
of the doc the "smp" in there really only means "simple compiler
barrier when UP".

Jan



RE: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Durrant, Paul
> -Original Message-
> From: Jan Beulich 
> Sent: 21 September 2020 14:32
> To: Julien Grall 
> Cc: Durrant, Paul ; Stefano Stabellini 
> ;
> andrew.coop...@citrix.com; George Dunlap ; Xia, 
> Hongyan
> ; xen-devel@lists.xenproject.org
> Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code
> 
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open
> attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 21.09.2020 15:27, Julien Grall wrote:
> > I think this part is racy at least on non-x86 platform as x86 seems to
> > implement smp_mb() with a strong memory barrier (mfence).
> 
> The "strength" of the memory barrier doesn't matter here imo. It's
> the fully coherent memory model (for WB type memory) which makes
> this be fine on x86. The barrier still only guarantees ordering,
> not visibility.
> 

In which case I misunderstood what the 'smp' means in this context then.

  Paul


Re: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Jan Beulich
On 21.09.2020 15:27, Julien Grall wrote:
> I think this part is racy at least on non-x86 platform as x86 seems to 
> implement smp_mb() with a strong memory barrier (mfence).

The "strength" of the memory barrier doesn't matter here imo. It's
the fully coherent memory model (for WB type memory) which makes
this be fine on x86. The barrier still only guarantees ordering,
not visibility.

Jan



Re: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Xia, Hongyan
On Mon, 2020-09-21 at 12:55 +, Durrant, Paul wrote:
> > -Original Message-
> > From: Julien Grall 
> > Sent: 21 September 2020 12:41
> > To: Jan Beulich ; Stefano Stabellini <
> > sstabell...@kernel.org>;
> > andrew.coop...@citrix.com; George Dunlap 
> > ; Durrant, Paul
> > 
> > Cc: Xia, Hongyan ; 
> > xen-devel@lists.xenproject.org
> > Subject: RE: [EXTERNAL] Memory ordering question in the shutdown
> > deferral code
> > 
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open
> > attachments unless you can confirm the sender and know the content
> > is safe.
> > 
> > 
> > 
> > (+ Xen-devel)
> > 
> > Sorry I forgot to CC xen-devel.
> > 
> > On 21/09/2020 12:38, Julien Grall wrote:
> > > Hi all,
> > > 
> > > I have started to look at the deferral code (see
> > > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate
> > > and
> > > Arm will soon use it.
> > > 
> > > The current implementation is using an smp_mb() to ensure
> > > ordering
> > > between a write then a read. The code looks roughly (I have
> > > slightly
> > > adapted it to make my question more obvious):
> > > 
> > > domain_shutdown()
> > >  d->is_shutting_down = 1;
> > >  smp_mb();
> > >  if ( !vcpu0->defer_shutdown )
> > >  {
> > >vcpu_pause_nosync(v);
> > >v->paused_for_shutdown = 1;
> > >  }
> > > 
> > > vcpu_start_shutdown_deferral()
> > >  vcpu0->defer_shutdown = 1;
> > >  smp_mb();
> > >  if ( unlikely(d->is_shutting_down) )
> > >vcpu_check_shutdown(v);
> > > 
> > >  return vcpu0->defer_shutdown;
> > > 
> > > smp_mb() should only guarantee ordering (this may be stronger on
> > > some
> > > arch), so I think there is a race between the two functions.
> > > 
> > > It would be possible to pause the vCPU in domain_shutdown()
> > > because
> > > vcpu0->defer_shutdown wasn't yet seen.
> > > 
> > > Equally, vcpu_start_shutdown_deferral() may not see d-
> > > >is_shutting_down
> > > and therefore Xen may continue to send the I/O. Yet the vCPU will
> > > be
> > > paused so the I/O will never complete.
> > > 
> 
> The barrier enforces global order, right? So, if domain_shutdown()
> pauses the vcpu then is_shutting_down must necessarily be visible all
> cpus. Thus vcpu_start_shutdown referral will execute
> vcpu_check_shutdown(), so I'm having a hard time seeing the race.

I think the question is whether smp_mb() can enforce immediate
visibility. It at least enforces ordering, so when other CPUs
eventually see the memory accesses they will be in the right order. On
the x86 side I think an mfence enforces that when the later load is
executed, the previous store is globally visible. But, if it is only
about the ordering seen by other cores, it could be possible that the
write is seen much later.

Hongyan


Re: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Julien Grall




On 21/09/2020 13:55, Durrant, Paul wrote:

-Original Message-
From: Julien Grall 
Sent: 21 September 2020 12:41
To: Jan Beulich ; Stefano Stabellini 
;
andrew.coop...@citrix.com; George Dunlap ; Durrant, 
Paul

Cc: Xia, Hongyan ; xen-devel@lists.xenproject.org
Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code

CAUTION: This email originated from outside of the organization. Do not click 
links or open
attachments unless you can confirm the sender and know the content is safe.



(+ Xen-devel)

Sorry I forgot to CC xen-devel.

On 21/09/2020 12:38, Julien Grall wrote:

Hi all,

I have started to look at the deferral code (see
vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
Arm will soon use it.

The current implementation is using an smp_mb() to ensure ordering
between a write then a read. The code looks roughly (I have slightly
adapted it to make my question more obvious):

domain_shutdown()
  d->is_shutting_down = 1;
  smp_mb();
  if ( !vcpu0->defer_shutdown )
  {
vcpu_pause_nosync(v);
v->paused_for_shutdown = 1;
  }

vcpu_start_shutdown_deferral()
  vcpu0->defer_shutdown = 1;
  smp_mb();
  if ( unlikely(d->is_shutting_down) )
vcpu_check_shutdown(v);

  return vcpu0->defer_shutdown;

smp_mb() should only guarantee ordering (this may be stronger on some
arch), so I think there is a race between the two functions.

It would be possible to pause the vCPU in domain_shutdown() because
vcpu0->defer_shutdown wasn't yet seen.

Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
and therefore Xen may continue to send the I/O. Yet the vCPU will be
paused so the I/O will never complete.



The barrier enforces global order, right?


It is not clear to me what you mean by "global ordering". This seems to 
suggest a very expensive synchronization barrier between all the processors.


From an arch-agnostic PoV, smp_mb() will enforce an ordering between 
loads/stores but it doesn't guarantee *when* they will be observed.



So, if domain_shutdown() pauses the vcpu then is_shutting_down must necessarily 
be visible all cpus.


That's not the guarantee provided by smp_mb() (see above).


Thus vcpu_start_shutdown referral will execute vcpu_check_shutdown(), so I'm 
having a hard time seeing the race.


I am not fully familiar with the IOREQ code, but it sounds to me this is
not the behavior that was intended. Can someone more familiar with the
code confirm it?



No indeed. I think emulation should complete before the vcpu pauses.


I think this part is racy at least on non-x86 platform as x86 seems to 
implement smp_mb() with a strong memory barrier (mfence).


Cheers,

--
Julien Grall



Re: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Jan Beulich
On 21.09.2020 13:40, Julien Grall wrote:
> (+ Xen-devel)
> 
> Sorry I forgot to CC xen-devel.
> 
> On 21/09/2020 12:38, Julien Grall wrote:
>> Hi all,
>>
>> I have started to look at the deferral code (see 
>> vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and 
>> Arm will soon use it.
>>
>> The current implementation is using an smp_mb() to ensure ordering 
>> between a write then a read. The code looks roughly (I have slightly 
>> adapted it to make my question more obvious):
>>
>> domain_shutdown()
>>      d->is_shutting_down = 1;
>>      smp_mb();
>>      if ( !vcpu0->defer_shutdown )
>>      {
>>    vcpu_pause_nosync(v);
>>    v->paused_for_shutdown = 1;
>>      }
>>
>> vcpu_start_shutdown_deferral()
>>      vcpu0->defer_shutdown = 1;
>>      smp_mb();
>>      if ( unlikely(d->is_shutting_down) )
>>    vcpu_check_shutdown(v);
>>
>>      return vcpu0->defer_shutdown;
>>
>> smp_mb() should only guarantee ordering (this may be stronger on some 
>> arch), so I think there is a race between the two functions.
>>
>> It would be possible to pause the vCPU in domain_shutdown() because 
>> vcpu0->defer_shutdown wasn't yet seen.
>>
>> Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down 
>> and therefore Xen may continue to send the I/O. Yet the vCPU will be 
>> paused so the I/O will never complete.

Individually for each of these I agree. But isn't the goal merely
to prevent both to enter their if()-s' bodies at the same time?
And isn't the combined effect of the two barriers preventing just
this?

>> I am not fully familiar with the IOREQ code, but it sounds to me this is 
>> not the behavior that was intended. Can someone more familiar with the 
>> code confirm it?

As to original intentions, I'm afraid among the people still
listed as maintainers for any part of Xen it may only be Tim to
possibly have been involved in the original installation of
this model, and hence who may know of the precise intentions
and considerations back at the time.

As far as I'm concerned, to be honest I don't think I've ever
managed to fully convince myself of the correctness of the
model in the general case. But since it did look good enough
for x86 ...

Jan



RE: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Durrant, Paul
> -Original Message-
> From: Julien Grall 
> Sent: 21 September 2020 12:41
> To: Jan Beulich ; Stefano Stabellini 
> ;
> andrew.coop...@citrix.com; George Dunlap ; Durrant, 
> Paul
> 
> Cc: Xia, Hongyan ; xen-devel@lists.xenproject.org
> Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code
> 
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open
> attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> (+ Xen-devel)
> 
> Sorry I forgot to CC xen-devel.
> 
> On 21/09/2020 12:38, Julien Grall wrote:
> > Hi all,
> >
> > I have started to look at the deferral code (see
> > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
> > Arm will soon use it.
> >
> > The current implementation is using an smp_mb() to ensure ordering
> > between a write then a read. The code looks roughly (I have slightly
> > adapted it to make my question more obvious):
> >
> > domain_shutdown()
> >  d->is_shutting_down = 1;
> >  smp_mb();
> >  if ( !vcpu0->defer_shutdown )
> >  {
> >vcpu_pause_nosync(v);
> >v->paused_for_shutdown = 1;
> >  }
> >
> > vcpu_start_shutdown_deferral()
> >  vcpu0->defer_shutdown = 1;
> >  smp_mb();
> >  if ( unlikely(d->is_shutting_down) )
> >vcpu_check_shutdown(v);
> >
> >  return vcpu0->defer_shutdown;
> >
> > smp_mb() should only guarantee ordering (this may be stronger on some
> > arch), so I think there is a race between the two functions.
> >
> > It would be possible to pause the vCPU in domain_shutdown() because
> > vcpu0->defer_shutdown wasn't yet seen.
> >
> > Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
> > and therefore Xen may continue to send the I/O. Yet the vCPU will be
> > paused so the I/O will never complete.
> >

The barrier enforces global order, right? So, if domain_shutdown() pauses the 
vcpu then is_shutting_down must necessarily be visible all cpus. Thus 
vcpu_start_shutdown referral will execute vcpu_check_shutdown(), so I'm having 
a hard time seeing the race.

> > I am not fully familiar with the IOREQ code, but it sounds to me this is
> > not the behavior that was intended. Can someone more familiar with the
> > code confirm it?
> >

No indeed. I think emulation should complete before the vcpu pauses.

  Paul

> > Cheers,
> >
> 
> --
> Julien Grall


Re: Memory ordering question in the shutdown deferral code

2020-09-21 Thread Julien Grall

(+ Xen-devel)

Sorry I forgot to CC xen-devel.

On 21/09/2020 12:38, Julien Grall wrote:

Hi all,

I have started to look at the deferral code (see 
vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and 
Arm will soon use it.


The current implementation is using an smp_mb() to ensure ordering 
between a write then a read. The code looks roughly (I have slightly 
adapted it to make my question more obvious):


domain_shutdown()
     d->is_shutting_down = 1;
     smp_mb();
     if ( !vcpu0->defer_shutdown )
     {
   vcpu_pause_nosync(v);
   v->paused_for_shutdown = 1;
     }

vcpu_start_shutdown_deferral()
     vcpu0->defer_shutdown = 1;
     smp_mb();
     if ( unlikely(d->is_shutting_down) )
   vcpu_check_shutdown(v);

     return vcpu0->defer_shutdown;

smp_mb() should only guarantee ordering (this may be stronger on some 
arch), so I think there is a race between the two functions.


It would be possible to pause the vCPU in domain_shutdown() because 
vcpu0->defer_shutdown wasn't yet seen.


Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down 
and therefore Xen may continue to send the I/O. Yet the vCPU will be 
paused so the I/O will never complete.


I am not fully familiar with the IOREQ code, but it sounds to me this is 
not the behavior that was intended. Can someone more familiar with the 
code confirm it?


Cheers,



--
Julien Grall