Re: sched_pin() versus PCPU_GET

2010-08-08 Thread Attilio Rao
2010/8/4  m...@freebsd.org:
 On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
 On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
 On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
      thread_lock(td);
      td-td_flags = ~TDF_SELECT;
      thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:   00 00
  802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:   00 00
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:   48 85 c0                test   %rax,%rax
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:   0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 Hmmm, I think it might be able to because they refer to different locations.

 Note this rule in section 8.2.2 of Volume 3A:

   • Reads may be reordered with older writes to different locations but not
     with older writes to the same location.

 It is certainly true that sparc64 could reorder with RMO.  I believe ia64
 could reorder as well.  Since sched_pin/unpin are frequently used to provide
 this sort of synchronization, we could use memory barriers in pin/unpin
 like so:

 sched_pin()
 {
       td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
 }

 sched_unpin()
 {
       atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
 }

 We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
 they
 are slightly more heavyweight, though it would be more clear what is 
 happening
 I think.

 However, to actually get a race you'd have to have an interrupt fire and
 migrate you so that the speculative read was from the other CPU.  However, I
 don't think the speculative read would be preserved in that case.  The CPU
 has to return to a specific PC when it returns from the interrupt and it has
 no way of storing the state for what speculative reordering it might be
 doing, so presumably it is thrown away?  I suppose it is possible that it
 actually retires both instructions (but reordered) and then returns to the PC
 value after the read of listlocks after the interrupt.  However, in that case
 the scheduler would not migrate as it would see td_pinned != 0.  To get the
 race you have to have the interrupt take effect prior to modifying td_pinned,
 so I think the processor would have to discard the reordered read of
 listlocks so it could safely resume execution at the 'incl' instruction.

 The other nit there on x86 at least is that the incl instruction is doing
 both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

 That would seem to prevent the read of listlocks from passing the read of
 td_pinned in the incl instruction on x86.

 I wonder how that's interpreted in the microcode, though?  I.e. if the
 incr instruction decodes to load, add, store, does the h/w allow the
 later reads to pass the final store?

 I added the 

Re: sched_pin() versus PCPU_GET

2010-08-08 Thread mdf
On Sun, Aug 8, 2010 at 2:43 PM, Attilio Rao atti...@freebsd.org wrote:
 2010/8/4  m...@freebsd.org:
 On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
 On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
 On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
      thread_lock(td);
      td-td_flags = ~TDF_SELECT;
      thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:   00 00
  802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:   00 00
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:   48 85 c0                test   %rax,%rax
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:   0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 Hmmm, I think it might be able to because they refer to different 
 locations.

 Note this rule in section 8.2.2 of Volume 3A:

   • Reads may be reordered with older writes to different locations but not
     with older writes to the same location.

 It is certainly true that sparc64 could reorder with RMO.  I believe ia64
 could reorder as well.  Since sched_pin/unpin are frequently used to 
 provide
 this sort of synchronization, we could use memory barriers in pin/unpin
 like so:

 sched_pin()
 {
       td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
 }

 sched_unpin()
 {
       atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
 }

 We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
 they
 are slightly more heavyweight, though it would be more clear what is 
 happening
 I think.

 However, to actually get a race you'd have to have an interrupt fire and
 migrate you so that the speculative read was from the other CPU.  However, I
 don't think the speculative read would be preserved in that case.  The CPU
 has to return to a specific PC when it returns from the interrupt and it has
 no way of storing the state for what speculative reordering it might be
 doing, so presumably it is thrown away?  I suppose it is possible that it
 actually retires both instructions (but reordered) and then returns to the 
 PC
 value after the read of listlocks after the interrupt.  However, in that 
 case
 the scheduler would not migrate as it would see td_pinned != 0.  To get the
 race you have to have the interrupt take effect prior to modifying 
 td_pinned,
 so I think the processor would have to discard the reordered read of
 listlocks so it could safely resume execution at the 'incl' instruction.

 The other nit there on x86 at least is that the incl instruction is doing
 both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

 That would seem to prevent the read of listlocks from passing the read of
 td_pinned in the incl instruction on x86.

 I wonder how that's interpreted in the microcode, though?  I.e. if the
 incr instruction decodes to load, add, 

Re: sched_pin() versus PCPU_GET

2010-08-06 Thread Attilio Rao
2010/8/5 John Baldwin j...@freebsd.org:
 On Thursday, August 05, 2010 11:59:37 am m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin j...@freebsd.org wrote:
  On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote:
  On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
   On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote:
   On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
 We've seen a few instances at work where witness_warn() in ast()
 indicates the sched lock is still held, but the place it claims
 it was
 held by is in fact sometimes not possible to keep the lock, like:

     thread_lock(td);
     td-td_flags = ~TDF_SELECT;
     thread_unlock(td);

 What I was wondering is, even though the assembly I see in
 objdump -S
 for witness_warn has the increment of td_pinned before the
 PCPU_GET:

 802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
 802db217:   00 00
 802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
      * Pin the thread in order to avoid problems with thread
 migration.
      * Once that all verifies are passed about spinlocks
 ownership,
      * the thread is in a safe path and it can be unpinned.
      */
     sched_pin();
     lock_list = PCPU_GET(spinlocks);
 802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
 802db226:   00 00
     if (lock_list != NULL  lock_list-ll_count != 0) {
 802db228:   48 85 c0                test   %rax,%rax
      * Pin the thread in order to avoid problems with thread
 migration.
      * Once that all verifies are passed about spinlocks
 ownership,
      * the thread is in a safe path and it can be unpinned.
      */
     sched_pin();
     lock_list = PCPU_GET(spinlocks);
 802db22b:   48 89 85 f0 fe ff ff    mov
  %rax,-0x110(%rbp)
 802db232:   48 89 85 f8 fe ff ff    mov
  %rax,-0x108(%rbp)
     if (lock_list != NULL  lock_list-ll_count != 0) {
 802db239:   0f 84 ff 00 00 00       je
 802db33e
 witness_warn+0x30e
 802db23f:   44 8b 60 50             mov    0x50(%rax),
 %r12d

 is it possible for the hardware to do any re-ordering here?

 The reason I'm suspicious is not just that the code doesn't have
 a
 lock leak at the indicated point, but in one instance I can see
 in the
 dump that the lock_list local from witness_warn is from the pcpu
 structure for CPU 0 (and I was warned about sched lock 0), but
 the
 thread id in panic_cpu is 2.  So clearly the thread was being
 migrated
 right around panic time.

 This is the amd64 kernel on stable/7.  I'm not sure exactly what
 kind
 of hardware; it's a 4-way Intel chip from about 3 or 4 years ago
 IIRC.

 So... do we need some kind of barrier in the code for sched_pin()
 for
 it to really do what it claims?  Could the hardware have re-
 ordered
 the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
 increment?
   
Hmmm, I think it might be able to because they refer to different
 locations.
   
Note this rule in section 8.2.2 of Volume 3A:
   
  • Reads may be reordered with older writes to different locations
 but not
    with older writes to the same location.
   
It is certainly true that sparc64 could reorder with RMO.  I
 believe ia64
could reorder as well.  Since sched_pin/unpin are frequently used
 to provide
this sort of synchronization, we could use memory barriers in
 pin/unpin
like so:
   
sched_pin()
{
      td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
}
   
sched_unpin()
{
      atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
}
   
We could also just use atomic_add_acq_int() and
 atomic_sub_rel_int(), but they
are slightly more heavyweight, though it would be more clear what
 is happening
I think.
   
However, to actually get a race you'd have to have an interrupt fire
 and
migrate you so that the speculative read was from the other CPU.
  However, I
don't think the speculative read would be preserved in that case.
  The CPU
has to return to a specific PC when it returns from the interrupt
 and it has
no way of storing the state for what speculative reordering it might
 be
doing, so presumably it is thrown away?  I suppose it is possible
 that it
actually retires both instructions (but reordered) and then returns
 to the PC
value after the read of listlocks after the interrupt.  However, in
 that case
the scheduler would not migrate as it would see td_pinned != 0.  To
 get the
race you have to have the interrupt take effect prior to modifying
 

Re: sched_pin() versus PCPU_GET

2010-08-05 Thread mdf
On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin j...@freebsd.org wrote:
 On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
  On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote:
  On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
   On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
   On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
We've seen a few instances at work where witness_warn() in ast()
indicates the sched lock is still held, but the place it claims it 
was
held by is in fact sometimes not possible to keep the lock, like:
   
    thread_lock(td);
    td-td_flags = ~TDF_SELECT;
    thread_unlock(td);
   
What I was wondering is, even though the assembly I see in objdump -S
for witness_warn has the increment of td_pinned before the PCPU_GET:
   
802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
802db217:   00 00
802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
     * Pin the thread in order to avoid problems with thread 
migration.
     * Once that all verifies are passed about spinlocks ownership,
     * the thread is in a safe path and it can be unpinned.
     */
    sched_pin();
    lock_list = PCPU_GET(spinlocks);
802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
802db226:   00 00
    if (lock_list != NULL  lock_list-ll_count != 0) {
802db228:   48 85 c0                test   %rax,%rax
     * Pin the thread in order to avoid problems with thread 
migration.
     * Once that all verifies are passed about spinlocks ownership,
     * the thread is in a safe path and it can be unpinned.
     */
    sched_pin();
    lock_list = PCPU_GET(spinlocks);
802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
    if (lock_list != NULL  lock_list-ll_count != 0) {
802db239:   0f 84 ff 00 00 00       je     802db33e
witness_warn+0x30e
802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
   
is it possible for the hardware to do any re-ordering here?
   
The reason I'm suspicious is not just that the code doesn't have a
lock leak at the indicated point, but in one instance I can see in 
the
dump that the lock_list local from witness_warn is from the pcpu
structure for CPU 0 (and I was warned about sched lock 0), but the
thread id in panic_cpu is 2.  So clearly the thread was being 
migrated
right around panic time.
   
This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
of hardware; it's a 4-way Intel chip from about 3 or 4 years ago 
IIRC.
   
So... do we need some kind of barrier in the code for sched_pin() for
it to really do what it claims?  Could the hardware have re-ordered
the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
increment?
  
   Hmmm, I think it might be able to because they refer to different 
   locations.
  
   Note this rule in section 8.2.2 of Volume 3A:
  
     • Reads may be reordered with older writes to different locations 
   but not
       with older writes to the same location.
  
   It is certainly true that sparc64 could reorder with RMO.  I believe 
   ia64
   could reorder as well.  Since sched_pin/unpin are frequently used to 
   provide
   this sort of synchronization, we could use memory barriers in pin/unpin
   like so:
  
   sched_pin()
   {
         td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
   }
  
   sched_unpin()
   {
         atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
   }
  
   We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), 
   but they
   are slightly more heavyweight, though it would be more clear what is 
   happening
   I think.
  
   However, to actually get a race you'd have to have an interrupt fire and
   migrate you so that the speculative read was from the other CPU.  
   However, I
   don't think the speculative read would be preserved in that case.  The 
   CPU
   has to return to a specific PC when it returns from the interrupt and 
   it has
   no way of storing the state for what speculative reordering it might be
   doing, so presumably it is thrown away?  I suppose it is possible that 
   it
   actually retires both instructions (but reordered) and then returns to 
   the PC
   value after the read of listlocks after the interrupt.  However, in 
   that case
   the scheduler would not migrate as it would see td_pinned != 0.  To get 
   the
   race you have to have the interrupt take effect prior to modifying 
   td_pinned,
   so I think the processor would have to discard the reordered read of
   listlocks so it could safely resume execution at the 'incl' 

Re: sched_pin() versus PCPU_GET

2010-08-05 Thread mdf
On Wed, Aug 4, 2010 at 9:20 AM,  m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
 Actually, I would beg to differ in that case.  If PCPU_GET(spinlocks)
 returns non-NULL, then it means that you hold a spin lock,

 ll_count is 0 for the correct pc_spinlocks and non-zero for the
 wrong one, though.  So I think it can be non-NULL but the current
 thread/CPU doesn't hold a spinlock.

 I don't believe we have any code in the NMI handler.  I'm on vacation
 today so I'll check tomorrow.

I checked and ipi_nmi_handler() doesn't appear to have any local
changes.  I assume that's where I should look?

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-08-05 Thread Kostik Belousov
On Thu, Aug 05, 2010 at 09:01:22AM -0700, m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 9:20 AM,  m...@freebsd.org wrote:
  On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
  Actually, I would beg to differ in that case.  If PCPU_GET(spinlocks)
  returns non-NULL, then it means that you hold a spin lock,
 
  ll_count is 0 for the correct pc_spinlocks and non-zero for the
  wrong one, though.  So I think it can be non-NULL but the current
  thread/CPU doesn't hold a spinlock.
 
  I don't believe we have any code in the NMI handler.  I'm on vacation
  today so I'll check tomorrow.
 
 I checked and ipi_nmi_handler() doesn't appear to have any local
 changes.  I assume that's where I should look?

I think that, in case you do not use hwpmc, you could add iretq
as the first instruction of the nmi handler in exception.S and see
whether the issue is reproducable.


pgpXv7NOKvlXx.pgp
Description: PGP signature


Re: sched_pin() versus PCPU_GET

2010-08-05 Thread John Baldwin
On Thursday, August 05, 2010 12:01:22 pm m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 9:20 AM,  m...@freebsd.org wrote:
  On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
  Actually, I would beg to differ in that case.  If PCPU_GET(spinlocks)
  returns non-NULL, then it means that you hold a spin lock,
 
  ll_count is 0 for the correct pc_spinlocks and non-zero for the
  wrong one, though.  So I think it can be non-NULL but the current
  thread/CPU doesn't hold a spinlock.
 
  I don't believe we have any code in the NMI handler.  I'm on vacation
  today so I'll check tomorrow.
 
 I checked and ipi_nmi_handler() doesn't appear to have any local
 changes.  I assume that's where I should look?

The tricky bits are all in the assembly rather than in C, probably in 
exception.S.  However, if %gs were corrupt I would not expect it to point to 
another CPU's data, but garbage from userland.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-08-05 Thread John Baldwin
On Thursday, August 05, 2010 11:59:37 am m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin j...@freebsd.org wrote:
  On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote:
  On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
   On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote:
   On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
 We've seen a few instances at work where witness_warn() in ast()
 indicates the sched lock is still held, but the place it claims 
it was
 held by is in fact sometimes not possible to keep the lock, like:

 thread_lock(td);
 td-td_flags = ~TDF_SELECT;
 thread_unlock(td);

 What I was wondering is, even though the assembly I see in 
objdump -S
 for witness_warn has the increment of td_pinned before the 
PCPU_GET:

 802db210:   65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
 802db217:   00 00
 802db219:   ff 83 04 01 00 00   incl   0x104(%rbx)
  * Pin the thread in order to avoid problems with thread 
migration.
  * Once that all verifies are passed about spinlocks 
ownership,
  * the thread is in a safe path and it can be unpinned.
  */
 sched_pin();
 lock_list = PCPU_GET(spinlocks);
 802db21f:   65 48 8b 04 25 48 00mov%gs:0x48,%rax
 802db226:   00 00
 if (lock_list != NULL  lock_list-ll_count != 0) {
 802db228:   48 85 c0test   %rax,%rax
  * Pin the thread in order to avoid problems with thread 
migration.
  * Once that all verifies are passed about spinlocks 
ownership,
  * the thread is in a safe path and it can be unpinned.
  */
 sched_pin();
 lock_list = PCPU_GET(spinlocks);
 802db22b:   48 89 85 f0 fe ff ffmov   
 %rax,-0x110(%rbp)
 802db232:   48 89 85 f8 fe ff ffmov   
 %rax,-0x108(%rbp)
 if (lock_list != NULL  lock_list-ll_count != 0) {
 802db239:   0f 84 ff 00 00 00   je 
802db33e
 witness_warn+0x30e
 802db23f:   44 8b 60 50 mov0x50(%rax),
%r12d

 is it possible for the hardware to do any re-ordering here?

 The reason I'm suspicious is not just that the code doesn't have 
a
 lock leak at the indicated point, but in one instance I can see 
in the
 dump that the lock_list local from witness_warn is from the pcpu
 structure for CPU 0 (and I was warned about sched lock 0), but 
the
 thread id in panic_cpu is 2.  So clearly the thread was being 
migrated
 right around panic time.

 This is the amd64 kernel on stable/7.  I'm not sure exactly what 
kind
 of hardware; it's a 4-way Intel chip from about 3 or 4 years ago 
IIRC.

 So... do we need some kind of barrier in the code for sched_pin() 
for
 it to really do what it claims?  Could the hardware have re-
ordered
 the mov%gs:0x48,%rax PCPU_GET to before the sched_pin()
 increment?
   
Hmmm, I think it might be able to because they refer to different 
locations.
   
Note this rule in section 8.2.2 of Volume 3A:
   
  • Reads may be reordered with older writes to different locations 
but not
with older writes to the same location.
   
It is certainly true that sparc64 could reorder with RMO.  I 
believe ia64
could reorder as well.  Since sched_pin/unpin are frequently used 
to provide
this sort of synchronization, we could use memory barriers in 
pin/unpin
like so:
   
sched_pin()
{
  td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
}
   
sched_unpin()
{
  atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
}
   
We could also just use atomic_add_acq_int() and 
atomic_sub_rel_int(), but they
are slightly more heavyweight, though it would be more clear what 
is happening
I think.
   
However, to actually get a race you'd have to have an interrupt fire 
and
migrate you so that the speculative read was from the other CPU. 
 However, I
don't think the speculative read would be preserved in that case. 
 The CPU
has to return to a specific PC when it returns from the interrupt 
and it has
no way of storing the state for what speculative reordering it might 
be
doing, so presumably it is thrown away?  I suppose it is possible 
that it
actually retires both instructions (but reordered) and then returns 
to the PC
value after the read of listlocks after the interrupt.  However, in 
that case
the scheduler would not migrate as it would see td_pinned != 0.  To 
get the
race you have to have the interrupt take effect prior to modifying 
td_pinned,
so I think the processor 

Re: sched_pin() versus PCPU_GET

2010-08-05 Thread mdf
 (gdb) p panic_cpu
 $9 = 2
 (gdb) p dumptid
 $12 = 100751
 (gdb) p cpuhead.slh_first-pc_allcpu.sle_next-pc_curthread-td_tid
 $14 = 100751

 (gdb) p *cpuhead.slh_first-pc_allcpu.sle_next
 $6 = {
   pc_curthread = 0xff00716d6960,
   pc_cpuid = 2,
   pc_spinlocks = 0x80803198,

 (gdb) p lock_list
 $2 = (struct lock_list_entry *) 0x80803fb0

 (gdb) p *cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next-
pc_allcpu.sle_next
 $8 = {
   pc_curthread = 0xff0005479960,
   pc_cpuid = 0,
   pc_spinlocks = 0x80803fb0,

 I.e. we're dumping on CPU 2, but the lock_list pointer that was saved
 in the dump matches that of CPU 0.

 Can you print out the tid's for the two curthreads?  It's not impossible that
 the thread migrated after calling panic.  In fact we force threads to CPU 0
 during shutdown.

dumptid matches the pc_curthread for CPU 2 and is printed above.

The lock_list local variable matches the PCPU for CPU 0, which has tid:

(gdb) p 
cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_curthread-td_tid
$2 = 15

(gdb) p 
cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_curthread-td_proc-p_comm
$3 = idle: cpu0\000\000\000\000\000\000\000\000\000

Note that lock_list-ll_count is now 0, but of course wasn't when we
panic'd.  Also, the panic message showed exclusive spin mutex sched
lock 0 (sched lock) r = 0 (0x806cf640) locked @
/build/mnt/src/sys/kern/sys_generic.c:826; i.e. the lock was for CPU
0 as well.  If we truly were returning to user space with that lock
held it would still be held and we'd still be on CPU 0.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-08-04 Thread John Baldwin
On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote:
 On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
  On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
  On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
   We've seen a few instances at work where witness_warn() in ast()
   indicates the sched lock is still held, but the place it claims it was
   held by is in fact sometimes not possible to keep the lock, like:
  
   thread_lock(td);
   td-td_flags = ~TDF_SELECT;
   thread_unlock(td);
  
   What I was wondering is, even though the assembly I see in objdump -S
   for witness_warn has the increment of td_pinned before the PCPU_GET:
  
   802db210:   65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
   802db217:   00 00
   802db219:   ff 83 04 01 00 00   incl   0x104(%rbx)
* Pin the thread in order to avoid problems with thread migration.
* Once that all verifies are passed about spinlocks ownership,
* the thread is in a safe path and it can be unpinned.
*/
   sched_pin();
   lock_list = PCPU_GET(spinlocks);
   802db21f:   65 48 8b 04 25 48 00mov%gs:0x48,%rax
   802db226:   00 00
   if (lock_list != NULL  lock_list-ll_count != 0) {
   802db228:   48 85 c0test   %rax,%rax
* Pin the thread in order to avoid problems with thread migration.
* Once that all verifies are passed about spinlocks ownership,
* the thread is in a safe path and it can be unpinned.
*/
   sched_pin();
   lock_list = PCPU_GET(spinlocks);
   802db22b:   48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp)
   802db232:   48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp)
   if (lock_list != NULL  lock_list-ll_count != 0) {
   802db239:   0f 84 ff 00 00 00   je 802db33e
   witness_warn+0x30e
   802db23f:   44 8b 60 50 mov0x50(%rax),%r12d
  
   is it possible for the hardware to do any re-ordering here?
  
   The reason I'm suspicious is not just that the code doesn't have a
   lock leak at the indicated point, but in one instance I can see in the
   dump that the lock_list local from witness_warn is from the pcpu
   structure for CPU 0 (and I was warned about sched lock 0), but the
   thread id in panic_cpu is 2.  So clearly the thread was being migrated
   right around panic time.
  
   This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
   of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
  
   So... do we need some kind of barrier in the code for sched_pin() for
   it to really do what it claims?  Could the hardware have re-ordered
   the mov%gs:0x48,%rax PCPU_GET to before the sched_pin()
   increment?
 
  Hmmm, I think it might be able to because they refer to different 
  locations.
 
  Note this rule in section 8.2.2 of Volume 3A:
 
• Reads may be reordered with older writes to different locations but not
  with older writes to the same location.
 
  It is certainly true that sparc64 could reorder with RMO.  I believe ia64
  could reorder as well.  Since sched_pin/unpin are frequently used to 
  provide
  this sort of synchronization, we could use memory barriers in pin/unpin
  like so:
 
  sched_pin()
  {
td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
  }
 
  sched_unpin()
  {
atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
  }
 
  We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
  they
  are slightly more heavyweight, though it would be more clear what is 
  happening
  I think.
 
  However, to actually get a race you'd have to have an interrupt fire and
  migrate you so that the speculative read was from the other CPU.  However, I
  don't think the speculative read would be preserved in that case.  The CPU
  has to return to a specific PC when it returns from the interrupt and it has
  no way of storing the state for what speculative reordering it might be
  doing, so presumably it is thrown away?  I suppose it is possible that it
  actually retires both instructions (but reordered) and then returns to the 
  PC
  value after the read of listlocks after the interrupt.  However, in that 
  case
  the scheduler would not migrate as it would see td_pinned != 0.  To get the
  race you have to have the interrupt take effect prior to modifying 
  td_pinned,
  so I think the processor would have to discard the reordered read of
  listlocks so it could safely resume execution at the 'incl' instruction.
 
  The other nit there on x86 at least is that the incl instruction is doing
  both a read and a write and another rule in the section 8.2.2 is this:
 
   • Reads are not reordered with other reads.
 
  That would seem to prevent the read of listlocks from passing the read of
  td_pinned in the incl instruction on x86.
 
 I wonder how that's interpreted in the 

Re: sched_pin() versus PCPU_GET

2010-08-04 Thread mdf
On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
 On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote:
 On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
  On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
  On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
   We've seen a few instances at work where witness_warn() in ast()
   indicates the sched lock is still held, but the place it claims it was
   held by is in fact sometimes not possible to keep the lock, like:
  
       thread_lock(td);
       td-td_flags = ~TDF_SELECT;
       thread_unlock(td);
  
   What I was wondering is, even though the assembly I see in objdump -S
   for witness_warn has the increment of td_pinned before the PCPU_GET:
  
   802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
   802db217:   00 00
   802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
        * Pin the thread in order to avoid problems with thread migration.
        * Once that all verifies are passed about spinlocks ownership,
        * the thread is in a safe path and it can be unpinned.
        */
       sched_pin();
       lock_list = PCPU_GET(spinlocks);
   802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
   802db226:   00 00
       if (lock_list != NULL  lock_list-ll_count != 0) {
   802db228:   48 85 c0                test   %rax,%rax
        * Pin the thread in order to avoid problems with thread migration.
        * Once that all verifies are passed about spinlocks ownership,
        * the thread is in a safe path and it can be unpinned.
        */
       sched_pin();
       lock_list = PCPU_GET(spinlocks);
   802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
   802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
       if (lock_list != NULL  lock_list-ll_count != 0) {
   802db239:   0f 84 ff 00 00 00       je     802db33e
   witness_warn+0x30e
   802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
  
   is it possible for the hardware to do any re-ordering here?
  
   The reason I'm suspicious is not just that the code doesn't have a
   lock leak at the indicated point, but in one instance I can see in the
   dump that the lock_list local from witness_warn is from the pcpu
   structure for CPU 0 (and I was warned about sched lock 0), but the
   thread id in panic_cpu is 2.  So clearly the thread was being migrated
   right around panic time.
  
   This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
   of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
  
   So... do we need some kind of barrier in the code for sched_pin() for
   it to really do what it claims?  Could the hardware have re-ordered
   the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
   increment?
 
  Hmmm, I think it might be able to because they refer to different 
  locations.
 
  Note this rule in section 8.2.2 of Volume 3A:
 
    • Reads may be reordered with older writes to different locations but 
  not
      with older writes to the same location.
 
  It is certainly true that sparc64 could reorder with RMO.  I believe ia64
  could reorder as well.  Since sched_pin/unpin are frequently used to 
  provide
  this sort of synchronization, we could use memory barriers in pin/unpin
  like so:
 
  sched_pin()
  {
        td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
  }
 
  sched_unpin()
  {
        atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
  }
 
  We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
  they
  are slightly more heavyweight, though it would be more clear what is 
  happening
  I think.
 
  However, to actually get a race you'd have to have an interrupt fire and
  migrate you so that the speculative read was from the other CPU.  However, 
  I
  don't think the speculative read would be preserved in that case.  The CPU
  has to return to a specific PC when it returns from the interrupt and it 
  has
  no way of storing the state for what speculative reordering it might be
  doing, so presumably it is thrown away?  I suppose it is possible that it
  actually retires both instructions (but reordered) and then returns to the 
  PC
  value after the read of listlocks after the interrupt.  However, in that 
  case
  the scheduler would not migrate as it would see td_pinned != 0.  To get the
  race you have to have the interrupt take effect prior to modifying 
  td_pinned,
  so I think the processor would have to discard the reordered read of
  listlocks so it could safely resume execution at the 'incl' instruction.
 
  The other nit there on x86 at least is that the incl instruction is doing
  both a read and a write and another rule in the section 8.2.2 is this:
 
   • Reads are not reordered with other reads.
 
  That would seem to prevent the read of listlocks from passing the read of
  

Re: sched_pin() versus PCPU_GET

2010-08-04 Thread John Baldwin
On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
  On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote:
  On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
   On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
   On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
We've seen a few instances at work where witness_warn() in ast()
indicates the sched lock is still held, but the place it claims it was
held by is in fact sometimes not possible to keep the lock, like:
   
thread_lock(td);
td-td_flags = ~TDF_SELECT;
thread_unlock(td);
   
What I was wondering is, even though the assembly I see in objdump -S
for witness_warn has the increment of td_pinned before the PCPU_GET:
   
802db210:   65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
802db217:   00 00
802db219:   ff 83 04 01 00 00   incl   0x104(%rbx)
 * Pin the thread in order to avoid problems with thread 
migration.
 * Once that all verifies are passed about spinlocks ownership,
 * the thread is in a safe path and it can be unpinned.
 */
sched_pin();
lock_list = PCPU_GET(spinlocks);
802db21f:   65 48 8b 04 25 48 00mov%gs:0x48,%rax
802db226:   00 00
if (lock_list != NULL  lock_list-ll_count != 0) {
802db228:   48 85 c0test   %rax,%rax
 * Pin the thread in order to avoid problems with thread 
migration.
 * Once that all verifies are passed about spinlocks ownership,
 * the thread is in a safe path and it can be unpinned.
 */
sched_pin();
lock_list = PCPU_GET(spinlocks);
802db22b:   48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp)
802db232:   48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp)
if (lock_list != NULL  lock_list-ll_count != 0) {
802db239:   0f 84 ff 00 00 00   je 802db33e
witness_warn+0x30e
802db23f:   44 8b 60 50 mov0x50(%rax),%r12d
   
is it possible for the hardware to do any re-ordering here?
   
The reason I'm suspicious is not just that the code doesn't have a
lock leak at the indicated point, but in one instance I can see in the
dump that the lock_list local from witness_warn is from the pcpu
structure for CPU 0 (and I was warned about sched lock 0), but the
thread id in panic_cpu is 2.  So clearly the thread was being migrated
right around panic time.
   
This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
   
So... do we need some kind of barrier in the code for sched_pin() for
it to really do what it claims?  Could the hardware have re-ordered
the mov%gs:0x48,%rax PCPU_GET to before the sched_pin()
increment?
  
   Hmmm, I think it might be able to because they refer to different 
   locations.
  
   Note this rule in section 8.2.2 of Volume 3A:
  
 • Reads may be reordered with older writes to different locations but 
   not
   with older writes to the same location.
  
   It is certainly true that sparc64 could reorder with RMO.  I believe 
   ia64
   could reorder as well.  Since sched_pin/unpin are frequently used to 
   provide
   this sort of synchronization, we could use memory barriers in pin/unpin
   like so:
  
   sched_pin()
   {
 td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
   }
  
   sched_unpin()
   {
 atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
   }
  
   We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), 
   but they
   are slightly more heavyweight, though it would be more clear what is 
   happening
   I think.
  
   However, to actually get a race you'd have to have an interrupt fire and
   migrate you so that the speculative read was from the other CPU.  
   However, I
   don't think the speculative read would be preserved in that case.  The 
   CPU
   has to return to a specific PC when it returns from the interrupt and it 
   has
   no way of storing the state for what speculative reordering it might be
   doing, so presumably it is thrown away?  I suppose it is possible that it
   actually retires both instructions (but reordered) and then returns to 
   the PC
   value after the read of listlocks after the interrupt.  However, in that 
   case
   the scheduler would not migrate as it would see td_pinned != 0.  To get 
   the
   race you have to have the interrupt take effect prior to modifying 
   td_pinned,
   so I think the processor would have to discard the reordered read of
   listlocks so it could safely resume execution at the 'incl' instruction.
  
   The other nit there on x86 at least is that the incl instruction is doing
   both a 

Re: sched_pin() versus PCPU_GET

2010-08-03 Thread mdf
On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
 On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
 On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
      thread_lock(td);
      td-td_flags = ~TDF_SELECT;
      thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:   00 00
  802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:   00 00
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:   48 85 c0                test   %rax,%rax
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:   0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 Hmmm, I think it might be able to because they refer to different locations.

 Note this rule in section 8.2.2 of Volume 3A:

   • Reads may be reordered with older writes to different locations but not
     with older writes to the same location.

 It is certainly true that sparc64 could reorder with RMO.  I believe ia64
 could reorder as well.  Since sched_pin/unpin are frequently used to provide
 this sort of synchronization, we could use memory barriers in pin/unpin
 like so:

 sched_pin()
 {
       td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
 }

 sched_unpin()
 {
       atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
 }

 We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
 they
 are slightly more heavyweight, though it would be more clear what is 
 happening
 I think.

 However, to actually get a race you'd have to have an interrupt fire and
 migrate you so that the speculative read was from the other CPU.  However, I
 don't think the speculative read would be preserved in that case.  The CPU
 has to return to a specific PC when it returns from the interrupt and it has
 no way of storing the state for what speculative reordering it might be
 doing, so presumably it is thrown away?  I suppose it is possible that it
 actually retires both instructions (but reordered) and then returns to the PC
 value after the read of listlocks after the interrupt.  However, in that case
 the scheduler would not migrate as it would see td_pinned != 0.  To get the
 race you have to have the interrupt take effect prior to modifying td_pinned,
 so I think the processor would have to discard the reordered read of
 listlocks so it could safely resume execution at the 'incl' instruction.

 The other nit there on x86 at least is that the incl instruction is doing
 both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

 That would seem to prevent the read of listlocks from passing the read of
 td_pinned in the incl instruction on x86.

I wonder how that's interpreted in the microcode, though?  I.e. if the
incr instruction decodes to load, add, store, does the h/w allow the
later reads to pass the final store?

I added the following:

sched_pin();
   

Re: sched_pin() versus PCPU_GET

2010-07-30 Thread Kostik Belousov
On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote:
 On Thu, Jul 29, 2010 at 4:39 PM,  m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
         thread_lock(td);
         td-td_flags = ~TDF_SELECT;
         thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:       65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:       00 00
  802db219:       ff 83 04 01 00 00       incl   0x104(%rbx)
          * Pin the thread in order to avoid problems with thread migration.
          * Once that all verifies are passed about spinlocks ownership,
          * the thread is in a safe path and it can be unpinned.
          */
         sched_pin();
         lock_list = PCPU_GET(spinlocks);
  802db21f:       65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:       00 00
         if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:       48 85 c0                test   %rax,%rax
          * Pin the thread in order to avoid problems with thread migration.
          * Once that all verifies are passed about spinlocks ownership,
          * the thread is in a safe path and it can be unpinned.
          */
         sched_pin();
         lock_list = PCPU_GET(spinlocks);
  802db22b:       48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:       48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
         if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:       0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:       44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?
 
 So after some research, the answer I'm getting is maybe.  What I'm
 concerned about is whether the h/w reordered the read of PCPU_GET in
 front of the previous store to increment td_pinned.  While not an
 ultimate authority,
 http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems
 implies that stores can be reordered after loads for both Intel and
 amd64 chips, which would I believe account for the behavior seen here.
 

Am I right that you suggest that in the sequence
mov %gs:0x0,%rbx  [1]
incl0x104(%rbx)   [2]
mov %gs:0x48,%rax [3]
interrupt and preemption happen between points [2] and [3] ?
And the %rax value after the thread was put back onto the (different) new
CPU and executed [3] was still from the old cpu' pcpu area ?

I do not believe this is possible. CPU is always self-consistent. Context
switch from the thread can only occur on the return from interrupt
handler, in critical_exit() or such. This code is executing on the
same processor, and thus should already see the effect of [2], that
would prevent context switch.

If interrupt happens between [1] and [2], then context saving code
should still see the consistent view of the register file state,
regardless of the processor issuing speculative read of
*%gs:0x48. Return from the interrupt is the serialization point due to
iret, causing read in [3] to be reissued.



pgpSPJLw7Y6uh.pgp
Description: PGP signature


Re: sched_pin() versus PCPU_GET

2010-07-30 Thread mdf
2010/7/30 Kostik Belousov kostik...@gmail.com:
 On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote:
 On Thu, Jul 29, 2010 at 4:39 PM,  m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
         thread_lock(td);
         td-td_flags = ~TDF_SELECT;
         thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:       65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:       00 00
  802db219:       ff 83 04 01 00 00       incl   0x104(%rbx)
          * Pin the thread in order to avoid problems with thread migration.
          * Once that all verifies are passed about spinlocks ownership,
          * the thread is in a safe path and it can be unpinned.
          */
         sched_pin();
         lock_list = PCPU_GET(spinlocks);
  802db21f:       65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:       00 00
         if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:       48 85 c0                test   %rax,%rax
          * Pin the thread in order to avoid problems with thread migration.
          * Once that all verifies are passed about spinlocks ownership,
          * the thread is in a safe path and it can be unpinned.
          */
         sched_pin();
         lock_list = PCPU_GET(spinlocks);
  802db22b:       48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:       48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
         if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:       0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:       44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 So after some research, the answer I'm getting is maybe.  What I'm
 concerned about is whether the h/w reordered the read of PCPU_GET in
 front of the previous store to increment td_pinned.  While not an
 ultimate authority,
 http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems
 implies that stores can be reordered after loads for both Intel and
 amd64 chips, which would I believe account for the behavior seen here.

 Am I right that you suggest that in the sequence
        mov     %gs:0x0,%rbx      [1]
        incl    0x104(%rbx)       [2]
        mov     %gs:0x48,%rax     [3]
 interrupt and preemption happen between points [2] and [3] ?
 And the %rax value after the thread was put back onto the (different) new
 CPU and executed [3] was still from the old cpu' pcpu area ?

Right, but I'm also asking if it's possible the hardware executed the
instructions as:

        mov     %gs:0x0,%rbx      [1]
        mov     %gs:0x48,%rax     [3]
        incl    0x104(%rbx)       [2]

On PowerPC this is definitely possible and I'd use an isync to prevent
the re-ordering.  I haven't been able to confirm that Intel/AMD
present such a strict ordering that no barrier is needed.

It's admittedly a very tight window, and we've only seen it twice, but
I have no other way to explain the symptom.  Unfortunately in the dump
gdb shows both %rax and %gs as 0, so I can't confirm that they had a
value I'd expect from another CPU.  The only thing I do have is
panic_cpu being different than the CPU at the time of
PCPU_GET(spinlock), but of course there's definitely a window there.

 I do not believe this is possible. CPU is always self-consistent. Context
 switch from the thread can only occur on the return from interrupt
 handler, in critical_exit() or such. This code is executing on the
 same processor, and thus should already see the effect of [2], that
 would prevent context switch.

Right, but if the hardware allowed reads to pass writes, then %rax
would have an incorrect value which would be saved at interrupt time,
and restored on another processor.

I can add a few sanity asserts to try to prove this one way or another
and hope they don't mess with the timing; 

Re: sched_pin() versus PCPU_GET

2010-07-30 Thread John Baldwin
On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
 We've seen a few instances at work where witness_warn() in ast()
 indicates the sched lock is still held, but the place it claims it was
 held by is in fact sometimes not possible to keep the lock, like:
 
   thread_lock(td);
   td-td_flags = ~TDF_SELECT;
   thread_unlock(td);
 
 What I was wondering is, even though the assembly I see in objdump -S
 for witness_warn has the increment of td_pinned before the PCPU_GET:
 
 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
 802db217: 00 00
 802db219: ff 83 04 01 00 00   incl   0x104(%rbx)
* Pin the thread in order to avoid problems with thread migration.
* Once that all verifies are passed about spinlocks ownership,
* the thread is in a safe path and it can be unpinned.
*/
   sched_pin();
   lock_list = PCPU_GET(spinlocks);
 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax
 802db226: 00 00
   if (lock_list != NULL  lock_list-ll_count != 0) {
 802db228: 48 85 c0test   %rax,%rax
* Pin the thread in order to avoid problems with thread migration.
* Once that all verifies are passed about spinlocks ownership,
* the thread is in a safe path and it can be unpinned.
*/
   sched_pin();
   lock_list = PCPU_GET(spinlocks);
 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp)
 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp)
   if (lock_list != NULL  lock_list-ll_count != 0) {
 802db239: 0f 84 ff 00 00 00   je 802db33e
 witness_warn+0x30e
 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d
 
 is it possible for the hardware to do any re-ordering here?
 
 The reason I'm suspicious is not just that the code doesn't have a
 lock leak at the indicated point, but in one instance I can see in the
 dump that the lock_list local from witness_warn is from the pcpu
 structure for CPU 0 (and I was warned about sched lock 0), but the
 thread id in panic_cpu is 2.  So clearly the thread was being migrated
 right around panic time.
 
 This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
 of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
 So... do we need some kind of barrier in the code for sched_pin() for
 it to really do what it claims?  Could the hardware have re-ordered
 the mov%gs:0x48,%rax PCPU_GET to before the sched_pin()
 increment?

Hmmm, I think it might be able to because they refer to different locations.

Note this rule in section 8.2.2 of Volume 3A:

  • Reads may be reordered with older writes to different locations but not
with older writes to the same location.

It is certainly true that sparc64 could reorder with RMO.  I believe ia64 
could reorder as well.  Since sched_pin/unpin are frequently used to provide 
this sort of synchronization, we could use memory barriers in pin/unpin
like so:

sched_pin()
{
td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
}

sched_unpin()
{
atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
}

We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they 
are slightly more heavyweight, though it would be more clear what is happening 
I think.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-07-30 Thread John Baldwin
On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
 On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
  
  thread_lock(td);
  td-td_flags = ~TDF_SELECT;
  thread_unlock(td);
  
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
  
  802db210:   65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
  802db217:   00 00
  802db219:   ff 83 04 01 00 00   incl   0x104(%rbx)
   * Pin the thread in order to avoid problems with thread migration.
   * Once that all verifies are passed about spinlocks ownership,
   * the thread is in a safe path and it can be unpinned.
   */
  sched_pin();
  lock_list = PCPU_GET(spinlocks);
  802db21f:   65 48 8b 04 25 48 00mov%gs:0x48,%rax
  802db226:   00 00
  if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:   48 85 c0test   %rax,%rax
   * Pin the thread in order to avoid problems with thread migration.
   * Once that all verifies are passed about spinlocks ownership,
   * the thread is in a safe path and it can be unpinned.
   */
  sched_pin();
  lock_list = PCPU_GET(spinlocks);
  802db22b:   48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp)
  802db232:   48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp)
  if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:   0f 84 ff 00 00 00   je 802db33e
  witness_warn+0x30e
  802db23f:   44 8b 60 50 mov0x50(%rax),%r12d
  
  is it possible for the hardware to do any re-ordering here?
  
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
  
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
  
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov%gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?
 
 Hmmm, I think it might be able to because they refer to different locations.
 
 Note this rule in section 8.2.2 of Volume 3A:
 
   • Reads may be reordered with older writes to different locations but not
 with older writes to the same location.
 
 It is certainly true that sparc64 could reorder with RMO.  I believe ia64 
 could reorder as well.  Since sched_pin/unpin are frequently used to provide 
 this sort of synchronization, we could use memory barriers in pin/unpin
 like so:
 
 sched_pin()
 {
   td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
 }
 
 sched_unpin()
 {
   atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
 }
 
 We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
 they 
 are slightly more heavyweight, though it would be more clear what is 
 happening 
 I think.

However, to actually get a race you'd have to have an interrupt fire and
migrate you so that the speculative read was from the other CPU.  However, I
don't think the speculative read would be preserved in that case.  The CPU
has to return to a specific PC when it returns from the interrupt and it has
no way of storing the state for what speculative reordering it might be
doing, so presumably it is thrown away?  I suppose it is possible that it
actually retires both instructions (but reordered) and then returns to the PC
value after the read of listlocks after the interrupt.  However, in that case
the scheduler would not migrate as it would see td_pinned != 0.  To get the
race you have to have the interrupt take effect prior to modifying td_pinned,
so I think the processor would have to discard the reordered read of
listlocks so it could safely resume execution at the 'incl' instruction.

The other nit there on x86 at least is that the incl instruction is doing
both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

That would seem to prevent the read of listlocks from passing the read of
td_pinned in the incl instruction on x86.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


sched_pin() versus PCPU_GET

2010-07-29 Thread mdf
We've seen a few instances at work where witness_warn() in ast()
indicates the sched lock is still held, but the place it claims it was
held by is in fact sometimes not possible to keep the lock, like:

thread_lock(td);
td-td_flags = ~TDF_SELECT;
thread_unlock(td);

What I was wondering is, even though the assembly I see in objdump -S
for witness_warn has the increment of td_pinned before the PCPU_GET:

802db210:   65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
802db217:   00 00
802db219:   ff 83 04 01 00 00   incl   0x104(%rbx)
 * Pin the thread in order to avoid problems with thread migration.
 * Once that all verifies are passed about spinlocks ownership,
 * the thread is in a safe path and it can be unpinned.
 */
sched_pin();
lock_list = PCPU_GET(spinlocks);
802db21f:   65 48 8b 04 25 48 00mov%gs:0x48,%rax
802db226:   00 00
if (lock_list != NULL  lock_list-ll_count != 0) {
802db228:   48 85 c0test   %rax,%rax
 * Pin the thread in order to avoid problems with thread migration.
 * Once that all verifies are passed about spinlocks ownership,
 * the thread is in a safe path and it can be unpinned.
 */
sched_pin();
lock_list = PCPU_GET(spinlocks);
802db22b:   48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp)
802db232:   48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp)
if (lock_list != NULL  lock_list-ll_count != 0) {
802db239:   0f 84 ff 00 00 00   je 802db33e
witness_warn+0x30e
802db23f:   44 8b 60 50 mov0x50(%rax),%r12d

is it possible for the hardware to do any re-ordering here?

The reason I'm suspicious is not just that the code doesn't have a
lock leak at the indicated point, but in one instance I can see in the
dump that the lock_list local from witness_warn is from the pcpu
structure for CPU 0 (and I was warned about sched lock 0), but the
thread id in panic_cpu is 2.  So clearly the thread was being migrated
right around panic time.

This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.

So... do we need some kind of barrier in the code for sched_pin() for
it to really do what it claims?  Could the hardware have re-ordered
the mov%gs:0x48,%rax PCPU_GET to before the sched_pin()
increment?

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-07-29 Thread mdf
On Thu, Jul 29, 2010 at 4:39 PM,  m...@freebsd.org wrote:
 We've seen a few instances at work where witness_warn() in ast()
 indicates the sched lock is still held, but the place it claims it was
 held by is in fact sometimes not possible to keep the lock, like:

        thread_lock(td);
        td-td_flags = ~TDF_SELECT;
        thread_unlock(td);

 What I was wondering is, even though the assembly I see in objdump -S
 for witness_warn has the increment of td_pinned before the PCPU_GET:

 802db210:       65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
 802db217:       00 00
 802db219:       ff 83 04 01 00 00       incl   0x104(%rbx)
         * Pin the thread in order to avoid problems with thread migration.
         * Once that all verifies are passed about spinlocks ownership,
         * the thread is in a safe path and it can be unpinned.
         */
        sched_pin();
        lock_list = PCPU_GET(spinlocks);
 802db21f:       65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
 802db226:       00 00
        if (lock_list != NULL  lock_list-ll_count != 0) {
 802db228:       48 85 c0                test   %rax,%rax
         * Pin the thread in order to avoid problems with thread migration.
         * Once that all verifies are passed about spinlocks ownership,
         * the thread is in a safe path and it can be unpinned.
         */
        sched_pin();
        lock_list = PCPU_GET(spinlocks);
 802db22b:       48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
 802db232:       48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
        if (lock_list != NULL  lock_list-ll_count != 0) {
 802db239:       0f 84 ff 00 00 00       je     802db33e
 witness_warn+0x30e
 802db23f:       44 8b 60 50             mov    0x50(%rax),%r12d

 is it possible for the hardware to do any re-ordering here?

 The reason I'm suspicious is not just that the code doesn't have a
 lock leak at the indicated point, but in one instance I can see in the
 dump that the lock_list local from witness_warn is from the pcpu
 structure for CPU 0 (and I was warned about sched lock 0), but the
 thread id in panic_cpu is 2.  So clearly the thread was being migrated
 right around panic time.

 This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
 of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.

 So... do we need some kind of barrier in the code for sched_pin() for
 it to really do what it claims?  Could the hardware have re-ordered
 the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
 increment?

So after some research, the answer I'm getting is maybe.  What I'm
concerned about is whether the h/w reordered the read of PCPU_GET in
front of the previous store to increment td_pinned.  While not an
ultimate authority,
http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems
implies that stores can be reordered after loads for both Intel and
amd64 chips, which would I believe account for the behavior seen here.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org