Re: sched_pin() versus PCPU_GET
On Sun, Aug 8, 2010 at 2:43 PM, Attilio Rao wrote: > 2010/8/4 : >> On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: >>> On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: > We've seen a few instances at work where witness_warn() in ast() > indicates the sched lock is still held, but the place it claims it was > held by is in fact sometimes not possible to keep the lock, like: > > thread_lock(td); > td->td_flags &= ~TDF_SELECT; > thread_unlock(td); > > What I was wondering is, even though the assembly I see in objdump -S > for witness_warn has the increment of td_pinned before the PCPU_GET: > > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx > 802db217: 00 00 > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > * Pin the thread in order to avoid problems with thread migration. > * Once that all verifies are passed about spinlocks ownership, > * the thread is in a safe path and it can be unpinned. > */ > sched_pin(); > lock_list = PCPU_GET(spinlocks); > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax > 802db226: 00 00 > if (lock_list != NULL && lock_list->ll_count != 0) { > 802db228: 48 85 c0 test %rax,%rax > * Pin the thread in order to avoid problems with thread migration. > * Once that all verifies are passed about spinlocks ownership, > * the thread is in a safe path and it can be unpinned. > */ > sched_pin(); > lock_list = PCPU_GET(spinlocks); > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) > if (lock_list != NULL && lock_list->ll_count != 0) { > 802db239: 0f 84 ff 00 00 00 je 802db33e > > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d > > is it possible for the hardware to do any re-ordering here? > > The reason I'm suspicious is not just that the code doesn't have a > lock leak at the indicated point, but in one instance I can see in the > dump that the lock_list local from witness_warn is from the pcpu > structure for CPU 0 (and I was warned about sched lock 0), but the > thread id in panic_cpu is 2. So clearly the thread was being migrated > right around panic time. > > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > > So... do we need some kind of barrier in the code for sched_pin() for > it to really do what it claims? Could the hardware have re-ordered > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() > increment? Hmmm, I think it might be able to because they refer to different locations. Note this rule in section 8.2.2 of Volume 3A: • Reads may be reordered with older writes to different locations but not with older writes to the same location. It is certainly true that sparc64 could reorder with RMO. I believe ia64 could reorder as well. Since sched_pin/unpin are frequently used to provide this sort of synchronization, we could use memory barriers in pin/unpin like so: sched_pin() { td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; } sched_unpin() { atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); } We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they are slightly more heavyweight, though it would be more clear what is happening I think. >>> >>> However, to actually get a race you'd have to have an interrupt fire and >>> migrate you so that the speculative read was from the other CPU. However, I >>> don't think the speculative read would be preserved in that case. The CPU >>> has to return to a specific PC when it returns from the interrupt and it has >>> no way of storing the state for what speculative reordering it might be >>> doing, so presumably it is thrown away? I suppose it is possible that it >>> actually retires both instructions (but reordered) and then returns to the >>> PC >>> value after the read of listlocks after the interrupt. However, in that >>> case >>> the scheduler would not migrate as it would see td_pinned != 0. To get the >>> race you have to have the interrupt take effect prior to modifying >>> td_pinned, >>> so I think the processor would have to discard the reordered read of >>> listlocks so it could safely resume execution at the 'incl' instruction. >>> >>> The other nit there on x86 at le
Re: sched_pin() versus PCPU_GET
2010/8/4 : > On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: >> On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: >>> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: >>> > We've seen a few instances at work where witness_warn() in ast() >>> > indicates the sched lock is still held, but the place it claims it was >>> > held by is in fact sometimes not possible to keep the lock, like: >>> > >>> > thread_lock(td); >>> > td->td_flags &= ~TDF_SELECT; >>> > thread_unlock(td); >>> > >>> > What I was wondering is, even though the assembly I see in objdump -S >>> > for witness_warn has the increment of td_pinned before the PCPU_GET: >>> > >>> > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx >>> > 802db217: 00 00 >>> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >>> > * Pin the thread in order to avoid problems with thread migration. >>> > * Once that all verifies are passed about spinlocks ownership, >>> > * the thread is in a safe path and it can be unpinned. >>> > */ >>> > sched_pin(); >>> > lock_list = PCPU_GET(spinlocks); >>> > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax >>> > 802db226: 00 00 >>> > if (lock_list != NULL && lock_list->ll_count != 0) { >>> > 802db228: 48 85 c0 test %rax,%rax >>> > * Pin the thread in order to avoid problems with thread migration. >>> > * Once that all verifies are passed about spinlocks ownership, >>> > * the thread is in a safe path and it can be unpinned. >>> > */ >>> > sched_pin(); >>> > lock_list = PCPU_GET(spinlocks); >>> > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) >>> > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) >>> > if (lock_list != NULL && lock_list->ll_count != 0) { >>> > 802db239: 0f 84 ff 00 00 00 je 802db33e >>> > >>> > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d >>> > >>> > is it possible for the hardware to do any re-ordering here? >>> > >>> > The reason I'm suspicious is not just that the code doesn't have a >>> > lock leak at the indicated point, but in one instance I can see in the >>> > dump that the lock_list local from witness_warn is from the pcpu >>> > structure for CPU 0 (and I was warned about sched lock 0), but the >>> > thread id in panic_cpu is 2. So clearly the thread was being migrated >>> > right around panic time. >>> > >>> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind >>> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. >>> > >>> > So... do we need some kind of barrier in the code for sched_pin() for >>> > it to really do what it claims? Could the hardware have re-ordered >>> > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() >>> > increment? >>> >>> Hmmm, I think it might be able to because they refer to different locations. >>> >>> Note this rule in section 8.2.2 of Volume 3A: >>> >>> • Reads may be reordered with older writes to different locations but not >>> with older writes to the same location. >>> >>> It is certainly true that sparc64 could reorder with RMO. I believe ia64 >>> could reorder as well. Since sched_pin/unpin are frequently used to provide >>> this sort of synchronization, we could use memory barriers in pin/unpin >>> like so: >>> >>> sched_pin() >>> { >>> td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; >>> } >>> >>> sched_unpin() >>> { >>> atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); >>> } >>> >>> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but >>> they >>> are slightly more heavyweight, though it would be more clear what is >>> happening >>> I think. >> >> However, to actually get a race you'd have to have an interrupt fire and >> migrate you so that the speculative read was from the other CPU. However, I >> don't think the speculative read would be preserved in that case. The CPU >> has to return to a specific PC when it returns from the interrupt and it has >> no way of storing the state for what speculative reordering it might be >> doing, so presumably it is thrown away? I suppose it is possible that it >> actually retires both instructions (but reordered) and then returns to the PC >> value after the read of listlocks after the interrupt. However, in that case >> the scheduler would not migrate as it would see td_pinned != 0. To get the >> race you have to have the interrupt take effect prior to modifying td_pinned, >> so I think the processor would have to discard the reordered read of >> listlocks so it could safely resume execution at the 'incl' instruction. >> >> The other nit there on x86 at least is that the incl instruction is doing >> both a read and a write and another rule in the section 8.2.2 is this: >> >> • Reads are not reordered with other reads. >> >> That wo
Re: sched_pin() versus PCPU_GET
2010/8/5 John Baldwin : > On Thursday, August 05, 2010 11:59:37 am m...@freebsd.org wrote: >> On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin wrote: >> > On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote: >> >> On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin wrote: >> >> > On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote: >> >> >> On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: >> >> >> > On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: >> >> >> >> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: >> >> >> >> > We've seen a few instances at work where witness_warn() in ast() >> >> >> >> > indicates the sched lock is still held, but the place it claims > it was >> >> >> >> > held by is in fact sometimes not possible to keep the lock, like: >> >> >> >> > >> >> >> >> > thread_lock(td); >> >> >> >> > td->td_flags &= ~TDF_SELECT; >> >> >> >> > thread_unlock(td); >> >> >> >> > >> >> >> >> > What I was wondering is, even though the assembly I see in > objdump -S >> >> >> >> > for witness_warn has the increment of td_pinned before the > PCPU_GET: >> >> >> >> > >> >> >> >> > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx >> >> >> >> > 802db217: 00 00 >> >> >> >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >> >> >> >> > * Pin the thread in order to avoid problems with thread > migration. >> >> >> >> > * Once that all verifies are passed about spinlocks > ownership, >> >> >> >> > * the thread is in a safe path and it can be unpinned. >> >> >> >> > */ >> >> >> >> > sched_pin(); >> >> >> >> > lock_list = PCPU_GET(spinlocks); >> >> >> >> > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax >> >> >> >> > 802db226: 00 00 >> >> >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> >> >> >> > 802db228: 48 85 c0 test %rax,%rax >> >> >> >> > * Pin the thread in order to avoid problems with thread > migration. >> >> >> >> > * Once that all verifies are passed about spinlocks > ownership, >> >> >> >> > * the thread is in a safe path and it can be unpinned. >> >> >> >> > */ >> >> >> >> > sched_pin(); >> >> >> >> > lock_list = PCPU_GET(spinlocks); >> >> >> >> > 802db22b: 48 89 85 f0 fe ff ff mov > %rax,-0x110(%rbp) >> >> >> >> > 802db232: 48 89 85 f8 fe ff ff mov > %rax,-0x108(%rbp) >> >> >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> >> >> >> > 802db239: 0f 84 ff 00 00 00 je > 802db33e >> >> >> >> > >> >> >> >> > 802db23f: 44 8b 60 50 mov 0x50(%rax), > %r12d >> >> >> >> > >> >> >> >> > is it possible for the hardware to do any re-ordering here? >> >> >> >> > >> >> >> >> > The reason I'm suspicious is not just that the code doesn't have > a >> >> >> >> > lock leak at the indicated point, but in one instance I can see > in the >> >> >> >> > dump that the lock_list local from witness_warn is from the pcpu >> >> >> >> > structure for CPU 0 (and I was warned about sched lock 0), but > the >> >> >> >> > thread id in panic_cpu is 2. So clearly the thread was being > migrated >> >> >> >> > right around panic time. >> >> >> >> > >> >> >> >> > This is the amd64 kernel on stable/7. I'm not sure exactly what > kind >> >> >> >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago > IIRC. >> >> >> >> > >> >> >> >> > So... do we need some kind of barrier in the code for sched_pin() > for >> >> >> >> > it to really do what it claims? Could the hardware have re- > ordered >> >> >> >> > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() >> >> >> >> > increment? >> >> >> >> >> >> >> >> Hmmm, I think it might be able to because they refer to different > locations. >> >> >> >> >> >> >> >> Note this rule in section 8.2.2 of Volume 3A: >> >> >> >> >> >> >> >> • Reads may be reordered with older writes to different locations > but not >> >> >> >> with older writes to the same location. >> >> >> >> >> >> >> >> It is certainly true that sparc64 could reorder with RMO. I > believe ia64 >> >> >> >> could reorder as well. Since sched_pin/unpin are frequently used > to provide >> >> >> >> this sort of synchronization, we could use memory barriers in > pin/unpin >> >> >> >> like so: >> >> >> >> >> >> >> >> sched_pin() >> >> >> >> { >> >> >> >> td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; >> >> >> >> } >> >> >> >> >> >> >> >> sched_unpin() >> >> >> >> { >> >> >> >> atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); >> >> >> >> } >> >> >> >> >> >> >> >> We could also just use atomic_add_acq_int() and > atomic_sub_rel_int(), but they >> >> >> >> are slightly more heavyweight, though it would be more clear what > is happening >> >> >> >> I think. >> >> >> > >> >> >> > However, to actually get a race you'd have to have an interrupt fire > and >> >> >> > migrate you so that t
Re: sched_pin() versus PCPU_GET
>> (gdb) p panic_cpu >> $9 = 2 >> (gdb) p dumptid >> $12 = 100751 >> (gdb) p cpuhead.slh_first->pc_allcpu.sle_next->pc_curthread->td_tid >> $14 = 100751 >> >> (gdb) p *cpuhead.slh_first->pc_allcpu.sle_next >> $6 = { >> pc_curthread = 0xff00716d6960, >> pc_cpuid = 2, >> pc_spinlocks = 0x80803198, >> >> (gdb) p lock_list >> $2 = (struct lock_list_entry *) 0x80803fb0 >> >> (gdb) p *cpuhead.slh_first->pc_allcpu.sle_next->pc_allcpu.sle_next- >>pc_allcpu.sle_next >> $8 = { >> pc_curthread = 0xff0005479960, >> pc_cpuid = 0, >> pc_spinlocks = 0x80803fb0, >> >> I.e. we're dumping on CPU 2, but the lock_list pointer that was saved >> in the dump matches that of CPU 0. > > Can you print out the tid's for the two curthreads? It's not impossible that > the thread migrated after calling panic. In fact we force threads to CPU 0 > during shutdown. dumptid matches the pc_curthread for CPU 2 and is printed above. The lock_list local variable matches the PCPU for CPU 0, which has tid: (gdb) p cpuhead.slh_first->pc_allcpu.sle_next->pc_allcpu.sle_next->pc_allcpu.sle_next->pc_curthread->td_tid $2 = 15 (gdb) p cpuhead.slh_first->pc_allcpu.sle_next->pc_allcpu.sle_next->pc_allcpu.sle_next->pc_curthread->td_proc->p_comm $3 = "idle: cpu0\000\000\000\000\000\000\000\000\000" Note that lock_list->ll_count is now 0, but of course wasn't when we panic'd. Also, the panic message showed "exclusive spin mutex sched lock 0 (sched lock) r = 0 (0x806cf640) locked @ /build/mnt/src/sys/kern/sys_generic.c:826"; i.e. the lock was for CPU 0 as well. If we truly were returning to user space with that lock held it would still be held and we'd still be on CPU 0. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: sched_pin() versus PCPU_GET
On Thursday, August 05, 2010 11:59:37 am m...@freebsd.org wrote: > On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin wrote: > > On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote: > >> On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin wrote: > >> > On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote: > >> >> On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: > >> >> > On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: > >> >> >> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: > >> >> >> > We've seen a few instances at work where witness_warn() in ast() > >> >> >> > indicates the sched lock is still held, but the place it claims it was > >> >> >> > held by is in fact sometimes not possible to keep the lock, like: > >> >> >> > > >> >> >> > thread_lock(td); > >> >> >> > td->td_flags &= ~TDF_SELECT; > >> >> >> > thread_unlock(td); > >> >> >> > > >> >> >> > What I was wondering is, even though the assembly I see in objdump -S > >> >> >> > for witness_warn has the increment of td_pinned before the PCPU_GET: > >> >> >> > > >> >> >> > 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx > >> >> >> > 802db217: 00 00 > >> >> >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > >> >> >> > * Pin the thread in order to avoid problems with thread migration. > >> >> >> > * Once that all verifies are passed about spinlocks ownership, > >> >> >> > * the thread is in a safe path and it can be unpinned. > >> >> >> > */ > >> >> >> > sched_pin(); > >> >> >> > lock_list = PCPU_GET(spinlocks); > >> >> >> > 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax > >> >> >> > 802db226: 00 00 > >> >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { > >> >> >> > 802db228: 48 85 c0test %rax,%rax > >> >> >> > * Pin the thread in order to avoid problems with thread migration. > >> >> >> > * Once that all verifies are passed about spinlocks ownership, > >> >> >> > * the thread is in a safe path and it can be unpinned. > >> >> >> > */ > >> >> >> > sched_pin(); > >> >> >> > lock_list = PCPU_GET(spinlocks); > >> >> >> > 802db22b: 48 89 85 f0 fe ff ffmov %rax,-0x110(%rbp) > >> >> >> > 802db232: 48 89 85 f8 fe ff ffmov %rax,-0x108(%rbp) > >> >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { > >> >> >> > 802db239: 0f 84 ff 00 00 00 je 802db33e > >> >> >> > > >> >> >> > 802db23f: 44 8b 60 50 mov0x50(%rax), %r12d > >> >> >> > > >> >> >> > is it possible for the hardware to do any re-ordering here? > >> >> >> > > >> >> >> > The reason I'm suspicious is not just that the code doesn't have a > >> >> >> > lock leak at the indicated point, but in one instance I can see in the > >> >> >> > dump that the lock_list local from witness_warn is from the pcpu > >> >> >> > structure for CPU 0 (and I was warned about sched lock 0), but the > >> >> >> > thread id in panic_cpu is 2. So clearly the thread was being migrated > >> >> >> > right around panic time. > >> >> >> > > >> >> >> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > >> >> >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > >> >> >> > > >> >> >> > So... do we need some kind of barrier in the code for sched_pin() for > >> >> >> > it to really do what it claims? Could the hardware have re- ordered > >> >> >> > the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin() > >> >> >> > increment? > >> >> >> > >> >> >> Hmmm, I think it might be able to because they refer to different locations. > >> >> >> > >> >> >> Note this rule in section 8.2.2 of Volume 3A: > >> >> >> > >> >> >> • Reads may be reordered with older writes to different locations but not > >> >> >> with older writes to the same location. > >> >> >> > >> >> >> It is certainly true that sparc64 could reorder with RMO. I believe ia64 > >> >> >> could reorder as well. Since sched_pin/unpin are frequently used to provide > >> >> >> this sort of synchronization, we could use memory barriers in pin/unpin > >> >> >> like so: > >> >> >> > >> >> >> sched_pin() > >> >> >> { > >> >> >> td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; > >> >> >> } > >> >> >> > >> >> >> sched_unpin() > >> >> >> { > >> >> >> atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); > >> >> >> } > >> >> >> > >> >> >> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they > >> >> >> are slightly more heavyweight, though it would be more clear what is happening > >> >> >> I think. > >> >> > > >> >> > However, to actually get a race you'd have to have an interrupt fire and > >> >> > migrate you so that the speculative read was from the other CPU. However, I > >> >> > don't think the speculative read would be preserved in that case. T
Re: sched_pin() versus PCPU_GET
On Thursday, August 05, 2010 12:01:22 pm m...@freebsd.org wrote: > On Wed, Aug 4, 2010 at 9:20 AM, wrote: > > On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin wrote: > >> Actually, I would beg to differ in that case. If PCPU_GET(spinlocks) > >> returns non-NULL, then it means that you hold a spin lock, > > > > ll_count is 0 for the "correct" pc_spinlocks and non-zero for the > > "wrong" one, though. So I think it can be non-NULL but the current > > thread/CPU doesn't hold a spinlock. > > > > I don't believe we have any code in the NMI handler. I'm on vacation > > today so I'll check tomorrow. > > I checked and ipi_nmi_handler() doesn't appear to have any local > changes. I assume that's where I should look? The tricky bits are all in the assembly rather than in C, probably in exception.S. However, if %gs were corrupt I would not expect it to point to another CPU's data, but garbage from userland. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: sched_pin() versus PCPU_GET
On Thu, Aug 05, 2010 at 09:01:22AM -0700, m...@freebsd.org wrote: > On Wed, Aug 4, 2010 at 9:20 AM, wrote: > > On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin wrote: > >> Actually, I would beg to differ in that case. If PCPU_GET(spinlocks) > >> returns non-NULL, then it means that you hold a spin lock, > > > > ll_count is 0 for the "correct" pc_spinlocks and non-zero for the > > "wrong" one, though. So I think it can be non-NULL but the current > > thread/CPU doesn't hold a spinlock. > > > > I don't believe we have any code in the NMI handler. I'm on vacation > > today so I'll check tomorrow. > > I checked and ipi_nmi_handler() doesn't appear to have any local > changes. I assume that's where I should look? I think that, in case you do not use hwpmc, you could add "iretq" as the first instruction of the nmi handler in exception.S and see whether the issue is reproducable. pgpXv7NOKvlXx.pgp Description: PGP signature
Re: sched_pin() versus PCPU_GET
On Wed, Aug 4, 2010 at 9:20 AM, wrote: > On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin wrote: >> Actually, I would beg to differ in that case. If PCPU_GET(spinlocks) >> returns non-NULL, then it means that you hold a spin lock, > > ll_count is 0 for the "correct" pc_spinlocks and non-zero for the > "wrong" one, though. So I think it can be non-NULL but the current > thread/CPU doesn't hold a spinlock. > > I don't believe we have any code in the NMI handler. I'm on vacation > today so I'll check tomorrow. I checked and ipi_nmi_handler() doesn't appear to have any local changes. I assume that's where I should look? Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: sched_pin() versus PCPU_GET
On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin wrote: > On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote: >> On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin wrote: >> > On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote: >> >> On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: >> >> > On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: >> >> >> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: >> >> >> > We've seen a few instances at work where witness_warn() in ast() >> >> >> > indicates the sched lock is still held, but the place it claims it >> >> >> > was >> >> >> > held by is in fact sometimes not possible to keep the lock, like: >> >> >> > >> >> >> > thread_lock(td); >> >> >> > td->td_flags &= ~TDF_SELECT; >> >> >> > thread_unlock(td); >> >> >> > >> >> >> > What I was wondering is, even though the assembly I see in objdump -S >> >> >> > for witness_warn has the increment of td_pinned before the PCPU_GET: >> >> >> > >> >> >> > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx >> >> >> > 802db217: 00 00 >> >> >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >> >> >> > * Pin the thread in order to avoid problems with thread >> >> >> > migration. >> >> >> > * Once that all verifies are passed about spinlocks ownership, >> >> >> > * the thread is in a safe path and it can be unpinned. >> >> >> > */ >> >> >> > sched_pin(); >> >> >> > lock_list = PCPU_GET(spinlocks); >> >> >> > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax >> >> >> > 802db226: 00 00 >> >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> >> >> > 802db228: 48 85 c0 test %rax,%rax >> >> >> > * Pin the thread in order to avoid problems with thread >> >> >> > migration. >> >> >> > * Once that all verifies are passed about spinlocks ownership, >> >> >> > * the thread is in a safe path and it can be unpinned. >> >> >> > */ >> >> >> > sched_pin(); >> >> >> > lock_list = PCPU_GET(spinlocks); >> >> >> > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) >> >> >> > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) >> >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> >> >> > 802db239: 0f 84 ff 00 00 00 je 802db33e >> >> >> > >> >> >> > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d >> >> >> > >> >> >> > is it possible for the hardware to do any re-ordering here? >> >> >> > >> >> >> > The reason I'm suspicious is not just that the code doesn't have a >> >> >> > lock leak at the indicated point, but in one instance I can see in >> >> >> > the >> >> >> > dump that the lock_list local from witness_warn is from the pcpu >> >> >> > structure for CPU 0 (and I was warned about sched lock 0), but the >> >> >> > thread id in panic_cpu is 2. So clearly the thread was being >> >> >> > migrated >> >> >> > right around panic time. >> >> >> > >> >> >> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind >> >> >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago >> >> >> > IIRC. >> >> >> > >> >> >> > So... do we need some kind of barrier in the code for sched_pin() for >> >> >> > it to really do what it claims? Could the hardware have re-ordered >> >> >> > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() >> >> >> > increment? >> >> >> >> >> >> Hmmm, I think it might be able to because they refer to different >> >> >> locations. >> >> >> >> >> >> Note this rule in section 8.2.2 of Volume 3A: >> >> >> >> >> >> • Reads may be reordered with older writes to different locations >> >> >> but not >> >> >> with older writes to the same location. >> >> >> >> >> >> It is certainly true that sparc64 could reorder with RMO. I believe >> >> >> ia64 >> >> >> could reorder as well. Since sched_pin/unpin are frequently used to >> >> >> provide >> >> >> this sort of synchronization, we could use memory barriers in pin/unpin >> >> >> like so: >> >> >> >> >> >> sched_pin() >> >> >> { >> >> >> td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; >> >> >> } >> >> >> >> >> >> sched_unpin() >> >> >> { >> >> >> atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); >> >> >> } >> >> >> >> >> >> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), >> >> >> but they >> >> >> are slightly more heavyweight, though it would be more clear what is >> >> >> happening >> >> >> I think. >> >> > >> >> > However, to actually get a race you'd have to have an interrupt fire and >> >> > migrate you so that the speculative read was from the other CPU. >> >> > However, I >> >> > don't think the speculative read would be preserved in that case. The >> >> > CPU >> >> > has to return to a specific PC when it returns from the interrupt and >> >> > it has >> >> > no way of stori
Re: sched_pin() versus PCPU_GET
On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote: > On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin wrote: > > On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote: > >> On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: > >> > On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: > >> >> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: > >> >> > We've seen a few instances at work where witness_warn() in ast() > >> >> > indicates the sched lock is still held, but the place it claims it was > >> >> > held by is in fact sometimes not possible to keep the lock, like: > >> >> > > >> >> > thread_lock(td); > >> >> > td->td_flags &= ~TDF_SELECT; > >> >> > thread_unlock(td); > >> >> > > >> >> > What I was wondering is, even though the assembly I see in objdump -S > >> >> > for witness_warn has the increment of td_pinned before the PCPU_GET: > >> >> > > >> >> > 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx > >> >> > 802db217: 00 00 > >> >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > >> >> > * Pin the thread in order to avoid problems with thread > >> >> > migration. > >> >> > * Once that all verifies are passed about spinlocks ownership, > >> >> > * the thread is in a safe path and it can be unpinned. > >> >> > */ > >> >> > sched_pin(); > >> >> > lock_list = PCPU_GET(spinlocks); > >> >> > 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax > >> >> > 802db226: 00 00 > >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { > >> >> > 802db228: 48 85 c0test %rax,%rax > >> >> > * Pin the thread in order to avoid problems with thread > >> >> > migration. > >> >> > * Once that all verifies are passed about spinlocks ownership, > >> >> > * the thread is in a safe path and it can be unpinned. > >> >> > */ > >> >> > sched_pin(); > >> >> > lock_list = PCPU_GET(spinlocks); > >> >> > 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp) > >> >> > 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp) > >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { > >> >> > 802db239: 0f 84 ff 00 00 00 je 802db33e > >> >> > > >> >> > 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d > >> >> > > >> >> > is it possible for the hardware to do any re-ordering here? > >> >> > > >> >> > The reason I'm suspicious is not just that the code doesn't have a > >> >> > lock leak at the indicated point, but in one instance I can see in the > >> >> > dump that the lock_list local from witness_warn is from the pcpu > >> >> > structure for CPU 0 (and I was warned about sched lock 0), but the > >> >> > thread id in panic_cpu is 2. So clearly the thread was being migrated > >> >> > right around panic time. > >> >> > > >> >> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > >> >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > >> >> > > >> >> > So... do we need some kind of barrier in the code for sched_pin() for > >> >> > it to really do what it claims? Could the hardware have re-ordered > >> >> > the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin() > >> >> > increment? > >> >> > >> >> Hmmm, I think it might be able to because they refer to different > >> >> locations. > >> >> > >> >> Note this rule in section 8.2.2 of Volume 3A: > >> >> > >> >> • Reads may be reordered with older writes to different locations but > >> >> not > >> >> with older writes to the same location. > >> >> > >> >> It is certainly true that sparc64 could reorder with RMO. I believe > >> >> ia64 > >> >> could reorder as well. Since sched_pin/unpin are frequently used to > >> >> provide > >> >> this sort of synchronization, we could use memory barriers in pin/unpin > >> >> like so: > >> >> > >> >> sched_pin() > >> >> { > >> >> td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; > >> >> } > >> >> > >> >> sched_unpin() > >> >> { > >> >> atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); > >> >> } > >> >> > >> >> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), > >> >> but they > >> >> are slightly more heavyweight, though it would be more clear what is > >> >> happening > >> >> I think. > >> > > >> > However, to actually get a race you'd have to have an interrupt fire and > >> > migrate you so that the speculative read was from the other CPU. > >> > However, I > >> > don't think the speculative read would be preserved in that case. The > >> > CPU > >> > has to return to a specific PC when it returns from the interrupt and it > >> > has > >> > no way of storing the state for what speculative reordering it might be > >> > doing, so presumably it is thrown away? I suppose it is possible that it > >> > actually retires both instructions (but reordered) and then
Re: sched_pin() versus PCPU_GET
On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin wrote: > On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote: >> On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: >> > On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: >> >> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: >> >> > We've seen a few instances at work where witness_warn() in ast() >> >> > indicates the sched lock is still held, but the place it claims it was >> >> > held by is in fact sometimes not possible to keep the lock, like: >> >> > >> >> > thread_lock(td); >> >> > td->td_flags &= ~TDF_SELECT; >> >> > thread_unlock(td); >> >> > >> >> > What I was wondering is, even though the assembly I see in objdump -S >> >> > for witness_warn has the increment of td_pinned before the PCPU_GET: >> >> > >> >> > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx >> >> > 802db217: 00 00 >> >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >> >> > * Pin the thread in order to avoid problems with thread migration. >> >> > * Once that all verifies are passed about spinlocks ownership, >> >> > * the thread is in a safe path and it can be unpinned. >> >> > */ >> >> > sched_pin(); >> >> > lock_list = PCPU_GET(spinlocks); >> >> > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax >> >> > 802db226: 00 00 >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> >> > 802db228: 48 85 c0 test %rax,%rax >> >> > * Pin the thread in order to avoid problems with thread migration. >> >> > * Once that all verifies are passed about spinlocks ownership, >> >> > * the thread is in a safe path and it can be unpinned. >> >> > */ >> >> > sched_pin(); >> >> > lock_list = PCPU_GET(spinlocks); >> >> > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) >> >> > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) >> >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> >> > 802db239: 0f 84 ff 00 00 00 je 802db33e >> >> > >> >> > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d >> >> > >> >> > is it possible for the hardware to do any re-ordering here? >> >> > >> >> > The reason I'm suspicious is not just that the code doesn't have a >> >> > lock leak at the indicated point, but in one instance I can see in the >> >> > dump that the lock_list local from witness_warn is from the pcpu >> >> > structure for CPU 0 (and I was warned about sched lock 0), but the >> >> > thread id in panic_cpu is 2. So clearly the thread was being migrated >> >> > right around panic time. >> >> > >> >> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind >> >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. >> >> > >> >> > So... do we need some kind of barrier in the code for sched_pin() for >> >> > it to really do what it claims? Could the hardware have re-ordered >> >> > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() >> >> > increment? >> >> >> >> Hmmm, I think it might be able to because they refer to different >> >> locations. >> >> >> >> Note this rule in section 8.2.2 of Volume 3A: >> >> >> >> • Reads may be reordered with older writes to different locations but >> >> not >> >> with older writes to the same location. >> >> >> >> It is certainly true that sparc64 could reorder with RMO. I believe ia64 >> >> could reorder as well. Since sched_pin/unpin are frequently used to >> >> provide >> >> this sort of synchronization, we could use memory barriers in pin/unpin >> >> like so: >> >> >> >> sched_pin() >> >> { >> >> td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; >> >> } >> >> >> >> sched_unpin() >> >> { >> >> atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); >> >> } >> >> >> >> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but >> >> they >> >> are slightly more heavyweight, though it would be more clear what is >> >> happening >> >> I think. >> > >> > However, to actually get a race you'd have to have an interrupt fire and >> > migrate you so that the speculative read was from the other CPU. However, >> > I >> > don't think the speculative read would be preserved in that case. The CPU >> > has to return to a specific PC when it returns from the interrupt and it >> > has >> > no way of storing the state for what speculative reordering it might be >> > doing, so presumably it is thrown away? I suppose it is possible that it >> > actually retires both instructions (but reordered) and then returns to the >> > PC >> > value after the read of listlocks after the interrupt. However, in that >> > case >> > the scheduler would not migrate as it would see td_pinned != 0. To get the >> > race you have to have the interrupt take effect prior to modifying >> > td_pinned, >> > so I think the
Re: sched_pin() versus PCPU_GET
On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote: > On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: > > On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: > >> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: > >> > We've seen a few instances at work where witness_warn() in ast() > >> > indicates the sched lock is still held, but the place it claims it was > >> > held by is in fact sometimes not possible to keep the lock, like: > >> > > >> > thread_lock(td); > >> > td->td_flags &= ~TDF_SELECT; > >> > thread_unlock(td); > >> > > >> > What I was wondering is, even though the assembly I see in objdump -S > >> > for witness_warn has the increment of td_pinned before the PCPU_GET: > >> > > >> > 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx > >> > 802db217: 00 00 > >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > >> > * Pin the thread in order to avoid problems with thread migration. > >> > * Once that all verifies are passed about spinlocks ownership, > >> > * the thread is in a safe path and it can be unpinned. > >> > */ > >> > sched_pin(); > >> > lock_list = PCPU_GET(spinlocks); > >> > 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax > >> > 802db226: 00 00 > >> > if (lock_list != NULL && lock_list->ll_count != 0) { > >> > 802db228: 48 85 c0test %rax,%rax > >> > * Pin the thread in order to avoid problems with thread migration. > >> > * Once that all verifies are passed about spinlocks ownership, > >> > * the thread is in a safe path and it can be unpinned. > >> > */ > >> > sched_pin(); > >> > lock_list = PCPU_GET(spinlocks); > >> > 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp) > >> > 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp) > >> > if (lock_list != NULL && lock_list->ll_count != 0) { > >> > 802db239: 0f 84 ff 00 00 00 je 802db33e > >> > > >> > 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d > >> > > >> > is it possible for the hardware to do any re-ordering here? > >> > > >> > The reason I'm suspicious is not just that the code doesn't have a > >> > lock leak at the indicated point, but in one instance I can see in the > >> > dump that the lock_list local from witness_warn is from the pcpu > >> > structure for CPU 0 (and I was warned about sched lock 0), but the > >> > thread id in panic_cpu is 2. So clearly the thread was being migrated > >> > right around panic time. > >> > > >> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > >> > > >> > So... do we need some kind of barrier in the code for sched_pin() for > >> > it to really do what it claims? Could the hardware have re-ordered > >> > the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin() > >> > increment? > >> > >> Hmmm, I think it might be able to because they refer to different > >> locations. > >> > >> Note this rule in section 8.2.2 of Volume 3A: > >> > >> • Reads may be reordered with older writes to different locations but not > >> with older writes to the same location. > >> > >> It is certainly true that sparc64 could reorder with RMO. I believe ia64 > >> could reorder as well. Since sched_pin/unpin are frequently used to > >> provide > >> this sort of synchronization, we could use memory barriers in pin/unpin > >> like so: > >> > >> sched_pin() > >> { > >> td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; > >> } > >> > >> sched_unpin() > >> { > >> atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); > >> } > >> > >> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but > >> they > >> are slightly more heavyweight, though it would be more clear what is > >> happening > >> I think. > > > > However, to actually get a race you'd have to have an interrupt fire and > > migrate you so that the speculative read was from the other CPU. However, I > > don't think the speculative read would be preserved in that case. The CPU > > has to return to a specific PC when it returns from the interrupt and it has > > no way of storing the state for what speculative reordering it might be > > doing, so presumably it is thrown away? I suppose it is possible that it > > actually retires both instructions (but reordered) and then returns to the > > PC > > value after the read of listlocks after the interrupt. However, in that > > case > > the scheduler would not migrate as it would see td_pinned != 0. To get the > > race you have to have the interrupt take effect prior to modifying > > td_pinned, > > so I think the processor would have to discard the reordered read of > > listlocks so it could safely resume execution at the 'incl' instruction. > > > > The other nit there on x86 at least
Re: sched_pin() versus PCPU_GET
On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin wrote: > On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: >> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: >> > We've seen a few instances at work where witness_warn() in ast() >> > indicates the sched lock is still held, but the place it claims it was >> > held by is in fact sometimes not possible to keep the lock, like: >> > >> > thread_lock(td); >> > td->td_flags &= ~TDF_SELECT; >> > thread_unlock(td); >> > >> > What I was wondering is, even though the assembly I see in objdump -S >> > for witness_warn has the increment of td_pinned before the PCPU_GET: >> > >> > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx >> > 802db217: 00 00 >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >> > * Pin the thread in order to avoid problems with thread migration. >> > * Once that all verifies are passed about spinlocks ownership, >> > * the thread is in a safe path and it can be unpinned. >> > */ >> > sched_pin(); >> > lock_list = PCPU_GET(spinlocks); >> > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax >> > 802db226: 00 00 >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> > 802db228: 48 85 c0 test %rax,%rax >> > * Pin the thread in order to avoid problems with thread migration. >> > * Once that all verifies are passed about spinlocks ownership, >> > * the thread is in a safe path and it can be unpinned. >> > */ >> > sched_pin(); >> > lock_list = PCPU_GET(spinlocks); >> > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) >> > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> > 802db239: 0f 84 ff 00 00 00 je 802db33e >> > >> > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d >> > >> > is it possible for the hardware to do any re-ordering here? >> > >> > The reason I'm suspicious is not just that the code doesn't have a >> > lock leak at the indicated point, but in one instance I can see in the >> > dump that the lock_list local from witness_warn is from the pcpu >> > structure for CPU 0 (and I was warned about sched lock 0), but the >> > thread id in panic_cpu is 2. So clearly the thread was being migrated >> > right around panic time. >> > >> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. >> > >> > So... do we need some kind of barrier in the code for sched_pin() for >> > it to really do what it claims? Could the hardware have re-ordered >> > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() >> > increment? >> >> Hmmm, I think it might be able to because they refer to different locations. >> >> Note this rule in section 8.2.2 of Volume 3A: >> >> • Reads may be reordered with older writes to different locations but not >> with older writes to the same location. >> >> It is certainly true that sparc64 could reorder with RMO. I believe ia64 >> could reorder as well. Since sched_pin/unpin are frequently used to provide >> this sort of synchronization, we could use memory barriers in pin/unpin >> like so: >> >> sched_pin() >> { >> td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; >> } >> >> sched_unpin() >> { >> atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); >> } >> >> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but >> they >> are slightly more heavyweight, though it would be more clear what is >> happening >> I think. > > However, to actually get a race you'd have to have an interrupt fire and > migrate you so that the speculative read was from the other CPU. However, I > don't think the speculative read would be preserved in that case. The CPU > has to return to a specific PC when it returns from the interrupt and it has > no way of storing the state for what speculative reordering it might be > doing, so presumably it is thrown away? I suppose it is possible that it > actually retires both instructions (but reordered) and then returns to the PC > value after the read of listlocks after the interrupt. However, in that case > the scheduler would not migrate as it would see td_pinned != 0. To get the > race you have to have the interrupt take effect prior to modifying td_pinned, > so I think the processor would have to discard the reordered read of > listlocks so it could safely resume execution at the 'incl' instruction. > > The other nit there on x86 at least is that the incl instruction is doing > both a read and a write and another rule in the section 8.2.2 is this: > > • Reads are not reordered with other reads. > > That would seem to prevent the read of listlocks from passing the read of > td_pinned in the incl instruction on x86. I won
Re: sched_pin() versus PCPU_GET
On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: > On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: > > We've seen a few instances at work where witness_warn() in ast() > > indicates the sched lock is still held, but the place it claims it was > > held by is in fact sometimes not possible to keep the lock, like: > > > > thread_lock(td); > > td->td_flags &= ~TDF_SELECT; > > thread_unlock(td); > > > > What I was wondering is, even though the assembly I see in objdump -S > > for witness_warn has the increment of td_pinned before the PCPU_GET: > > > > 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx > > 802db217: 00 00 > > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > > * Pin the thread in order to avoid problems with thread migration. > > * Once that all verifies are passed about spinlocks ownership, > > * the thread is in a safe path and it can be unpinned. > > */ > > sched_pin(); > > lock_list = PCPU_GET(spinlocks); > > 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax > > 802db226: 00 00 > > if (lock_list != NULL && lock_list->ll_count != 0) { > > 802db228: 48 85 c0test %rax,%rax > > * Pin the thread in order to avoid problems with thread migration. > > * Once that all verifies are passed about spinlocks ownership, > > * the thread is in a safe path and it can be unpinned. > > */ > > sched_pin(); > > lock_list = PCPU_GET(spinlocks); > > 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp) > > 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp) > > if (lock_list != NULL && lock_list->ll_count != 0) { > > 802db239: 0f 84 ff 00 00 00 je 802db33e > > > > 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d > > > > is it possible for the hardware to do any re-ordering here? > > > > The reason I'm suspicious is not just that the code doesn't have a > > lock leak at the indicated point, but in one instance I can see in the > > dump that the lock_list local from witness_warn is from the pcpu > > structure for CPU 0 (and I was warned about sched lock 0), but the > > thread id in panic_cpu is 2. So clearly the thread was being migrated > > right around panic time. > > > > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > > > > So... do we need some kind of barrier in the code for sched_pin() for > > it to really do what it claims? Could the hardware have re-ordered > > the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin() > > increment? > > Hmmm, I think it might be able to because they refer to different locations. > > Note this rule in section 8.2.2 of Volume 3A: > > • Reads may be reordered with older writes to different locations but not > with older writes to the same location. > > It is certainly true that sparc64 could reorder with RMO. I believe ia64 > could reorder as well. Since sched_pin/unpin are frequently used to provide > this sort of synchronization, we could use memory barriers in pin/unpin > like so: > > sched_pin() > { > td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; > } > > sched_unpin() > { > atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); > } > > We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but > they > are slightly more heavyweight, though it would be more clear what is > happening > I think. However, to actually get a race you'd have to have an interrupt fire and migrate you so that the speculative read was from the other CPU. However, I don't think the speculative read would be preserved in that case. The CPU has to return to a specific PC when it returns from the interrupt and it has no way of storing the state for what speculative reordering it might be doing, so presumably it is thrown away? I suppose it is possible that it actually retires both instructions (but reordered) and then returns to the PC value after the read of listlocks after the interrupt. However, in that case the scheduler would not migrate as it would see td_pinned != 0. To get the race you have to have the interrupt take effect prior to modifying td_pinned, so I think the processor would have to discard the reordered read of listlocks so it could safely resume execution at the 'incl' instruction. The other nit there on x86 at least is that the incl instruction is doing both a read and a write and another rule in the section 8.2.2 is this: • Reads are not reordered with other reads. That would seem to prevent the read of listlocks from passing the read of td_pinned in the incl instruction on x86. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hacker
Re: sched_pin() versus PCPU_GET
On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: > We've seen a few instances at work where witness_warn() in ast() > indicates the sched lock is still held, but the place it claims it was > held by is in fact sometimes not possible to keep the lock, like: > > thread_lock(td); > td->td_flags &= ~TDF_SELECT; > thread_unlock(td); > > What I was wondering is, even though the assembly I see in objdump -S > for witness_warn has the increment of td_pinned before the PCPU_GET: > > 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx > 802db217: 00 00 > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >* Pin the thread in order to avoid problems with thread migration. >* Once that all verifies are passed about spinlocks ownership, >* the thread is in a safe path and it can be unpinned. >*/ > sched_pin(); > lock_list = PCPU_GET(spinlocks); > 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax > 802db226: 00 00 > if (lock_list != NULL && lock_list->ll_count != 0) { > 802db228: 48 85 c0test %rax,%rax >* Pin the thread in order to avoid problems with thread migration. >* Once that all verifies are passed about spinlocks ownership, >* the thread is in a safe path and it can be unpinned. >*/ > sched_pin(); > lock_list = PCPU_GET(spinlocks); > 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp) > 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp) > if (lock_list != NULL && lock_list->ll_count != 0) { > 802db239: 0f 84 ff 00 00 00 je 802db33e > > 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d > > is it possible for the hardware to do any re-ordering here? > > The reason I'm suspicious is not just that the code doesn't have a > lock leak at the indicated point, but in one instance I can see in the > dump that the lock_list local from witness_warn is from the pcpu > structure for CPU 0 (and I was warned about sched lock 0), but the > thread id in panic_cpu is 2. So clearly the thread was being migrated > right around panic time. > > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > > So... do we need some kind of barrier in the code for sched_pin() for > it to really do what it claims? Could the hardware have re-ordered > the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin() > increment? Hmmm, I think it might be able to because they refer to different locations. Note this rule in section 8.2.2 of Volume 3A: • Reads may be reordered with older writes to different locations but not with older writes to the same location. It is certainly true that sparc64 could reorder with RMO. I believe ia64 could reorder as well. Since sched_pin/unpin are frequently used to provide this sort of synchronization, we could use memory barriers in pin/unpin like so: sched_pin() { td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; } sched_unpin() { atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); } We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they are slightly more heavyweight, though it would be more clear what is happening I think. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: sched_pin() versus PCPU_GET
2010/7/30 Kostik Belousov : > On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote: >> On Thu, Jul 29, 2010 at 4:39 PM, wrote: >> > We've seen a few instances at work where witness_warn() in ast() >> > indicates the sched lock is still held, but the place it claims it was >> > held by is in fact sometimes not possible to keep the lock, like: >> > >> > thread_lock(td); >> > td->td_flags &= ~TDF_SELECT; >> > thread_unlock(td); >> > >> > What I was wondering is, even though the assembly I see in objdump -S >> > for witness_warn has the increment of td_pinned before the PCPU_GET: >> > >> > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx >> > 802db217: 00 00 >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >> > * Pin the thread in order to avoid problems with thread migration. >> > * Once that all verifies are passed about spinlocks ownership, >> > * the thread is in a safe path and it can be unpinned. >> > */ >> > sched_pin(); >> > lock_list = PCPU_GET(spinlocks); >> > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax >> > 802db226: 00 00 >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> > 802db228: 48 85 c0 test %rax,%rax >> > * Pin the thread in order to avoid problems with thread migration. >> > * Once that all verifies are passed about spinlocks ownership, >> > * the thread is in a safe path and it can be unpinned. >> > */ >> > sched_pin(); >> > lock_list = PCPU_GET(spinlocks); >> > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) >> > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> > 802db239: 0f 84 ff 00 00 00 je 802db33e >> > >> > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d >> > >> > is it possible for the hardware to do any re-ordering here? >> > >> > The reason I'm suspicious is not just that the code doesn't have a >> > lock leak at the indicated point, but in one instance I can see in the >> > dump that the lock_list local from witness_warn is from the pcpu >> > structure for CPU 0 (and I was warned about sched lock 0), but the >> > thread id in panic_cpu is 2. So clearly the thread was being migrated >> > right around panic time. >> > >> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. >> > >> > So... do we need some kind of barrier in the code for sched_pin() for >> > it to really do what it claims? Could the hardware have re-ordered >> > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() >> > increment? >> >> So after some research, the answer I'm getting is "maybe". What I'm >> concerned about is whether the h/w reordered the read of PCPU_GET in >> front of the previous store to increment td_pinned. While not an >> ultimate authority, >> http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems >> implies that stores can be reordered after loads for both Intel and >> amd64 chips, which would I believe account for the behavior seen here. > > Am I right that you suggest that in the sequence > mov %gs:0x0,%rbx [1] > incl 0x104(%rbx) [2] > mov %gs:0x48,%rax [3] > interrupt and preemption happen between points [2] and [3] ? > And the %rax value after the thread was put back onto the (different) new > CPU and executed [3] was still from the old cpu' pcpu area ? Right, but I'm also asking if it's possible the hardware executed the instructions as: mov %gs:0x0,%rbx [1] mov %gs:0x48,%rax [3] incl 0x104(%rbx) [2] On PowerPC this is definitely possible and I'd use an isync to prevent the re-ordering. I haven't been able to confirm that Intel/AMD present such a strict ordering that no barrier is needed. It's admittedly a very tight window, and we've only seen it twice, but I have no other way to explain the symptom. Unfortunately in the dump gdb shows both %rax and %gs as 0, so I can't confirm that they had a value I'd expect from another CPU. The only thing I do have is panic_cpu being different than the CPU at the time of PCPU_GET(spinlock), but of course there's definitely a window there. > I do not believe this is possible. CPU is always self-consistent. Context > switch from the thread can only occur on the return from interrupt > handler, in critical_exit() or such. This code is executing on the > same processor, and thus should already see the effect of [2], that > would prevent context switch. Right, but if the hardware allowed reads to pass writes, then %rax would have an incorrect value which would be saved at interrupt time, and
Re: sched_pin() versus PCPU_GET
On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote: > On Thu, Jul 29, 2010 at 4:39 PM, wrote: > > We've seen a few instances at work where witness_warn() in ast() > > indicates the sched lock is still held, but the place it claims it was > > held by is in fact sometimes not possible to keep the lock, like: > > > > thread_lock(td); > > td->td_flags &= ~TDF_SELECT; > > thread_unlock(td); > > > > What I was wondering is, even though the assembly I see in objdump -S > > for witness_warn has the increment of td_pinned before the PCPU_GET: > > > > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx > > 802db217: 00 00 > > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > > * Pin the thread in order to avoid problems with thread migration. > > * Once that all verifies are passed about spinlocks ownership, > > * the thread is in a safe path and it can be unpinned. > > */ > > sched_pin(); > > lock_list = PCPU_GET(spinlocks); > > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax > > 802db226: 00 00 > > if (lock_list != NULL && lock_list->ll_count != 0) { > > 802db228: 48 85 c0 test %rax,%rax > > * Pin the thread in order to avoid problems with thread migration. > > * Once that all verifies are passed about spinlocks ownership, > > * the thread is in a safe path and it can be unpinned. > > */ > > sched_pin(); > > lock_list = PCPU_GET(spinlocks); > > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) > > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) > > if (lock_list != NULL && lock_list->ll_count != 0) { > > 802db239: 0f 84 ff 00 00 00 je 802db33e > > > > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d > > > > is it possible for the hardware to do any re-ordering here? > > > > The reason I'm suspicious is not just that the code doesn't have a > > lock leak at the indicated point, but in one instance I can see in the > > dump that the lock_list local from witness_warn is from the pcpu > > structure for CPU 0 (and I was warned about sched lock 0), but the > > thread id in panic_cpu is 2. So clearly the thread was being migrated > > right around panic time. > > > > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > > > > So... do we need some kind of barrier in the code for sched_pin() for > > it to really do what it claims? Could the hardware have re-ordered > > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() > > increment? > > So after some research, the answer I'm getting is "maybe". What I'm > concerned about is whether the h/w reordered the read of PCPU_GET in > front of the previous store to increment td_pinned. While not an > ultimate authority, > http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems > implies that stores can be reordered after loads for both Intel and > amd64 chips, which would I believe account for the behavior seen here. > Am I right that you suggest that in the sequence mov %gs:0x0,%rbx [1] incl0x104(%rbx) [2] mov %gs:0x48,%rax [3] interrupt and preemption happen between points [2] and [3] ? And the %rax value after the thread was put back onto the (different) new CPU and executed [3] was still from the old cpu' pcpu area ? I do not believe this is possible. CPU is always self-consistent. Context switch from the thread can only occur on the return from interrupt handler, in critical_exit() or such. This code is executing on the same processor, and thus should already see the effect of [2], that would prevent context switch. If interrupt happens between [1] and [2], then context saving code should still see the consistent view of the register file state, regardless of the processor issuing speculative read of *%gs:0x48. Return from the interrupt is the serialization point due to iret, causing read in [3] to be reissued. pgpSPJLw7Y6uh.pgp Description: PGP signature
Re: sched_pin() versus PCPU_GET
On Thu, Jul 29, 2010 at 4:39 PM, wrote: > We've seen a few instances at work where witness_warn() in ast() > indicates the sched lock is still held, but the place it claims it was > held by is in fact sometimes not possible to keep the lock, like: > > thread_lock(td); > td->td_flags &= ~TDF_SELECT; > thread_unlock(td); > > What I was wondering is, even though the assembly I see in objdump -S > for witness_warn has the increment of td_pinned before the PCPU_GET: > > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx > 802db217: 00 00 > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > * Pin the thread in order to avoid problems with thread migration. > * Once that all verifies are passed about spinlocks ownership, > * the thread is in a safe path and it can be unpinned. > */ > sched_pin(); > lock_list = PCPU_GET(spinlocks); > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax > 802db226: 00 00 > if (lock_list != NULL && lock_list->ll_count != 0) { > 802db228: 48 85 c0 test %rax,%rax > * Pin the thread in order to avoid problems with thread migration. > * Once that all verifies are passed about spinlocks ownership, > * the thread is in a safe path and it can be unpinned. > */ > sched_pin(); > lock_list = PCPU_GET(spinlocks); > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) > if (lock_list != NULL && lock_list->ll_count != 0) { > 802db239: 0f 84 ff 00 00 00 je 802db33e > > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d > > is it possible for the hardware to do any re-ordering here? > > The reason I'm suspicious is not just that the code doesn't have a > lock leak at the indicated point, but in one instance I can see in the > dump that the lock_list local from witness_warn is from the pcpu > structure for CPU 0 (and I was warned about sched lock 0), but the > thread id in panic_cpu is 2. So clearly the thread was being migrated > right around panic time. > > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > > So... do we need some kind of barrier in the code for sched_pin() for > it to really do what it claims? Could the hardware have re-ordered > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() > increment? So after some research, the answer I'm getting is "maybe". What I'm concerned about is whether the h/w reordered the read of PCPU_GET in front of the previous store to increment td_pinned. While not an ultimate authority, http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems implies that stores can be reordered after loads for both Intel and amd64 chips, which would I believe account for the behavior seen here. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
sched_pin() versus PCPU_GET
We've seen a few instances at work where witness_warn() in ast() indicates the sched lock is still held, but the place it claims it was held by is in fact sometimes not possible to keep the lock, like: thread_lock(td); td->td_flags &= ~TDF_SELECT; thread_unlock(td); What I was wondering is, even though the assembly I see in objdump -S for witness_warn has the increment of td_pinned before the PCPU_GET: 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx 802db217: 00 00 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax 802db226: 00 00 if (lock_list != NULL && lock_list->ll_count != 0) { 802db228: 48 85 c0test %rax,%rax * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp) 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp) if (lock_list != NULL && lock_list->ll_count != 0) { 802db239: 0f 84 ff 00 00 00 je 802db33e 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d is it possible for the hardware to do any re-ordering here? The reason I'm suspicious is not just that the code doesn't have a lock leak at the indicated point, but in one instance I can see in the dump that the lock_list local from witness_warn is from the pcpu structure for CPU 0 (and I was warned about sched lock 0), but the thread id in panic_cpu is 2. So clearly the thread was being migrated right around panic time. This is the amd64 kernel on stable/7. I'm not sure exactly what kind of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. So... do we need some kind of barrier in the code for sched_pin() for it to really do what it claims? Could the hardware have re-ordered the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin() increment? Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"