freebsd exokernel
Hi folks, i know its a kind of off topic, but i think this is the perfect list for this.. anyone here think a little bit like me, and like the exokernel idea? the primary idea is to leverage only things like schedulling, and drivers to kernel ring .. and downgrade things like VFS and MM to userland rings as library.. so an aplication could optionally use those as libs abstracting things generically.. (like a bicicle with wheels)... and when you really need or want.. you can go into the bare metal and create your own application abstraction.. imagine what it could represent in performance since the layers get optimized and are not on top of other layers... and without the context switch between user an kernel ring?? for a applicartion virtual machine like java(with its own schedulling, mm and fs layers), or a database (fs and memory layers) or a virtualization software.. if we write a database for instance and want to outperform disks, the actual scenario is: or you invade the kernel of the OS and implement your abstraction(you has to know all the sou rce of it) and part you code in the userland :s , or you dont mess with the kernel at all (its too impratical) and keep in the userland.. and everybody has to be ruled by only one homogenic way to "see" things.. your application may have luck.. this kernel abstraction its good for you.. but you may has not.. and even if you can see the gold.. you cant advance any further.. the mit guys create one based on 98 (i think) openbsd, and they created a web server that (now optional) tcp protocol where persisted on disk, so its protocol agnostic, and can change its communication wall in runtime.. sometimes im looking to where evething is going in technology, and we are kind of stepping back.. putting more layers on top of others layers... and slowing everything.. instead of getting it faster as it can.. i would like to share experience and what you think about this.. would it be a feasible project to borrow things from freebsd, and start a project like this? anyone like this idea ?? anyway, just some thoughts for now.. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: possible NFS lockups
On Tue, Jul 27, 2010 at 10:29 AM, krad wrote: > I have a production mail system with an nfs backend. Every now and again we > see the nfs die on a particular head end. However it doesn't die across all > the nodes. This suggests to me there isnt an issue with the filer itself and > the stats from the filer concur with that. > > The symptoms are lines like this appearing in dmesg > > nfs server 10.44.17.138:/vol/vol1/mail: not responding > nfs server 10.44.17.138:/vol/vol1/mail: is alive again > > trussing df it seems to hang on getfsstat, this is presumably when it tries > the nfs mounts > I also have this problem, where nfs locks up on a FreeBSD 9 server and a FreeBSD RELENG_8 client -- Sam Fourman Jr. Fourman Networks http://www.fourmannetworks.com ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
ls, mount point aware
As I can see, more and more base apps, are aware of mount points. I.e; In 8.1, chgrp(1), chown(8) and cp(1) now have an -x flag. And what about human users? 'ls' command, should in it's long list of directories, show something like: Hey, this directory, is also a mount point. One letter flag? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: sched_pin() versus PCPU_GET
On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: > On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: > > We've seen a few instances at work where witness_warn() in ast() > > indicates the sched lock is still held, but the place it claims it was > > held by is in fact sometimes not possible to keep the lock, like: > > > > thread_lock(td); > > td->td_flags &= ~TDF_SELECT; > > thread_unlock(td); > > > > What I was wondering is, even though the assembly I see in objdump -S > > for witness_warn has the increment of td_pinned before the PCPU_GET: > > > > 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx > > 802db217: 00 00 > > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > > * Pin the thread in order to avoid problems with thread migration. > > * Once that all verifies are passed about spinlocks ownership, > > * the thread is in a safe path and it can be unpinned. > > */ > > sched_pin(); > > lock_list = PCPU_GET(spinlocks); > > 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax > > 802db226: 00 00 > > if (lock_list != NULL && lock_list->ll_count != 0) { > > 802db228: 48 85 c0test %rax,%rax > > * Pin the thread in order to avoid problems with thread migration. > > * Once that all verifies are passed about spinlocks ownership, > > * the thread is in a safe path and it can be unpinned. > > */ > > sched_pin(); > > lock_list = PCPU_GET(spinlocks); > > 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp) > > 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp) > > if (lock_list != NULL && lock_list->ll_count != 0) { > > 802db239: 0f 84 ff 00 00 00 je 802db33e > > > > 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d > > > > is it possible for the hardware to do any re-ordering here? > > > > The reason I'm suspicious is not just that the code doesn't have a > > lock leak at the indicated point, but in one instance I can see in the > > dump that the lock_list local from witness_warn is from the pcpu > > structure for CPU 0 (and I was warned about sched lock 0), but the > > thread id in panic_cpu is 2. So clearly the thread was being migrated > > right around panic time. > > > > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > > > > So... do we need some kind of barrier in the code for sched_pin() for > > it to really do what it claims? Could the hardware have re-ordered > > the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin() > > increment? > > Hmmm, I think it might be able to because they refer to different locations. > > Note this rule in section 8.2.2 of Volume 3A: > > • Reads may be reordered with older writes to different locations but not > with older writes to the same location. > > It is certainly true that sparc64 could reorder with RMO. I believe ia64 > could reorder as well. Since sched_pin/unpin are frequently used to provide > this sort of synchronization, we could use memory barriers in pin/unpin > like so: > > sched_pin() > { > td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; > } > > sched_unpin() > { > atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); > } > > We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but > they > are slightly more heavyweight, though it would be more clear what is > happening > I think. However, to actually get a race you'd have to have an interrupt fire and migrate you so that the speculative read was from the other CPU. However, I don't think the speculative read would be preserved in that case. The CPU has to return to a specific PC when it returns from the interrupt and it has no way of storing the state for what speculative reordering it might be doing, so presumably it is thrown away? I suppose it is possible that it actually retires both instructions (but reordered) and then returns to the PC value after the read of listlocks after the interrupt. However, in that case the scheduler would not migrate as it would see td_pinned != 0. To get the race you have to have the interrupt take effect prior to modifying td_pinned, so I think the processor would have to discard the reordered read of listlocks so it could safely resume execution at the 'incl' instruction. The other nit there on x86 at least is that the incl instruction is doing both a read and a write and another rule in the section 8.2.2 is this: • Reads are not reordered with other reads. That would seem to prevent the read of listlocks from passing the read of td_pinned in the incl instruction on x86. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hacker
Re: sched_pin() versus PCPU_GET
On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: > We've seen a few instances at work where witness_warn() in ast() > indicates the sched lock is still held, but the place it claims it was > held by is in fact sometimes not possible to keep the lock, like: > > thread_lock(td); > td->td_flags &= ~TDF_SELECT; > thread_unlock(td); > > What I was wondering is, even though the assembly I see in objdump -S > for witness_warn has the increment of td_pinned before the PCPU_GET: > > 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx > 802db217: 00 00 > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >* Pin the thread in order to avoid problems with thread migration. >* Once that all verifies are passed about spinlocks ownership, >* the thread is in a safe path and it can be unpinned. >*/ > sched_pin(); > lock_list = PCPU_GET(spinlocks); > 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax > 802db226: 00 00 > if (lock_list != NULL && lock_list->ll_count != 0) { > 802db228: 48 85 c0test %rax,%rax >* Pin the thread in order to avoid problems with thread migration. >* Once that all verifies are passed about spinlocks ownership, >* the thread is in a safe path and it can be unpinned. >*/ > sched_pin(); > lock_list = PCPU_GET(spinlocks); > 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp) > 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp) > if (lock_list != NULL && lock_list->ll_count != 0) { > 802db239: 0f 84 ff 00 00 00 je 802db33e > > 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d > > is it possible for the hardware to do any re-ordering here? > > The reason I'm suspicious is not just that the code doesn't have a > lock leak at the indicated point, but in one instance I can see in the > dump that the lock_list local from witness_warn is from the pcpu > structure for CPU 0 (and I was warned about sched lock 0), but the > thread id in panic_cpu is 2. So clearly the thread was being migrated > right around panic time. > > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > > So... do we need some kind of barrier in the code for sched_pin() for > it to really do what it claims? Could the hardware have re-ordered > the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin() > increment? Hmmm, I think it might be able to because they refer to different locations. Note this rule in section 8.2.2 of Volume 3A: • Reads may be reordered with older writes to different locations but not with older writes to the same location. It is certainly true that sparc64 could reorder with RMO. I believe ia64 could reorder as well. Since sched_pin/unpin are frequently used to provide this sort of synchronization, we could use memory barriers in pin/unpin like so: sched_pin() { td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1; } sched_unpin() { atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1); } We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they are slightly more heavyweight, though it would be more clear what is happening I think. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: sched_pin() versus PCPU_GET
2010/7/30 Kostik Belousov : > On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote: >> On Thu, Jul 29, 2010 at 4:39 PM, wrote: >> > We've seen a few instances at work where witness_warn() in ast() >> > indicates the sched lock is still held, but the place it claims it was >> > held by is in fact sometimes not possible to keep the lock, like: >> > >> > thread_lock(td); >> > td->td_flags &= ~TDF_SELECT; >> > thread_unlock(td); >> > >> > What I was wondering is, even though the assembly I see in objdump -S >> > for witness_warn has the increment of td_pinned before the PCPU_GET: >> > >> > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx >> > 802db217: 00 00 >> > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) >> > * Pin the thread in order to avoid problems with thread migration. >> > * Once that all verifies are passed about spinlocks ownership, >> > * the thread is in a safe path and it can be unpinned. >> > */ >> > sched_pin(); >> > lock_list = PCPU_GET(spinlocks); >> > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax >> > 802db226: 00 00 >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> > 802db228: 48 85 c0 test %rax,%rax >> > * Pin the thread in order to avoid problems with thread migration. >> > * Once that all verifies are passed about spinlocks ownership, >> > * the thread is in a safe path and it can be unpinned. >> > */ >> > sched_pin(); >> > lock_list = PCPU_GET(spinlocks); >> > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) >> > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) >> > if (lock_list != NULL && lock_list->ll_count != 0) { >> > 802db239: 0f 84 ff 00 00 00 je 802db33e >> > >> > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d >> > >> > is it possible for the hardware to do any re-ordering here? >> > >> > The reason I'm suspicious is not just that the code doesn't have a >> > lock leak at the indicated point, but in one instance I can see in the >> > dump that the lock_list local from witness_warn is from the pcpu >> > structure for CPU 0 (and I was warned about sched lock 0), but the >> > thread id in panic_cpu is 2. So clearly the thread was being migrated >> > right around panic time. >> > >> > This is the amd64 kernel on stable/7. I'm not sure exactly what kind >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. >> > >> > So... do we need some kind of barrier in the code for sched_pin() for >> > it to really do what it claims? Could the hardware have re-ordered >> > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() >> > increment? >> >> So after some research, the answer I'm getting is "maybe". What I'm >> concerned about is whether the h/w reordered the read of PCPU_GET in >> front of the previous store to increment td_pinned. While not an >> ultimate authority, >> http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems >> implies that stores can be reordered after loads for both Intel and >> amd64 chips, which would I believe account for the behavior seen here. > > Am I right that you suggest that in the sequence > mov %gs:0x0,%rbx [1] > incl 0x104(%rbx) [2] > mov %gs:0x48,%rax [3] > interrupt and preemption happen between points [2] and [3] ? > And the %rax value after the thread was put back onto the (different) new > CPU and executed [3] was still from the old cpu' pcpu area ? Right, but I'm also asking if it's possible the hardware executed the instructions as: mov %gs:0x0,%rbx [1] mov %gs:0x48,%rax [3] incl 0x104(%rbx) [2] On PowerPC this is definitely possible and I'd use an isync to prevent the re-ordering. I haven't been able to confirm that Intel/AMD present such a strict ordering that no barrier is needed. It's admittedly a very tight window, and we've only seen it twice, but I have no other way to explain the symptom. Unfortunately in the dump gdb shows both %rax and %gs as 0, so I can't confirm that they had a value I'd expect from another CPU. The only thing I do have is panic_cpu being different than the CPU at the time of PCPU_GET(spinlock), but of course there's definitely a window there. > I do not believe this is possible. CPU is always self-consistent. Context > switch from the thread can only occur on the return from interrupt > handler, in critical_exit() or such. This code is executing on the > same processor, and thus should already see the effect of [2], that > would prevent context switch. Right, but if the hardware allowed reads to pass writes, then %rax would have an incorrect value which would be saved at interrupt time, and
Re: sched_pin() versus PCPU_GET
On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote: > On Thu, Jul 29, 2010 at 4:39 PM, wrote: > > We've seen a few instances at work where witness_warn() in ast() > > indicates the sched lock is still held, but the place it claims it was > > held by is in fact sometimes not possible to keep the lock, like: > > > > thread_lock(td); > > td->td_flags &= ~TDF_SELECT; > > thread_unlock(td); > > > > What I was wondering is, even though the assembly I see in objdump -S > > for witness_warn has the increment of td_pinned before the PCPU_GET: > > > > 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx > > 802db217: 00 00 > > 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) > > * Pin the thread in order to avoid problems with thread migration. > > * Once that all verifies are passed about spinlocks ownership, > > * the thread is in a safe path and it can be unpinned. > > */ > > sched_pin(); > > lock_list = PCPU_GET(spinlocks); > > 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax > > 802db226: 00 00 > > if (lock_list != NULL && lock_list->ll_count != 0) { > > 802db228: 48 85 c0 test %rax,%rax > > * Pin the thread in order to avoid problems with thread migration. > > * Once that all verifies are passed about spinlocks ownership, > > * the thread is in a safe path and it can be unpinned. > > */ > > sched_pin(); > > lock_list = PCPU_GET(spinlocks); > > 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) > > 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) > > if (lock_list != NULL && lock_list->ll_count != 0) { > > 802db239: 0f 84 ff 00 00 00 je 802db33e > > > > 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d > > > > is it possible for the hardware to do any re-ordering here? > > > > The reason I'm suspicious is not just that the code doesn't have a > > lock leak at the indicated point, but in one instance I can see in the > > dump that the lock_list local from witness_warn is from the pcpu > > structure for CPU 0 (and I was warned about sched lock 0), but the > > thread id in panic_cpu is 2. So clearly the thread was being migrated > > right around panic time. > > > > This is the amd64 kernel on stable/7. I'm not sure exactly what kind > > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. > > > > So... do we need some kind of barrier in the code for sched_pin() for > > it to really do what it claims? Could the hardware have re-ordered > > the "mov %gs:0x48,%rax" PCPU_GET to before the sched_pin() > > increment? > > So after some research, the answer I'm getting is "maybe". What I'm > concerned about is whether the h/w reordered the read of PCPU_GET in > front of the previous store to increment td_pinned. While not an > ultimate authority, > http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems > implies that stores can be reordered after loads for both Intel and > amd64 chips, which would I believe account for the behavior seen here. > Am I right that you suggest that in the sequence mov %gs:0x0,%rbx [1] incl0x104(%rbx) [2] mov %gs:0x48,%rax [3] interrupt and preemption happen between points [2] and [3] ? And the %rax value after the thread was put back onto the (different) new CPU and executed [3] was still from the old cpu' pcpu area ? I do not believe this is possible. CPU is always self-consistent. Context switch from the thread can only occur on the return from interrupt handler, in critical_exit() or such. This code is executing on the same processor, and thus should already see the effect of [2], that would prevent context switch. If interrupt happens between [1] and [2], then context saving code should still see the consistent view of the register file state, regardless of the processor issuing speculative read of *%gs:0x48. Return from the interrupt is the serialization point due to iret, causing read in [3] to be reissued. pgpSPJLw7Y6uh.pgp Description: PGP signature