freebsd exokernel

2010-07-30 Thread Fabio Kaminski
Hi folks,

i know its a kind of off topic, but i think this is the perfect list for
this..

anyone here think a little bit like me, and like the exokernel idea?

the primary idea is to leverage only things like schedulling, and drivers to
kernel ring .. and downgrade things like VFS and MM to userland rings
as library.. so an aplication could optionally use those as libs abstracting
things generically.. (like a bicicle with wheels)...
and when you really need or want.. you can go into the bare metal and create
your own application abstraction..

imagine what it could represent in performance since the layers get
optimized and are not on top of other layers... and without the context
switch between user an kernel ring??

for a applicartion virtual machine like java(with its own schedulling, mm
and fs layers), or a database (fs and memory layers) or a virtualization
software..

if we write a database for instance and want to outperform disks, the actual
scenario is: or you invade the kernel of the OS and implement your
abstraction(you has to know all the sou rce of it)
and part you code in the userland :s , or you dont mess with the kernel at
all (its too impratical) and keep in the userland.. and everybody has to be
ruled by only one homogenic way to "see" things.. your application may have
luck.. this kernel abstraction its good for you.. but you may has not.. and
even if you can see the gold.. you cant advance any further..

the mit guys create one based on 98 (i think) openbsd, and they created a
web server that (now optional) tcp protocol where persisted on disk, so its
protocol agnostic, and can change
its communication wall in runtime..

sometimes im looking to where evething is going in technology, and we are
kind of stepping back.. putting more layers on top of others layers... and
slowing everything.. instead of getting it faster as it can..

i would like to share experience and what you think about this..

would it be a feasible project to borrow things from freebsd, and start a
project like this? anyone like this idea ??

anyway, just some thoughts for now..
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: possible NFS lockups

2010-07-30 Thread Sam Fourman Jr.
On Tue, Jul 27, 2010 at 10:29 AM, krad  wrote:
> I have a production mail system with an nfs backend. Every now and again we
> see the nfs die on a particular head end. However it doesn't die across all
> the nodes. This suggests to me there isnt an issue with the filer itself and
> the stats from the filer concur with that.
>
> The symptoms are lines like this appearing in dmesg
>
> nfs server 10.44.17.138:/vol/vol1/mail: not responding
> nfs server 10.44.17.138:/vol/vol1/mail: is alive again
>
> trussing df it seems to hang on getfsstat, this is presumably when it tries
> the nfs mounts
>

I also have this problem, where nfs locks up on a FreeBSD 9 server
and a FreeBSD RELENG_8 client


-- 

Sam Fourman Jr.
Fourman Networks
http://www.fourmannetworks.com
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


ls, mount point aware

2010-07-30 Thread Domagoj S.
As I can see, more and more base apps, are aware of mount points.
I.e; In 8.1, chgrp(1), chown(8) and cp(1) now have an -x flag.

And what about human users?

'ls' command, should in it's long list of directories, show something like:
Hey, this directory, is also a mount point.

One letter flag?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: sched_pin() versus PCPU_GET

2010-07-30 Thread John Baldwin
On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
> On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
> > We've seen a few instances at work where witness_warn() in ast()
> > indicates the sched lock is still held, but the place it claims it was
> > held by is in fact sometimes not possible to keep the lock, like:
> > 
> > thread_lock(td);
> > td->td_flags &= ~TDF_SELECT;
> > thread_unlock(td);
> > 
> > What I was wondering is, even though the assembly I see in objdump -S
> > for witness_warn has the increment of td_pinned before the PCPU_GET:
> > 
> > 802db210:   65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
> > 802db217:   00 00
> > 802db219:   ff 83 04 01 00 00   incl   0x104(%rbx)
> >  * Pin the thread in order to avoid problems with thread migration.
> >  * Once that all verifies are passed about spinlocks ownership,
> >  * the thread is in a safe path and it can be unpinned.
> >  */
> > sched_pin();
> > lock_list = PCPU_GET(spinlocks);
> > 802db21f:   65 48 8b 04 25 48 00mov%gs:0x48,%rax
> > 802db226:   00 00
> > if (lock_list != NULL && lock_list->ll_count != 0) {
> > 802db228:   48 85 c0test   %rax,%rax
> >  * Pin the thread in order to avoid problems with thread migration.
> >  * Once that all verifies are passed about spinlocks ownership,
> >  * the thread is in a safe path and it can be unpinned.
> >  */
> > sched_pin();
> > lock_list = PCPU_GET(spinlocks);
> > 802db22b:   48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp)
> > 802db232:   48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp)
> > if (lock_list != NULL && lock_list->ll_count != 0) {
> > 802db239:   0f 84 ff 00 00 00   je 802db33e
> > 
> > 802db23f:   44 8b 60 50 mov0x50(%rax),%r12d
> > 
> > is it possible for the hardware to do any re-ordering here?
> > 
> > The reason I'm suspicious is not just that the code doesn't have a
> > lock leak at the indicated point, but in one instance I can see in the
> > dump that the lock_list local from witness_warn is from the pcpu
> > structure for CPU 0 (and I was warned about sched lock 0), but the
> > thread id in panic_cpu is 2.  So clearly the thread was being migrated
> > right around panic time.
> > 
> > This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
> > 
> > So... do we need some kind of barrier in the code for sched_pin() for
> > it to really do what it claims?  Could the hardware have re-ordered
> > the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin()
> > increment?
> 
> Hmmm, I think it might be able to because they refer to different locations.
> 
> Note this rule in section 8.2.2 of Volume 3A:
> 
>   • Reads may be reordered with older writes to different locations but not
> with older writes to the same location.
> 
> It is certainly true that sparc64 could reorder with RMO.  I believe ia64 
> could reorder as well.  Since sched_pin/unpin are frequently used to provide 
> this sort of synchronization, we could use memory barriers in pin/unpin
> like so:
> 
> sched_pin()
> {
>   td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1;
> }
> 
> sched_unpin()
> {
>   atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1);
> }
> 
> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
> they 
> are slightly more heavyweight, though it would be more clear what is 
> happening 
> I think.

However, to actually get a race you'd have to have an interrupt fire and
migrate you so that the speculative read was from the other CPU.  However, I
don't think the speculative read would be preserved in that case.  The CPU
has to return to a specific PC when it returns from the interrupt and it has
no way of storing the state for what speculative reordering it might be
doing, so presumably it is thrown away?  I suppose it is possible that it
actually retires both instructions (but reordered) and then returns to the PC
value after the read of listlocks after the interrupt.  However, in that case
the scheduler would not migrate as it would see td_pinned != 0.  To get the
race you have to have the interrupt take effect prior to modifying td_pinned,
so I think the processor would have to discard the reordered read of
listlocks so it could safely resume execution at the 'incl' instruction.

The other nit there on x86 at least is that the incl instruction is doing
both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

That would seem to prevent the read of listlocks from passing the read of
td_pinned in the incl instruction on x86.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hacker

Re: sched_pin() versus PCPU_GET

2010-07-30 Thread John Baldwin
On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
> We've seen a few instances at work where witness_warn() in ast()
> indicates the sched lock is still held, but the place it claims it was
> held by is in fact sometimes not possible to keep the lock, like:
> 
>   thread_lock(td);
>   td->td_flags &= ~TDF_SELECT;
>   thread_unlock(td);
> 
> What I was wondering is, even though the assembly I see in objdump -S
> for witness_warn has the increment of td_pinned before the PCPU_GET:
> 
> 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
> 802db217: 00 00
> 802db219: ff 83 04 01 00 00   incl   0x104(%rbx)
>* Pin the thread in order to avoid problems with thread migration.
>* Once that all verifies are passed about spinlocks ownership,
>* the thread is in a safe path and it can be unpinned.
>*/
>   sched_pin();
>   lock_list = PCPU_GET(spinlocks);
> 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax
> 802db226: 00 00
>   if (lock_list != NULL && lock_list->ll_count != 0) {
> 802db228: 48 85 c0test   %rax,%rax
>* Pin the thread in order to avoid problems with thread migration.
>* Once that all verifies are passed about spinlocks ownership,
>* the thread is in a safe path and it can be unpinned.
>*/
>   sched_pin();
>   lock_list = PCPU_GET(spinlocks);
> 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp)
> 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp)
>   if (lock_list != NULL && lock_list->ll_count != 0) {
> 802db239: 0f 84 ff 00 00 00   je 802db33e
> 
> 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d
> 
> is it possible for the hardware to do any re-ordering here?
> 
> The reason I'm suspicious is not just that the code doesn't have a
> lock leak at the indicated point, but in one instance I can see in the
> dump that the lock_list local from witness_warn is from the pcpu
> structure for CPU 0 (and I was warned about sched lock 0), but the
> thread id in panic_cpu is 2.  So clearly the thread was being migrated
> right around panic time.
> 
> This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
> of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
> 
> So... do we need some kind of barrier in the code for sched_pin() for
> it to really do what it claims?  Could the hardware have re-ordered
> the "mov%gs:0x48,%rax" PCPU_GET to before the sched_pin()
> increment?

Hmmm, I think it might be able to because they refer to different locations.

Note this rule in section 8.2.2 of Volume 3A:

  • Reads may be reordered with older writes to different locations but not
with older writes to the same location.

It is certainly true that sparc64 could reorder with RMO.  I believe ia64 
could reorder as well.  Since sched_pin/unpin are frequently used to provide 
this sort of synchronization, we could use memory barriers in pin/unpin
like so:

sched_pin()
{
td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1;
}

sched_unpin()
{
atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1);
}

We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they 
are slightly more heavyweight, though it would be more clear what is happening 
I think.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: sched_pin() versus PCPU_GET

2010-07-30 Thread mdf
2010/7/30 Kostik Belousov :
> On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote:
>> On Thu, Jul 29, 2010 at 4:39 PM,   wrote:
>> > We've seen a few instances at work where witness_warn() in ast()
>> > indicates the sched lock is still held, but the place it claims it was
>> > held by is in fact sometimes not possible to keep the lock, like:
>> >
>> >        thread_lock(td);
>> >        td->td_flags &= ~TDF_SELECT;
>> >        thread_unlock(td);
>> >
>> > What I was wondering is, even though the assembly I see in objdump -S
>> > for witness_warn has the increment of td_pinned before the PCPU_GET:
>> >
>> > 802db210:       65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
>> > 802db217:       00 00
>> > 802db219:       ff 83 04 01 00 00       incl   0x104(%rbx)
>> >         * Pin the thread in order to avoid problems with thread migration.
>> >         * Once that all verifies are passed about spinlocks ownership,
>> >         * the thread is in a safe path and it can be unpinned.
>> >         */
>> >        sched_pin();
>> >        lock_list = PCPU_GET(spinlocks);
>> > 802db21f:       65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
>> > 802db226:       00 00
>> >        if (lock_list != NULL && lock_list->ll_count != 0) {
>> > 802db228:       48 85 c0                test   %rax,%rax
>> >         * Pin the thread in order to avoid problems with thread migration.
>> >         * Once that all verifies are passed about spinlocks ownership,
>> >         * the thread is in a safe path and it can be unpinned.
>> >         */
>> >        sched_pin();
>> >        lock_list = PCPU_GET(spinlocks);
>> > 802db22b:       48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
>> > 802db232:       48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
>> >        if (lock_list != NULL && lock_list->ll_count != 0) {
>> > 802db239:       0f 84 ff 00 00 00       je     802db33e
>> > 
>> > 802db23f:       44 8b 60 50             mov    0x50(%rax),%r12d
>> >
>> > is it possible for the hardware to do any re-ordering here?
>> >
>> > The reason I'm suspicious is not just that the code doesn't have a
>> > lock leak at the indicated point, but in one instance I can see in the
>> > dump that the lock_list local from witness_warn is from the pcpu
>> > structure for CPU 0 (and I was warned about sched lock 0), but the
>> > thread id in panic_cpu is 2.  So clearly the thread was being migrated
>> > right around panic time.
>> >
>> > This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
>> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
>> >
>> > So... do we need some kind of barrier in the code for sched_pin() for
>> > it to really do what it claims?  Could the hardware have re-ordered
>> > the "mov    %gs:0x48,%rax" PCPU_GET to before the sched_pin()
>> > increment?
>>
>> So after some research, the answer I'm getting is "maybe".  What I'm
>> concerned about is whether the h/w reordered the read of PCPU_GET in
>> front of the previous store to increment td_pinned.  While not an
>> ultimate authority,
>> http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems
>> implies that stores can be reordered after loads for both Intel and
>> amd64 chips, which would I believe account for the behavior seen here.
>
> Am I right that you suggest that in the sequence
>        mov     %gs:0x0,%rbx      [1]
>        incl    0x104(%rbx)       [2]
>        mov     %gs:0x48,%rax     [3]
> interrupt and preemption happen between points [2] and [3] ?
> And the %rax value after the thread was put back onto the (different) new
> CPU and executed [3] was still from the old cpu' pcpu area ?

Right, but I'm also asking if it's possible the hardware executed the
instructions as:

        mov     %gs:0x0,%rbx      [1]
        mov     %gs:0x48,%rax     [3]
        incl    0x104(%rbx)       [2]

On PowerPC this is definitely possible and I'd use an isync to prevent
the re-ordering.  I haven't been able to confirm that Intel/AMD
present such a strict ordering that no barrier is needed.

It's admittedly a very tight window, and we've only seen it twice, but
I have no other way to explain the symptom.  Unfortunately in the dump
gdb shows both %rax and %gs as 0, so I can't confirm that they had a
value I'd expect from another CPU.  The only thing I do have is
panic_cpu being different than the CPU at the time of
PCPU_GET(spinlock), but of course there's definitely a window there.

> I do not believe this is possible. CPU is always self-consistent. Context
> switch from the thread can only occur on the return from interrupt
> handler, in critical_exit() or such. This code is executing on the
> same processor, and thus should already see the effect of [2], that
> would prevent context switch.

Right, but if the hardware allowed reads to pass writes, then %rax
would have an incorrect value which would be saved at interrupt time,
and

Re: sched_pin() versus PCPU_GET

2010-07-30 Thread Kostik Belousov
On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote:
> On Thu, Jul 29, 2010 at 4:39 PM,   wrote:
> > We've seen a few instances at work where witness_warn() in ast()
> > indicates the sched lock is still held, but the place it claims it was
> > held by is in fact sometimes not possible to keep the lock, like:
> >
> >        thread_lock(td);
> >        td->td_flags &= ~TDF_SELECT;
> >        thread_unlock(td);
> >
> > What I was wondering is, even though the assembly I see in objdump -S
> > for witness_warn has the increment of td_pinned before the PCPU_GET:
> >
> > 802db210:       65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
> > 802db217:       00 00
> > 802db219:       ff 83 04 01 00 00       incl   0x104(%rbx)
> >         * Pin the thread in order to avoid problems with thread migration.
> >         * Once that all verifies are passed about spinlocks ownership,
> >         * the thread is in a safe path and it can be unpinned.
> >         */
> >        sched_pin();
> >        lock_list = PCPU_GET(spinlocks);
> > 802db21f:       65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
> > 802db226:       00 00
> >        if (lock_list != NULL && lock_list->ll_count != 0) {
> > 802db228:       48 85 c0                test   %rax,%rax
> >         * Pin the thread in order to avoid problems with thread migration.
> >         * Once that all verifies are passed about spinlocks ownership,
> >         * the thread is in a safe path and it can be unpinned.
> >         */
> >        sched_pin();
> >        lock_list = PCPU_GET(spinlocks);
> > 802db22b:       48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
> > 802db232:       48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
> >        if (lock_list != NULL && lock_list->ll_count != 0) {
> > 802db239:       0f 84 ff 00 00 00       je     802db33e
> > 
> > 802db23f:       44 8b 60 50             mov    0x50(%rax),%r12d
> >
> > is it possible for the hardware to do any re-ordering here?
> >
> > The reason I'm suspicious is not just that the code doesn't have a
> > lock leak at the indicated point, but in one instance I can see in the
> > dump that the lock_list local from witness_warn is from the pcpu
> > structure for CPU 0 (and I was warned about sched lock 0), but the
> > thread id in panic_cpu is 2.  So clearly the thread was being migrated
> > right around panic time.
> >
> > This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
> >
> > So... do we need some kind of barrier in the code for sched_pin() for
> > it to really do what it claims?  Could the hardware have re-ordered
> > the "mov    %gs:0x48,%rax" PCPU_GET to before the sched_pin()
> > increment?
> 
> So after some research, the answer I'm getting is "maybe".  What I'm
> concerned about is whether the h/w reordered the read of PCPU_GET in
> front of the previous store to increment td_pinned.  While not an
> ultimate authority,
> http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems
> implies that stores can be reordered after loads for both Intel and
> amd64 chips, which would I believe account for the behavior seen here.
> 

Am I right that you suggest that in the sequence
mov %gs:0x0,%rbx  [1]
incl0x104(%rbx)   [2]
mov %gs:0x48,%rax [3]
interrupt and preemption happen between points [2] and [3] ?
And the %rax value after the thread was put back onto the (different) new
CPU and executed [3] was still from the old cpu' pcpu area ?

I do not believe this is possible. CPU is always self-consistent. Context
switch from the thread can only occur on the return from interrupt
handler, in critical_exit() or such. This code is executing on the
same processor, and thus should already see the effect of [2], that
would prevent context switch.

If interrupt happens between [1] and [2], then context saving code
should still see the consistent view of the register file state,
regardless of the processor issuing speculative read of
*%gs:0x48. Return from the interrupt is the serialization point due to
iret, causing read in [3] to be reissued.



pgpSPJLw7Y6uh.pgp
Description: PGP signature