from:"Attilio Rao"

Re: locks under printf(9) and WITNESS = panic?

2013-07-11 Thread Attilio Rao

On Thu, Jul 11, 2013 at 1:21 PM, John Baldwin  wrote:
> On Saturday, June 29, 2013 9:19:24 pm Steven Hartland wrote:
>> when booting stable/9 under a debug kernel with WITNESS
>> enabled and verbose I get the following panic..
>>
>> It seems very much like the discussion from a year back on
>> current: http://lists.freebsd.org/pipermail/freebsd-current/2012-
> January/031375.html
>>
>> Any ideas?
>
> Yeah, that lock needs to be MTX_RECURSE (the cnputs_mtx).  However, it
> only recurses under witness.  *sigh*

I have a patch to make mtx_lock_flags() to accept MTX_RECURSE. I will
commit it as long as all the consumers code will be reviewed which
should be any day.

Attilio


--
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 9.1 coredump

2013-01-23 Thread Attilio Rao

On Wed, Jan 23, 2013 at 1:32 PM, Alexander Nikiforenko
 wrote:
>>hi, i was run ssh-keygen with output to 32g usb 3.0 flash, and got this core
>
> sorry, i was forgot.
> i mount that flash via fusefs-exfat-0.9.8

This is on stable/9?
If yes, I will send you patches to use new fuse approach in a while.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-11-02 Thread Attilio Rao

On Wed, Oct 31, 2012 at 11:11 AM, Harald Schmalzbauer
 wrote:
>  schrieb Attilio Rao am 29.10.2012 23:02 (localtime):
>> On Mon, Oct 29, 2012 at 7:37 PM, Harald Schmalzbauer
>>  wrote:
>>>  schrieb Attilio Rao am 27.10.2012 23:07 (localtime):
>>>> On Sat, Oct 27, 2012 at 9:46 PM, Attilio Rao  wrote:
>>>>> On Sat, Sep 8, 2012 at 12:48 AM, Attilio Rao  wrote:
>>>>>> On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
>>>>>>  wrote:
>>>>>>>  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
>>>>>>>> On 8/8/12, Harald Schmalzbauer  wrote:
>>>>>>>>>  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
>>>>>>>>>>>> mount -t unionfs -o noatime /usr /mnt
>>>>>>>>>>>>
>>>>>>>>>>>> insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
>>>>>>>>>>>> exclusive locked but should be
>>>>>>>>>>>> KDB: enter: lock violation
>>>>>>>>>>> Pavel,
>>>>>>>>>>> can you give a spin to this patch?:
>>>>>>>>>>> http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch
>>>>>>>>>>>
>>>>>>>>>>> I think that the unlocking is due at that point as the vnode lock 
>>>>>>>>>>> can
>>>>>>>>>>> be switch later on.
>>>>>>>>>>>
>>>>>>>>>>> Let me know what you think about it and what the test does.
>>>>>>>>>> Thanks!
>>>>>>>>>> This patch fixes the problem with lock violation. Sorry I've tested 
>>>>>>>>>> it so
>>>>>>>>>> late.
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> this patch still applies cleanly to RELENG_9_1. Was there another fix
>>>>>>>>> for the issue or has it just not been PR-sent and thus forgotten?
>>>>>>>> Can you and Pavel try the attached patch? Unfortunately I had no time
>>>>>>>> to test it, I just made in 5 free mins from a non-FreeBSD workstation,
>>>>>>> Sorry, couldn't test earlier, but now I did:
>>>>>>> With this patch applied the machine hangs without debug kernel and the
>>>>>>> latter gives the following panic:
>>>>>>> System call nmount returning with the following locks held:
>>>>>>> exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
>>>>>>> src/sys/fs/unionfs/union_vnops.c:1938
>>>>>>> panic: witness_warn
>>>>>>> cpuid = 0
>>>>>>> KDB: stack backtrace:
>>>>>>> db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
>>>>>>> db_trace_self_wrapper+0x26
>>>>>>> kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
>>>>>>> witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
>>>>>>> syscall(d1de3d08) ar syscall+0x415
>>>>>>> Xint0x80_syscall() at Xint0x80_syscall+0x21
>>>>>>> --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
>>>>>>> 0xbfbfe46c, ebp = 0xbfbfede8 ---
>>>>>>> KDB: enter: panic
>>>>>>> [ thread pid 86 tid 100054 ]
>>>>>>> Stopped adkdb_enter+0x3a: movl $0,kdb_why
>>>>>>> db> bt
>>>>>>> Tracing pid 86 tid 100054 td 0xc541b000
>>>>>>> kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
>>>>>>> witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
>>>>>>> syscall(d1de3d08) at syscall+0x415
>>>>>>> Xint0x80_syscall() at Xint0x80_syscall+0x21
>>>>>>>
>>>>>>> Hmm, I guess I forgot to install kernel debug symbols...
>>>>>>> Coming back if I have more
>>>>>> Unfortunately unionfs does very wrong things with the insmntque() 
>>>>>> locking.
>>>>>> It basically expects the vnode to return locked in the same way
>>>>>> requested by the precedent namei() (when that happens) but when you do
>>>>>> insmntque() you

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-10-27 Thread Attilio Rao

On Sat, Oct 27, 2012 at 9:46 PM, Attilio Rao  wrote:
> On Sat, Sep 8, 2012 at 12:48 AM, Attilio Rao  wrote:
>> On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
>>  wrote:
>>>  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
>>>> On 8/8/12, Harald Schmalzbauer  wrote:
>>>>>  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
>>>>>>>> mount -t unionfs -o noatime /usr /mnt
>>>>>>>>
>>>>>>>> insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
>>>>>>>> exclusive locked but should be
>>>>>>>> KDB: enter: lock violation
>>>>>>> Pavel,
>>>>>>> can you give a spin to this patch?:
>>>>>>> http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch
>>>>>>>
>>>>>>> I think that the unlocking is due at that point as the vnode lock can
>>>>>>> be switch later on.
>>>>>>>
>>>>>>> Let me know what you think about it and what the test does.
>>>>>> Thanks!
>>>>>> This patch fixes the problem with lock violation. Sorry I've tested it so
>>>>>> late.
>>>>> Hello,
>>>>>
>>>>> this patch still applies cleanly to RELENG_9_1. Was there another fix
>>>>> for the issue or has it just not been PR-sent and thus forgotten?
>>>> Can you and Pavel try the attached patch? Unfortunately I had no time
>>>> to test it, I just made in 5 free mins from a non-FreeBSD workstation,
>>>
>>> Sorry, couldn't test earlier, but now I did:
>>> With this patch applied the machine hangs without debug kernel and the
>>> latter gives the following panic:
>>> System call nmount returning with the following locks held:
>>> exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
>>> src/sys/fs/unionfs/union_vnops.c:1938
>>> panic: witness_warn
>>> cpuid = 0
>>> KDB: stack backtrace:
>>> db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
>>> db_trace_self_wrapper+0x26
>>> kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
>>> witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
>>> syscall(d1de3d08) ar syscall+0x415
>>> Xint0x80_syscall() at Xint0x80_syscall+0x21
>>> --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
>>> 0xbfbfe46c, ebp = 0xbfbfede8 ---
>>> KDB: enter: panic
>>> [ thread pid 86 tid 100054 ]
>>> Stopped adkdb_enter+0x3a: movl $0,kdb_why
>>> db> bt
>>> Tracing pid 86 tid 100054 td 0xc541b000
>>> kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
>>> witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
>>> syscall(d1de3d08) at syscall+0x415
>>> Xint0x80_syscall() at Xint0x80_syscall+0x21
>>>
>>> Hmm, I guess I forgot to install kernel debug symbols...
>>> Coming back if I have more
>>
>> Unfortunately unionfs does very wrong things with the insmntque() locking.
>> It basically expects the vnode to return locked in the same way
>> requested by the precedent namei() (when that happens) but when you do
>> insmntque() you can only have an LK_EXCLUSIVE lock on the vnode.
>
> Hello,
> the following patch should workout the issues around unionfs_nodeget() a bit:
> http://www.freebsd.org/~attilio/unionfs_nodeget2.patch
>
> Unfortunately unionfs code is rather messy in the lookup path about
> locking requirements so follow what it needs to be done there is a bit
> difficult.
> I have no way to test this patch, so it is just test-compiled at the
> moment, but I would need that you also test lookup path (so directory
> "ls", find(1) on the whole unionfs volume, etc.) to validate it
> someway.

On a second thought, I think that locking in lookup (and also other
operations) is so fragile and difficult to follow that it makes all
vnops real locking landmines.
I think that the following patch fixes the insmntque insertion and
follows the old approach well enough to be committed separately:
http://www.freebsd.org/~attilio/unionfs_nodeget3.patch

However I strongly suggest that someone does review & sweep all the
locking from nodeget and  related functions removing the tedious
lkflags conditional, reinforcing and expliciting locking rules within
functions, checking out for races (which I'm sure are quite a few by
the fact that vn lock gets dropped indiscriminately in many points)
and possibly review the highly proficient usage of LK_RETRY that I'm
sure is not always safe.

All these steps should really be carried out separately.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-10-27 Thread Attilio Rao

On Sat, Sep 8, 2012 at 12:48 AM, Attilio Rao  wrote:
> On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
>  wrote:
>>  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
>>> On 8/8/12, Harald Schmalzbauer  wrote:
>>>>  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
>>>>>>> mount -t unionfs -o noatime /usr /mnt
>>>>>>>
>>>>>>> insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
>>>>>>> exclusive locked but should be
>>>>>>> KDB: enter: lock violation
>>>>>> Pavel,
>>>>>> can you give a spin to this patch?:
>>>>>> http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch
>>>>>>
>>>>>> I think that the unlocking is due at that point as the vnode lock can
>>>>>> be switch later on.
>>>>>>
>>>>>> Let me know what you think about it and what the test does.
>>>>> Thanks!
>>>>> This patch fixes the problem with lock violation. Sorry I've tested it so
>>>>> late.
>>>> Hello,
>>>>
>>>> this patch still applies cleanly to RELENG_9_1. Was there another fix
>>>> for the issue or has it just not been PR-sent and thus forgotten?
>>> Can you and Pavel try the attached patch? Unfortunately I had no time
>>> to test it, I just made in 5 free mins from a non-FreeBSD workstation,
>>
>> Sorry, couldn't test earlier, but now I did:
>> With this patch applied the machine hangs without debug kernel and the
>> latter gives the following panic:
>> System call nmount returning with the following locks held:
>> exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
>> src/sys/fs/unionfs/union_vnops.c:1938
>> panic: witness_warn
>> cpuid = 0
>> KDB: stack backtrace:
>> db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
>> db_trace_self_wrapper+0x26
>> kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
>> witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
>> syscall(d1de3d08) ar syscall+0x415
>> Xint0x80_syscall() at Xint0x80_syscall+0x21
>> --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
>> 0xbfbfe46c, ebp = 0xbfbfede8 ---
>> KDB: enter: panic
>> [ thread pid 86 tid 100054 ]
>> Stopped adkdb_enter+0x3a: movl $0,kdb_why
>> db> bt
>> Tracing pid 86 tid 100054 td 0xc541b000
>> kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
>> witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
>> syscall(d1de3d08) at syscall+0x415
>> Xint0x80_syscall() at Xint0x80_syscall+0x21
>>
>> Hmm, I guess I forgot to install kernel debug symbols...
>> Coming back if I have more
>
> Unfortunately unionfs does very wrong things with the insmntque() locking.
> It basically expects the vnode to return locked in the same way
> requested by the precedent namei() (when that happens) but when you do
> insmntque() you can only have an LK_EXCLUSIVE lock on the vnode.

Hello,
the following patch should workout the issues around unionfs_nodeget() a bit:
http://www.freebsd.org/~attilio/unionfs_nodeget2.patch

Unfortunately unionfs code is rather messy in the lookup path about
locking requirements so follow what it needs to be done there is a bit
difficult.
I have no way to test this patch, so it is just test-compiled at the
moment, but I would need that you also test lookup path (so directory
"ls", find(1) on the whole unionfs volume, etc.) to validate it
someway.

If it panics again, please provide the kernel.debug and the vmcore.X file.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Panic with fusefs-ntfs on FreeBSD 9 RC1 amd64

2012-10-08 Thread Attilio Rao

On Fri, Sep 28, 2012 at 11:32 PM, Kevin Oberman  wrote:
> On Fri, Sep 28, 2012 at 7:20 AM, Attilio Rao  wrote:
>> On Mon, Sep 24, 2012 at 6:25 PM, Kevin Oberman  wrote:
>>> On Tue, Sep 18, 2012 at 7:55 PM, Attilio Rao  wrote:
>>>> On Tue, Sep 18, 2012 at 5:14 PM, Marcelo Gondim  
>>>> wrote:
>>>>> I installed the package ntfs-fusefs on two different servers and both 
>>>>> causes
>>>>> kernel panic when trying to copy anything.
>>>>> A server using FreeBSD 9.0 STABLE amd64 and the other using FreeBSD 9 RC1
>>>>> amd64.
>>>>> Someone is having the same problem?
>>>>
>>>> Hello Marcelo,
>>>> Do you think you can try fuse import explained here:
>>>> http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036677.html
>>>>
>>>> The proposed patch is for HEAD@240684 but I'm sure it should apply
>>>> cleanly to RELENG_9_1 too.
>>>>
>>>> Please let me know if you have further questions.
>>>
>>> I tried patching 9-Stable with fuse_240684.patch. It applied cleanly,
>>> but the kernel build failed:
>>> cc -O2 -pipe -fno-strict-aliasing -Werror -D_KERNEL -DKLD_MODULE
>>> -nostdinc   -DHAVE_KERNEL_OPTION_HEADERS -include
>>> /usr/obj/usr/src/sys/GENERIC/opt_global.h -I. -I@ -I@/contrib/altq
>>> -finline-limit=8000 --param inline-unit-growth=100 --param
>>> large-function-growth=1000 -fno-common -g -fno-omit-frame-pointer
>>> -I/usr/obj/usr/src/sys/GENERIC  -mcmodel=kernel -mno-red-zone -mno-mmx
>>> -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding
>>> -fstack-protector -std=iso9899:1999 -fstack-protector -Wall
>>> -Wredundant-decls -Wnested-externs -Wstrict-prototypes
>>> -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual  -Wundef
>>> -Wno-pointer-sign -fformat-extensions  -Wmissing-include-dirs
>>> -fdiagnostics-show-option   -c
>>> /usr/src/sys/modules/fuse/../../fs/fuse/fuse_ipc.c
>>> cc1: warnings being treated as errors
>>> /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c: In function
>>> 'fuse_vnode_setsize':
>>> /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c:378: warning:
>>> passing argument 3 of 'vtruncbuf' makes pointer from integer without a
>>> cast
>>> /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c:378: error: too
>>> few arguments to function 'vtruncbuf'
>>> *** [fuse_node.o] Error code 1
>>>
>>> Looks like something has changed between stable and current that won't
>>> work. Any suggestions for a quick fix?
>>
>> Please check this out:
>> http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036862.html
>
> Attilio,
>
> stable/9 (r239879) patched and lightly tested. Seems to be working
> fine at this time. I still need to further study mount_fusefs and I am
> still using the old mount-fuse script until I can look at how HAL and
> Gnome will handle things.
>
> I'll try my rsync test, which reliably crashed the system with the old
> fusefs stuff, this weekend.

So, did you try this? Any new?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Panic with fusefs-ntfs on FreeBSD 9 RC1 amd64

2012-09-28 Thread Attilio Rao

On Mon, Sep 24, 2012 at 6:25 PM, Kevin Oberman  wrote:
> On Tue, Sep 18, 2012 at 7:55 PM, Attilio Rao  wrote:
>> On Tue, Sep 18, 2012 at 5:14 PM, Marcelo Gondim  
>> wrote:
>>> I installed the package ntfs-fusefs on two different servers and both causes
>>> kernel panic when trying to copy anything.
>>> A server using FreeBSD 9.0 STABLE amd64 and the other using FreeBSD 9 RC1
>>> amd64.
>>> Someone is having the same problem?
>>
>> Hello Marcelo,
>> Do you think you can try fuse import explained here:
>> http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036677.html
>>
>> The proposed patch is for HEAD@240684 but I'm sure it should apply
>> cleanly to RELENG_9_1 too.
>>
>> Please let me know if you have further questions.
>
> I tried patching 9-Stable with fuse_240684.patch. It applied cleanly,
> but the kernel build failed:
> cc -O2 -pipe -fno-strict-aliasing -Werror -D_KERNEL -DKLD_MODULE
> -nostdinc   -DHAVE_KERNEL_OPTION_HEADERS -include
> /usr/obj/usr/src/sys/GENERIC/opt_global.h -I. -I@ -I@/contrib/altq
> -finline-limit=8000 --param inline-unit-growth=100 --param
> large-function-growth=1000 -fno-common -g -fno-omit-frame-pointer
> -I/usr/obj/usr/src/sys/GENERIC  -mcmodel=kernel -mno-red-zone -mno-mmx
> -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding
> -fstack-protector -std=iso9899:1999 -fstack-protector -Wall
> -Wredundant-decls -Wnested-externs -Wstrict-prototypes
> -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual  -Wundef
> -Wno-pointer-sign -fformat-extensions  -Wmissing-include-dirs
> -fdiagnostics-show-option   -c
> /usr/src/sys/modules/fuse/../../fs/fuse/fuse_ipc.c
> cc1: warnings being treated as errors
> /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c: In function
> 'fuse_vnode_setsize':
> /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c:378: warning:
> passing argument 3 of 'vtruncbuf' makes pointer from integer without a
> cast
> /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c:378: error: too
> few arguments to function 'vtruncbuf'
> *** [fuse_node.o] Error code 1
>
> Looks like something has changed between stable and current that won't
> work. Any suggestions for a quick fix?

Please check this out:
http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036862.html

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Panic with fusefs-ntfs on FreeBSD 9 RC1 amd64

2012-09-18 Thread Attilio Rao

On Tue, Sep 18, 2012 at 5:14 PM, Marcelo Gondim  wrote:
> I installed the package ntfs-fusefs on two different servers and both causes
> kernel panic when trying to copy anything.
> A server using FreeBSD 9.0 STABLE amd64 and the other using FreeBSD 9 RC1
> amd64.
> Someone is having the same problem?

Hello Marcelo,
Do you think you can try fuse import explained here:
http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036677.html

The proposed patch is for HEAD@240684 but I'm sure it should apply
cleanly to RELENG_9_1 too.

Please let me know if you have further questions.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-09-07 Thread Attilio Rao

On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
 wrote:
>  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
>> On 8/8/12, Harald Schmalzbauer  wrote:
>>>  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
>>>>>> mount -t unionfs -o noatime /usr /mnt
>>>>>>
>>>>>> insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
>>>>>> exclusive locked but should be
>>>>>> KDB: enter: lock violation
>>>>> Pavel,
>>>>> can you give a spin to this patch?:
>>>>> http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch
>>>>>
>>>>> I think that the unlocking is due at that point as the vnode lock can
>>>>> be switch later on.
>>>>>
>>>>> Let me know what you think about it and what the test does.
>>>> Thanks!
>>>> This patch fixes the problem with lock violation. Sorry I've tested it so
>>>> late.
>>> Hello,
>>>
>>> this patch still applies cleanly to RELENG_9_1. Was there another fix
>>> for the issue or has it just not been PR-sent and thus forgotten?
>> Can you and Pavel try the attached patch? Unfortunately I had no time
>> to test it, I just made in 5 free mins from a non-FreeBSD workstation,
>
> Sorry, couldn't test earlier, but now I did:
> With this patch applied the machine hangs without debug kernel and the
> latter gives the following panic:
> System call nmount returning with the following locks held:
> exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
> src/sys/fs/unionfs/union_vnops.c:1938
> panic: witness_warn
> cpuid = 0
> KDB: stack backtrace:
> db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
> db_trace_self_wrapper+0x26
> kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
> witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
> syscall(d1de3d08) ar syscall+0x415
> Xint0x80_syscall() at Xint0x80_syscall+0x21
> --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
> 0xbfbfe46c, ebp = 0xbfbfede8 ---
> KDB: enter: panic
> [ thread pid 86 tid 100054 ]
> Stopped adkdb_enter+0x3a: movl $0,kdb_why
> db> bt
> Tracing pid 86 tid 100054 td 0xc541b000
> kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
> witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
> syscall(d1de3d08) at syscall+0x415
> Xint0x80_syscall() at Xint0x80_syscall+0x21
>
> Hmm, I guess I forgot to install kernel debug symbols...
> Coming back if I have more

Unfortunately unionfs does very wrong things with the insmntque() locking.
It basically expects the vnode to return locked in the same way
requested by the precedent namei() (when that happens) but when you do
insmntque() you can only have an LK_EXCLUSIVE lock on the vnode.

I still need some time to fix this but my bandwidth is basically 0 at
the moment, I'll try to get back to you with a patch as soon as
possible.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-08-09 Thread Attilio Rao

On 8/8/12, Harald Schmalzbauer  wrote:
>  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation
>>>
>>> Pavel,
>>> can you give a spin to this patch?:
>>> http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch
>>>
>>> I think that the unlocking is due at that point as the vnode lock can
>>> be switch later on.
>>>
>>> Let me know what you think about it and what the test does.
>>
>> Thanks!
>> This patch fixes the problem with lock violation. Sorry I've tested it so
>> late.
>
> Hello,
>
> this patch still applies cleanly to RELENG_9_1. Was there another fix
> for the issue or has it just not been PR-sent and thus forgotten?

Can you and Pavel try the attached patch? Unfortunately I had no time
to test it, I just made in 5 free mins from a non-FreeBSD workstation,
then you should be able to tell me if it works or not, even compiling
it on a RELENG_9_1.
Please try with INVARIANTS option on.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
Index: sys/fs/unionfs/union_subr.c
===
--- sys/fs/unionfs/union_subr.c	(revision 239152)
+++ sys/fs/unionfs/union_subr.c	(working copy)
@@ -237,7 +237,8 @@ unionfs_nodeget(struct mount *mp, struct vnode *up
 		if (vp != NULLVP) {
 			vref(vp);
 			*vpp = vp;
-			goto unionfs_nodeget_out;
+			lockmgr(vp->v_vnlock, LK_EXCLUSIVE, NULL);
+			return (0);
 		}
 	}
 
@@ -255,17 +256,19 @@ unionfs_nodeget(struct mount *mp, struct vnode *up
 	 */
 	unp = malloc(sizeof(struct unionfs_node),
 	M_UNIONFSNODE, M_WAITOK | M_ZERO);
+	if (path != NULL) {
+		unp->un_path = (char *)
+		malloc(cnp->cn_namelen +1, M_UNIONFSPATH, M_WAITOK|M_ZERO);
+		bcopy(cnp->cn_nameptr, unp->un_path, cnp->cn_namelen);
+		unp->un_path[cnp->cn_namelen] = '\0';
+	}
 
 	error = getnewvnode("unionfs", mp, &unionfs_vnodeops, &vp);
 	if (error != 0) {
+		free(unp->un_path, M_UNIONFSNODE);
 		free(unp, M_UNIONFSNODE);
 		return (error);
 	}
-	error = insmntque(vp, mp);	/* XXX: Too early for mpsafe fs */
-	if (error != 0) {
-		free(unp, M_UNIONFSNODE);
-		return (error);
-	}
 	if (dvp != NULLVP)
 		vref(dvp);
 	if (uppervp != NULLVP)
@@ -286,15 +289,22 @@ unionfs_nodeget(struct mount *mp, struct vnode *up
 	else
 		vp->v_vnlock = lowervp->v_vnlock;
 
-	if (path != NULL) {
-		unp->un_path = (char *)
-		malloc(cnp->cn_namelen +1, M_UNIONFSPATH, M_WAITOK|M_ZERO);
-		bcopy(cnp->cn_nameptr, unp->un_path, cnp->cn_namelen);
-		unp->un_path[cnp->cn_namelen] = '\0';
-	}
 	vp->v_type = vt;
 	vp->v_data = unp;
 
+	lockmgr(vp->v_vnlock, LK_EXCLUSIVE, NULL);
+	error = insmntque(vp, mp);
+	if (error != 0) {
+		if (dvp != NULLVP)
+			vrele(dvp);
+		if (uppervp != NULLVP)
+			vrele(uppervp);
+		if (lowervp != NULLVP)
+			vrele(lowervp);
+		free(unp->un_path, M_UNIONFSNODE);
+		free(unp, M_UNIONFSNODE);
+		return (error);
+	}
 	if ((uppervp != NULLVP && ump->um_uppervp == uppervp) &&
 	(lowervp != NULLVP && ump->um_lowervp == lowervp))
 		vp->v_vflag |= VV_ROOT;
@@ -317,11 +327,6 @@ unionfs_nodeget(struct mount *mp, struct vnode *up
 		vref(vp);
 	} else
 		*vpp = vp;
-
-unionfs_nodeget_out:
-	if (lkflags & LK_TYPE_MASK)
-		vn_lock(vp, lkflags | LK_RETRY);
-
 	return (0);
 }
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-08-08 Thread Attilio Rao

On 8/8/12, Harald Schmalzbauer  wrote:
>  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation
>>>
>>> Pavel,
>>> can you give a spin to this patch?:
>>> http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch
>>>
>>> I think that the unlocking is due at that point as the vnode lock can
>>> be switch later on.
>>>
>>> Let me know what you think about it and what the test does.
>>
>> Thanks!
>> This patch fixes the problem with lock violation. Sorry I've tested it so
>> late.
>
> Hello,
>
> this patch still applies cleanly to RELENG_9_1. Was there another fix
> for the issue or has it just not been PR-sent and thus forgotten?

There are more things to fix in inode instantiation for unionfs. I
hope to make a comprehensive patch for tests in a couple of days.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [stable 9] panic on reboot: ipmi_wd_event()

2012-08-02 Thread Attilio Rao

On 8/2/12, John Baldwin  wrote:
> On Wednesday, August 01, 2012 6:48:48 pm Sean Bruno wrote:
>> On Wed, 2012-08-01 at 05:53 -0700, John Baldwin wrote:
>> > Index: vfs_subr.c
>> > ===
>> > --- vfs_subr.c  (revision 238969)
>> > +++ vfs_subr.c  (working copy)
>> > @@ -1868,8 +1868,11 @@ sched_sync(void)
>> > continue;
>> > }
>> >
>> > -   if (first_printf == 0)
>> > +   if (first_printf == 0) {
>> > +   mtx_unlock(&sync_mtx);
>> > wdog_kern_pat(WD_LASTVAL);
>> > +   mtx_lock(&sync_mtx);
>> > +   }
>> >
>> > }
>> > if (!LIST_EMPTY(gslp)) {
>> >
>> >
>> > --
>> > John Baldwin
>>
>> This definitely makes the panic go away on reboot.
>
> Attilio, does this change seem ok to you?

Thanks for asking me to review.

I think it is safe because we are going to use LIST_EMPTY() on the
global list anyway as next check.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [stable 9] panic on reboot: ipmi_wd_event()

2012-08-01 Thread Attilio Rao

On 8/1/12, Attilio Rao  wrote:
> On 8/1/12, John Baldwin  wrote:
>> On Tuesday, July 31, 2012 4:51:19 pm Attilio Rao wrote:
>>> On 7/31/12, John Baldwin  wrote:
>>> > On Thursday, July 19, 2012 7:58:14 pm Sean Bruno wrote:
>>> >> Working on the Dell R420 today, got most of it working, even the
>>> >> broadcom ethernet cards!  However, I get the following when I reboot
>>> >> the
>>> >> system:
>>> >>
>>> >> Syncing disks, vnodes remaining...4 Sleeping thread (tid 100107, pid
>>> >> 9)
>>> >> owns a non-sleepable lock
>>> >> KDB: stack backtrace of thread 100107:
>>> >> sched_switch() at sched_switch+0x19f
>>> >> mi_switch() at mi_switch+0x208
>>> >> sleepq_switch() at sleepq_switch+0xfc
>>> >> sleepq_wait() at sleepq_wait+0x4d
>>> >> _sleep() at _sleep+0x3f6
>>> >> ipmi_submit_driver_request() at ipmi_submit_driver_request+0x97
>>> >> ipmi_set_watchdog() at ipmi_set_watchdog+0xb1
>>> >> ipmi_wd_event() at ipmi_wd_event+0x8f
>>> >> kern_do_pat() at kern_do_pat+0x10f
>>> >> sched_sync() at sched_sync+0x1ea
>>> >> fork_exit() at fork_exit+0x135
>>> >> fork_trampoline() at fork_trampoline+0xe
>>> >
>>> > Hmmm, the watchdog pat should probably happen without holding locks if
>>> > possible.  This is related to the IPMI watchdog being special and
>>> > wanting
>>> > to schedule a thread to work.
>>>
>>> The watchdog pat without the locks is not easy to do because we
>>> register the watchdog callbacks in eventhandlers, which are indeed
>>> locked (and you may also end up racing against watchdog detach, if you
>>> don't use any lock at all).
>>
>> No, eventhandlers go through several hoops to not hold any locks while
>> the eventhandler functions are running.  It seems in this case that a
>> lock is held in a higher layer (sched_sync()) and that is what I was
>> talking about.  Yes, it is the 'sync_mtx' that is held.  Something like
>> this
>
> No, EVENTHANDLER_INVOKE() acquires eventhandler internal locks.
> Look at eventhandler_find_list() for details.

Oh, but I guess you misunderstood me -- I didn't mean to say that
eventhandler callbacks run with eventhandlers lock held, I meant to
say that that it would be nice if EVENTHANDLER_INVOKE() could run
lockless. This would have avoided some issues in special context (I
recall I had some issues at work years ago, but they could have been
predating the STOP_SCHEDULER() patch and in DDB).

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [stable 9] panic on reboot: ipmi_wd_event()

2012-08-01 Thread Attilio Rao

On 8/1/12, John Baldwin  wrote:
> On Tuesday, July 31, 2012 4:51:19 pm Attilio Rao wrote:
>> On 7/31/12, John Baldwin  wrote:
>> > On Thursday, July 19, 2012 7:58:14 pm Sean Bruno wrote:
>> >> Working on the Dell R420 today, got most of it working, even the
>> >> broadcom ethernet cards!  However, I get the following when I reboot
>> >> the
>> >> system:
>> >>
>> >> Syncing disks, vnodes remaining...4 Sleeping thread (tid 100107, pid
>> >> 9)
>> >> owns a non-sleepable lock
>> >> KDB: stack backtrace of thread 100107:
>> >> sched_switch() at sched_switch+0x19f
>> >> mi_switch() at mi_switch+0x208
>> >> sleepq_switch() at sleepq_switch+0xfc
>> >> sleepq_wait() at sleepq_wait+0x4d
>> >> _sleep() at _sleep+0x3f6
>> >> ipmi_submit_driver_request() at ipmi_submit_driver_request+0x97
>> >> ipmi_set_watchdog() at ipmi_set_watchdog+0xb1
>> >> ipmi_wd_event() at ipmi_wd_event+0x8f
>> >> kern_do_pat() at kern_do_pat+0x10f
>> >> sched_sync() at sched_sync+0x1ea
>> >> fork_exit() at fork_exit+0x135
>> >> fork_trampoline() at fork_trampoline+0xe
>> >
>> > Hmmm, the watchdog pat should probably happen without holding locks if
>> > possible.  This is related to the IPMI watchdog being special and
>> > wanting
>> > to schedule a thread to work.
>>
>> The watchdog pat without the locks is not easy to do because we
>> register the watchdog callbacks in eventhandlers, which are indeed
>> locked (and you may also end up racing against watchdog detach, if you
>> don't use any lock at all).
>
> No, eventhandlers go through several hoops to not hold any locks while
> the eventhandler functions are running.  It seems in this case that a
> lock is held in a higher layer (sched_sync()) and that is what I was
> talking about.  Yes, it is the 'sync_mtx' that is held.  Something like this

No, EVENTHANDLER_INVOKE() acquires eventhandler internal locks.
Look at eventhandler_find_list() for details.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [stable 9] panic on reboot: ipmi_wd_event()

2012-07-31 Thread Attilio Rao

On 7/31/12, John Baldwin  wrote:
> On Thursday, July 19, 2012 7:58:14 pm Sean Bruno wrote:
>> Working on the Dell R420 today, got most of it working, even the
>> broadcom ethernet cards!  However, I get the following when I reboot the
>> system:
>>
>> Syncing disks, vnodes remaining...4 Sleeping thread (tid 100107, pid 9)
>> owns a non-sleepable lock
>> KDB: stack backtrace of thread 100107:
>> sched_switch() at sched_switch+0x19f
>> mi_switch() at mi_switch+0x208
>> sleepq_switch() at sleepq_switch+0xfc
>> sleepq_wait() at sleepq_wait+0x4d
>> _sleep() at _sleep+0x3f6
>> ipmi_submit_driver_request() at ipmi_submit_driver_request+0x97
>> ipmi_set_watchdog() at ipmi_set_watchdog+0xb1
>> ipmi_wd_event() at ipmi_wd_event+0x8f
>> kern_do_pat() at kern_do_pat+0x10f
>> sched_sync() at sched_sync+0x1ea
>> fork_exit() at fork_exit+0x135
>> fork_trampoline() at fork_trampoline+0xe
>
> Hmmm, the watchdog pat should probably happen without holding locks if
> possible.  This is related to the IPMI watchdog being special and wanting
> to schedule a thread to work.

The watchdog pat without the locks is not easy to do because we
register the watchdog callbacks in eventhandlers, which are indeed
locked (and you may also end up racing against watchdog detach, if you
don't use any lock at all).

There is a similar issue when you enter DDB o coredump, for example
but this is someway collateral due to the "after-panic" nature of the
situation. We should seriously looking into requirements for watchdog
patting and possibly DDB entering situations, outline correct
semantics to follow and refactor code to follow them.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: IPMI hardware watchdogs Re: dell r420/r320 stable/9

2012-07-27 Thread Attilio Rao

On Fri, Jul 27, 2012 at 3:55 PM, Andrew Boyer  wrote:
>
> On Jul 27, 2012, at 10:42 AM, Attilio Rao wrote:
>
>> On Fri, Jul 27, 2012 at 3:33 PM, Andrew Boyer  
>> wrote:
>>>
>>> On Jul 26, 2012, at 8:50 PM, Sean Bruno wrote:
>>>
>>>> For the time being I had to revert the following from my stable/9 tree.
>>>> Otherwise I would get a kernel panic on shutdown from ipmi(4).
>>>>
>>>> http://svnweb.freebsd.org/base?view=revision&revision=237839
>>>> http://svnweb.freebsd.org/base?view=revision&revision=221121
>>>>
>>>
>>> On a somewhat related note: We noticed recently that you can't pet or 
>>> disable the IPMI hardware watchdog once SCHEDULER_STOPPED() is true.  This 
>>> means it can fire unexpectedly while you're dumping core or rebooting, 
>>> depending on how long the timeout was on the pet before the panic.  The 
>>> ipmi driver will need to process the command differently if the scheduler 
>>> is stopped.  I haven't had time to look at a fix yet.
>>
>> I recall I fixed that internally for SV, but the key here is that we
>> need to find an unified (or a default policy).
>> More specifically, do we want the watchdog also covers the kernel dump
>> part (because of possible deadlocks when dumping). If the answer is
>> yes, we likely need pat the watchdog from within the dumping cycle
>> itself. If the answer is no, then we can just disable it when entering
>> the panic path. But anyway, we need to identify a default policy that
>> makes sense first.
>>
>> Attilio
>>
>
> For our use case, we need the system to reset if the dump hangs.

This means we might likely go to control by hand the watchdog patting
in the panic path and more specifically I guess this reduces to
patting the watching from within the dumping cycle (there could be
other expensive points we can consider but nothing that pop off my
head right now). Maybe Ryan can share with us if SV can contribute the
code back about that specific part.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: IPMI hardware watchdogs Re: dell r420/r320 stable/9

2012-07-27 Thread Attilio Rao

On Fri, Jul 27, 2012 at 3:33 PM, Andrew Boyer  wrote:
>
> On Jul 26, 2012, at 8:50 PM, Sean Bruno wrote:
>
>> For the time being I had to revert the following from my stable/9 tree.
>> Otherwise I would get a kernel panic on shutdown from ipmi(4).
>>
>> http://svnweb.freebsd.org/base?view=revision&revision=237839
>> http://svnweb.freebsd.org/base?view=revision&revision=221121
>>
>
>
> On a somewhat related note: We noticed recently that you can't pet or disable 
> the IPMI hardware watchdog once SCHEDULER_STOPPED() is true.  This means it 
> can fire unexpectedly while you're dumping core or rebooting, depending on 
> how long the timeout was on the pet before the panic.  The ipmi driver will 
> need to process the command differently if the scheduler is stopped.  I 
> haven't had time to look at a fix yet.

I recall I fixed that internally for SV, but the key here is that we
need to find an unified (or a default policy).
More specifically, do we want the watchdog also covers the kernel dump
part (because of possible deadlocks when dumping). If the answer is
yes, we likely need pat the watchdog from within the dumping cycle
itself. If the answer is no, then we can just disable it when entering
the panic path. But anyway, we need to identify a default policy that
makes sense first.

Attilio

-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: stable/9 sandybridge reboot panic

2012-05-25 Thread Attilio Rao

2012/5/25, Sean Bruno :
> Dell R620, getting pretty reliable panics here everytime I reboot.
>
> http://people.freebsd.org/~sbruno/sandybridge_reboot_panic.txt

I'm sure that if you drop hwpmc you will get rid of it.
it would be good if you however get something for Davide and George
that they can investigate and fix, would be great, in particular
because it is already ported on STABLE_9.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Complete hang on 9.0-RELEASE

2012-03-05 Thread Attilio Rao

2012/3/5, Arnaud Lacombe :
> Hi,
>
> On Wed, Feb 29, 2012 at 2:31 PM, Arnaud Lacombe  wrote:
>> Hi,
>>
>> On Wed, Feb 29, 2012 at 2:22 PM, Attilio Rao  wrote:
>>> 2012/2/29, Arnaud Lacombe :
>>>> Hi,
>>>>
>>>> On Wed, Feb 29, 2012 at 1:44 PM, Attilio Rao 
>>>> wrote:
>>>>> 2012/2/29, Arnaud Lacombe :
>>>>>> Hi,
>>>>>>
>>>>>> On Wed, Feb 29, 2012 at 12:59 PM, Arnaud Lacombe 
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Mon, Feb 27, 2012 at 12:48 PM, Arnaud Lacombe 
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Mon, Feb 27, 2012 at 10:36 AM, Attilio Rao 
>>>>>>>> wrote:
>>>>>>>>> 2012/2/27, Arnaud Lacombe :
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi folks,
>>>>>>>>>>>
>>>>>>>>>>> For the records, I was running some tests yesterday on top of a
>>>>>>>>>>> 9.0-RELEASE, amd64, kernel when the box hanged. At the time of
>>>>>>>>>>> the
>>>>>>>>>>> hang, the box was running a process with about 2800 threads with
>>>>>>>>>>> heavy
>>>>>>>>>>> IPC between 1400 writers and 1400 readers. The box was in single
>>>>>>>>>>> user
>>>>>>>>>>> mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the
>>>>>>>>>>> beginning
>>>>>>>>>>> of the dmesg:
>>>>>>>>>>>
>>>>>>>>>> This happened a second time, now with FreeBSD 8.2-RELEASE.
>>>>>>>>>> Complete
>>>>>>>>>> machine hang. The machine was running about 4000 threads in a
>>>>>>>>>> single
>>>>>>>>>> process, all the other condition are the same.
>>>>>>>>>
>>>>>>>>> Arnaud,
>>>>>>>>> can you please break in your kernel via KDB, collect the following
>>>>>>>>> informations from the DDB prompt:
>>>>>>>>> - ps
>>>>>>>>> - alltrace
>>>>>>>>> - show allpcpu
>>>>>>>>> - possibly get a coredump with 'call doadump'
>>>>>>>>>
>>>>>>>> Will do, but I'll need to rebuild a kernel to include DDB.
>>>>>>>>
>>>>>>>>> and in the end provide all those along with kernel binary and
>>>>>>>>> possibly
>>>>>>>>> sources somewhere?
>>>>>>>>>
>>>>>>>> I'll be testing a bare `release/8.2.0' with the following patch:
>>>>>>>>
>>>>>>>> diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
>>>>>>>> index c3e0095..7bd997f 100644
>>>>>>>> --- a/sys/amd64/conf/GENERIC
>>>>>>>> +++ b/sys/amd64/conf/GENERIC
>>>>>>>> @@ -79,6 +79,10 @@ options  INCLUDE_CONFIG_FILE # Include
>>>>>>>> this
>>>>>>>> file in kernel
>>>>>>>>
>>>>>>>>  optionsKDB   # Kernel debugger related code
>>>>>>>>  optionsKDB_TRACE # Print a stack trace for a panic
>>>>>>>> +optionsDDB
>>>>>>>> +optionsBREAK_TO_DEBUGGER
>>>>>>>> +optionsALT_BREAK_TO_DEBUGGER
>>>>>>>>
>>>>>>>>  # Make an SMP-capable kernel by default
>>>>>>>>  optionsSMP   # Symmetric MultiProcessor Kernel
>>>>>>>>
>>>>>>> ok, it happened again after 2 days, the process was running about
>>>>>>> 3200
>>>>>>> threads. I'm trying to break into DDB and let you know, I'm not that
>>>>>>> successful for now...
>>>>>>>
>>>>>> No luck. None of BREAK or ALT_BREAK are responding. I will not touch
>>>>>> the system in the next few hours if you want me to test something on
>>>>>> it. In the event of 8.2-RELEASE or 9.0-RELEASE are  not meant to work
>>>>>> reliably on top of a 7.4-RELEASE userland, I will re-setup the test to
>>>>>> occurs on a clean 9.0-RELEASE system and re-try.
>>>>>
>>>>> We allow to break KBI when new releases happens, thus this may cause a
>>>>> breakage for you, even if a deadlock is really not something you want.
>>>>>
>>>>> Can you try enabling SW_WATCHDOG, DEADLKRES and possibly arm your
>>>>> ichwd?
>>>>> if the breakage involves clocks or interrupt sources there are still
>>>>> chances they will be able to catch it though.
>>>>>
>>>>> However, it doesn't seem you are setup with a proper serial console?
>>>> The serial console is working definitively fine. I can break into DDB
>>>> at will when the test is running. I did not test with ALT_BREAK
>>>> per-se, but BREAK does work.
>>>
>>> So if you try to break in DDB via serial break it doesn't work?
>>> That is definitively very bad...
>>>
>> just to be sure, I rebooted the system and I could break into DDB at
>> the first attempt with ALT_BREAK, BREAK was a bit more reluctant but
>> worked too. So yes, this does not taste good :/
>>
>>> Can you try with the options I mentioned earlier and see if something
>>> changes?
>>>
>> will do, but I will first attempt to reproduce this on 9.0-RELEASE.
>>
> 9.0-RELEASE (kernel + userland) hanged today while running 2000
> threads. Next step is to reproduce it with a watchdog+textdump enabled
> kernel.

And you were still unable to break in DDB, right?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Complete hang on 9.0-RELEASE

2012-02-29 Thread Attilio Rao

2012/2/29, Arnaud Lacombe :
> Hi,
>
> On Wed, Feb 29, 2012 at 1:44 PM, Attilio Rao  wrote:
>> 2012/2/29, Arnaud Lacombe :
>>> Hi,
>>>
>>> On Wed, Feb 29, 2012 at 12:59 PM, Arnaud Lacombe 
>>> wrote:
>>>> Hi,
>>>>
>>>> On Mon, Feb 27, 2012 at 12:48 PM, Arnaud Lacombe 
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> On Mon, Feb 27, 2012 at 10:36 AM, Attilio Rao 
>>>>> wrote:
>>>>>> 2012/2/27, Arnaud Lacombe :
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe 
>>>>>>> wrote:
>>>>>>>> Hi folks,
>>>>>>>>
>>>>>>>> For the records, I was running some tests yesterday on top of a
>>>>>>>> 9.0-RELEASE, amd64, kernel when the box hanged. At the time of the
>>>>>>>> hang, the box was running a process with about 2800 threads with
>>>>>>>> heavy
>>>>>>>> IPC between 1400 writers and 1400 readers. The box was in single
>>>>>>>> user
>>>>>>>> mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the beginning
>>>>>>>> of the dmesg:
>>>>>>>>
>>>>>>> This happened a second time, now with FreeBSD 8.2-RELEASE. Complete
>>>>>>> machine hang. The machine was running about 4000 threads in a single
>>>>>>> process, all the other condition are the same.
>>>>>>
>>>>>> Arnaud,
>>>>>> can you please break in your kernel via KDB, collect the following
>>>>>> informations from the DDB prompt:
>>>>>> - ps
>>>>>> - alltrace
>>>>>> - show allpcpu
>>>>>> - possibly get a coredump with 'call doadump'
>>>>>>
>>>>> Will do, but I'll need to rebuild a kernel to include DDB.
>>>>>
>>>>>> and in the end provide all those along with kernel binary and possibly
>>>>>> sources somewhere?
>>>>>>
>>>>> I'll be testing a bare `release/8.2.0' with the following patch:
>>>>>
>>>>> diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
>>>>> index c3e0095..7bd997f 100644
>>>>> --- a/sys/amd64/conf/GENERIC
>>>>> +++ b/sys/amd64/conf/GENERIC
>>>>> @@ -79,6 +79,10 @@ options  INCLUDE_CONFIG_FILE # Include this
>>>>> file in kernel
>>>>>
>>>>>  optionsKDB   # Kernel debugger related code
>>>>>  optionsKDB_TRACE # Print a stack trace for a panic
>>>>> +optionsDDB
>>>>> +optionsBREAK_TO_DEBUGGER
>>>>> +optionsALT_BREAK_TO_DEBUGGER
>>>>>
>>>>>  # Make an SMP-capable kernel by default
>>>>>  optionsSMP   # Symmetric MultiProcessor Kernel
>>>>>
>>>> ok, it happened again after 2 days, the process was running about 3200
>>>> threads. I'm trying to break into DDB and let you know, I'm not that
>>>> successful for now...
>>>>
>>> No luck. None of BREAK or ALT_BREAK are responding. I will not touch
>>> the system in the next few hours if you want me to test something on
>>> it. In the event of 8.2-RELEASE or 9.0-RELEASE are  not meant to work
>>> reliably on top of a 7.4-RELEASE userland, I will re-setup the test to
>>> occurs on a clean 9.0-RELEASE system and re-try.
>>
>> We allow to break KBI when new releases happens, thus this may cause a
>> breakage for you, even if a deadlock is really not something you want.
>>
>> Can you try enabling SW_WATCHDOG, DEADLKRES and possibly arm your ichwd?
>> if the breakage involves clocks or interrupt sources there are still
>> chances they will be able to catch it though.
>>
>> However, it doesn't seem you are setup with a proper serial console?
> The serial console is working definitively fine. I can break into DDB
> at will when the test is running. I did not test with ALT_BREAK
> per-se, but BREAK does work.

So if you try to break in DDB via serial break it doesn't work?
That is definitively very bad...

Can you try with the options I mentioned earlier and see if something changes?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Complete hang on 9.0-RELEASE

2012-02-29 Thread Attilio Rao

2012/2/29, Arnaud Lacombe :
> Hi,
>
> On Wed, Feb 29, 2012 at 12:59 PM, Arnaud Lacombe  wrote:
>> Hi,
>>
>> On Mon, Feb 27, 2012 at 12:48 PM, Arnaud Lacombe 
>> wrote:
>>> Hi,
>>>
>>> On Mon, Feb 27, 2012 at 10:36 AM, Attilio Rao 
>>> wrote:
>>>> 2012/2/27, Arnaud Lacombe :
>>>>> Hi,
>>>>>
>>>>> On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe 
>>>>> wrote:
>>>>>> Hi folks,
>>>>>>
>>>>>> For the records, I was running some tests yesterday on top of a
>>>>>> 9.0-RELEASE, amd64, kernel when the box hanged. At the time of the
>>>>>> hang, the box was running a process with about 2800 threads with heavy
>>>>>> IPC between 1400 writers and 1400 readers. The box was in single user
>>>>>> mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the beginning
>>>>>> of the dmesg:
>>>>>>
>>>>> This happened a second time, now with FreeBSD 8.2-RELEASE. Complete
>>>>> machine hang. The machine was running about 4000 threads in a single
>>>>> process, all the other condition are the same.
>>>>
>>>> Arnaud,
>>>> can you please break in your kernel via KDB, collect the following
>>>> informations from the DDB prompt:
>>>> - ps
>>>> - alltrace
>>>> - show allpcpu
>>>> - possibly get a coredump with 'call doadump'
>>>>
>>> Will do, but I'll need to rebuild a kernel to include DDB.
>>>
>>>> and in the end provide all those along with kernel binary and possibly
>>>> sources somewhere?
>>>>
>>> I'll be testing a bare `release/8.2.0' with the following patch:
>>>
>>> diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
>>> index c3e0095..7bd997f 100644
>>> --- a/sys/amd64/conf/GENERIC
>>> +++ b/sys/amd64/conf/GENERIC
>>> @@ -79,6 +79,10 @@ options  INCLUDE_CONFIG_FILE # Include this
>>> file in kernel
>>>
>>>  optionsKDB   # Kernel debugger related code
>>>  optionsKDB_TRACE # Print a stack trace for a panic
>>> +optionsDDB
>>> +optionsBREAK_TO_DEBUGGER
>>> +optionsALT_BREAK_TO_DEBUGGER
>>>
>>>  # Make an SMP-capable kernel by default
>>>  optionsSMP   # Symmetric MultiProcessor Kernel
>>>
>> ok, it happened again after 2 days, the process was running about 3200
>> threads. I'm trying to break into DDB and let you know, I'm not that
>> successful for now...
>>
> No luck. None of BREAK or ALT_BREAK are responding. I will not touch
> the system in the next few hours if you want me to test something on
> it. In the event of 8.2-RELEASE or 9.0-RELEASE are  not meant to work
> reliably on top of a 7.4-RELEASE userland, I will re-setup the test to
> occurs on a clean 9.0-RELEASE system and re-try.

We allow to break KBI when new releases happens, thus this may cause a
breakage for you, even if a deadlock is really not something you want.

Can you try enabling SW_WATCHDOG, DEADLKRES and possibly arm your ichwd?
if the breakage involves clocks or interrupt sources there are still
chances they will be able to catch it though.

However, it doesn't seem you are setup with a proper serial console?
If this is the case, you need to go with a textdump in order to
collect DDB output.
Or if you have it you might try with sending a serial break and kernel
should break in DDB.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Complete hang on 9.0-RELEASE

2012-02-27 Thread Attilio Rao

2012/2/27, Arnaud Lacombe :
> Hi,
>
> On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe  wrote:
>> Hi folks,
>>
>> For the records, I was running some tests yesterday on top of a
>> 9.0-RELEASE, amd64, kernel when the box hanged. At the time of the
>> hang, the box was running a process with about 2800 threads with heavy
>> IPC between 1400 writers and 1400 readers. The box was in single user
>> mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the beginning
>> of the dmesg:
>>
> This happened a second time, now with FreeBSD 8.2-RELEASE. Complete
> machine hang. The machine was running about 4000 threads in a single
> process, all the other condition are the same.

Arnaud,
can you please break in your kernel via KDB, collect the following
informations from the DDB prompt:
- ps
- alltrace
- show allpcpu
- possibly get a coredump with 'call doadump'

and in the end provide all those along with kernel binary and possibly
sources somewhere?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-02-15 Thread Attilio Rao

2012/2/13, Pavel Polyakov :
> http://www.freebsd.org/cgi/query-pr.cgi?pr=165087
>
> Occurs simply trying to use unionfs:
> mount -t unionfs -o noatime /usr /mnt
>
> insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
> exclusive locked but should be
> KDB: enter: lock violation

Pavel,
can you give a spin to this patch?:
http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

I think that the unlocking is due at that point as the vnode lock can
be switch later on.

Let me know what you think about it and what the test does.
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Custom kernel poll summary (was: Re: Reducing the need to compile a custom kernel)

2012-02-14 Thread Attilio Rao

2012/2/14, Alexander Leidinger :
> Quoting Alexander Leidinger  (from Fri, 10
> Feb 2012 14:56:04 +0100):
>
>> Such a kernel would cover situations where people compile their own
>> kernel because they want to get rid of some unused kernel code (and
>> maybe even need the memory this frees up).
>>
>> The question is, is this enough? Or asked differently, why are you
>> compiling a custom kernel in a production environment (so I rule out
>> debug options zhich are not enabled in GENERIC)? Are there options
>> which you add which you can not add as a module (SW_WATCHDOG comes
>> to my mind)? If yes, which ones and how important are they for you?
>
> Here is what I got, the first column is the number of requests, the
> second what is requested, and the 3rd my comments (basically it means,
> if there is a comment, it is not needed/possible to include in a
> modular kernel):

...
> 2 SW_WATCHDOG

This can become a module with very little effort I guess.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-16 Thread Attilio Rao

2011/12/15 Steve Kargl :
> On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote:
>>
>> I basically went through all the e-mail you just sent and identified 4
>> real report on which we could work on and summarizied in the attached
>> Excel file.
>> I'd like that George, Steve, Doug, Andrey and Mike possibly review the
>> few datas there and add more, if they want, or make more important
>> clarifications in particular about the Xorg presence (or rather not)
>> in their workload.
>
> Your summary of my observations appears correct.
>
> I have grabbed an up-to-date /usr/src, built and
> installed world, and built and installed a new
> kernel on one of the nodes in my cluster.  It
> has
>
> CPU: Dual Core AMD Opteron(tm) Processor 280 (2392.65-MHz K8-class CPU)
>  Origin = "AuthenticAMD"  Id = 0x20f12  Family = f  Model = 21  Stepping = 2
>  Features=0x178bfbff  MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
>  Features2=0x1
>  AMD Features=0xe2500800
>  AMD Features2=0x3
> real memory  = 17179869184 (16384 MB)
> avail memory = 16269832192 (15516 MB)
> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
> FreeBSD/SMP: 2 package(s) x 2 core(s)
>
> I can perform new tests with both ULE and 4BSD, but you'll
> need to be precise in the information you want collected
> (and how to collect the data) due to the rather limited
> amount of time I currently have.

It seems a perfect environment, just please make sure you made a
debug-free userland (setting MALLOC_PRODUCTION in jemalloc basically).

The first thing is, can you try reproducing your case? As far as I got
it, for you it was enough to run N + small_amount of CPU-bound threads
to show performance penalty, so I'd ask you to start with using dnetc
or just your preferred cpu-bound workload and verify you can reproduce
the issue.
As it happens, please monitor the threads bouncing and CPU utilization
via 'top' (you don't need to be 100% precise, jut to get an idea, and
keep an eye on things like excessive threads migration, thread binding
obsessity, low throughput on CPU).
One note: if your workloads need to do I/O please use a tempfs or
memory storage to do so, in order to reduce I/O effects at all.
Also, verify this doesn't happen with 4BSD scheduler, just in case.

Finally, if the problem is still in place, please recompile your
kernel by adding:
options KTR
options KTR_ENTRIES=262144
options KTR_COMPILE=(KTR_SCHED)
options KTR_MASK=(KTR_SCHED)

And reproduce the issue.
When you are in the middle of the scheduling issue go with:
# ktrdump -ctf > ktr-ule-problem-YOURNAME.out

and send to the mailing list along with your dmesg and the
informations on the CPU utilization you gathered by top(1).

That should cover it all, but if you have further questions, please
just go ahead.

Thanks,
Attilio

-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Benchmark (Phoronix): FreeBSD 9.0-RC2 vs. Oracle Linux 6.1 Server

2011-12-16 Thread Attilio Rao

2011/12/16 Arnaud Lacombe :
> Hi,
>
> On Thu, Dec 15, 2011 at 2:32 AM, O. Hartmann
>  wrote:
>> Just saw this shot benchmark on Phoronix dot com today:
>>
>> http://www.phoronix.com/scan.php?page=news_item&px=MTAyNzA
>>
> it might be worth highlighting that despite Oracle Linux 6.1 Server is
> using a kernel + compiler almost 2 years old, it still manages to
> out-perform the bleeding edge FreeBSD :-)
>
> Now, from what I've read so far in this thread, it seems that a lot of
> people are still in abnegation...
>
> my 0.2c,
>  - Arnaud

Said by someone which really thinks passing __FILE__ and __LINE__ to
kernel function is going to give a mesaurable performance penalty is
really hilarious however :)

It is crystal clear you really don't understand how to make reliable
benchmarks (and likely you don't really have a grasp of nowaday's
machine contention points), so why you keep talking about it? It would
be more valuable for you and whatever project you follow if you spend
your time coding and making real benchmarking.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao

2011/12/15 Mike Tancsa :
> On 12/15/2011 11:56 AM, Attilio Rao wrote:
>> So, as very first thing, can you try the following:
>> - Same codebase, etc. etc.
>> - Make the test 4 times, discard the first and ministat for the other 3
>> - Reboot
>> - Change the steal_thresh value
>> - Make the test 4 times, discard the first and ministat for the other 3
>>
>> Then report discarded values and the ministated one and we will have
>> more informations I guess
>> (also, I don't think devfs contention should play a role here, thus
>> nevermind about it for now).
>
>
> Results and data at
>
> http://www.tancsa.com/ule-bsd.html

I'm not totally sure, what does burnP6 do? is it a CPU-bound workload?
Also, how many threads are spanked in your case for parallel bzip2?

Also, it would be very good if you could arrange these tests against
newer -CURRENT (with userland and kerneland debugging off).

Thanks a lot of your hard work,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao

2011/12/15 Jeremy Chadwick :
> On Thu, Dec 15, 2011 at 05:26:27PM +0100, Attilio Rao wrote:
>> 2011/12/13 Jeremy Chadwick :
>> > On Mon, Dec 12, 2011 at 02:47:57PM +0100, O. Hartmann wrote:
>> >> > Not fully right, boinc defaults to run on idprio 31 so this isn't an
>> >> > issue. And yes, there are cases where SCHED_ULE shows much better
>> >> > performance then SCHED_4BSD. ??[...]
>> >>
>> >> Do we have any proof at hand for such cases where SCHED_ULE performs
>> >> much better than SCHED_4BSD? Whenever the subject comes up, it is
>> >> mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
>> >> 2. But in the end I see here contradictionary statements. People
>> >> complain about poor performance (especially in scientific environments),
>> >> and other give contra not being the case.
>> >>
>> >> Within our department, we developed a highly scalable code for planetary
>> >> science purposes on imagery. It utilizes present GPUs via OpenCL if
>> >> present. Otherwise it grabs as many cores as it can.
>> >> By the end of this year I'll get a new desktop box based on Intels new
>> >> Sandy Bridge-E architecture with plenty of memory. If the colleague who
>> >> developed the code is willing performing some benchmarks on the same
>> >> hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
>> >> recent Suse. For FreeBSD I intent also to look for performance with both
>> >> different schedulers available.
>> >
>> > This is in no way shape or form the same kind of benchmark as what
>> > you're planning to do, but I thought I'd throw it out there for folks to
>> > take in as they see fit.
>> >
>> > I know folks were focused mainly on buildworld.
>> >
>> > I personally would find it interesting if someone with a higher-end
>> > system (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the
>> > same test (changing -jX to -j{numofcores} of course).
>> >
>> > --
>> > | Jeremy Chadwick ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??jdc at 
>> > parodius.com |
>> > | Parodius Networking ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 
>> > http://www.parodius.com/ |
>> > | UNIX Systems Administrator ?? ?? ?? ?? ?? ?? ?? ?? ?? Mountain View, CA, 
>> > US |
>> > | Making life hard for others since 1977. ?? ?? ?? ?? ?? ?? ?? PGP 
>> > 4BD6C0CB |
>> >
>> >
>> > sched_ule
>> > ===
>> > - time make -j2 buildworld
>> > ??1689.831u 229.328s 18:46.20 170.4% 6566+2051k 432+4264io 4565pf+0w
>> > - time make -j2 buildkernel
>> > ??640.542u 87.737s 9:01.38 134.5% 6490+1920k 134+5968io 0pf+0w
>> >
>> >
>> > sched_4bsd
>> > 
>> > - time make -j2 buildworld
>> > ??1662.793u 206.908s 17:12.02 181.1% 6578+2054k 23750+4271io 6451pf+0w
>> > - time make -j2 buildkernel
>> > ??638.717u 76.146s 8:34.90 138.8% 6530+1927k 6415+5903io 0pf+0w
>> >
>> >
>> > software
>> > ==
>> > * sched_ule test: ??FreeBSD 8.2-STABLE, Thu Dec ??1 04:37:29 PST 2011
>> > * sched_4bsd test: FreeBSD 8.2-STABLE, Mon Dec 12 22:42:54 PST 2011
>>
>> Hi Jeremy,
>> thanks for the time you spent on this.
>>
>> However, I wanted to ask/let you note 3 things:
>> 1) Did you use 2 different code base for the test? (one updated on
>> December 1 and another one on December 12)
>
> No; src-all (/usr/src on this system) was not updated between December
> 1st and December 12th PST.  I do believe I updated it today (15th PST).
> I can/will obviously hold off so that we have a consistent code base for
> comparing numbers between schedulers during buildworld and/or
> buildkernel.
>
>> 2) Please note that you should have repeated this test several times
>> (basically until you don't get a standard deviation which is
>> acceptable with ministat) and report the ministat output
>
> This is the first time I have heard of ministat(1).  I'm pretty sure I
> see what it's for and how it applies to this situation, but boy that man
> page could use some clarification (I have 3 people looking at this thing
> right now trying to figure out what means what in the graph :-) ).
> Anyway, graph or not, I see the point.
>
> Regarding multiple tests: yup, you're absolutely right, the only way to
> do it would be to run a sequence of tests repeatedly (probably 10

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao

2011/12/15 Mike Tancsa :
> On 12/15/2011 11:42 AM, Attilio Rao wrote:
>>
>> I'm thinking now to a better test-case for this: can you try that on a
>> tmpfs volume?
>
> There is enough RAM in the box so that it should not touch the disk, and
> I was sending the output to /dev/null, so it was not writing to the disk.
>
>>
>> Also what filesystem you were using?
>
> UFS
>
>> How many CPUs were in place?
>
> 4
>
>> Did you reboot before to move the steal_thresh value?
>
> No.

So, as very first thing, can you try the following:
- Same codebase, etc. etc.
- Make the test 4 times, discard the first and ministat for the other 3
- Reboot
- Change the steal_thresh value
- Make the test 4 times, discard the first and ministat for the other 3

Then report discarded values and the ministated one and we will have
more informations I guess
(also, I don't think devfs contention should play a role here, thus
nevermind about it for now).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao

2011/12/15 Mike Tancsa :
> On 12/15/2011 11:26 AM, Attilio Rao wrote:
>>
>> Hi Mike,
>> was that just the same codebase with the switch SCHED_4BSD/SCHED_ULE?
>
> Hi Attilio,
>        It was the same codebase.
>
>
>> Could you retry the bench checking CPU usage and possible thread
>> migration around for both cases?
>
> I can, but how do I do that ?

I'm thinking now to a better test-case for this: can you try that on a
tmpfs volume?

Also what filesystem you were using? How many CPUs were in place?
Did you reboot before to move the steal_thresh value?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao

2011/12/13 Jeremy Chadwick :
> On Mon, Dec 12, 2011 at 02:47:57PM +0100, O. Hartmann wrote:
>> > Not fully right, boinc defaults to run on idprio 31 so this isn't an
>> > issue. And yes, there are cases where SCHED_ULE shows much better
>> > performance then SCHED_4BSD.  [...]
>>
>> Do we have any proof at hand for such cases where SCHED_ULE performs
>> much better than SCHED_4BSD? Whenever the subject comes up, it is
>> mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
>> 2. But in the end I see here contradictionary statements. People
>> complain about poor performance (especially in scientific environments),
>> and other give contra not being the case.
>>
>> Within our department, we developed a highly scalable code for planetary
>> science purposes on imagery. It utilizes present GPUs via OpenCL if
>> present. Otherwise it grabs as many cores as it can.
>> By the end of this year I'll get a new desktop box based on Intels new
>> Sandy Bridge-E architecture with plenty of memory. If the colleague who
>> developed the code is willing performing some benchmarks on the same
>> hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
>> recent Suse. For FreeBSD I intent also to look for performance with both
>> different schedulers available.
>
> This is in no way shape or form the same kind of benchmark as what
> you're planning to do, but I thought I'd throw it out there for folks to
> take in as they see fit.
>
> I know folks were focused mainly on buildworld.
>
> I personally would find it interesting if someone with a higher-end
> system (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the
> same test (changing -jX to -j{numofcores} of course).
>
> --
> | Jeremy Chadwick                                jdc at parodius.com |
> | Parodius Networking                       http://www.parodius.com/ |
> | UNIX Systems Administrator                   Mountain View, CA, US |
> | Making life hard for others since 1977.               PGP 4BD6C0CB |
>
>
> sched_ule
> ===
> - time make -j2 buildworld
>  1689.831u 229.328s 18:46.20 170.4% 6566+2051k 432+4264io 4565pf+0w
> - time make -j2 buildkernel
>  640.542u 87.737s 9:01.38 134.5% 6490+1920k 134+5968io 0pf+0w
>
>
> sched_4bsd
> 
> - time make -j2 buildworld
>  1662.793u 206.908s 17:12.02 181.1% 6578+2054k 23750+4271io 6451pf+0w
> - time make -j2 buildkernel
>  638.717u 76.146s 8:34.90 138.8% 6530+1927k 6415+5903io 0pf+0w
>
>
> software
> ==
> * sched_ule test:  FreeBSD 8.2-STABLE, Thu Dec  1 04:37:29 PST 2011
> * sched_4bsd test: FreeBSD 8.2-STABLE, Mon Dec 12 22:42:54 PST 2011

Hi Jeremy,
thanks for the time you spent on this.

However, I wanted to ask/let you note 3 things:
1) Did you use 2 different code base for the test? (one updated on
December 1 and another one on December 12)
2) Please note that you should have repeated this test several times
(basically until you don't get a standard deviation which is
acceptable with ministat) and report the ministat output
3) The difference is less than 2% which I suspect is really
statistically unuseful/the same

I'm not really even surprised ULE is not faster than 4BSD in this case
because usually buildworld/buildkernel tests are driven for the vast
majority by I/O overhead rather than scheduler capacity. It would be
more interesting to analyze how buildworld does while another type of
workload is going on.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao

2011/12/13 Daniel Kalchev :
>
>
> On 13.12.11 09:36, Jeremy Chadwick wrote:
>>
>> I personally would find it interesting if someone with a higher-end system
>> (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the same test
>> (changing -jX to -j{numofcores} of course).
>
>
> Is 4 way 8 core Opteron ok? That is 32 cores, 64GB RAM.
>
> Testing with buildworld in my opinion is not adequate, as it involves way
> too much I/O. Any advice on proper testing methodology?

I'm sure that I/O and pmap subsystem contention (because of
buildworld) and TLB shootdown overhead (because of 32 CPUs) will be so
overwhelming that you are not really going to benchmark the scheduler
activity at all.

However I still don't get what you want to verify exactly?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao

2011/12/14 Mike Tancsa :
> On 12/13/2011 7:01 PM, m...@freebsd.org wrote:
>>
>> Has anyone experiencing problems tried to set sysctl 
>> kern.sched.steal_thresh=1 ?
>>
>> I don't remember what our specific problem at $WORK was, perhaps it
>> was just interrupt threads not getting serviced fast enough, but we've
>> hard-coded this to 1 and removed the code that sets it in
>> sched_initticks().  The same effect should be had by setting the
>> sysctl after a box is up.
>
> FWIW, this does impact the performance of pbzip2 on an i7. Using a 1.1G file
>
> pbzip2 -v -c big > /dev/null
>
> with burnP6 running in the background,
>
> sysctl kern.sched.steal_thresh=1
> vs
> sysctl kern.sched.steal_thresh=3
>
>
>
>    N           Min           Max        Median           Avg        Stddev
> x  10     38.005022      38.42238     38.194648     38.165052    0.15546188
> +   9     38.695417     40.595544     39.392127     39.435384    0.59814114
> Difference at 95.0% confidence
>        1.27033 +/- 0.412636
>        3.32852% +/- 1.08119%
>        (Student's t, pooled s = 0.425627)
>
> a value of 1 is *slightly* faster.

Hi Mike,
was that just the same codebase with the switch SCHED_4BSD/SCHED_ULE?

Also, the results here should be in the 3% interval for the avg case,
which is not yet at the 'alarm level' but could still be an
indication.
I still suspect I/O plays a big role here, however, thus it could be
detemined by other factors.

Could you retry the bench checking CPU usage and possible thread
migration around for both cases?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao

2011/12/9 George Mitchell :
> dnetc is an open-source program from http://www.distributed.net/.  It
> tries a brute-force approach to cracking RC4 puzzles and also computes
> optimal Golomb rulers.  It starts up one process per CPU and runs at
> nice 20 and is, for all intents and purposes, 100% compute bound.

[Posting on the first message of the thread]

I basically went through all the e-mail you just sent and identified 4
real report on which we could work on and summarizied in the attached
Excel file.
I'd like that George, Steve, Doug, Andrey and Mike possibly review the
few datas there and add more, if they want, or make more important
clarifications in particular about the Xorg presence (or rather not)
in their workload.

I've readed a couple of message in the thread pointing the finger to
Xorg to be excessively CPU-intensive and I think they are right, we
might try to find a solution for that at some point, but it is really
a very edge case.
Geroge's and Steve's case, instead, look very different from this and
I want to analyze them in detail.
George already provided schedgraph traces and for others, if they
cannot provide them directly, I'd really appreciate they would at
least describe in detail the workload so that I get a chance to
reproduce it.

If someone else thinks he has a specific problem that is not
characterized by one of the cases above please let me know and I will
put this in the chart.

Thanks for the hard work you guys put in pointing out ULE's problem, I
think we will get at the bottom of this if we keep up sharing thoughts
and reports.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-09 Thread Attilio Rao

2011/12/10 Eitan Adler :
> On Fri, Dec 9, 2011 at 8:15 PM, George Mitchell  wrote:
>> Hope the attached helps.                         -- George Mitchell
>
> You attached dmesg, not a patch.

This is what is needed for a schedgraph analysis, along with KTR
points collection.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-09 Thread Attilio Rao

2011/12/10 George Mitchell :
> On 12/09/11 10:17, Attilio Rao wrote:
>>
>> [...]
>>
>> More precisely I'd be interested in KTR traces.
>> To be even more precise:
>> With a completely stable GENERIC configuration (or otherwise please
>> post your kernel config) please add the following:
>> options KTR
>> options KTR_ENTRIES=262144
>> options KTR_COMPILE=(KTR_SCHED)
>> options KTR_MASK=(KTR_SCHED)
>>
>> While you are in the middle of the slow-down (so once it is well
>> established) please do:
>> # sysclt debug.ktr.cpumask=""
>
>
> wonderland# sysctl debug.ktr.cpumask=""
> debug.ktr.cpumask: 
> sysctl: debug.ktr.cpumask: Invalid argument
>
>
>>
>> In the end go with:
>> # ktrdump -ctf>  ktr-ule-problem.out
>
>
> It's 44MB, so it's at http://www.m5p.com/~george/ktr-ule-problem.out

What svn revision did you use for it?
What is the CPUs frequencies of machines generating this?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-09 Thread Attilio Rao

2011/12/9 George Mitchell :
> dnetc is an open-source program from http://www.distributed.net/.  It
> tries a brute-force approach to cracking RC4 puzzles and also computes
> optimal Golomb rulers.  It starts up one process per CPU and runs at
> nice 20 and is, for all intents and purposes, 100% compute bound.
>
> Here is what happens on my system, running 9.0-PRERELEASE, with and
> without dnetc running, with SCHED_ULE and SCHED-4BSD, when I run the
> command:
>
> time make buildkernel KERNCONF=WONDERLAND
>
> (I get similar results on 8.x as well.)
>
> SCHED_4BSD, dnetc not running:
> 1329.715u 123.739s 24:47.95 97.6%       6310+1987k 11233+11098io 419pf+0w
>
> SCHED_4BSD, dnetc running:
> 1329.364u 115.158s 26:14.83 91.7%       6325+1987k 10912+11060io 393pf+0w
>
> SCHED_ULE, dnetc not running:
> 1357.457u 121.526s 25:20.64 97.2%       6326+1990k 11234+11149io 419pf+0w
>
> SCHED_ULE, dnetc running:
> Still going after seven and a half hours of clock time, up to
> compiling netgraph/bluetooth.  (Completed in another five minutes
> after stopping dnetc so I could write this message in a reasonable
> amount of time.)
>
> Not everybody runs this sort of program, but there are plenty of
> similar projects out there, and people who try to participate in
> them will be mightily displeased with their FreeBSD systems when
> they do.  Is there some case where SCHED_ULE exhibits significantly
> better performance than SCHED_4BSD?  If not, I think SCHED-4BSD
> should remain the default GENERIC configuration until this is fixed.

Hi George,
are you interested in exploring more the case with SCHED_ULE and dnetc?

More precisely I'd be interested in KTR traces.
To be even more precise:
With a completely stable GENERIC configuration (or otherwise please
post your kernel config) please add the following:
options KTR
options KTR_ENTRIES=262144
options KTR_COMPILE=(KTR_SCHED)
options KTR_MASK=(KTR_SCHED)

While you are in the middle of the slow-down (so once it is well
established) please do:
# sysclt debug.ktr.cpumask=""

In the end go with:
# ktrdump -ctf > ktr-ule-problem.out

and send the file to this mailing list.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: FreeBSD 9-Beta3 on X300 2 problems

2011-09-28 Thread Attilio Rao

2011/9/27 crsnet.pl :
>> Hi,
>
> Hello, thanks for reply.
>>
>> Please try to do this without wlan loaded at all (not just down, but
>> build your wifi support as a module.)
>> Then try without X, see whether it's related to that or not.
>>
> First i make kldunload if_iwn.
> When i try to suspend from X, Xorg close, i see console and laptop suspend.
> When i resume it, i get console (any key dosent work), when i try to ALT+F9
> i get black screen and beep;/
>
> But when i try to suspen from console. I get :
> pci0: failed to set ACPI power state D2 \_SB_.PCI0_EXP0: AE_BAD_PARAMETER
> pci0: failed to set ACPI power state D2 \_SB_.PCI0_EXP1: AE_BAD_PARAMETER
> pci0: failed to set ACPI power state D2 \_SB_.PCI0_EXP2: AE_BAD_PARAMETER
> And laptop suspend, when i resume it. He hangs when i press any buttons it
> does nothing. And than i see on console that info :
> ugen0.2:  ... disconnected
> ugen4.2:  ... disconnected
> ubt0: at uhub0 ... disconnected
> then i see this presed lethers
> and
> acpi0: suspend request ignored (not ready yet) and laptops langs and beep ;/
>
>> (And you haven't told us what your hardware is.)
>
> #dmesg (+WITNESS)
> Copyright (c) 1992-2011 The FreeBSD Project.
> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
>        The Regents of the University of California. All rights reserved.
> FreeBSD is a registered trademark of The FreeBSD Foundation.
> FreeBSD 9.0-BETA3 #3: Tue Sep 27 10:47:57 CEST 2011
>    cr4sh@x300:/sys/amd64/compile/GENERIC amd64
> WARNING: WITNESS option enabled, expect reduced performance.
> CPU: Intel(R) Core(TM)2 Duo CPU     L7100  @ 1.20GHz (1197.03-MHz K8-class
> CPU)
>  Origin = "GenuineIntel"  Id = 0x6fb  Family = 6  Model = f  Stepping = 11
>  Features=0xbfebfbff
>                 BE>
>  Features2=0xe3bd
>  AMD Features=0x20100800
>  AMD Features2=0x1
>  TSC: P-state invariant, performance statistics
> real memory  = 2147483648 (2048 MB)
> avail memory = 2019139584 (1925 MB)
> Event timer "LAPIC" quality 400
> ACPI APIC Table: 
> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
> FreeBSD/SMP: 1 package(s) x 2 core(s)
>  cpu0 (BSP): APIC ID:  0
>  cpu1 (AP): APIC ID:  1
> ACPI Warning: 32/64X length mismatch in Gpe1Block: 0/32
> (20110527/tbfadt-556)
> ACPI Warning: Optional field Gpe1Block has zero address or length:
> 0x102C/0x0 (20110527/tbfadt-586)
> ioapic0: Changing APIC ID to 1
> ioapic0  irqs 0-23 on motherboard
> kbd1 at kbdmux0
> acpi0:  on motherboard
> CPU0: local APIC error 0x40
> acpi_ec0:  port 0x62,0x66 on acpi0
> acpi0: Power Button (fixed)
> acpi0: reservation of 0, a (3) failed
> acpi0: reservation of 10, 7ef0 (3) failed
> Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
> acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0
> cpu0:  on acpi0
> cpu1:  on acpi0
> acpi_lid0:  on acpi0
> acpi_button0:  on acpi0
> pcib0:  port 0xcf8-0xcff on acpi0
> pci0:  on pcib0
> vgapci0:  port 0x1800-0x1807 mem
> 0xfa00-0xfa0f,0xe000-0xefff irq 16 at device 2.0 on pci0
> agp0:  on vgapci0
> agp0: aperture size is 256M, detected 7676k stolen memory
> vgapci1:  mem 0xfa10-0xfa1f at device 2.1 on
> pci0
> pci0:  at device 3.0 (no driver attached)
> atapci0:  port
> 0x1828-0x182f,0x180c-0x180f,0x1820-0x1827,0x1808-0x180b,0x1810-0x181f irq 18
> at device 3.2 on pci0
> ata2:  on atapci0
> ata3:  on atapci0
> pci0:  at device 3.3 (no driver attached)
> em0:  port 0x1840-0x185f mem
> 0xfa20-0xfa21,0xfa225000-0xfa225fff irq 20 at device 25.0 o
>
>        n pci0
> em0: Using an MSI interrupt
> acquiring duplicate lock of same type: "network driver"
>  1st &dev_spec->swflag_mutex @ dev/e1000/e1000_ich8lan.c:785
>  2nd &dev_spec->nvm_mutex @ dev/e1000/e1000_ich8lan.c:751

I think that MTX_NETWORK_LOCK is not suitable for this case as you
will have 2 different locks with the same name in softc.

I think that this patch should be good to go (and fixes the WITNESS warning):
http://www.freebsd.org/~attilio/e1000_mutex_init.patch

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: panic: spin lock held too long (RELENG_8 from today)

2011-09-01 Thread Attilio Rao

2011/9/1 Trent Nelson :
>
> On Aug 19, 2011, at 7:53 PM, Attilio Rao wrote:
>
>> If nobody complains about it earlier, I'll propose the patch to re@ in 8 
>> hours.
>
> Just a friendly 'me too', for the records.  22 hours of heavy network/disk 
> I/O and no panic yet -- prior to the patch it was a panic orgy.
>
> Any response from re@ on the patch?  It didn't appear to be in stable/8 as of 
> yesterday:

It has been committed to STABLE_8 as r225288.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-19 Thread Attilio Rao

If nobody complains about it earlier, I'll propose the patch to re@ in 8 hours.

Attilio

2011/8/19 Mike Tancsa :
> On 8/18/2011 8:37 PM, Chip Camden wrote:
>
>>> st> Thanks, Attilio.  I've applied the patch and removed the extra debug
>>> st> options I had added (though keeping debug symbols).  I'll let you know 
>>> if
>>> st> I experience any more panics.
>>>
>>>  No panic for 20 hours at this moment, FYI.  For my NFS server, I
>>>  think another 24 hours would be sufficient to confirm the stability.
>>>  I will see how it works...
>>>
>>> -- Hiroki
>>
>> Likewise:
>>
>> $ uptime
>>  5:37PM  up 21:45, 5 users, load averages: 0.68, 0.45, 0.63
>>
>> So far, so good (knocks on head).
>>
>
>
> 0(ns4)% uptime
>  8:55AM  up 22:39, 3 users, load averages: 0.01, 0.00, 0.00
> 0(ns4)%
>
>
> So far so good for me too
>
>        ---Mike
>
> --
> ---
> Mike Tancsa, tel +1 519 651 3400
> Sentex Communications, m...@sentex.net
> Providing Internet services since 1994 www.sentex.net
> Cambridge, Ontario Canada   http://www.tancsa.com/
>



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: USB/coredump hangs in 8 and 9

2011-08-19 Thread Attilio Rao

2011/8/12 Andrew Boyer :
> Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
> Re: debugging frequent kernel panics on 8.2-RELEASE (originally on 
> freebsd-stable)
> Re: System hang in USB umass module while processing panic  (originally on 
> freebsd-usb)
>
> Hello Andriy and Hans,
>
> Sorry for tying in so many discussions on this topic, but I think I have an 
> explanation for the problems we have been reporting* with hanging coredumps 
> on multicore systems on 8.2-RELEASE, and it has implications for Andriy's 
> proposed scheduler patch** and for USB.
>
> In today's 8.X and 9.X branches, nothing that I can find stops the other CPUs 
> when the kernel panics, but many parts of the locking code get disabled (grep 
> on 'panicstr').  The 'bufwrite: buffer is not busy???' panic is caused by the 
> syncer encountering an error.  If that happens when it's on the dumping CPU 
> everything hangs.  If it's running on a different CPU, it will be blocked and 
> hidden by the panic_cpu spinlock in panic(), and the dump continues, polling 
> every attached keyboard for a Ctl-C.
>
> But, the new 8.X USB stack relies on multithreading.  (The new stack is the 
> variable that broke coredumps for us in the 7.1->8.2 transition, I think.)  
> SVN 224223 fixes a hang that would happen when dumpsys() polls the USB 
> keyboard (IPMI KVM, in our case).  That helps, but it only gets as far as 
> usb_process(), where it hangs in a loop around a cv_wait() call.  This is 
> easy to reproduce by adding code to the watchdog to break into the debugger 
> if panicstr is set.
>
> I am experimenting with Andriy's patch** to stop the scheduler and it seems 
> to be most of the way there, stopping the CPUs and disabling the rest of 
> locking.  There are a few places that still reference panicstr, but that's 
> minor.  These are the changes I made to the patch:
>  * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is 
> true, so that we don't hang up in USB.  ukbd_yield()  locks up in 
> DROP_GIANT(), and if you skip ukbd_yield(), usbd_transfer_poll() locks up 
> trying to drop mutexes.
>  * Changed the call to spinlock_enter() back to critical_enter(), so that 
> interrupts stay enabled and the hardclock still functions.

Which spinlock_enter() are you referring here?
I think that having interrupts fast handlers running during
panic/shutdown is something we should avoid like hell.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-18 Thread Attilio Rao

2011/8/18 Andriy Gapon :
> on 17/08/2011 23:21 Andriy Gapon said the following:
>>
>> It seems like everything starts with some kind of a race between
>> terminating
>> processes in a jail and termination of the jail itself.  This is where the
>> details are very thin so far.  What we see is that a process (http) is in
>> exit(2) syscall, in exit1() function actually, and past the place where
>> P_WEXIT
>> flag is set and even past the place where p_limit is freed and reset to
>> NULL.
>> At that place the thread calls prison_proc_free(), which calls
>> prison_deref().
>> Then, we see that in prison_deref() the thread gets a page fault because
>> of what
>> seems like a NULL pointer dereference.  That's just the start of the
>> problem and
>> its root cause.
>>
>> Then, trap_pfault() gets invoked and, because addresses close to NULL look
>> like
>> userspace addresses, vm_fault/vm_fault_hold gets called, which in its turn
>> goes
>> on to call vm_map_growstack.  First thing that vm_map_growstack does is a
>> call
>> to lim_cur(), but because p_limit is already NULL, that call results in a
>> NULL
>> pointer dereference and a page fault.  Goto the beginning of this
>> paragraph.
>>
>> So we get this recursion of sorts, which only ends when a stack is
>> exhausted and
>> a CPU generates a double-fault.
>
> BTW, does anyone has an idea why the thread in question would "disappear"
> from
> the kgdb's point of view?
>
> (kgdb) p cpuid_to_pcpu[2]->pc_curthread->td_tid
> $3 = 102057
> (kgdb) tid 102057
> invalid tid
>
> info threads also doesn't list the thread.
>
> Is it because the panic happened while the thread was somewhere in exit1()?
> is there an easy way to examine its stack in this case?

Yes it is likely it.

'tid' command should lookup the tid_to_thread() table (or similar
name) which returns NULL, which means the thread has past beyond the
point it was in the lookup table.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao

2011/8/18 Hiroki Sato :
> Hiroki Sato  wrote
>  in <20110818.043332.27079545013461535@allbsd.org>:
>
> hr> Attilio Rao  wrote
> hr>   in :
> hr>
> hr> at> 2011/8/17 Hiroki Sato :
> hr> at> > Hi,
> hr> at> >
> hr> at> > Mike Tancsa  wrote
> hr> at> >  in <4e15a08c.6090...@sentex.net>:
> hr> at> >
> hr> at> > mi> On 7/7/2011 7:32 AM, Mike Tancsa wrote:
> hr> at> > mi> > On 7/7/2011 4:20 AM, Kostik Belousov wrote:
> hr> at> > mi> >>
> hr> at> > mi> >> BTW, we had a similar panic, "spinlock held too long", the 
> spinlock
> hr> at> > mi> >> is the sched lock N, on busy 8-core box recently upgraded to 
> the
> hr> at> > mi> >> stable/8. Unfortunately, machine hung dumping core, so the 
> stack trace
> hr> at> > mi> >> for the owner thread was not available.
> hr> at> > mi> >>
> hr> at> > mi> >> I was unable to make any conclusion from the data that was 
> present.
> hr> at> > mi> >> If the situation is reproducable, you coulld try to revert 
> r221937. This
> hr> at> > mi> >> is pure speculation, though.
> hr> at> > mi> >
> hr> at> > mi> > Another crash just now after 5hrs uptime. I will try and 
> revert r221937
> hr> at> > mi> > unless there is any extra debugging you want me to add to the 
> kernel
> hr> at> > mi> > instead  ?
> hr> at> >
> hr> at> >  I am also suffering from a reproducible panic on an 8-STABLE box, 
> an
> hr> at> >  NFS server with heavy I/O load.  I could not get a kernel dump
> hr> at> >  because this panic locked up the machine just after it occurred, 
> but
> hr> at> >  according to the stack trace it was the same as posted one.
> hr> at> >  Switching to an 8.2R kernel can prevent this panic.
> hr> at> >
> hr> at> >  Any progress on the investigation?
> hr> at>
> hr> at> Hiroki,
> hr> at> how easilly can you reproduce it?
> hr>
> hr>  It takes 5-10 hours.  I installed another kernel for debugging just
> hr>  now, so I think I will be able to collect more detail information in
> hr>  a couple of days.
> hr>
> hr> at> It would be important to have a DDB textdump with these informations:
> hr> at> - bt
> hr> at> - ps
> hr> at> - show allpcpu
> hr> at> - alltrace
> hr> at>
> hr> at> Alternatively, a coredump which has the stop cpu patch which Andryi 
> can provide.
> hr>
> hr>  Okay, I will post them once I can get another panic.  Thanks!
>
>  I got the panic with a crash dump this time.  The result of bt, ps,
>  allpcpu, and traces can be found at the following URL:
>
>  http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt

Actually, I think I see the bug here.

In callout_cpu_switch() if a low priority thread is migrating the
callout and gets preempted after the outcoming cpu queue lock is left
(and scheduled much later) we get this problem.

In order to fix this bug it could be enough to use a critical section,
but I think this should be really interrupt safe, thus I'd wrap them
up with spinlock_enter()/spinlock_exit(). Fortunately
callout_cpu_switch() should be called rarely and also we already do
expensive locking operations in callout, thus we should not have
problem performance-wise.

Can the guys I also CC'ed here try the following patch, with all the
initial kernel options that were leading you to the deadlock? (thus
revert any debugging patch/option you added for the moment):
http://www.freebsd.org/~attilio/callout-fixup.diff

Please note that this patch is for STABLE_8, if you can confirm the
good result I'll commit to -CURRENT and then backmarge as soon as
possible.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao

2011/8/18 Hiroki Sato :
> Hiroki Sato  wrote
>  in <20110818.043332.27079545013461535@allbsd.org>:
>
> hr> Attilio Rao  wrote
> hr>   in :
> hr>
> hr> at> 2011/8/17 Hiroki Sato :
> hr> at> > Hi,
> hr> at> >
> hr> at> > Mike Tancsa  wrote
> hr> at> >  in <4e15a08c.6090...@sentex.net>:
> hr> at> >
> hr> at> > mi> On 7/7/2011 7:32 AM, Mike Tancsa wrote:
> hr> at> > mi> > On 7/7/2011 4:20 AM, Kostik Belousov wrote:
> hr> at> > mi> >>
> hr> at> > mi> >> BTW, we had a similar panic, "spinlock held too long", the 
> spinlock
> hr> at> > mi> >> is the sched lock N, on busy 8-core box recently upgraded to 
> the
> hr> at> > mi> >> stable/8. Unfortunately, machine hung dumping core, so the 
> stack trace
> hr> at> > mi> >> for the owner thread was not available.
> hr> at> > mi> >>
> hr> at> > mi> >> I was unable to make any conclusion from the data that was 
> present.
> hr> at> > mi> >> If the situation is reproducable, you coulld try to revert 
> r221937. This
> hr> at> > mi> >> is pure speculation, though.
> hr> at> > mi> >
> hr> at> > mi> > Another crash just now after 5hrs uptime. I will try and 
> revert r221937
> hr> at> > mi> > unless there is any extra debugging you want me to add to the 
> kernel
> hr> at> > mi> > instead  ?
> hr> at> >
> hr> at> >  I am also suffering from a reproducible panic on an 8-STABLE box, 
> an
> hr> at> >  NFS server with heavy I/O load.  I could not get a kernel dump
> hr> at> >  because this panic locked up the machine just after it occurred, 
> but
> hr> at> >  according to the stack trace it was the same as posted one.
> hr> at> >  Switching to an 8.2R kernel can prevent this panic.
> hr> at> >
> hr> at> >  Any progress on the investigation?
> hr> at>
> hr> at> Hiroki,
> hr> at> how easilly can you reproduce it?
> hr>
> hr>  It takes 5-10 hours.  I installed another kernel for debugging just
> hr>  now, so I think I will be able to collect more detail information in
> hr>  a couple of days.
> hr>
> hr> at> It would be important to have a DDB textdump with these informations:
> hr> at> - bt
> hr> at> - ps
> hr> at> - show allpcpu
> hr> at> - alltrace
> hr> at>
> hr> at> Alternatively, a coredump which has the stop cpu patch which Andryi 
> can provide.
> hr>
> hr>  Okay, I will post them once I can get another panic.  Thanks!
>
>  I got the panic with a crash dump this time.  The result of bt, ps,
>  allpcpu, and traces can be found at the following URL:
>
>  http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt

I'm not sure I understand it, is also a corefile available?
If yes, where I could get it? (with the relevant sources and kernel.debug).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao

2011/8/17 Hiroki Sato :
> Hi,
>
> Mike Tancsa  wrote
>  in <4e15a08c.6090...@sentex.net>:
>
> mi> On 7/7/2011 7:32 AM, Mike Tancsa wrote:
> mi> > On 7/7/2011 4:20 AM, Kostik Belousov wrote:
> mi> >>
> mi> >> BTW, we had a similar panic, "spinlock held too long", the spinlock
> mi> >> is the sched lock N, on busy 8-core box recently upgraded to the
> mi> >> stable/8. Unfortunately, machine hung dumping core, so the stack trace
> mi> >> for the owner thread was not available.
> mi> >>
> mi> >> I was unable to make any conclusion from the data that was present.
> mi> >> If the situation is reproducable, you coulld try to revert r221937. 
> This
> mi> >> is pure speculation, though.
> mi> >
> mi> > Another crash just now after 5hrs uptime. I will try and revert r221937
> mi> > unless there is any extra debugging you want me to add to the kernel
> mi> > instead  ?
>
>  I am also suffering from a reproducible panic on an 8-STABLE box, an
>  NFS server with heavy I/O load.  I could not get a kernel dump
>  because this panic locked up the machine just after it occurred, but
>  according to the stack trace it was the same as posted one.
>  Switching to an 8.2R kernel can prevent this panic.
>
>  Any progress on the investigation?

Hiroki,
how easilly can you reproduce it?

It would be important to have a DDB textdump with these informations:
- bt
- ps
- show allpcpu
- alltrace

Alternatively, a coredump which has the stop cpu patch which Andryi can provide.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-11 Thread Attilio Rao

2011/8/11 Jeremy Chadwick :
> On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote:
>> That's not the issue as its happening across board over 130 machines :(
>
> Agreed, bad hardware sounds unlikely here.  I could believe some strange
> incompatibility (e.g. BIOS quirk or the like[1]) that might cause problems
> en masse across many servers, but hardware issues are unlikely in this
> situation.
>
> [1]: I mention this because we had something similar happen at my
> workplace.  For months we used a specific model of system from our
> vendor which worked reliably, zero issues.  Then we got a new shipment
> of boxes (same model as prior) which started acting very odd (often AHCI
> timeout issues or MCEs which when decoded would usually turn out to be
> nonsensical).  It took weeks to determine the cause given how slow the
> vendor was to respond: root cause turned out to be that the vendor
> decided, on a whim, to start shipping a newer BIOS version which wasn't
> "as compatible" with Solaris as previous BIOSes.  Downgrading all the
> systems to the older BIOS fixed the problem.

That falls in the "hw problem" category for me.

Anyway, we really would need much more information in order to take a
proactive action.

Would it be possible to access to one of the panic'ing machine? Is it
always the same panic which is happening or it is variadic (like: once
page fault, once fatal double fault, once fatal trap, etc.).

Whatever informations you can provide may be valuable here.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-11 Thread Attilio Rao

I'd really point the finger to faulty hw.

Please run all the necessary diagnostic tools for catching it.

Attilio

2011/8/11 Andriy Gapon :
> on 10/08/2011 18:35 Steven Hartland said the following:
>> Fatal double fault
>> rip = 0x8052f6f1
>> rsp = 0xff86ce600fb0
>> rbp = 0xff86ce601210
>> cpuid = 0; apic id = 00
>> panic: double fault
>> cpuid = 0
>> KDB: stack backtrace:
>> #0 0x803af91e at kdb_backtrace+0x5e
>> #1 0x8037d817 at panic+0x187
>> #2 0x80574316 at dblfault_handler+0x96
>> #3 0x8055d06d at Xdblfault+0xad
> [snip]
>> #0  sched_switch (td=0x80830bc0, newtd=0xff000a73f8c0, 
>> flags=Variable
>> "flags" is not available.)
>>    at /usr/src/sys/kern/sched_ule.c:1858
>> 1858                    cpuid = PCPU_GET(cpuid);
>> (kgdb)
>> #0  sched_switch (td=0x80830bc0, newtd=0xff000a73f8c0, 
>> flags=Variable
>> "flags" is not available.)
>>    at /usr/src/sys/kern/sched_ule.c:1858
>> #1  0x80385c86 in mi_switch (flags=260, newtd=0x0)
>>    at /usr/src/sys/kern/kern_synch.c:449
>> #2  0x803b92d2 in sleepq_timedwait (wchan=0x80830760, pri=68)
>>    at /usr/src/sys/kern/subr_sleepqueue.c:644
>> #3  0x803861e1 in _sleep (ident=0x80830760, lock=0x0,
>>    priority=Variable "priority" is not available.
>> ) at /usr/src/sys/kern/kern_synch.c:230
>> #4  0x80532c29 in scheduler (dummy=Variable "dummy" is not available.
>> ) at /usr/src/sys/vm/vm_glue.c:807
>> #5  0x80335d67 in mi_startup () at /usr/src/sys/kern/init_main.c:254
>> #6  0x8016efac in btext () at /usr/src/sys/amd64/amd64/locore.S:81
>> #7  0x808556e0 in sleepq_chains ()
>> #8  0x8083b1e0 in cpu_top ()
>> #9  0x in ?? ()
>> #10 0x80830bc0 in proc0 ()
>> #11 0x80ba4b90 in ?? ()
>> #12 0x80ba4b38 in ?? ()
>> #13 0xff000a73f8c0 in ?? ()
>> #14 0x803a2cc9 in sched_switch (td=0x0, newtd=0x0, flags=Variable 
>> "flags"
>> is not available.
>> )
>>    at /usr/src/sys/kern/sched_ule.c:1852
>> Previous frame inner to this frame (corrupt stack?)
>> (kgdb)
>
> Looks like this is just the first thread in the kernel.
> Perhaps 'thread apply all bt' could help to find the culprit.
>
> --
> Andriy Gapon
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [poll / rfc] kdb_stop_cpus

2011-06-04 Thread Attilio Rao

2011/6/4 Andriy Gapon :
> on 03/06/2011 20:57 Robert N. M. Watson said the following:
>>
>> On 3 Jun 2011, at 16:13, Andriy Gapon wrote:
>>
>>> I wonder if anybody uses kdb_stop_cpus with non-default value. If, yes, I
>>> am very interested to learn about your usecase for it.
>>
>> The issue that prompted the sysctl was non-NMI IPIs being used to enter the
>> debugger or reboot following a core hanging with interrupts disabled. With
>> the switch to NMI IPIs in some of those circumstances, life is better -- at
>> least, on hardware that supports non-maskable IPIs. I seem to recall sparc64
>> doesn't, however?
>
> Seems to be so as Nathan has also pointed out for PPC.
> For this I also plan the following change:
>
> commit 458ebd9aca7e91fc6e0825c727c7220ab9f61016
>
>    generic_stop_cpus: move timeout detection code from under DIAGNOSTIC
>
>    ... and also increase it a bit.
>    IMO it's better to detect and report the (rather serious) condition and
>    allow a system to proceed somehow rather than be stuck in an endless
>    loop.
>
> diff --git a/sys/kern/subr_smp.c b/sys/kern/subr_smp.c
> index ae52f4b..4bd766b 100644
> --- a/sys/kern/subr_smp.c
> +++ b/sys/kern/subr_smp.c
> @@ -232,12 +232,10 @@ generic_stop_cpus(cpumask_t map, u_int type)
>                /* spin */
>                cpu_spinwait();
>                i++;
> -#ifdef DIAGNOSTIC
> -               if (i == 10) {
> +               if (i == 1) {
>                        printf("timeout stopping cpus\n");
>                        break;
>                }
> -#endif
>        }
>
>        stopping_cpu = NOCPU;

I'd also add the ability, once the deadlock is detected, to break in
KDB, and put that under DIAGNOSTIC.
I had such a patch and I used it to debug some deadlocks on shutdown
code, but now it seems I can't find it anymore.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [poll / rfc] kdb_stop_cpus

2011-06-04 Thread Attilio Rao

2011/6/3 Nathan Whitehorn :
> On 06/03/11 10:13, Andriy Gapon wrote:
>>
>> I wonder if anybody uses kdb_stop_cpus with non-default value.
>> If, yes, I am very interested to learn about your usecase for it.
>>
>> I think that the default kdb behavior is the correct one, so it doesn't
>> make sense
>> to have a knob to turn on incorrect behavior.
>> But I may be missing something obvious.
>>
>> The comment in the code doesn't really satisfy me:
>> /*
>>  * Flag indicating whether or not to IPI the other CPUs to stop them on
>>  * entering the debugger.  Sometimes, this will result in a deadlock as
>>  * stop_cpus() waits for the other cpus to stop, so we allow it to be
>>  * disabled.  In order to maximize the chances of success, use a hard
>>  * stop for that.
>>  */
>>
>> The hard stop should be sufficiently mighty.
>> Yes, I am aware of supposedly extremely rare situations where a deadlock
>> could
>> happen even when using hard stop.  But I'd rather fix that than have this
>> switch.
>>
>> Oh, the commit message (from 2004) explains it:
>>>
>>> Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
>>> attempt to IPI other cpus when entering the debugger in order to stop
>>> them while in the debugger.  The default remains to issue the stop;
>>> however, that can result in a hang if another cpu has interrupts disabled
>>> and is spinning, since the IPI won't be received and the KDB will wait
>>> indefinitely.  We probably need to add a timeout, but this is a useful
>>> stopgap in the mean time.
>>
>> But that was before we started using hard stop in this context (in 2009).
>
> Some non-x86 platforms (e.g. PPC) don't support real NMIs, and so this still
> applies.

Well, if I get Andriy's proposal right, he just wants to trim off the
possibility to not stop the CPUs on entering KDB. I'm not entirely
sure why there is a sysctl for disabling that and I really don't want
it.

Note that the missing of the NMI/privileged Interrupt is not going to
be a factor on this request, unless you are worried a lot by the easy
deadlock that a normal stop operation may lead.
If that is the case, I think that the upcoming work on skipping
locking during KDB/panic entering is going to help a lot for this
case. At that point removing the possibility to turn off CPU stopping
will be a good idea, IMHO.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 8.2-PRERELEASE freezing on reboot (-current OK)

2010-12-14 Thread Attilio Rao

2010/12/10 Arno J. Klaassen :
>
> Hello,
>
> just FYI that on an 8-way Tyan S3992-E based box, a reboot under
> 8.2-PRERELEASE (in fact, 8-stable since quite a while) makes the box
> freeze, whilst the same thing under -current works OK.
>
> For info the end of console output in both cases as well as dmesg.boot
> for -current.
>
> Feel free to contact me for more info or test patches.

Hello Arno,
I'd need you do the following things:
- Compile a new kernel including this patch:
http://www.freebsd.org/~attilio/diagno-stable8.diff

and including the kernel config options KDB, DDB, DIAGNOSTIC and WITNESS.
Please accurately skip, if present in your config file, options
WITNESS_SKIPSPIN and KDB_UNATTENDED.

These options could make the deadlock not visible anymore, at some extent.
You may repeat a lot of times the reboot in order to try to get
something but if you can't reproduce it just let me know.

- When the kernel deadlocks, this time, after a while it should be
able to resolve the deadlock alone.
If that happens you will see the DDB prompt. At the ddb prompt type
the following commands:
db> ps
db> show allpcpu
db> allt
db> show alllocks

Note that this is quite a big output and you'd need a serial console to log it.
If you can't arrange serial connections, I'll tell you what
informations I need to specifically check and you may annotate someway
and reply to me (the full logs would be valuable, but it is better
than nothing).

Are the instructions clear?

Let me know if you have any question.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [releng_8 tinderbox] failure on amd64/amd64

2010-10-26 Thread Attilio Rao

This issue should be resolved by r214370 already; someone else can
validate this?

Thanks,
Attilio

2010/10/26 FreeBSD Tinderbox :
> TB --- 2010-10-26 06:20:40 - tinderbox 2.6 running on 
> freebsd-current.sentex.ca
> TB --- 2010-10-26 06:20:40 - starting RELENG_8 tinderbox run for amd64/amd64
> TB --- 2010-10-26 06:20:40 - cleaning the object tree
> TB --- 2010-10-26 06:23:51 - cvsupping the source tree
> TB --- 2010-10-26 06:23:51 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca 
> /tinderbox/RELENG_8/amd64/amd64/supfile
> TB --- 2010-10-26 06:28:44 - building world
> TB --- 2010-10-26 06:28:44 - MAKEOBJDIRPREFIX=/obj
> TB --- 2010-10-26 06:28:44 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
> TB --- 2010-10-26 06:28:44 - TARGET=amd64
> TB --- 2010-10-26 06:28:44 - TARGET_ARCH=amd64
> TB --- 2010-10-26 06:28:44 - TZ=UTC
> TB --- 2010-10-26 06:28:44 - __MAKE_CONF=/dev/null
> TB --- 2010-10-26 06:28:44 - cd /src
> TB --- 2010-10-26 06:28:44 - /usr/bin/make -B buildworld
 World build started on Tue Oct 26 06:28:46 UTC 2010
 Rebuilding the temporary build tree
 stage 1.1: legacy release compatibility shims
 stage 1.2: bootstrap tools
 stage 2.1: cleaning up the object tree
 stage 2.2: rebuilding the object tree
 stage 2.3: build tools
 stage 3: cross tools
 stage 4.1: building includes
 stage 4.2: building libraries
 stage 4.3: make dependencies
 stage 4.4: building everything
 stage 5.1: building 32 bit shim libraries
 World build completed on Tue Oct 26 14:25:11 UTC 2010
> TB --- 2010-10-26 14:25:11 - generating LINT kernel config
> TB --- 2010-10-26 14:25:11 - cd /src/sys/amd64/conf
> TB --- 2010-10-26 14:25:11 - /usr/bin/make -B LINT
> TB --- 2010-10-26 14:25:12 - building LINT kernel
> TB --- 2010-10-26 14:25:12 - MAKEOBJDIRPREFIX=/obj
> TB --- 2010-10-26 14:25:12 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
> TB --- 2010-10-26 14:25:12 - TARGET=amd64
> TB --- 2010-10-26 14:25:12 - TARGET_ARCH=amd64
> TB --- 2010-10-26 14:25:12 - TZ=UTC
> TB --- 2010-10-26 14:25:12 - __MAKE_CONF=/dev/null
> TB --- 2010-10-26 14:25:12 - cd /src
> TB --- 2010-10-26 14:25:12 - /usr/bin/make -B buildkernel KERNCONF=LINT
 Kernel build for LINT started on Tue Oct 26 14:25:12 UTC 2010
 stage 1: configuring the kernel
 stage 2.1: cleaning up the object tree
 stage 2.2: rebuilding the object tree
 stage 2.3: build tools
 stage 3.1: making dependencies
> [...]
> awk -f /src/sys/tools/makeobjops.awk /src/sys/opencrypto/cryptodev_if.m -h
> awk -f /src/sys/tools/makeobjops.awk /src/sys/dev/acpica/acpi_if.m -h
> awk -f /src/sys/tools/makeobjops.awk /src/sys/dev/acpi_support/acpi_wmi_if.m 
> -h
> rm -f .newdep
> /usr/bin/make -V CFILES -V SYSTEM_CFILES -V GEN_CFILES |  MKDEP_CPP="cc -E" 
> CC="cc" xargs mkdep -a -f .newdep -O2 -frename-registers -pipe 
> -fno-strict-aliasing  -std=c99  -Wall -Wredundant-decls -Wnested-externs 
> -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith -Winline 
> -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc  -I. 
> -I/src/sys -I/src/sys/contrib/altq -I/src/sys/contrib/ipfilter 
> -I/src/sys/contrib/pf -I/src/sys/dev/ath -I/src/sys/dev/ath/ath_hal 
> -I/src/sys/contrib/ngatm -I/src/sys/dev/twa -I/src/sys/gnu/fs/xfs/FreeBSD 
> -I/src/sys/gnu/fs/xfs/FreeBSD/support -I/src/sys/gnu/fs/xfs 
> -I/src/sys/contrib/opensolaris/compat -I/src/sys/dev/cxgb -D_KERNEL 
> -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common 
> -finline-limit=8000 --param inline-unit-growth=100 --param 
> large-function-growth=1000 -DGPROF -falign-functions=16 -DGPROF4 -DGUPROF 
> -fno-builtin -fno-omit-frame-pointer -mcmodel=kernel -mno-red-zone  
> -mfpmath=387 -mno-sse -mno-sse2 -mno-ss!
>  e3 -mno-mmx -mno-3dnow  -msoft-float -fno-asynchronous-unwind-tables 
> -ffreestanding -fstack-protector
> cc: /src/sys/libkern/inet_ntop.c: No such file or directory
> cc: /src/sys/libkern/inet_pton.c: No such file or directory
> mkdep: compile failed
> *** Error code 1
>
> Stop in /obj/src/sys/LINT.
> *** Error code 1
>
> Stop in /src.
> *** Error code 1
>
> Stop in /src.
> TB --- 2010-10-26 14:50:51 - WARNING: /usr/bin/make returned exit code  1
> TB --- 2010-10-26 14:50:51 - ERROR: failed to build lint kernel
> TB --- 2010-10-26 14:50:51 - 4255.74 user 16233.66 system 30610.94 real
>
>
> http://tinderbox.freebsd.org/tinderbox-releng_8-RELENG_8-amd64-amd64.full
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: stable GENERIC kernel build fails?

2010-10-26 Thread Attilio Rao

Sorry for the mis-service, it should be fixed now.

Thanks,
Attilio

2010/10/26 Chip Camden :
> After a csup, building the GENERIC kernel on amd64 fails with:
>
> make -V CFILES -V SYSTEM_CFILES -V GEN_CFILES |  MKDEP_CPP="cc -E"
> CC="cc" xargs mkdep -a -f .newdep -O2 -frename-registers -pipe
> -fno-strict-aliasing  -std=c99 -g -Wall -Wredundant-decls
> -Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes
> -Wpointer-arith -Winline -Wcast-qual  -Wundef -Wno-pointer-sign
> -fformat-extensions -nostdinc  -I. -I/usr/src/sys
> -I/usr/src/sys/contrib/altq -I/usr/src/sys/contrib/ipfilter
> -I/usr/src/sys/contrib/pf -I/usr/src/sys/dev/ath
> -I/usr/src/sys/dev/ath/ath_hal -I/usr/src/sys/contrib/ngatm
> -I/usr/src/sys/dev/twa -I/usr/src/sys/gnu/fs/xfs/FreeBSD
> -I/usr/src/sys/gnu/fs/xfs/FreeBSD/support -I/usr/src/sys/gnu/fs/xfs
> -I/usr/src/sys/contrib/opensolaris/compat -I/usr/src/sys/dev/cxgb
> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common
> -finline-limit=8000 --param inline-unit-growth=100 --param
> large-function-growth=1000  -fno-omit-frame-pointer -mcmodel=kernel
> -mno-red-zone  -mfpmath=387 -mno-sse -mno-sse2 -mno-sse3 -mno-mmx
> -mno-3dnow  -msoft-float -fno-asynchronous-unwind-tables -ffreestanding
> -fstack-protector
> cc: /usr/src/sys/libkern/inet_ntop.c: No such file or directory
> cc: /usr/src/sys/libkern/inet_pton.c: No such file or directory
> mkdep: compile failed
> *** Error code 1
>
> Stop in /usr/obj/usr/src/sys/GENERIC.
> *** Error code 1
>
> Stop in /usr/src.
> *** Error code 1
>
> Stop in /usr/src.
> libertas/usr/src# uname -a
> FreeBSD libertas.local.camdensoftware.com 8.1-STABLE FreeBSD 8.1-STABLE #81: 
> Sun Oct 24 11:46:14 PDT 2010     
> sterl...@libertas.local.camdensoftware.com:/usr/obj/usr/src/sys/LIBERTAS  
> amd64
>
> --
> Sterling (Chip) Camden    | sterl...@camdensoftware.com | 2048D/3A978E4F
> http://camdensoftware.com | http://chipstips.com        | 
> http://chipsquips.com
>



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Kernel panic when unpluggin AC adaptor

2010-05-13 Thread Attilio Rao

2010/5/14 Giovanni Trematerra :
> On Thu, May 13, 2010 at 1:09 AM, Brandon Gooch
>  wrote:
>> On Wed, May 12, 2010 at 9:41 AM, Attilio Rao  wrote:
>>> 2010/5/12 David DEMELIER :
>>>> I remove the patch, and built the kernel (I updated the src this
>>>> morning) and it does not panic now. It's really odd. If it reappears
>>>> soon I will tell you.
>>>
>>> I looked at the code with Giovanni and I have the feeling that the
>>> race with the idle thread may still be fatal.
>>> We need to fix that.
>>>
>>> Attilio
>>>
>>
>> That seems to be the case, as my laptop shows about an 80-85 % chance
>> of experiencing a panic if left idle for long-ish periods of time (2
>> to 4 hours). I usually rebuild world or big ports overnight, and more
>> often than not I wake up to a panicked machine, same situation every
>> time:
>>
>> ...
>> rman_get_bushandle() at rman_get_bushandle+0x1
>> sched_idletd() at sched_idletd+0x123
>> fork_exit() at fork_exit+0x12a
>> fork_trampoline() at fork_trampoline+0xe
>> ...
>>
>> The kernel/userland is rebuilt, the ports are finished compiling --
>> it's in the time AFTER the completion of all tasks that the machine
>> gets bored and tries to kill itself :)
>>
>> I have seen the AC adapter plug/unplug "hang" in the past on this
>> laptop, but I never made the connection between the events, as
>> nowadays my laptop usually stays plugged in :(
>>
>> Attilio, I hope you can track this one down, let me know if I can do
>> anything to help or test...
>>
>
> Attilio and I came up with this patch. It seems ready for stress
> testing and review
> Please test and report back.

I have still to review it completely, hope to do that asap.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Kernel panic when unpluggin AC adaptor

2010-05-12 Thread Attilio Rao

2010/5/12 David DEMELIER :
> I remove the patch, and built the kernel (I updated the src this
> morning) and it does not panic now. It's really odd. If it reappears
> soon I will tell you.

I looked at the code with Giovanni and I have the feeling that the
race with the idle thread may still be fatal.
We need to fix that.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: CPU problems after 8.0-STABLE update

2010-04-09 Thread Attilio Rao

2010/4/9 Jakub Lach :
>
>
>
> Andriy Gapon wrote:
>>
>>
>> Really shooting in the dark here: are there any BIOS options about HPET
>> and RTC on
>> this system?  Can you try playing with them?
>>
>>
>
> Hello. I have similar problem. Once in few boots performance would be
> sluggish and
> top would be at 0%. It started on 4th April I think. After today's update,
> problem is persistent.
> Currently, as I type letters are appearing with considerable delay.
>
> I'm using HPET, 8-STABLE amd64 r206412

Ok, r206421 switches the default tunable for machdep.lapic_allclock in
order to enable atrtc usage only if it is properly turned off.
I will MFC in a week.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: CPU problems after 8.0-STABLE update

2010-04-08 Thread Attilio Rao

2010/4/8 Andriy Gapon :
> on 08/04/2010 04:29 Akephalos said the following:
>> Attilio, I csup-dated several hours ago and rebuilt and installed the kernel
>> (and world, in case it matters).
>>
>> %uname -a FreeBSD free.bsd369441.org 8.0-STABLE FreeBSD 8.0-STABLE #0: Thu
>> Apr  8 03:01:13 EEST 2010
>> r...@free.bsd369441.org:/usr/obj/usr/src/sys/GENERIC  amd64
>>
>> The problem persists without the machdep trick, I see only one processor in
>> top with 0.0% CPU load.
>>
>
> Interesting, I couldn't see anything obviously wrong about your hardware.
> Could you please post a verbose dmesg from a problematic boot somewhere?
> Also, output of 'vmstat -i' and Interrupt request lines portion of 'devinfo 
> -u'
> output.
> Thanks!

I watched again the patch I committed to STABLE_8 and I can't find
anything wrong with it.
Also the fact that the setting machdep.lapic_all=1 fixes this means
that this may be an atrtc working problem.
Maybe new atom machine expose a problem with it?
I'm thinking if we might switch this into an opt-in rather than an
opt-out feature.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: CPU problems after 8.0-STABLE update

2010-04-06 Thread Attilio Rao

2010/4/6 Akephalos Akephalos :
> On Sun, Apr 4, 2010 at 7:28 PM, Attilio Rao  wrote:
>>
>> What architecture is it?
>> May you try setting machdep.lapic_allclocks to 1 in /boot/loader.conf?
>> May you report #dmesg | grep atrtc
>>
>>
>> Thanks,
>> Attilio
>>
>>
>> --
>> Peace can only be achieved by understanding - A. Einstein
>
> # dmesg | grep -B 5 -A 5 -i rtc
> acpi_button0:  on acpi0
> acpi_button1:  on acpi0
> acpi_tz0:  on acpi0
> battery0:  on acpi0
> acpi_acad0:  on acpi0
> atrtc0:  port 0x70-0x71 irq 8 on acpi0
> atkbdc0:  port 0x60,0x64 irq 1 on acpi0
> atkbd0:  irq 1 on atkbdc0
> kbd0 at atkbd0
> atkbd0: [GIANT-LOCKED]
> atkbd0: [ITHREAD]
> ---
>
> I set machdep.lapic_allclocks to 1 at statup - top works now!! I can see
> both processors with top -P, btw, everything looks fine, although I get core
> dump for xfce4-taskmanager and can't test it (it's probably related to
> something else).
> ---
>
> I started powerd - it scales the frequencies correctly now.
>
> This seems to be the solution, is this a bug should I report or leave things
> like this?

Uhm, may you tell me which revision did you update to? May you update
to the latest now, recompile your kernel, remove the hint
machdep.lapic_allclocks and report if it works or not?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: CPU problems after 8.0-STABLE update

2010-04-04 Thread Attilio Rao

2010/4/4 Akephalos Akephalos :
> Hey,
>
> I installed 8.0 release and used it very briefly until updating through
> cvsup to the latest stable source. I had no problems with the release (DVD)
> version, except that my wireless card wasn't detected, so updating was the
> natural thing to do. My hardware is an ASUS dual Turion laptop (K50AB), and
> my working setup was like this:
>
> - /boot/loader.conf:
> cpufreq_load="YES"
> hint.acpi_throttle.0.disabled="1"
> - /etc/rc.conf:
> powerd_enable="YES"
>
> It was working fine, the CPU frequency was scaling as expected, I checked it
> numerous times while working and idle with 'sysctl dev.cpu.0.freq'. Also,
> the load was displayed correctly in the taskmanager (I don't remember what
> was displayed in 'top', but I suppose it was ok).
>
> Now, after updating through buildworld, powerd doesn't scale the frequency
> anymore. My observations pointed out that the problem is that the CPU load
> is not detected correctly anymore:
> - I got three frequency steps: 575, 1150 and 2300 (correctly detected by
> dev.cpu.0.freq_levels while cpufreq module is loaded), but powerd scales
> down the frequency to the minimum, 575 then keeps it like that no matter of
> the load - dev.cpu.0.freq shows 575 and I got large build times because of
> it. To be able to use it fully, I have to kill powerd and set the frequency
> manually, or disable it at startup.
> - 'top -P' displays only one CPU and its load is 0% everything all the time,
> despite any load
> - I can't see anything in a taskmanager, the last time I tried with xfce and
> CURRENT (CURRENT had the same issue)
> - dev.cpu.0.cx_usage shows 100%.
> ---
>
> I'd like to find out the problem, why the CPU level is not detected
> correctly and how to fix this/report.

What architecture is it?
May you try setting machdep.lapic_allclocks to 1 in /boot/loader.conf?
May you report #dmesg | grep atrtc


Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: ZFS and sh(1) panic: spin lock [lock addr] (smp rendezvous) held by [sh(1) proc tid] too long

2010-02-20 Thread Attilio Rao

2010/1/27 Brandon Gooch :
> The machine, a Dell Optiplex 755, has been locking up recently. The
> situation usually occurs while using VirtualBox (running a 64-bit
> Windows 7 instance) and doing anything else in another xterm (such as
> rebuilding a port).  I've been unable to reliably reproduce it (I'm in
> an X session and the machine will not panic "properly").
>
> However, while rebuilding Xorg today at ttyv0 and runnning
> VBoxHeadless on ttyv1, I managed to trigger what I believe is the
> lockup.
>
> I've attached a textdump in hopes that someone may be able to take a
> look and provide clues or instruction on debugging this.

I think that jhb@ saw a similar problem while working on nVidia driver
or the like.
Not sure if he made any progress to debug this.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: run_interrupt_driven_hooks: still waiting... for xpt_config

2010-01-21 Thread Attilio Rao

2010/1/21 Willem Jan Withagen :
> Willem Jan Withagen wrote:
>>
>> I'm trying to revive an old dual optern Tyan Tomcat S2875 board. Even
>> upgraded it to the most recent BIOS. But still no go.
>> Both with 8.0 and 7.2 RELEASE.
>>
>> I've also disabled P1394 and all USB in the BIOS, that did not work
>> either.
>> Only thing that is "extra" in the box is a an Areca 1120 controller.
>
> Moved the bootable disk to an default SATA port on the MB, and removed the
> Areca controller.
> That gets ride of the problem, but it also creates a new problem since I'd
> like to use the controller to handle a bunch of backup-disks.
>
> Suggestions on how to get the Areca controller passed the xpt_config test
> are welcomed.

It may be linked to sbp(4) probabilly. Do you have it in your kernel?
do you want to recompile it without if the answer is yes?
It would be interesting to try it without ACPI and possibly see, in
the hang case, on which IRQ (sharing with whatever other source) is.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-19 Thread Attilio Rao

2010/1/19 Pete French :
>> May you post your kernel config?
>
> sure...
>
>        include         GENERIC
>        ident           DEBUG
>        options         KDB
>        options         DDB
>        options         WITNESS
>        options         INVARIANT_SUPPORT
>        options         INVARIANTS

Ok then, remove the debugging (WITNESS, INVARIANT*), leave in place
KDB and DDB, add GDB and try at least to get a coredump when it
deadlocks.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-18 Thread Attilio Rao

2010/1/18 Pete French :
>> One may never know, try without WITNESS but still the same setup.
>
> Well, I have been running like this for three days with no lockups
> dissapointingly. I just saw that you commited the lock patches, so
> am going to update to the latest STABLE and go back to GENERIC to see if
> that still locks up (as I can see a couple of other fixes in there).
> Will let you know what happens - at the moment it's frustrating
> as it wont lockup if I have anything diagnostic in the kernel it
> seem!

May you post your kernel config?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-15 Thread Attilio Rao

2010/1/15 Pete French :
> Well, the machine has been running the WITNESS + INVARIANTS kernel
> for 20 hours now without locking up.This looks like what I
> saw before - compiling in WITNESS stops it locking up -(
>
> Is there any use in my runing a kernel with just INVARIANTS to see if
> that will lcok ? I know it locks with KDN and DDb on their own, but
> am not usre how useful that is.

One may never know, try without WITNESS but still the same setup.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-14 Thread Attilio Rao

2010/1/14 Pete French :
>> INVARIANTS requires INVARIANT_SUPPORT [sic] in the kernel config (see 
>> comments in GENERIC).
>
> Ah, right, that would explain it. Thanks!

INVARIANT_SUPPORT is made mandatory in order to allow non-INVARIANT
kernel to be able to handle INVARIANT compiled modules.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-14 Thread Attilio Rao

2010/1/14 Pete French :
>> http://www.freebsd.org/~attilio/lockmgr_fix8.diff
>>
>> I'm seeking for testers here.
>> Any report would be very much appreciated.
>
> I tested the patch on my machine which locks up, and I am afraid that it
> still locks, even with the patch applied. The last things on the console
> before the lock are.
>
> 1) A whole load of sshd errors for one of those flood attacks which try
> multiple usersnames. This is not unusual, all my systems with an external
> ssh port see this.
>
> 2) Four 'Watchdog timeout occurreed, resetting!" messages from if_bce.c.
> These are new - without your patch I did not get these.
>
> I have tried rnning this machine with WITNESS in the kernel, but it
> will not deadlock then. Without WiTNESS it will lock up in about
> twelve hours. I am going to try with just KDB and DDB to see if I can get
> it into a state where we can get some useful information out of it.

Also enable INVARIANTS.
While there (with my patch applied) please setup textdump in order to
report the following DDB commands (and once it deadlocks break in
DDB):
bt, show allpcpu, ps, alltrace, show alllocks

Try also to get a coredump (and if you can't report immediately to us
and try to not turn off the machine in order to apply following
instructions).

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

[PATCH] Lockmgr deadlock on STABLE_8

2010-01-13 Thread Attilio Rao

As people following HEAD may have seen, around 1 month ago a fix to
lockmgr(9) has been committed that should prevent a deadlock for that
primitive (the fixup is composed by r200447,201703,201709-201710).
As long as the approach choosen in HEAD is optimal, unluckilly it does
introduce an ABI breakage.
In order to allow a MFC, a similar approach, being a bit sub-optimal,
but not breaking ABI, has been prepared for STABLE_8:
http://www.freebsd.org/~attilio/lockmgr_fix8.diff

I'm seeking for testers here.
Any report would be very much appreciated.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Possible scheduler (SCHED_ULE) bug?

2009-11-08 Thread Attilio Rao

2009/10/23 Jaime Bozza :
> I believe I found a problem with the ULE scheduler - At least the fact that 
> there is a problem, but I'm not sure where to go from here.   The system 
> locks all processes, but doesn't panic, so I have no output to give.
>
> I was able to duplicate this on three different machines and solved it by 
> switching to the scheduler to 4BSD.
>
> Here's the environment:
>
> FreeBSD 7.2 i386, installed from bootonly ISO, Custom install, minimal, no 
> other changes other than setting timezone, changing root password, and 
> turning on sshd (allowing root and password connection).

Did you recompile your kernel? Can you show me the revision of
src/sys/kern/sched_ule.c you used?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: resource leak in fifo_vnops.c: 6.x/7.x/8.x

2009-11-06 Thread Attilio Rao

2009/11/6 Dorr H. Clark :
>
>
> We believe we have identified a significant resource leak
> present in 6.x, 7.x, and 8.x.  We believe this is a regression
> versus FreeBSD 4.x which appears to do the Right Thing (tm).
>
> We have a test program (see below) which will run the system
> out of sockets by repeated exercise of the failing code
> path in the kernel.
>
> Our proposed fix is applied to the file usr/src/sys/fs/fifofs/fifo_vnops.c
>
>
> @@ -237,6 +237,8 @@
>if (ap->a_mode & FWRITE) {
>if ((ap->a_mode & O_NONBLOCK) && fip->fi_readers == 0) {
>mtx_unlock(&fifo_mtx);
> +   /* Exclusive VOP lock is held - safe to clean */
> +   fifo_cleanup(vp);
>return (ENXIO);
>}
>fip->fi_writers++;

I think it should also check that fip->if_writers == 0 (and possibly
the checks within fifo_cleanup() should just be assertions, but that's
orthogonal someway) and the comment is not needed.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: interrupt threads CPU usage in FreeBSD 8.0

2009-11-03 Thread Attilio Rao

2009/10/21 Igor Sysoev :
> Hi,
>
> for some reason in 8.0 top always shows 0% CPU usage for intr kernel
> process and active interrupt thread, "irq19 bge0" in my case.
>
> 8-0 RC1 top -PS:
>
> CPU 0: 27.8% user,  0.0% nice,  7.1% system,  0.0% interrupt, 65.0% idle
> CPU 1:  3.0% user,  0.0% nice,  2.3% system,  7.1% interrupt, 87.6% idle
>
>  PID USERNAME THR PRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
>   11 root   2 171 ki31 0K32K RUN 0 140.7H 152.54% idle
> 61371 nobody 1  69  -10   384M   289M kqread  0 105:56 17.77% nginx
> 61372 nobody 1  67  -10   384M   293M CPU00 106:15 16.99% nginx
>   12 root  15 -60- 0K   240K WAIT0  54:50  0.00% intr
>
> 8.0 RC1 top -PSH:
>
>  PID USERNAMEPRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
>   11 root171 ki31 0K32K RUN 1  71.5H 81.05% {idle: cpu1}
>   11 root171 ki31 0K32K CPU00  69.3H 69.19% {idle: cpu0}
> 61372 nobody   68  -10   384M   294M kqread  0 107:06 18.99% nginx
> 61371 nobody   68  -10   384M   291M kqread  0 106:45 16.99% nginx
>   12 root-68- 0K   240K WAIT1  50:48  0.00% {irq19: bge0}
>   17 root 44- 0K16K syncer  1   5:23  0.00% syncer
>   12 root-32- 0K   240K WAIT1   3:06  0.00% {swi4: clock}
>
> 7.2-STABLE top -PS:
>
> CPU 0:  9.0% user,  0.0% nice,  7.9% system,  9.0% interrupt, 74.1% idle
> CPU 1: 23.3% user,  0.0% nice,  8.3% system,  0.0% interrupt, 68.4% idle
>
>  PID USERNAME  THR PRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
>   12 root1 171 ki31 0K16K RUN 0 275.0H 83.59% idle: cpu0
>   11 root1 171 ki31 0K16K RUN 1 264.2H 76.27% idle: cpu1
> 16109 nobody  1  68  -10   376M   307M CPU01  28:05 21.97% nginx
> 16110 nobody  1   4  -10   376M   316M RUN 0  28:05 20.17% nginx
>   26 root1 -68- 0K16K WAIT0 902:39  6.69% irq19: bge0

How old is your 7.2-STABLE?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-28 Thread Attilio Rao

2009/9/28 C. C. Tang :
> C. C. Tang wrote:
>>
>> Attilio Rao wrote:
>>>
>>> 2009/9/22 C. C. Tang :
>>>>>>>>>
>>>>>>> I have patched the sched_ule.c and did a make buildkernel & make
>>>>>>> installkernel (is buildworld and installworld necessary?), rebooted
>>>>>>> and
>>>>>>> the
>>>>>>> machine is running now.
>>>>>>> I will post here again if there is any update.
>>>>
>>>> My server is up for 3.5 days now with HyperThreading & powerd enabled.
>>>> No panic occured yet.
>>>
>>> Usually how long did it take to panic?
>>>
>>> Attilio
>>>
>>>
>> It is rather random, but will usually panic within one week.
>> Anyway my server will keep running and I will report if it has any
>> problem.
>>
>> Thanks,
>> C.C.
>>
> My server is up for 9.5 days now. Seems working fine.

The patch has been committed to STABLE_7 as well.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 8.0-RC1 panic attaching ppc

2009-09-24 Thread Attilio Rao

2009/9/24 Daniel O'Connor :
> On Wed, 23 Sep 2009, Attilio Rao wrote:
>> 2009/9/23 Daniel O'Connor :
>> > If I enable the parallel port on this Gigabyte MA7785GM-US2H I get
>> > a trap 12 when booting up.
>> >
>> > I forgot to take a picture of it at the time but I should be able
>> > to reproduce it tomorrow.
>> >
>> > Has anyone seen anything before? (a quick google showed nothing). I
>> > did not see it on 7.2(ish) on the same hardware.
>>
>> Are you able to enable KDB in your kernel config and return a
>> backtrace here?
>
> Yes, here it is..
>
> pmap_extract() at pmap_extract+0x13a
> isa_dmarangecheck() at isa_dmarangecheck+0x7a
> isa_dma_init() at isa_dma_init+0xda
> ppc_isa_attach() at ppc_sa_attach+0x40
> device_attach() at device_attach+0x69
> bus_generic_attach() at bus_generic_attach+0x1a
> acpi_attach() at acpi_attach+0x9f8
>
> (there's more but I imagine the above is probably sufficient).
>
> I took pictures, they are here
> http://www.gsoft.com.au/~doconnor/SNC00111.jpg
> http://www.gsoft.com.au/~doconnor/SNC00112.jpg
>
> If I put the parallel port in EPP mode then it works, I presume that's
> because it doesn't require a DMA channel whereas ECP doesn't. I haven't
> enumerated the possibilities though :)

Can you try to get a kernel dump that we can analyze?
You would just need to recompile the kernel with options KDB, GDB and
debugging symbols.
Then we can do more on that.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 8.0-RC1 panic attaching ppc

2009-09-23 Thread Attilio Rao

2009/9/23 Daniel O'Connor :
> If I enable the parallel port on this Gigabyte MA7785GM-US2H I get a
> trap 12 when booting up.
>
> I forgot to take a picture of it at the time but I should be able to
> reproduce it tomorrow.
>
> Has anyone seen anything before? (a quick google showed nothing). I did
> not see it on 7.2(ish) on the same hardware.

Are you able to enable KDB in your kernel config and return a backtrace here?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-21 Thread Attilio Rao

2009/9/22 C. C. Tang :
>>
>>
 I have patched the sched_ule.c and did a make buildkernel & make
 installkernel (is buildworld and installworld necessary?), rebooted and
 the
 machine is running now.
 I will post here again if there is any update.
>
> My server is up for 3.5 days now with HyperThreading & powerd enabled.
> No panic occured yet.

Usually how long did it take to panic?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-19 Thread Attilio Rao

2009/9/19 Dan Naumov :
> On Fri, Sep 18, 2009 at 2:25 PM, C. C. Tang  wrote:
>>> Attilio Rao wrote:
>>>>
>>>> 2009/9/17 C. C. Tang :
>>>>>>>
>>>>>>> Dan, is that machine equipped with Hyperthreading?
>>>>>>>
>>>>>>> Attilio
>>>>>>
>>>>>> Yes. It's an Intel Atom 330, which is a dualcore CPU with HT (4 cores
>>>>>> visible in "top" as a result)
>>>>>
>>>>> Yes, mine is also Atom 330.
>>>>>
>>>>> I cannot test the patch because my machine is also in production now.
>>>>> But I
>>>>> have tested it with hyperthreading.
>>>>> powerd with HyperThreading -> spin lock hold too long
>>>>> powerd without HyperThreading -> no problem
>>>>> no powerd with/without HyperThreading -> no problem
>>>>
>>>> But these are with the last patch I posted in?
>>>> (specifically, for 7.2:
>>>> http://www.freebsd.org/~attilio/sched_ule.diff
>>>> )
>>>>
>>>> So with the patch in, powerd and hyperthreading on you still get a
>>>> deadlock?
>>>>
>>>> Attilio
>>>>
>>>>
>> I have patched the sched_ule.c and did a make buildkernel & make
>> installkernel (is buildworld and installworld necessary?), rebooted and the
>> machine is running now.
>> I will post here again if there is any update.
>
> Considering we are at RC1 right now, is there any chance this patch
> makes it into 8.0 release if the patch fixes the issue and doesn't
> cause any regressions? Unfortunately I can't test it myself right now,
> so I have to rely on other people experiencing the same issue to see
> if the patch fixes it.

I alredy committed it to STABLE_8 and then it will make it for sure.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-17 Thread Attilio Rao

2009/9/17 C. C. Tang :
>>> Dan, is that machine equipped with Hyperthreading?
>>>
>>> Attilio
>>
>> Yes. It's an Intel Atom 330, which is a dualcore CPU with HT (4 cores
>> visible in "top" as a result)
>
> Yes, mine is also Atom 330.
>
> I cannot test the patch because my machine is also in production now. But I
> have tested it with hyperthreading.
> powerd with HyperThreading -> spin lock hold too long
> powerd without HyperThreading -> no problem
> no powerd with/without HyperThreading -> no problem

But these are with the last patch I posted in?
(specifically, for 7.2:
http://www.freebsd.org/~attilio/sched_ule.diff
)

So with the patch in, powerd and hyperthreading on you still get a deadlock?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-14 Thread Attilio Rao

2009/7/23 C. C. Tang :
> Attilio Rao wrote:
>>
>> 2009/7/22 C. C. Tang :
>>>>
>>>> Could that one (on i386) be related?
>>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584
>>>>
>>> I have no idea about it but I can tell the difference...
>>> My machine panic randomly rather than on shutdown and I remembered that
>>> it
>>> failed to write core dump. It also failed to reboot automatically..
>>
>> Is your problem on -CURRENT and amd64?
>> At some point there has been a problem with PAT support (and
>> tlb_shootdowns() could lead to a livelock hanging forever, leading to
>> such a bug) but I expect it is fixed now.
>> Can you try with a fresh new -CURRENT if any?
>
> My problem is on i386 version of 7.2-RELEASE-p2 on Intel Atom 330 CPU.
> And my system just panic randomly with "spin lock held too long".
> It didn't panic at reboot or shutdown so I think it the problem is somewhat
> different from that mentioned by Barbara's PR?
>
> Anyway I disabled powerd and it seems become stable now.
>
> And I am sorry that my system has been put into service so it would be hard
> for me to switch to -CURRENT...  :(

Can you re-enable powerd and try the attached patch?:
http://www.freebsd.org/~attilio/sched_ule.diff

The patch is against STABLE_7, but I think HEAD has the same bug.
Please try it and report to me.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-12 Thread Attilio Rao

2009/7/7 Dan Naumov :
> I just got a panic following by a reboot a few seconds after running
> "portsnap update", /var/log/messages shows the following:
>
> Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
> Jul  7 03:49:38 atom kernel: spin lock 0x80b3edc0 (sched lock
> 1) held by 0xff00017d8370 (tid 100054) too long
> Jul  7 03:49:38 atom kernel: panic: spin lock held too long
>
> /var/crash looks empty. This is a system running official 7.2-p1
> binaries since I am using freebsd-update to keep up with the patches
> (just updated to -p2 after this panic) running with very low load,
> mostly serving files to my home network over Samba and running a few
> irssi instances in a screen. What do I need to do to catch more
> information if/when this happens again?

Dan, is that machine equipped with Hyperthreading?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: spinlock held too long on reboot

2009-08-04 Thread Attilio Rao

2009/7/29 Attilio Rao :
> 2009/5/23 Stefan Bethke :
>> I wrote:
>>
>>> Syncing disks, vnodes remaining...0 done
>>> All buffers synced.
>>> GEOM_MIRROR: Device diesel_root: provider mirror/diesel_root destroyed.
>>> Uptime: 6m32s
>>> GEOM_MIRROR: Device diesel_root destroyed.
>>> Rebooting...
>>> cpu_reset: Stopping other CPUs
>>> spin lock 0x8078c900 (sched lock 1) held by 0xff00014d4ab0
>>> (tid 12) too long
>>> panic: spin lock held too long
>>> cpuid = 0
>>> KDB: enter: panic
>>> [thread pid 77 tid 100090 ]
>>> Stopped at  kdb_enter+0x3d: movq$0,0x48bbd0(%rip)
>>> db> bt
>>> Tracing pid 77 tid 100090 td 0xff000457bab0
>>> kdb_enter() at kdb_enter+0x3d
>>> panic() at panic+0x17b
>>> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
>>> _mtx_lock_spin() at _mtx_lock_spin+0x9e
>>> _mtx_lock_spin_flags() at _mtx_lock_spin_flags+0x72
>>> sched_balance_group() at sched_balance_group+0xc5
>>> sched_balance_group() at sched_balance_group+0x1f8
>>> sched_balance() at sched_balance+0xa2
>>> sched_clock() at sched_clock+0xf6
>>> statclock() at statclock+0xbd
>>> lapic_handle_timer() at lapic_handle_timer+0x197
>>> Xtimerint() at Xtimerint+0x8c
>>> --- interrupt, rip = 0x80541cc4, rsp = 0xff80771dba90, rbp =
>>> 0xff80771dbab0 ---
>>> DELAY() at DELAY+0x64
>>> cpu_reset() at cpu_reset+0xdd
>>> boot() at boot+0x2e6
>>> reboot() at reboot+0x42
>>> syscall() at syscall+0x1a5
>>> Xfast_syscall() at Xfast_syscall+0xd0
>>> --- syscall (55, FreeBSD ELF64, reboot), rip = 0x800788eec, rsp =
>>> 0x7fffeca8, rbp = 0 ---
>>
>>
>> I've only seen this once.  If I should encounter it again, is there
>> something you'd like me to look at?
>
> [ Sorry, trying to add anyone who alredy reported such a problem even
> if I know many of you experienced it on -STABLE]

If you are experiencing this problem, you would like to test this port
from rink@ on 7.2 of the new version of the patch:
http://people.freebsd.org/~rink/tmp/ipi_7stable.diff

while the -CURRENT version that probabilly is going to be committed
soon is here:
http://www.freebsd.org/~attilio/stop_nmi2.diff

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: kern/134584: [panic] spin lock held too long

2009-07-27 Thread Attilio Rao

2009/7/26 barbara :
> It happened again, on shutdown.
> As the previous time, it happened after a high (for a desktop) uptime and, if 
> it could matter, after running net-p2p/transmission-gtk2 for several hours.
> I don't know if it's related, but often quitting transmission, doesn't 
> terminate the process. Sometimes it end after several minutes the gui exited, 
> sometimes it's still running after hours.
> I've noticed it as the destination folder is on a manually mounted device and 
> I can't umount it as fstat reports the device used by a transmission process.
> So I often have to kill it.
> This happened both the time I had this kind of panic.

What hw is that? How many CPUs does it have?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: kern/134584: [panic] spin lock held too long

2009-07-26 Thread Attilio Rao

2009/7/26 barbara :
> It happened again, on shutdown.
> As the previous time, it happened after a high (for a desktop) uptime and, if 
> it could matter, after running net-p2p/transmission-gtk2 for several hours.
> I don't know if it's related, but often quitting transmission, doesn't 
> terminate the process. Sometimes it end after several minutes the gui exited, 
> sometimes it's still running after hours.
> I've noticed it as the destination folder is on a manually mounted device and 
> I can't umount it as fstat reports the device used by a transmission process.
> So I often have to kill it.
> This happened both the time I had this kind of panic.

Can you try to reproduce it with WITNESS and *without*
WITNESS_SKIPSPIN? I would need to look at "show alllocks" and
possibily "ps" because it seems that the lock owner is preempted but
it should not happen while holding a spinlock (unless the acquired
spinlock is the one in the preempting path, in this case thought it
should drop inside sched_switch() and we can try to understand why
that doesn't happen).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-22 Thread Attilio Rao

2009/7/23 NAKAJI Hiroyuki :
>>>>>> In <4a667469.1080...@gmail.com>
>>>>>>   "C. C. Tang"  wrote:
>> > Could that one (on i386) be related?
>> > http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584
>> >
>
>> I have no idea about it but I can tell the difference...
>> My machine panic randomly rather than on shutdown and I remembered
>> that it failed to write core dump. It also failed to reboot
>> automatically..
>
> I also have trouble like yours.
> http://lists.freebsd.org/pipermail/freebsd-stable/2009-June/050526.html
>
> I've heard from Attilio Rao that he had found the problem and is working
> on it.

Your problem should be linked to a well known deadlock in the VM. kib@
and jeff@ were alredy working on a patch for that so I just passed
them the ball.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-22 Thread Attilio Rao

2009/7/22 C. C. Tang :
>> Could that one (on i386) be related?
>> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584
>>
>
> I have no idea about it but I can tell the difference...
> My machine panic randomly rather than on shutdown and I remembered that it
> failed to write core dump. It also failed to reboot automatically..

Is your problem on -CURRENT and amd64?
At some point there has been a problem with PAT support (and
tlb_shootdowns() could lead to a livelock hanging forever, leading to
such a bug) but I expect it is fixed now.
Can you try with a fresh new -CURRENT if any?

Thanks,
Attilio



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: smbfs panic when lost connection or unmount --force

2009-07-10 Thread Attilio Rao

2009/7/11 Oliver Pinter :
> regs and vnodes:
>
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01854.JPG
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01855.JPG
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01856.JPG
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01857.JPG
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01858.JPG
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01859.JPG
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01860.JPG

Sorry, maybe I wasn't clear, you should spell them 'lockedvnods'.

Thanks,
Attilio



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: smbfs panic when lost connection or unmount --force

2009-07-10 Thread Attilio Rao

2009/7/10 Oliver Pinter :
> Hello!
>
> Here is the bt:
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01845.JPG
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01846.JPG
> http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01847.JPG

Could you please add in this informations registers state and locked vnodes?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: smbfs panic when lost connection or unmount --force

2009-07-09 Thread Attilio Rao

2009/7/10 Oliver Pinter :
> Hi all!
>
> It is a kernel panic, when force unmount the smbfs volume or lost the
> connection with the samba server.
>
> --
> Thes OS is:
>
>
> kern.ostype: FreeBSD
> kern.osrelease: 7.2-STABLE
> kern.osrevision: 199506
> kern.version: FreeBSD 7.2-STABLE #4: Sat Jun 27 21:44:32 CEST 2009
>r...@oliverp:/usr/obj/usr/src/sys/stable
> kern.osreldate: 702103
>
> --
> make.conf:
>
>
> CPUTYPE?=core2
> CFLAGS= -O2 -fno-strict-aliasing -pipe
> MODULES_OVERRIDE=smbfs libiconv libmchain zfs opensolaris drm cd9660
> cd9660_iconv
>
> --
> panic message:
>
> Jul 10 01:58:39 oliverp syslogd: kernel boot file is /boot/kernel/kernel
> Jul 10 01:58:39 oliverp kernel: kernel trap 12 with interrupts disabled
> Jul 10 01:58:39 oliverp kernel:
> Jul 10 01:58:39 oliverp kernel:
> Jul 10 01:58:39 oliverp kernel: Fatal trap 12: page fault while in kernel mode
> Jul 10 01:58:39 oliverp kernel: cpuid = 2; apic id = 02
> Jul 10 01:58:39 oliverp kernel: fault virtual address   = 0x30
> Jul 10 01:58:39 oliverp kernel: fault code  = supervisor read 
> data,
> page not present
> Jul 10 01:58:39 oliverp kernel: instruction pointer = 
> 0x8:0x80327fd0
> Jul 10 01:58:39 oliverp kernel: stack pointer   = 
> 0x10:0xff8078360940
> Jul 10 01:58:39 oliverp kernel: frame pointer   = 
> 0x10:0xff0004c31390
> Jul 10 01:58:39 oliverp kernel: code segment= base 0x0, limit
> 0xf, type 0x1b
> Jul 10 01:58:39 oliverp kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
> Jul 10 01:58:39 oliverp kernel: processor eflags= resume, IOPL = 0
> Jul 10 01:58:39 oliverp kernel: current process = 60406 (smbiod0)
> Jul 10 01:58:39 oliverp kernel: trap number = 12
> Jul 10 01:58:39 oliverp kernel: panic: page fault
> Jul 10 01:58:39 oliverp kernel: cpuid = 2
> Jul 10 01:58:39 oliverp kernel: Uptime: 6h51m16s
> Jul 10 01:58:39 oliverp kernel: Physical memory: 4087 MB
> Jul 10 01:58:39 oliverp kernel: Dumping 2448 MB:Copyright (c)
> 1992-2009 The FreeBSD Project.

Can you at least produce a backtrace for that?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-08 Thread Attilio Rao

2009/7/8 Dan Naumov :
> On Wed, Jul 8, 2009 at 3:57 AM, Dan Naumov wrote:
>> On Tue, Jul 7, 2009 at 4:27 AM, Attilio Rao wrote:
>>> 2009/7/7 Dan Naumov :
>>>> On Tue, Jul 7, 2009 at 4:18 AM, Attilio Rao wrote:
>>>>> 2009/7/7 Dan Naumov :
>>>>>> I just got a panic following by a reboot a few seconds after running
>>>>>> "portsnap update", /var/log/messages shows the following:
>>>>>>
>>>>>> Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
>>>>>> Jul  7 03:49:38 atom kernel: spin lock 0x80b3edc0 (sched lock
>>>>>> 1) held by 0xff00017d8370 (tid 100054) too long
>>>>>> Jul  7 03:49:38 atom kernel: panic: spin lock held too long
>>>>>
>>>>> That's a known bug, affecting -CURRENT as well.
>>>>> The cpustop IPI is handled though an NMI, which means it could
>>>>> interrupt a CPU in any moment, even while holding a spinlock,
>>>>> violating one well known FreeBSD rule.
>>>>> That means that the cpu can stop itself while the thread was holding
>>>>> the sched lock spinlock and not releasing it (there is no way, modulo
>>>>> highly hackish, to fix that).
>>>>> In the while hardclock() wants to schedule something else to run and
>>>>> got stuck on the thread lock.
>>>>>
>>>>> Ideal fix would involve not using a NMI for serving the cpustop while
>>>>> having a cheap way (not making the common path too hard) to tell
>>>>> hardclock() to avoid scheduling while cpustop is in flight.
>>>>>
>>>>> Thanks,
>>>>> Attilio
>>>>
>>>> Any idea if a fix is being worked on and how unlucky must one be to
>>>> run into this issue, should I expect it to happen again? Is it
>>>> basically completely random?
>>>
>>> I'd like to work on that issue before BETA3 (and backport to
>>> STABLE_7), I'm just time-constrained right now.
>>> it is completely random.
>>>
>>> Thanks,
>>> Attilio
>>
>> Ok, this is getting pretty bad, 23 hours later, I get the same kind of
>> panic, the only difference is that instead of "portsnap update", this
>> was triggered by "portsnap cron" which I have running between 3 and 4
>> am every day:
>>
>> Jul  8 03:03:49 atom kernel: ssppiinn  lloocckk
>> 00xx8800bb33eeddc400  ((sscchheedd  lloocck k1 )0 )h
>> ehledl db yb y 0x0xfff0f1081735339760e 0( t(itdi d
>> 1016070)5 )t otoo ol olnogng
>> Jul  8 03:03:49 atom kernel: p
>> Jul  8 03:03:49 atom kernel: anic: spin lock held too long
>> Jul  8 03:03:49 atom kernel: cpuid = 0
>> Jul  8 03:03:49 atom kernel: Uptime: 23h2m38s
>
> I have now tried repeating the problem by running "stress --cpu 8 --io
> 8 --vm 4 --vm-bytes 1024M --timeout 600s --verbose" which pushed
> system load into the 15.50 ballpark and simultaneously running
> "portsnap fetch" and "portsnap update" but I couldn't manually trigger
> the panic, it seems that this problem is indeed random (although it
> baffles me why is it specifically portsnap triggering it). I have now
> disabled powerd to check whether that makes any difference to system
> stability.

But is that happening at reboot time?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-06 Thread Attilio Rao

2009/7/7 Dan Naumov :
> On Tue, Jul 7, 2009 at 4:18 AM, Attilio Rao wrote:
>> 2009/7/7 Dan Naumov :
>>> I just got a panic following by a reboot a few seconds after running
>>> "portsnap update", /var/log/messages shows the following:
>>>
>>> Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
>>> Jul  7 03:49:38 atom kernel: spin lock 0x80b3edc0 (sched lock
>>> 1) held by 0xff00017d8370 (tid 100054) too long
>>> Jul  7 03:49:38 atom kernel: panic: spin lock held too long
>>
>> That's a known bug, affecting -CURRENT as well.
>> The cpustop IPI is handled though an NMI, which means it could
>> interrupt a CPU in any moment, even while holding a spinlock,
>> violating one well known FreeBSD rule.
>> That means that the cpu can stop itself while the thread was holding
>> the sched lock spinlock and not releasing it (there is no way, modulo
>> highly hackish, to fix that).
>> In the while hardclock() wants to schedule something else to run and
>> got stuck on the thread lock.
>>
>> Ideal fix would involve not using a NMI for serving the cpustop while
>> having a cheap way (not making the common path too hard) to tell
>> hardclock() to avoid scheduling while cpustop is in flight.
>>
>> Thanks,
>> Attilio
>
> Any idea if a fix is being worked on and how unlucky must one be to
> run into this issue, should I expect it to happen again? Is it
> basically completely random?

I'd like to work on that issue before BETA3 (and backport to
STABLE_7), I'm just time-constrained right now.
it is completely random.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-06 Thread Attilio Rao

2009/7/7 Dan Naumov :
> I just got a panic following by a reboot a few seconds after running
> "portsnap update", /var/log/messages shows the following:
>
> Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
> Jul  7 03:49:38 atom kernel: spin lock 0x80b3edc0 (sched lock
> 1) held by 0xff00017d8370 (tid 100054) too long
> Jul  7 03:49:38 atom kernel: panic: spin lock held too long

That's a known bug, affecting -CURRENT as well.
The cpustop IPI is handled though an NMI, which means it could
interrupt a CPU in any moment, even while holding a spinlock,
violating one well known FreeBSD rule.
That means that the cpu can stop itself while the thread was holding
the sched lock spinlock and not releasing it (there is no way, modulo
highly hackish, to fix that).
In the while hardclock() wants to schedule something else to run and
got stuck on the thread lock.

Ideal fix would involve not using a NMI for serving the cpustop while
having a cheap way (not making the common path too hard) to tell
hardclock() to avoid scheduling while cpustop is in flight.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [nfs] process locks in "bo_wwait" on 6.4

2009-06-29 Thread Attilio Rao

2009/6/29 Attilio Rao :
> 2009/6/29 pluknet :
>> 2009/6/29 Attilio Rao :
>>> 2009/6/29 pluknet :
>>>> 2009/6/29 Attilio Rao :
>>>>> 2009/6/29 pluknet :
>>>>>> 2009/6/29 Attilio Rao :
>>>>>>> 2009/6/29 pluknet :
>>>>>>>> 2009/6/29 Attilio Rao :
>>>>>>>>> 2009/6/29 pluknet :
>>>>>>>>>> 2009/6/26 pluknet :
>>>>>>>>>>> 2009/6/26 pluknet :
>>>>>>>>>>>> Hello.
>>>>>>>>>>>>
>>>>>>>>>>>> While building a module on nfs mounted /usr/src
>>>>>>>>>>>> I got an unkillable process waiting forever in bo_wwait.
>>>>>>>>>>>
>>>>>>>>>>> Small note: iface on NFS server has mtu changed from 1500 to 1450.
>>>>>>>>>>> Can this be a source of the problem?
>>>>>>>>>>
>>>>>>>>>> This is 100% reproducible. Lock in the same place. Any hints?
>>>>>>>>>
>>>>>>>>> Can you also show the value of ps?
>>>>>>>>> A precise map of what processes are doing would give an help.
>>>>>>>>> Also would be useful to printout traces for other threads and not only
>>>>>>>>> the stucked one.
>>>>>>>>>
>>>>>>>>
>>>>>>>> >From another run:
>>>>>>>
>>>>>>> I'm unable to see who would be locking the buffer object in question.
>>>>>>> Do you have INVARIANT_SUPPORT/INVARIANTS on?
>>>>>>
>>>>>> Yes, I do both.
>>>>>>
>>>>>>> What revision of /usr/src/sys/kern/vfs_bio.c are you running with?
>>>>>>>
>>>>>>
>>>>>> As of 6.4-R: CVS rev 1.491.2.12.4.1 / SVN rev 183531.
>>>>>
>>>>> Please try this patch and report.
>>>>>
>>>>> Thanks,
>>>>> Attilio
>>>>>
>>>>> --- src/sys/nfsclient/nfs_vnops.c   2008/02/13 20:44:18 1.281
>>>>> +++ src/sys/nfsclient/nfs_vnops.c   2008/03/22 09:15:15 1.282
>>>>> @@ -33,7 +33,7 @@
>>>>>  */
>>>>>
>>>>>  #include 
>>>>> -__FBSDID("$FreeBSD:
>>>>> /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.281
>>>>> 2008/02/13 20:44:18 attilio Exp $");
>>>>> +__FBSDID("$FreeBSD:
>>>>> /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.282
>>>>> 2008/03/22 09:15:15 jeff Exp $");
>>>>>
>>>>
>>>> Do you refer to the whole svn r177493, or is its nfs part will be enough?
>>>> This only vfs_vnops.c diff seems not applicable without underneath
>>>> kernel part changes.
>>>>
>>>> I'll try. Thanks.
>>>
>>> The NFS part should be enough, though I don't understand why it
>>> doesn't trigger a panic on STABLE_6 as long as, at least in my
>>> revision, there is an assert for the buffer object lock to be held in
>>> bufobj_wwait(). What's your sys/kern/vfs_bio.c rev?
>>>
>>
>> As of 6.4-R.
>> $FreeBSD: src/sys/kern/vfs_bio.c,v 1.491.2.12.4.1 2008/10/02 02:57:24
>> kensmith Exp $
>
> That's it, the revision doesn't have the assert.
> If it does fix the problem for you, I will let you test a more
> comprehensive patch as there is also at least another fix I want to
> bring in along with this one (and the relative asserts).

Uhm, wait, after better looking at the code I don't think this patch
can fix your problem.
I will let you know with a bit of more time to study the deadlock.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [nfs] process locks in "bo_wwait" on 6.4

2009-06-29 Thread Attilio Rao

2009/6/29 pluknet :
> 2009/6/29 Attilio Rao :
>> 2009/6/29 pluknet :
>>> 2009/6/29 Attilio Rao :
>>>> 2009/6/29 pluknet :
>>>>> 2009/6/29 Attilio Rao :
>>>>>> 2009/6/29 pluknet :
>>>>>>> 2009/6/29 Attilio Rao :
>>>>>>>> 2009/6/29 pluknet :
>>>>>>>>> 2009/6/26 pluknet :
>>>>>>>>>> 2009/6/26 pluknet :
>>>>>>>>>>> Hello.
>>>>>>>>>>>
>>>>>>>>>>> While building a module on nfs mounted /usr/src
>>>>>>>>>>> I got an unkillable process waiting forever in bo_wwait.
>>>>>>>>>>
>>>>>>>>>> Small note: iface on NFS server has mtu changed from 1500 to 1450.
>>>>>>>>>> Can this be a source of the problem?
>>>>>>>>>
>>>>>>>>> This is 100% reproducible. Lock in the same place. Any hints?
>>>>>>>>
>>>>>>>> Can you also show the value of ps?
>>>>>>>> A precise map of what processes are doing would give an help.
>>>>>>>> Also would be useful to printout traces for other threads and not only
>>>>>>>> the stucked one.
>>>>>>>>
>>>>>>>
>>>>>>> >From another run:
>>>>>>
>>>>>> I'm unable to see who would be locking the buffer object in question.
>>>>>> Do you have INVARIANT_SUPPORT/INVARIANTS on?
>>>>>
>>>>> Yes, I do both.
>>>>>
>>>>>> What revision of /usr/src/sys/kern/vfs_bio.c are you running with?
>>>>>>
>>>>>
>>>>> As of 6.4-R: CVS rev 1.491.2.12.4.1 / SVN rev 183531.
>>>>
>>>> Please try this patch and report.
>>>>
>>>> Thanks,
>>>> Attilio
>>>>
>>>> --- src/sys/nfsclient/nfs_vnops.c   2008/02/13 20:44:18 1.281
>>>> +++ src/sys/nfsclient/nfs_vnops.c   2008/03/22 09:15:15 1.282
>>>> @@ -33,7 +33,7 @@
>>>>  */
>>>>
>>>>  #include 
>>>> -__FBSDID("$FreeBSD:
>>>> /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.281
>>>> 2008/02/13 20:44:18 attilio Exp $");
>>>> +__FBSDID("$FreeBSD:
>>>> /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.282
>>>> 2008/03/22 09:15:15 jeff Exp $");
>>>>
>>>
>>> Do you refer to the whole svn r177493, or is its nfs part will be enough?
>>> This only vfs_vnops.c diff seems not applicable without underneath
>>> kernel part changes.
>>>
>>> I'll try. Thanks.
>>
>> The NFS part should be enough, though I don't understand why it
>> doesn't trigger a panic on STABLE_6 as long as, at least in my
>> revision, there is an assert for the buffer object lock to be held in
>> bufobj_wwait(). What's your sys/kern/vfs_bio.c rev?
>>
>
> As of 6.4-R.
> $FreeBSD: src/sys/kern/vfs_bio.c,v 1.491.2.12.4.1 2008/10/02 02:57:24
> kensmith Exp $

That's it, the revision doesn't have the assert.
If it does fix the problem for you, I will let you test a more
comprehensive patch as there is also at least another fix I want to
bring in along with this one (and the relative asserts).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [nfs] process locks in "bo_wwait" on 6.4

2009-06-29 Thread Attilio Rao

2009/6/29 pluknet :
> 2009/6/29 Attilio Rao :
>> 2009/6/29 pluknet :
>>> 2009/6/29 Attilio Rao :
>>>> 2009/6/29 pluknet :
>>>>> 2009/6/29 Attilio Rao :
>>>>>> 2009/6/29 pluknet :
>>>>>>> 2009/6/26 pluknet :
>>>>>>>> 2009/6/26 pluknet :
>>>>>>>>> Hello.
>>>>>>>>>
>>>>>>>>> While building a module on nfs mounted /usr/src
>>>>>>>>> I got an unkillable process waiting forever in bo_wwait.
>>>>>>>>
>>>>>>>> Small note: iface on NFS server has mtu changed from 1500 to 1450.
>>>>>>>> Can this be a source of the problem?
>>>>>>>
>>>>>>> This is 100% reproducible. Lock in the same place. Any hints?
>>>>>>
>>>>>> Can you also show the value of ps?
>>>>>> A precise map of what processes are doing would give an help.
>>>>>> Also would be useful to printout traces for other threads and not only
>>>>>> the stucked one.
>>>>>>
>>>>>
>>>>> >From another run:
>>>>
>>>> I'm unable to see who would be locking the buffer object in question.
>>>> Do you have INVARIANT_SUPPORT/INVARIANTS on?
>>>
>>> Yes, I do both.
>>>
>>>> What revision of /usr/src/sys/kern/vfs_bio.c are you running with?
>>>>
>>>
>>> As of 6.4-R: CVS rev 1.491.2.12.4.1 / SVN rev 183531.
>>
>> Please try this patch and report.
>>
>> Thanks,
>> Attilio
>>
>> --- src/sys/nfsclient/nfs_vnops.c   2008/02/13 20:44:18 1.281
>> +++ src/sys/nfsclient/nfs_vnops.c   2008/03/22 09:15:15 1.282
>> @@ -33,7 +33,7 @@
>>  */
>>
>>  #include 
>> -__FBSDID("$FreeBSD:
>> /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.281
>> 2008/02/13 20:44:18 attilio Exp $");
>> +__FBSDID("$FreeBSD:
>> /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.282
>> 2008/03/22 09:15:15 jeff Exp $");
>>
>
> Do you refer to the whole svn r177493, or is its nfs part will be enough?
> This only vfs_vnops.c diff seems not applicable without underneath
> kernel part changes.
>
> I'll try. Thanks.

The NFS part should be enough, though I don't understand why it
doesn't trigger a panic on STABLE_6 as long as, at least in my
revision, there is an assert for the buffer object lock to be held in
bufobj_wwait(). What's your sys/kern/vfs_bio.c rev?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [nfs] process locks in "bo_wwait" on 6.4

2009-06-29 Thread Attilio Rao

2009/6/29 pluknet :
> 2009/6/29 Attilio Rao :
>> 2009/6/29 pluknet :
>>> 2009/6/29 Attilio Rao :
>>>> 2009/6/29 pluknet :
>>>>> 2009/6/26 pluknet :
>>>>>> 2009/6/26 pluknet :
>>>>>>> Hello.
>>>>>>>
>>>>>>> While building a module on nfs mounted /usr/src
>>>>>>> I got an unkillable process waiting forever in bo_wwait.
>>>>>>
>>>>>> Small note: iface on NFS server has mtu changed from 1500 to 1450.
>>>>>> Can this be a source of the problem?
>>>>>
>>>>> This is 100% reproducible. Lock in the same place. Any hints?
>>>>
>>>> Can you also show the value of ps?
>>>> A precise map of what processes are doing would give an help.
>>>> Also would be useful to printout traces for other threads and not only
>>>> the stucked one.
>>>>
>>>
>>> >From another run:
>>
>> I'm unable to see who would be locking the buffer object in question.
>> Do you have INVARIANT_SUPPORT/INVARIANTS on?
>
> Yes, I do both.
>
>> What revision of /usr/src/sys/kern/vfs_bio.c are you running with?
>>
>
> As of 6.4-R: CVS rev 1.491.2.12.4.1 / SVN rev 183531.

Please try this patch and report.

Thanks,
Attilio

--- src/sys/nfsclient/nfs_vnops.c   2008/02/13 20:44:18 1.281
+++ src/sys/nfsclient/nfs_vnops.c   2008/03/22 09:15:15 1.282
@@ -33,7 +33,7 @@
  */

 #include 
-__FBSDID("$FreeBSD:
/usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.281
2008/02/13 20:44:18 attilio Exp $");
+__FBSDID("$FreeBSD:
/usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.282
2008/03/22 09:15:15 jeff Exp $");

 /*
  * vnode op calls for Sun NFS version 2 and 3
@@ -2736,11 +2736,12 @@ nfs_flush(struct vnode *vp, int waitfor,
int i;
struct buf *nbp;
struct nfsmount *nmp = VFSTONFS(vp->v_mount);
-   int s, error = 0, slptimeo = 0, slpflag = 0, retv, bvecpos;
+   int error = 0, slptimeo = 0, slpflag = 0, retv, bvecpos;
int passone = 1;
u_quad_t off, endoff, toff;
struct ucred* wcred = NULL;
struct buf **bvec = NULL;
+   struct bufobj *bo;
 #ifndef NFS_COMMITBVECSIZ
 #define NFS_COMMITBVECSIZ  20
 #endif
@@ -2751,6 +2752,7 @@ nfs_flush(struct vnode *vp, int waitfor,
slpflag = PCATCH;
if (!commit)
passone = 0;
+   bo = &vp->v_bufobj;
/*
 * A b_flags == (B_DELWRI | B_NEEDCOMMIT) block has been written to the
 * server, but has not been committed to stable storage on the server
@@ -2763,15 +2765,14 @@ again:
endoff = 0;
bvecpos = 0;
if (NFS_ISV3(vp) && commit) {
-   s = splbio();
if (bvec != NULL && bvec != bvec_on_stack)
free(bvec, M_TEMP);
/*
 * Count up how many buffers waiting for a commit.
 */
bveccount = 0;
-   VI_LOCK(vp);
-   TAILQ_FOREACH_SAFE(bp, &vp->v_bufobj.bo_dirty.bv_hd, b_bobufs, 
nbp) {
+   BO_LOCK(bo);
+   TAILQ_FOREACH_SAFE(bp, &bo->bo_dirty.bv_hd, b_bobufs, nbp) {
if (!BUF_ISLOCKED(bp) &&
(bp->b_flags & (B_DELWRI | B_NEEDCOMMIT))
== (B_DELWRI | B_NEEDCOMMIT))
@@ -2788,11 +2789,11 @@ again:
 * Release the vnode interlock to avoid a lock
 * order reversal.
 */
-   VI_UNLOCK(vp);
+   BO_UNLOCK(bo);
bvec = (struct buf **)
malloc(bveccount * sizeof(struct buf *),
   M_TEMP, M_NOWAIT);
-   VI_LOCK(vp);
+   BO_LOCK(bo);
if (bvec == NULL) {
bvec = bvec_on_stack;
bvecsize = NFS_COMMITBVECSIZ;
@@ -2802,7 +2803,7 @@ again:
bvec = bvec_on_stack;
bvecsize = NFS_COMMITBVECSIZ;
}
-   TAILQ_FOREACH_SAFE(bp, &vp->v_bufobj.bo_dirty.bv_hd, b_bobufs, 
nbp) {
+   TAILQ_FOREACH_SAFE(bp, &bo->bo_dirty.bv_hd, b_bobufs, nbp) {
if (bvecpos >= bvecsize)
break;
if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL)) {
@@ -2815,7 +2816,7 @@ again:
nbp = TAILQ_NEXT(bp, b_bobufs);
continue;
}
-

Re: [nfs] process locks in "bo_wwait" on 6.4

2009-06-29 Thread Attilio Rao

2009/6/29 pluknet :
> 2009/6/29 Attilio Rao :
>> 2009/6/29 pluknet :
>>> 2009/6/26 pluknet :
>>>> 2009/6/26 pluknet :
>>>>> Hello.
>>>>>
>>>>> While building a module on nfs mounted /usr/src
>>>>> I got an unkillable process waiting forever in bo_wwait.
>>>>
>>>> Small note: iface on NFS server has mtu changed from 1500 to 1450.
>>>> Can this be a source of the problem?
>>>
>>> This is 100% reproducible. Lock in the same place. Any hints?
>>
>> Can you also show the value of ps?
>> A precise map of what processes are doing would give an help.
>> Also would be useful to printout traces for other threads and not only
>> the stucked one.
>>
>
> >From another run:

I'm unable to see who would be locking the buffer object in question.
Do you have INVARIANT_SUPPORT/INVARIANTS on?
What revision of /usr/src/sys/kern/vfs_bio.c are you running with?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [nfs] process locks in "bo_wwait" on 6.4

2009-06-29 Thread Attilio Rao

2009/6/29 pluknet :
> 2009/6/26 pluknet :
>> 2009/6/26 pluknet :
>>> Hello.
>>>
>>> While building a module on nfs mounted /usr/src
>>> I got an unkillable process waiting forever in bo_wwait.
>>
>> Small note: iface on NFS server has mtu changed from 1500 to 1450.
>> Can this be a source of the problem?
>
> This is 100% reproducible. Lock in the same place. Any hints?

Can you also show the value of ps?
A precise map of what processes are doing would give an help.
Also would be useful to printout traces for other threads and not only
the stucked one.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Big problem still remains with 7.2-STABLE locking up

2009-06-09 Thread Attilio Rao

2009/6/10 NAKAJI Hiroyuki :
> Thanks Attilio,
>
> I set up dcons target/host pair. Target is 7.2-STABLE and host is
> 6.4-STABLE.
>
> Dcons session was recorded with script.
> http://www.heimat.gr.jp/localhost/dcons.log

I'm following up privately with the user, news to come hopefully.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Big problem still remains with 7.2-STABLE locking up

2009-06-06 Thread Attilio Rao

2009/6/6 NAKAJI Hiroyuki :
> Hi,
>
> I noticed, some months ago, frequent lockups on my RELENG_6 server with
> ECS PM800-M2, Celeron 2.6GHz (UP), 2GB ram, ATA HDDs and 3Com NIC(xl0),
> and then I gave up this old server.
>
> Last month, I replaced this 'unstable' server to the new one with
> 7.2-RELEASE which worked very well until I setup it as 'a server'. The
> problem began just after it started 'the services'.
>
> My story is very similar to Pete's.
> http://lists.freebsd.org/pipermail/freebsd-stable/2009-January/047487.html
>
> I followed some instructions in the list thread. But unfortunately, the
> big problem still remains. 7.2-STABLE server locks up frequently.
>
> Help! :-(
>
> The server is NEC Express5800 S70/SD.
>
> o CPU: Intel(R) Celeron(R) CPU 440 @ 2.00GHz (2280.25-MHz K8-class CPU)
> o 6GB RAM
> o ACPI APIC Table: 
> o 80GB and 250GB SATA HDDs
> o http://www.heimat.gr.jp/~nakaji/localhost/dmesg.boot
>
> The kernel configuration is:
>
> include GENERIC
> ident   HEIMAT
> options MSGBUF_SIZE=81920
> makeoptions     DEBUG=-g
> options KDB
> options DDB
> options BREAK_TO_DEBUGGER
> options QUOTA

Were you unmounting any of the QUOTA'ed filesystems?
I'm aware of a possible deadlock between quota and unmount path which
is very difficult to trigger though.

Anyways, the only one way we have to debug this is getting some help
by the user.
1) Drop the option WITNESS_SPIKSPIN (as we would like to debug
spinlocks too) and LOCK_PROFILING (in order to create higher
contention and kill some barriers)
2) Once you get the deadlock break in the DDB debugger
3) Once you are in DDB informations which could be very useful are:
db> show allpcpu
db> show alllocks
db> show lockedvnods
db> ps
db> allthreads

Note that this is a lot of printout so you won't be able of collecting
all these informations if not with a serial connection.
4) Dump the content so that we can further look at locks structure
states once we identify something useful (ideally, keeping the machine
up in DDB for that would be very useful, but often not viable)

Let me know.
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

2009-05-21 Thread Attilio Rao

2009/5/21 Riccardo Torrini :
> On Wed, May 20, 2009 at 10:21:23AM -0400, John Baldwin wrote:
>
>> Try this.  It reverts the single-CCB part of the previous
>> commit while keeping the other fixes.  I missed that the
>> CCB might still be in flight when we schedule another rescan.
>
> Applied to mpt_raid.c,v 1.15.2.1 2008/07/28 17:05:09 jhb (it
> differ only for line position but adiacent lines are the same).
> Also redone a diff -u4 to verify, recompiled, installed, and...
>
> YOO-HOO.  Now it rebuild _without_ crashing.
>
> May 20 17:39:08 horse kernel: \
>mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
>mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
>mpt0:vol0(mpt0:0:0): Low Priority Re-Sync
>mpt0:vol0(mpt0:0:0): 64461754 of 71087625 blocks remaining
>
> Let me test against a 7.2-STABLE (and even to some -CURRENT)...
>
> [some times ahead]
>
> Bad news: I removed the second disk during rebuilding and it
> still crash.  I take a screen shapshot with camera because of
> too many messages for write down by hand  :)
>
> Image, src tarball and info here (about 2.2MB):
> ftp://ftp.torrini.org/pub/FreeBSD/mpt_crash_on_rebuild/

Please try the patch here:
http://www.freebsd.org/~attilio/notify.diff

I think it is perfectly fine this approach because the devctl_notify()
also will "silently" fail if no memory is available.
Note that this is a CAM "bug" more that the driver arises.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

[HEADS UP] lockmgr needing of sys/lockmgr.h on thirdy part codes

2008-05-05 Thread Attilio Rao

Hello,
after MFC'ed the usage of LOCK_FILE and LOCK_LINE for lockmgr(9), now
thirdy part code needs to include sys/lock.h just priorior than
sys/lockmgr.
Even if the patch doesn't break ABI / KPI (so it doesn't need thirdy
part KLD to be recompiled), it worths noting that the new code needs
this extra-care in order to be fully compliant.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Lock Order Reversal on 7.0-STABLE with pf and ipfw / dummynet (traces)

2008-03-27 Thread Attilio Rao

2008/3/25, Max Laier <[EMAIL PROTECTED]>:
> Hi Alex,
>
>  so it's basically back to square one.  We only have LORs between the pfil
>  R/W lock (read instance) and mutexes that don't have any lock order with
>  the pfil R/W lock (write instance) at all.  This means the deadlock can't
>  be explained by the LORs that are reported (unless there is something I'm
>  missing).  Unless somebody who is seeing these kind of deadlocks can
>  actually break into a debugger to identify the locks at play, everything
>  else is just speculation.
>
>  I will fix the fastroute LOR with the patch you have been testing,
>  eventhough it didn't fix your problem.  For the remaining issue, we need
>  more IPFW or lock primitives knowledge (extending CC-list).
>
>  Note that the first LOR features a recursive pickup of the pfil R/W lock.
>  I remember that Attilio committed a patch to forbid this for CURRENT.
>  Could this be the cause of a deadlock?  Would it make sense to MFC
>  rm_locks and try if they hold up under this scenario?

I decided to not commit this patch to CURRENT basing on the Robert's
feedback that read recursion in network stack is (will be?)
fundamental.
Likely, it should not explain the deadlock still.
As you point out, the better thing would be using a machine with stock
CVS + DDB + INVARIANTS and check the state of threads and the state of
locks.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: kqueue LOR

2006-12-12 Thread Attilio Rao


2006/12/12, Suleiman Souhlal <[EMAIL PROTECTED]>:

Attilio Rao wrote:
> 2006/12/12, Kostik Belousov <[EMAIL PROTECTED]>:
>
>> On Tue, Dec 12, 2006 at 12:44:54AM -0800, Suleiman Souhlal wrote:
>> > Kostik Belousov wrote:
>> > >On Sun, Nov 26, 2006 at 09:30:39AM +0100, V??clav Haisman wrote:
>> > >
>> > >>Hi,
>> > >>the attached lor.txt contains LOR I got this yesterday. It is
>> FreeBSD 6.1
>> > >>with relatively recent kernel, from last week or so.
>> > >>
>> > >>--
>> > >>VH
>> > >
>> > >
>> > >>+lock order reversal:
>> > >>+ 1st 0xc537f300 kqueue (kqueue) @
>> /usr/src/sys/kern/kern_event.c:1547
>> > >>+ 2nd 0xc45c22dc struct mount mtx (struct mount mtx) @
>> > >>/usr/src/sys/ufs/ufs/ufs_vnops.c:138
>> > >>+KDB: stack backtrace:
>> > >>+kdb_backtrace(c07f9879,c45c22dc,c07fd31c,c07fd31c,c080c7b2,...) at
>> > >>kdb_backtrace+0x2f
>> > >>+witness_checkorder(c45c22dc,9,c080c7b2,8a,c07fc6bd,...) at
>> > >>witness_checkorder+0x5fe
>> > >>+_mtx_lock_flags(c45c22dc,0,c080c7b2,8a,e790ba20,...) at
>> > >>_mtx_lock_flags+0x32
>> > >>+ufs_itimes(c47a0dd0,c47a0e90,e790ba78,c060e1cc,c47a0dd0,...) at
>> > >>ufs_itimes+0x6c
>> > >>+ufs_getattr(e790ba54,e790baec,c0622af6,c0896f40,e790ba54,...) at
>> > >>ufs_getattr+0x20
>> > >>+VOP_GETATTR_APV(c0896f40,e790ba54,c08a5760,c47a0dd0,e790ba74,...) at
>> > >>VOP_GETATTR_APV+0x3a
>> > >>+filt_vfsread(c4cf261c,6,c07f445e,60b,0,...) at filt_vfsread+0x75
>> > >>+knote(c4f57114,6,1,1f30c2af,1f30c2af,...) at knote+0x75
>> > >>+VOP_WRITE_APV(c0896f40,e790bbec,c47a0dd0,227,e790bcb4,...) at
>> > >>VOP_WRITE_APV+0x148
>> > >>+vn_write(c45d5120,e790bcb4,c5802a00,0,c4b73a80,...) at
>> vn_write+0x201
>> > >>+dofilewrite(c4b73a80,1b,c45d5120,e790bcb4,,...) at
>> > >>dofilewrite+0x84
>> > >>+kern_writev(c4b73a80,1b,e790bcb4,8220c71,0,...) at kern_writev+0x65
>> > >>+write(c4b73a80,e790bd04,c,c07d899c,3,...) at write+0x4f
>> > >>+syscall(3b,3b,bfbf003b,0,bfbfeae4,...) at syscall+0x295
>> > >>+Xint0x80_syscall() at Xint0x80_syscall+0x1f
>> > >>+--- syscall (4, FreeBSD ELF32, write), eip = 0x2831d727, esp =
>> > >>0xbfbfea1c, ebp = 0xbfbfea48 ---
>> > >
>> > >
>> > >Thank you for the report. The LOR is caused by my commit into
>> > >sys/ufs/ufs/ufs_vnops.c, rev. 1.280.
>> >
>> > Is the mount lock really required, if all we're doing is a single
>> read of a
>> > single word (mnt_kern_flags) (v_mount should be read-only for the whole
>> > lifetime of the vnode, I believe)? After all, reads of a single word
>> are
>> > atomic on all our supported architectures.
>> > The only situation I see where there MIGHT be problems are forced
>> unmounts,
>> > but I think there are bigger issues with those.
>> > Sorry for noticing this email only now.
>>
>> The problem is real with snapshotting. Ignoring
>> MNTK_SUSPEND/MNTK_SUSPENDED flags (in particular, reading stale value of
>> mnt_kern_flag) while setting IN_MODIFIED caused deadlock at ufs vnode
>> inactivation time. This was the big trouble with nfsd and snapshots. As
>> such, I think that precise value of mmnt_kern_flag is critical there,
>> and mount interlock is needed.
>
>
> This can be avoided using a memory barrier when setting flags.
> Even if memory barriers usage is not encouraged, some critical code
> should really use them replacing a mutex semantic (if that worths it).

Why is memory barrier usage not encouraged? As you said, they can be used to 
reduce the number of atomic (LOCKed) operations, in some cases.


Beacause they can lead to "errors" as it is not so straightforward to
understand when a memory barrier is needed more than an atomic
instruction and so on
(even if it doesn't value, for example, for ia32, for other
architectures memory barriers could be more expensive than the atomic
instruction, without counting a possible error).


FWIW, Linux has rmb() (load mem barrier), wmb() (store mem barrier), mb() (load/store mem 
barrier), smp_rmb(), smp_wmb(), smp_mb() (mem barriers only needed on SMP), and barrier() 
(GCC barrier (__asm __volatile (:::"memory")) macros that I've personally found 
very useful.


I think that our memory barriers reflect the usage we do into the
kernel as the base for building syncronizing primitives. From this
point of view our atomic operations (meant into the wider possible
sense, man 9 atomic) are more suitable than having something like
Linux's smp_*().

Attilio


--
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

1 2 >

1 - 100 of 101 matches

Mail list logo