Re: NFS-exported ZFS instability

2013-02-03 Thread Andriy Gapon
on 30/01/2013 00:44 Andriy Gapon said the following:
 on 29/01/2013 23:44 Hiroki Sato said the following:
   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
 
[snip]
 See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid 100639
 (nfsd in kmem_back).
 

I decided to write a few more words about this issue.

I think that the root cause of the problem is that ZFS ARC code performs memory
allocations with M_WAITOK while holding some ARC lock(s).

If a thread runs into such an allocation when a system is very low on memory
(even for a very short period of time), then the thread is going to be blocked
(to sleep in more exact terms) in VM_WAIT until a certain amount of memory is
freed.  To be more precise until v_free_count + v_cache_count goes above 
v_free_min.
And quoting from the report:
db show page
cnt.v_free_count: 8842
cnt.v_cache_count: 0
cnt.v_inactive_count: 0
cnt.v_active_count: 169
cnt.v_wire_count: 6081952
cnt.v_free_reserved: 7981
cnt.v_free_min: 38435
cnt.v_free_target: 161721
cnt.v_cache_min: 161721
cnt.v_inactive_target: 242581

In this case tid 100639 is the thread:
Tracing command nfsd pid 961 tid 100639 td 0xfe0027038920
sched_switch() at sched_switch+0x17a/frame 0xff86ca5c9c80
mi_switch() at mi_switch+0x1f8/frame 0xff86ca5c9cd0
sleepq_switch() at sleepq_switch+0x123/frame 0xff86ca5c9d00
sleepq_wait() at sleepq_wait+0x4d/frame 0xff86ca5c9d30
_sleep() at _sleep+0x3d4/frame 0xff86ca5c9dc0
kmem_back() at kmem_back+0x1a3/frame 0xff86ca5c9e50
kmem_malloc() at kmem_malloc+0x1f8/frame 0xff86ca5c9ea0
uma_large_malloc() at uma_large_malloc+0x4a/frame 0xff86ca5c9ee0
malloc() at malloc+0x14d/frame 0xff86ca5c9f20
arc_get_data_buf() at arc_get_data_buf+0x1f4/frame 0xff86ca5c9f60
arc_read_nolock() at arc_read_nolock+0x208/frame 0xff86ca5ca010
arc_read() at arc_read+0x93/frame 0xff86ca5ca090
dbuf_read() at dbuf_read+0x452/frame 0xff86ca5ca150
dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0x16a/frame
0xff86ca5ca1e0
dmu_buf_hold_array() at dmu_buf_hold_array+0x67/frame 0xff86ca5ca240
dmu_read_uio() at dmu_read_uio+0x3f/frame 0xff86ca5ca2a0
zfs_freebsd_read() at zfs_freebsd_read+0x3e9/frame 0xff86ca5ca3b0
nfsvno_read() at nfsvno_read+0x2db/frame 0xff86ca5ca490
nfsrvd_read() at nfsrvd_read+0x3ff/frame 0xff86ca5ca710
nfsrvd_dorpc() at nfsrvd_dorpc+0xc9/frame 0xff86ca5ca910
nfssvc_program() at nfssvc_program+0x5da/frame 0xff86ca5caaa0
svc_run_internal() at svc_run_internal+0x5fb/frame 0xff86ca5cabd0
svc_thread_start() at svc_thread_start+0xb/frame 0xff86ca5cabe0

Sleeping in VM_WAIT while holding the ARC lock(s) means that other ARC
operations may get blocked.  And pretty much all ZFS I/O goes through the ARC.
So that's why we see all those stuck nfsd threads.

Another factor greatly contributing to the problem is that currently the page
daemon blocks (sleeps) in arc_lowmem (a vm_lowmem hook) waiting for the ARC
reclaim thread to make a pass.  This happens before the page daemon makes its
own pageout pass.

But because tid 100639 holds the ARC lock(s), ARC reclaim thread gets blocked
and can not make any forward progress.  Thus the page daemon also gets blocked.
And thus the page daemon can not free up any pages.


So, this situation is not a true deadlock.  E.g. it is theoretically possible
that some other threads would free some memory at their own will and the
condition would clear up.  But in practice this is highly unlikely.

Some possible resolutions that I can think of.

The best one is probably doing ARC memory allocations without holding any locks.

Also, maybe we should make a rule that no vm_lowmem hooks should sleep.  That
is, arc_lowmem should signal the ARC reclaim thread to do some work, but should
not wait on it.

Perhaps we could also provide a mechanism to mark certain memory allocations as
special and use that mechanism for ARC allocations.  So that VM_WAIT unblocks
sooner: in this case we had 8842 free pages (~35MB), but thread 100639 was not
waken up.

I think that ideally we should do something about all the three directions.
But even one of them might turn out to be sufficient.
As I've said, the first one seems to be the most promising, but it would require
some tricky programming (flags and retries?) to move memory allocations out of
locked sections.
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS-exported ZFS instability

2013-02-03 Thread Rick Macklem
Andriy Gapon wrote:
 on 30/01/2013 00:44 Andriy Gapon said the following:
  on 29/01/2013 23:44 Hiroki Sato said the following:
http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
 
 [snip]
  See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid
  100639
  (nfsd in kmem_back).
 
 
 I decided to write a few more words about this issue.
 
 I think that the root cause of the problem is that ZFS ARC code
 performs memory
 allocations with M_WAITOK while holding some ARC lock(s).
 
 If a thread runs into such an allocation when a system is very low on
 memory
 (even for a very short period of time), then the thread is going to be
 blocked
 (to sleep in more exact terms) in VM_WAIT until a certain amount of
 memory is
 freed. To be more precise until v_free_count + v_cache_count goes
 above v_free_min.
 And quoting from the report:
 db show page
 cnt.v_free_count: 8842
 cnt.v_cache_count: 0
 cnt.v_inactive_count: 0
 cnt.v_active_count: 169
 cnt.v_wire_count: 6081952
 cnt.v_free_reserved: 7981
 cnt.v_free_min: 38435
 cnt.v_free_target: 161721
 cnt.v_cache_min: 161721
 cnt.v_inactive_target: 242581
 
 In this case tid 100639 is the thread:
 Tracing command nfsd pid 961 tid 100639 td 0xfe0027038920
 sched_switch() at sched_switch+0x17a/frame 0xff86ca5c9c80
 mi_switch() at mi_switch+0x1f8/frame 0xff86ca5c9cd0
 sleepq_switch() at sleepq_switch+0x123/frame 0xff86ca5c9d00
 sleepq_wait() at sleepq_wait+0x4d/frame 0xff86ca5c9d30
 _sleep() at _sleep+0x3d4/frame 0xff86ca5c9dc0
 kmem_back() at kmem_back+0x1a3/frame 0xff86ca5c9e50
 kmem_malloc() at kmem_malloc+0x1f8/frame 0xff86ca5c9ea0
 uma_large_malloc() at uma_large_malloc+0x4a/frame 0xff86ca5c9ee0
 malloc() at malloc+0x14d/frame 0xff86ca5c9f20
 arc_get_data_buf() at arc_get_data_buf+0x1f4/frame 0xff86ca5c9f60
 arc_read_nolock() at arc_read_nolock+0x208/frame 0xff86ca5ca010
 arc_read() at arc_read+0x93/frame 0xff86ca5ca090
 dbuf_read() at dbuf_read+0x452/frame 0xff86ca5ca150
 dmu_buf_hold_array_by_dnode() at
 dmu_buf_hold_array_by_dnode+0x16a/frame
 0xff86ca5ca1e0
 dmu_buf_hold_array() at dmu_buf_hold_array+0x67/frame
 0xff86ca5ca240
 dmu_read_uio() at dmu_read_uio+0x3f/frame 0xff86ca5ca2a0
 zfs_freebsd_read() at zfs_freebsd_read+0x3e9/frame 0xff86ca5ca3b0
 nfsvno_read() at nfsvno_read+0x2db/frame 0xff86ca5ca490
 nfsrvd_read() at nfsrvd_read+0x3ff/frame 0xff86ca5ca710
 nfsrvd_dorpc() at nfsrvd_dorpc+0xc9/frame 0xff86ca5ca910
 nfssvc_program() at nfssvc_program+0x5da/frame 0xff86ca5caaa0
 svc_run_internal() at svc_run_internal+0x5fb/frame 0xff86ca5cabd0
 svc_thread_start() at svc_thread_start+0xb/frame 0xff86ca5cabe0
 
 Sleeping in VM_WAIT while holding the ARC lock(s) means that other ARC
 operations may get blocked. And pretty much all ZFS I/O goes through
 the ARC.
 So that's why we see all those stuck nfsd threads.
 
 Another factor greatly contributing to the problem is that currently
 the page
 daemon blocks (sleeps) in arc_lowmem (a vm_lowmem hook) waiting for
 the ARC
 reclaim thread to make a pass. This happens before the page daemon
 makes its
 own pageout pass.
 
 But because tid 100639 holds the ARC lock(s), ARC reclaim thread gets
 blocked
 and can not make any forward progress. Thus the page daemon also gets
 blocked.
 And thus the page daemon can not free up any pages.
 
 
 So, this situation is not a true deadlock. E.g. it is theoretically
 possible
 that some other threads would free some memory at their own will and
 the
 condition would clear up. But in practice this is highly unlikely.
 
 Some possible resolutions that I can think of.
 
 The best one is probably doing ARC memory allocations without holding
 any locks.
 
 Also, maybe we should make a rule that no vm_lowmem hooks should
 sleep. That
 is, arc_lowmem should signal the ARC reclaim thread to do some work,
 but should
 not wait on it.
 
 Perhaps we could also provide a mechanism to mark certain memory
 allocations as
 special and use that mechanism for ARC allocations. So that VM_WAIT
 unblocks
 sooner: in this case we had 8842 free pages (~35MB), but thread 100639
 was not
 waken up.
 
 I think that ideally we should do something about all the three
 directions.
 But even one of them might turn out to be sufficient.
 As I've said, the first one seems to be the most promising, but it
 would require
 some tricky programming (flags and retries?) to move memory
 allocations out of
 locked sections.

For the NFSv4 stuff, I pre-allocate any structures that I might need
using malloc(..M_WAITOK) before going into the locked region. If I
don't need them, I just free() them at the end. (I assign newp
the allocation and set newp NULL if it is used. If newp != NULL
at the end, then free(newp..);)

This avoids all the go back and retry after doing an allocation
complexity. (One of the big names, maybe Dykstra, had a name 

Re: NFS-exported ZFS instability

2013-01-30 Thread Andriy Gapon
on 30/01/2013 01:06 Rick Macklem said the following:
 Andriy Gapon wrote:
 on 29/01/2013 23:44 Hiroki Sato said the following:
   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt

 I recognize here a ZFS ARC deadlock that should have been prevented by
 r241773
 and its MFCs (r242858 for 9, r242859 for 8).

 Unfortunately, pool-20130130-info.txt shows a kernel built from r244417,
 unless I somehow misread it.

You are right.  I slightly misdiagnosed the problem - it's not the same, but a
slightly different problem.  So it has almost the same cause, but r241773
didn't handle this situation.

Basically:
- a thread goes into ARC, acquires some ARC lock and then calls malloc(M_WAITOK)
- there is a page shortage, so the thread ends up in VM_WAIT() waiting on 
pagedaemon
- pagedaemon synchronously invokes lowmem hook
- the ARC hook sleeps waiting on ARC reclaim thread to make a pass
- ARC reclaim thread is blocked on the ARC lock held by the original thread

My conclusion: ARC lowmem hook should never wait on ARC reclaim thread.  At
least as long as the ARC code calls malloc(M_WAITOK) while holding locks.

Perhaps the root cause here is that we treat both KM_PUSHPAGE and KM_SLEEP as
M_WAITOK.  We do not seem to have an equivalent of KM_PUSHPAGE?
Perhaps resurrected M_USE_RESERVE could serve this role?

Quote:
A small pool of reserved memory is available to allow the system to progress
toward the goal of freeing additional memory while in a low memory situation.
The KM_PUSHPAGE flag enables use of this reserved memory pool on an allocation.
This flag can be used by drivers that implement strategy(9E) on memory
allocations associated with a single I/O operation. The driver guarantees that
the I/O operation will complete (or timeout) and, on completion, that the memory
will be returned. The KM_PUSHPAGE flag should be used only in kmem_cache_alloc()
calls. All allocations from a given cache should be consistent in their use of
the flag. A driver that adheres to these restrictions can guarantee progress in
a low memory situation without resorting to complex private allocation and
queuing schemes. If KM_PUSHPAGE is specified, KM_SLEEP can also be used without
causing deadlock.


But please note how the Solaris API allows to use KM_PUSHPAGE with KM_SLEEP, not
sure what's going on under the hood in that case.

 See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid
 100639
 (nfsd in kmem_back).

 --
 Andriy Gapon


-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS-exported ZFS instability

2013-01-30 Thread Rick Macklem
Andriy Gapon wrote:
 on 30/01/2013 01:06 Rick Macklem said the following:
  Andriy Gapon wrote:
  on 29/01/2013 23:44 Hiroki Sato said the following:
http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
 
  I recognize here a ZFS ARC deadlock that should have been prevented
  by
  r241773
  and its MFCs (r242858 for 9, r242859 for 8).
 
  Unfortunately, pool-20130130-info.txt shows a kernel built from
  r244417,
  unless I somehow misread it.
 
 You are right. I slightly misdiagnosed the problem - it's not the
 same, but a
 slightly different problem. So it has almost the same cause, but
 r241773
 didn't handle this situation.
 
 Basically:
 - a thread goes into ARC, acquires some ARC lock and then calls
 malloc(M_WAITOK)
 - there is a page shortage, so the thread ends up in VM_WAIT() waiting
 on pagedaemon
 - pagedaemon synchronously invokes lowmem hook
 - the ARC hook sleeps waiting on ARC reclaim thread to make a pass
 - ARC reclaim thread is blocked on the ARC lock held by the original
 thread
 
 My conclusion: ARC lowmem hook should never wait on ARC reclaim
 thread. At
 least as long as the ARC code calls malloc(M_WAITOK) while holding
 locks.
 
 Perhaps the root cause here is that we treat both KM_PUSHPAGE and
 KM_SLEEP as
 M_WAITOK. We do not seem to have an equivalent of KM_PUSHPAGE?
 Perhaps resurrected M_USE_RESERVE could serve this role?
 
Good work figuring this out! Obviously, better folk that I will
have to figure out how to fix this.

Good luck with it, rick
ps: Having some special place malloc() can go for critical allocations,
sounds like a good plan to me. Possibly have malloc() follow the
M_NOWAIT path and then go to this area when M_NOWAIT fails to allocate?

 Quote:
 A small pool of reserved memory is available to allow the system to
 progress
 toward the goal of freeing additional memory while in a low memory
 situation.
 The KM_PUSHPAGE flag enables use of this reserved memory pool on an
 allocation.
 This flag can be used by drivers that implement strategy(9E) on memory
 allocations associated with a single I/O operation. The driver
 guarantees that
 the I/O operation will complete (or timeout) and, on completion, that
 the memory
 will be returned. The KM_PUSHPAGE flag should be used only in
 kmem_cache_alloc()
 calls. All allocations from a given cache should be consistent in
 their use of
 the flag. A driver that adheres to these restrictions can guarantee
 progress in
 a low memory situation without resorting to complex private allocation
 and
 queuing schemes. If KM_PUSHPAGE is specified, KM_SLEEP can also be
 used without
 causing deadlock.
 
 
 But please note how the Solaris API allows to use KM_PUSHPAGE with
 KM_SLEEP, not
 sure what's going on under the hood in that case.
 
  See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and
  tid
  100639
  (nfsd in kmem_back).
 
  --
  Andriy Gapon
 
 
 --
 Andriy Gapon
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to
 freebsd-stable-unsubscr...@freebsd.org
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS-exported ZFS instability

2013-01-29 Thread Hiroki Sato
Hiroki Sato h...@freebsd.org wrote
  in 20130104.023244.472910818423317661@allbsd.org:

hr Konstantin Belousov kostik...@gmail.com wrote
hr   in 20130102174044.gb82...@kib.kiev.ua:
hr
hr ko  I might take a closer look this evening and see if I can spot anything
hr ko  in the log, rick
hr ko  ps: I hope Alan and Kostik don't mind being added to the cc list.
hr ko
hr ko What I see in the log is that the lock cascade rooted in the thread
hr ko 100838, which owns system map mutex. I believe this prevents malloc(9)
hr ko from making a progress in other threads, which e.g. own the ZFS vnode
hr ko locks. As the result, the whole system wedged.
hr ko
hr ko Looking back at the thread 100838, we can see that it executes
hr ko smp_tlb_shootdown(). It is impossible to tell from the static dump,
hr ko is the appearance of the smp_tlb_shootdown() in the backtrace is
hr ko transient, or the thread is spinning there, waiting for other CPUs to
hr ko acknowledge the request. But, since the system wedged, most likely,
hr ko smp_tlb_shootdown spins.
hr ko
hr ko Taking this hypothesis, the situation can occur, most likely, due to
hr ko some other core running with the interrupts disabled. Inspection of the
hr ko backtraces of the processes running on all cores does not show any which
hr ko could legitimately own a spinlock or otherwise run with the interrupts
hr ko disabled.
hr ko
hr ko One thing you could try to do is to enable WITNESS for the spinlocks,
hr ko to try to catch the leaked spinlock. I very much doubt that this is
hr ko the case.
hr ko
hr ko Another thing to try is to switch the CPU idle method to something
hr ko else. Look at the machdep.idle* sysctls. It could be some CPU errata
hr ko which blocks wakeup due the interrupt in some conditions in C1 ?
hr
hr  Thank you.  It can take 1-2 weeks to reproduce this, so I set
hr  debug.witness.skipspin=0 and keeping machdep.idle acpi abd will see
hr  how it goes for a while.  I will report again if I can get another
hr  freeze.

 Hmm, I could reproduce the same freeze when debug.witness.skipspin=0,
 too.  DDB and crash dump outputs are the following:

  http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
  http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt

 The value of machdep.idle was acpi.  I have seen this symptom on two
 boxes with the following CPUs, so I am guessing it is not specific to
 a CPU model:

  CPU: Intel(R) Pentium(R) D CPU 3.40GHz (3391.52-MHz K8-class CPU)
  CPU: Intel(R) Xeon(R) CPU X5650  @ 2.67GHz (2666.82-MHz K8-class CPU)

-- Hiroki


pgpD2om1nCoqH.pgp
Description: PGP signature


Re: NFS-exported ZFS instability

2013-01-29 Thread Andriy Gapon
on 29/01/2013 23:44 Hiroki Sato said the following:
   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt

I recognize here a ZFS ARC deadlock that should have been prevented by r241773
and its MFCs (r242858 for 9, r242859 for 8).

See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid 100639
(nfsd in kmem_back).

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS-exported ZFS instability

2013-01-29 Thread Rick Macklem
Andriy Gapon wrote:
 on 29/01/2013 23:44 Hiroki Sato said the following:
http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
 
 I recognize here a ZFS ARC deadlock that should have been prevented by
 r241773
 and its MFCs (r242858 for 9, r242859 for 8).
 
Unfortunately, pool-20130130-info.txt shows a kernel built from r244417,
unless I somehow misread it.

rick

 See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid
 100639
 (nfsd in kmem_back).
 
 --
 Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS-exported ZFS instability

2013-01-03 Thread Hiroki Sato
Rick Macklem rmack...@uoguelph.ca wrote
  in 1914428061.1617223.1357133079421.javamail.r...@erie.cs.uoguelph.ca:

rm Hiroki Sato wrote:
rm  Hello,
rm 
rm  I have been in a trouble about my NFS server for a long time. The
rm  symptom is that it stops working in one or two weeks after a boot. I
rm  could not track down the cause yet, but it is reproducible and only
rm  occurred under a very high I/O load.
rm 
rm  It did not panic, just stopped working---while it responded to ping,
rm  userland programs seemed not working. I could break it into DDB and
rm  get a kernel dump. The following URLs are a log of ps, trace, and
rm  etc.:
rm 
rm  http://people.allbsd.org/~hrs/FreeBSD/pool.log.20130102
rm  http://people.allbsd.org/~hrs/FreeBSD/pool.dmesg.20130102
rm 
rm  Does anyone see how to debug this? I guess this is due to a deadlock
rm  somewhere. I have suffered from this problem for almost two years.
rm  The above log is from stable/9 as of Dec 19, but this have persisted
rm  since 8.X.
rm 
rm Well, I took a quick glance at the log and there are a lot of processes
rm sleeping on pfault (in vm_waitpfault() in sys/vm/vm_page.c). I'm no
rm vm guy, so I'm not sure when/why that will happen. The comment on the
rm function suggests they are waiting for free pages.
rm
rm Maybe something as simple as running out of swap space or a problem
rm talking to the disk(s) that has the swap partition(s) or ???
rm (I'm talking through my hat here, because I'm not conversant with
rm  the vm side of things.)
rm
rm I might take a closer look this evening and see if I can spot anything
rm in the log, rick
rm ps: I hope Alan and Kostik don't mind being added to the cc list.

 Thank you.  This machine has 24GB RAM + 30GB swap.  16GB of them are
 used for ZFS ARC, and I can see 1.5GB free space on average.
 However, frequent swapouts happen in a regular basis even when the
 I/O load is low.  The amount used in the swap was 20-30MB only
 regardless of the load.

 I checked vm.stats and the outputs of vmstat -z/-m every 10 sec until
 the freeze several times but vm.stats.vm.v_free_count was around
 300,000 (1GB) even just before the freeze.

-- Hiroki


pgpt4cIux6h0I.pgp
Description: PGP signature


Re: NFS-exported ZFS instability

2013-01-03 Thread Hiroki Sato
Konstantin Belousov kostik...@gmail.com wrote
  in 20130102174044.gb82...@kib.kiev.ua:

ko  I might take a closer look this evening and see if I can spot anything
ko  in the log, rick
ko  ps: I hope Alan and Kostik don't mind being added to the cc list.
ko
ko What I see in the log is that the lock cascade rooted in the thread
ko 100838, which owns system map mutex. I believe this prevents malloc(9)
ko from making a progress in other threads, which e.g. own the ZFS vnode
ko locks. As the result, the whole system wedged.
ko
ko Looking back at the thread 100838, we can see that it executes
ko smp_tlb_shootdown(). It is impossible to tell from the static dump,
ko is the appearance of the smp_tlb_shootdown() in the backtrace is
ko transient, or the thread is spinning there, waiting for other CPUs to
ko acknowledge the request. But, since the system wedged, most likely,
ko smp_tlb_shootdown spins.
ko
ko Taking this hypothesis, the situation can occur, most likely, due to
ko some other core running with the interrupts disabled. Inspection of the
ko backtraces of the processes running on all cores does not show any which
ko could legitimately own a spinlock or otherwise run with the interrupts
ko disabled.
ko
ko One thing you could try to do is to enable WITNESS for the spinlocks,
ko to try to catch the leaked spinlock. I very much doubt that this is
ko the case.
ko
ko Another thing to try is to switch the CPU idle method to something
ko else. Look at the machdep.idle* sysctls. It could be some CPU errata
ko which blocks wakeup due the interrupt in some conditions in C1 ?

 Thank you.  It can take 1-2 weeks to reproduce this, so I set
 debug.witness.skipspin=0 and keeping machdep.idle acpi abd will see
 how it goes for a while.  I will report again if I can get another
 freeze.

-- Hiroki


pgppNW6a6Bds7.pgp
Description: PGP signature


Re: NFS-exported ZFS instability

2013-01-02 Thread Rick Macklem
Hiroki Sato wrote:
 Hello,
 
 I have been in a trouble about my NFS server for a long time. The
 symptom is that it stops working in one or two weeks after a boot. I
 could not track down the cause yet, but it is reproducible and only
 occurred under a very high I/O load.
 
 It did not panic, just stopped working---while it responded to ping,
 userland programs seemed not working. I could break it into DDB and
 get a kernel dump. The following URLs are a log of ps, trace, and
 etc.:
 
 http://people.allbsd.org/~hrs/FreeBSD/pool.log.20130102
 http://people.allbsd.org/~hrs/FreeBSD/pool.dmesg.20130102
 
 Does anyone see how to debug this? I guess this is due to a deadlock
 somewhere. I have suffered from this problem for almost two years.
 The above log is from stable/9 as of Dec 19, but this have persisted
 since 8.X.
 
Well, I took a quick glance at the log and there are a lot of processes
sleeping on pfault (in vm_waitpfault() in sys/vm/vm_page.c). I'm no
vm guy, so I'm not sure when/why that will happen. The comment on the
function suggests they are waiting for free pages.

Maybe something as simple as running out of swap space or a problem
talking to the disk(s) that has the swap partition(s) or ???
(I'm talking through my hat here, because I'm not conversant with
 the vm side of things.)

I might take a closer look this evening and see if I can spot anything
in the log, rick
ps: I hope Alan and Kostik don't mind being added to the cc list.

 -- Hiroki
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS-exported ZFS instability

2013-01-02 Thread Konstantin Belousov
On Wed, Jan 02, 2013 at 08:24:39AM -0500, Rick Macklem wrote:
 Hiroki Sato wrote:
  Hello,
  
  I have been in a trouble about my NFS server for a long time. The
  symptom is that it stops working in one or two weeks after a boot. I
  could not track down the cause yet, but it is reproducible and only
  occurred under a very high I/O load.
  
  It did not panic, just stopped working---while it responded to ping,
  userland programs seemed not working. I could break it into DDB and
  get a kernel dump. The following URLs are a log of ps, trace, and
  etc.:
  
  http://people.allbsd.org/~hrs/FreeBSD/pool.log.20130102
  http://people.allbsd.org/~hrs/FreeBSD/pool.dmesg.20130102
  
  Does anyone see how to debug this? I guess this is due to a deadlock
  somewhere. I have suffered from this problem for almost two years.
  The above log is from stable/9 as of Dec 19, but this have persisted
  since 8.X.
  
 Well, I took a quick glance at the log and there are a lot of processes
 sleeping on pfault (in vm_waitpfault() in sys/vm/vm_page.c). I'm no
 vm guy, so I'm not sure when/why that will happen. The comment on the
 function suggests they are waiting for free pages.
 
 Maybe something as simple as running out of swap space or a problem
 talking to the disk(s) that has the swap partition(s) or ???
 (I'm talking through my hat here, because I'm not conversant with
  the vm side of things.)
 
 I might take a closer look this evening and see if I can spot anything
 in the log, rick
 ps: I hope Alan and Kostik don't mind being added to the cc list.

What I see in the log is that the lock cascade rooted in the thread
100838, which owns system map mutex. I believe this prevents malloc(9)
from making a progress in other threads, which e.g. own the ZFS vnode
locks. As the result, the whole system wedged.

Looking back at the thread 100838, we can see that it executes
smp_tlb_shootdown(). It is impossible to tell from the static dump,
is the appearance of the smp_tlb_shootdown() in the backtrace is
transient, or the thread is spinning there, waiting for other CPUs to
acknowledge the request. But, since the system wedged, most likely,
smp_tlb_shootdown spins.

Taking this hypothesis, the situation can occur, most likely, due to
some other core running with the interrupts disabled. Inspection of the
backtraces of the processes running on all cores does not show any which
could legitimately own a spinlock or otherwise run with the interrupts
disabled.

One thing you could try to do is to enable WITNESS for the spinlocks,
to try to catch the leaked spinlock. I very much doubt that this is
the case.

Another thing to try is to switch the CPU idle method to something
else. Look at the machdep.idle* sysctls. It could be some CPU errata
which blocks wakeup due the interrupt in some conditions in C1 ?


pgpE8MByHmYh5.pgp
Description: PGP signature


Re: NFS-exported ZFS instability

2013-01-01 Thread Perry Hutchison
Hiroki Sato h...@freebsd.org wrote:

  I have been in a trouble about my NFS server for a long time.
  The symptom is that it stops working in one or two weeks after
  a boot ...  It did not panic, just stopped working---while it
  responded to ping, userland programs seemed not working ...

  Does anyone see how to debug this?  I guess this is due to a
  deadlock somewhere ...

If you can afford the overhead, you could try running with some
of the kernel debug options enabled (e.g. WITNESS, INVARIANTS,
MUTEX_DEBUG).  See conf/NOTES for descriptions.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org