subject:"\"Re\\\: NFS\\\-exported ZFS instability\""

Re: NFS-exported ZFS instability

2013-02-03 Thread Rick Macklem

Andriy Gapon wrote:
> on 30/01/2013 00:44 Andriy Gapon said the following:
> > on 29/01/2013 23:44 Hiroki Sato said the following:
> >>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
> >>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
> >
> [snip]
> > See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid
> > 100639
> > (nfsd in kmem_back).
> >
> 
> I decided to write a few more words about this issue.
> 
> I think that the root cause of the problem is that ZFS ARC code
> performs memory
> allocations with M_WAITOK while holding some ARC lock(s).
> 
> If a thread runs into such an allocation when a system is very low on
> memory
> (even for a very short period of time), then the thread is going to be
> blocked
> (to sleep in more exact terms) in VM_WAIT until a certain amount of
> memory is
> freed. To be more precise until v_free_count + v_cache_count goes
> above v_free_min.
> And quoting from the report:
> db> show page
> cnt.v_free_count: 8842
> cnt.v_cache_count: 0
> cnt.v_inactive_count: 0
> cnt.v_active_count: 169
> cnt.v_wire_count: 6081952
> cnt.v_free_reserved: 7981
> cnt.v_free_min: 38435
> cnt.v_free_target: 161721
> cnt.v_cache_min: 161721
> cnt.v_inactive_target: 242581
> 
> In this case tid 100639 is the thread:
> Tracing command nfsd pid 961 tid 100639 td 0xfe0027038920
> sched_switch() at sched_switch+0x17a/frame 0xff86ca5c9c80
> mi_switch() at mi_switch+0x1f8/frame 0xff86ca5c9cd0
> sleepq_switch() at sleepq_switch+0x123/frame 0xff86ca5c9d00
> sleepq_wait() at sleepq_wait+0x4d/frame 0xff86ca5c9d30
> _sleep() at _sleep+0x3d4/frame 0xff86ca5c9dc0
> kmem_back() at kmem_back+0x1a3/frame 0xff86ca5c9e50
> kmem_malloc() at kmem_malloc+0x1f8/frame 0xff86ca5c9ea0
> uma_large_malloc() at uma_large_malloc+0x4a/frame 0xff86ca5c9ee0
> malloc() at malloc+0x14d/frame 0xff86ca5c9f20
> arc_get_data_buf() at arc_get_data_buf+0x1f4/frame 0xff86ca5c9f60
> arc_read_nolock() at arc_read_nolock+0x208/frame 0xff86ca5ca010
> arc_read() at arc_read+0x93/frame 0xff86ca5ca090
> dbuf_read() at dbuf_read+0x452/frame 0xff86ca5ca150
> dmu_buf_hold_array_by_dnode() at
> dmu_buf_hold_array_by_dnode+0x16a/frame
> 0xff86ca5ca1e0
> dmu_buf_hold_array() at dmu_buf_hold_array+0x67/frame
> 0xff86ca5ca240
> dmu_read_uio() at dmu_read_uio+0x3f/frame 0xff86ca5ca2a0
> zfs_freebsd_read() at zfs_freebsd_read+0x3e9/frame 0xff86ca5ca3b0
> nfsvno_read() at nfsvno_read+0x2db/frame 0xff86ca5ca490
> nfsrvd_read() at nfsrvd_read+0x3ff/frame 0xff86ca5ca710
> nfsrvd_dorpc() at nfsrvd_dorpc+0xc9/frame 0xff86ca5ca910
> nfssvc_program() at nfssvc_program+0x5da/frame 0xff86ca5caaa0
> svc_run_internal() at svc_run_internal+0x5fb/frame 0xff86ca5cabd0
> svc_thread_start() at svc_thread_start+0xb/frame 0xff86ca5cabe0
> 
> Sleeping in VM_WAIT while holding the ARC lock(s) means that other ARC
> operations may get blocked. And pretty much all ZFS I/O goes through
> the ARC.
> So that's why we see all those stuck nfsd threads.
> 
> Another factor greatly contributing to the problem is that currently
> the page
> daemon blocks (sleeps) in arc_lowmem (a vm_lowmem hook) waiting for
> the ARC
> reclaim thread to make a pass. This happens before the page daemon
> makes its
> own pageout pass.
> 
> But because tid 100639 holds the ARC lock(s), ARC reclaim thread gets
> blocked
> and can not make any forward progress. Thus the page daemon also gets
> blocked.
> And thus the page daemon can not free up any pages.
> 
> 
> So, this situation is not a true deadlock. E.g. it is theoretically
> possible
> that some other threads would free some memory at their own will and
> the
> condition would clear up. But in practice this is highly unlikely.
> 
> Some possible resolutions that I can think of.
> 
> The best one is probably doing ARC memory allocations without holding
> any locks.
> 
> Also, maybe we should make a rule that no vm_lowmem hooks should
> sleep. That
> is, arc_lowmem should signal the ARC reclaim thread to do some work,
> but should
> not wait on it.
> 
> Perhaps we could also provide a mechanism to mark certain memory
> allocations as
> "special" and use that mechanism for ARC allocations. So that VM_WAIT
> unblocks
> sooner: in this case we had 8842 free pages (~35MB), but thread 100639
> was not
> waken up.
> 
> I think that ideally we should do something about all the three
> directions.
> But even one of them might turn out to be sufficient.
> As I've said, the first one seems to be the most promising, but it
> would require
> some tricky programming (flags and retries?) to move memory
> allocations out of
> locked sections.

For the NFSv4 stuff, I pre-allocate any structures that I might need
using malloc(..M_WAITOK) before going into the locked region. If I
don't need them, I just free() them at the end. (I assign "newp"
the allocation and set "newp" NULL if it is used. If "newp" != NULL
at the end, then free(

Re: NFS-exported ZFS instability

2013-02-03 Thread Andriy Gapon

on 30/01/2013 00:44 Andriy Gapon said the following:
> on 29/01/2013 23:44 Hiroki Sato said the following:
>>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
>>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
> 
[snip]
> See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid 100639
> (nfsd in kmem_back).
> 

I decided to write a few more words about this issue.

I think that the root cause of the problem is that ZFS ARC code performs memory
allocations with M_WAITOK while holding some ARC lock(s).

If a thread runs into such an allocation when a system is very low on memory
(even for a very short period of time), then the thread is going to be blocked
(to sleep in more exact terms) in VM_WAIT until a certain amount of memory is
freed.  To be more precise until v_free_count + v_cache_count goes above 
v_free_min.
And quoting from the report:
db> show page
cnt.v_free_count: 8842
cnt.v_cache_count: 0
cnt.v_inactive_count: 0
cnt.v_active_count: 169
cnt.v_wire_count: 6081952
cnt.v_free_reserved: 7981
cnt.v_free_min: 38435
cnt.v_free_target: 161721
cnt.v_cache_min: 161721
cnt.v_inactive_target: 242581

In this case tid 100639 is the thread:
Tracing command nfsd pid 961 tid 100639 td 0xfe0027038920
sched_switch() at sched_switch+0x17a/frame 0xff86ca5c9c80
mi_switch() at mi_switch+0x1f8/frame 0xff86ca5c9cd0
sleepq_switch() at sleepq_switch+0x123/frame 0xff86ca5c9d00
sleepq_wait() at sleepq_wait+0x4d/frame 0xff86ca5c9d30
_sleep() at _sleep+0x3d4/frame 0xff86ca5c9dc0
kmem_back() at kmem_back+0x1a3/frame 0xff86ca5c9e50
kmem_malloc() at kmem_malloc+0x1f8/frame 0xff86ca5c9ea0
uma_large_malloc() at uma_large_malloc+0x4a/frame 0xff86ca5c9ee0
malloc() at malloc+0x14d/frame 0xff86ca5c9f20
arc_get_data_buf() at arc_get_data_buf+0x1f4/frame 0xff86ca5c9f60
arc_read_nolock() at arc_read_nolock+0x208/frame 0xff86ca5ca010
arc_read() at arc_read+0x93/frame 0xff86ca5ca090
dbuf_read() at dbuf_read+0x452/frame 0xff86ca5ca150
dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0x16a/frame
0xff86ca5ca1e0
dmu_buf_hold_array() at dmu_buf_hold_array+0x67/frame 0xff86ca5ca240
dmu_read_uio() at dmu_read_uio+0x3f/frame 0xff86ca5ca2a0
zfs_freebsd_read() at zfs_freebsd_read+0x3e9/frame 0xff86ca5ca3b0
nfsvno_read() at nfsvno_read+0x2db/frame 0xff86ca5ca490
nfsrvd_read() at nfsrvd_read+0x3ff/frame 0xff86ca5ca710
nfsrvd_dorpc() at nfsrvd_dorpc+0xc9/frame 0xff86ca5ca910
nfssvc_program() at nfssvc_program+0x5da/frame 0xff86ca5caaa0
svc_run_internal() at svc_run_internal+0x5fb/frame 0xff86ca5cabd0
svc_thread_start() at svc_thread_start+0xb/frame 0xff86ca5cabe0

Sleeping in VM_WAIT while holding the ARC lock(s) means that other ARC
operations may get blocked.  And pretty much all ZFS I/O goes through the ARC.
So that's why we see all those stuck nfsd threads.

Another factor greatly contributing to the problem is that currently the page
daemon blocks (sleeps) in arc_lowmem (a vm_lowmem hook) waiting for the ARC
reclaim thread to make a pass.  This happens before the page daemon makes its
own pageout pass.

But because tid 100639 holds the ARC lock(s), ARC reclaim thread gets blocked
and can not make any forward progress.  Thus the page daemon also gets blocked.
And thus the page daemon can not free up any pages.


So, this situation is not a true deadlock.  E.g. it is theoretically possible
that some other threads would free some memory at their own will and the
condition would clear up.  But in practice this is highly unlikely.

Some possible resolutions that I can think of.

The best one is probably doing ARC memory allocations without holding any locks.

Also, maybe we should make a rule that no vm_lowmem hooks should sleep.  That
is, arc_lowmem should signal the ARC reclaim thread to do some work, but should
not wait on it.

Perhaps we could also provide a mechanism to mark certain memory allocations as
"special" and use that mechanism for ARC allocations.  So that VM_WAIT unblocks
sooner: in this case we had 8842 free pages (~35MB), but thread 100639 was not
waken up.

I think that ideally we should do something about all the three directions.
But even one of them might turn out to be sufficient.
As I've said, the first one seems to be the most promising, but it would require
some tricky programming (flags and retries?) to move memory allocations out of
locked sections.
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: NFS-exported ZFS instability

2013-01-30 Thread Rick Macklem

Andriy Gapon wrote:
> on 30/01/2013 01:06 Rick Macklem said the following:
> > Andriy Gapon wrote:
> >> on 29/01/2013 23:44 Hiroki Sato said the following:
> >>>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
> >>>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
> >>
> >> I recognize here a ZFS ARC deadlock that should have been prevented
> >> by
> >> r241773
> >> and its MFCs (r242858 for 9, r242859 for 8).
> >>
> > Unfortunately, pool-20130130-info.txt shows a kernel built from
> > r244417,
> > unless I somehow misread it.
> 
> You are right. I slightly misdiagnosed the problem - it's not the
> same, but a
> slightly different problem. So it has "almost the same" cause, but
> r241773
> didn't handle this situation.
> 
> Basically:
> - a thread goes into ARC, acquires some ARC lock and then calls
> malloc(M_WAITOK)
> - there is a page shortage, so the thread ends up in VM_WAIT() waiting
> on pagedaemon
> - pagedaemon synchronously invokes lowmem hook
> - the ARC hook sleeps waiting on ARC reclaim thread to make a pass
> - ARC reclaim thread is blocked on the ARC lock held by the original
> thread
> 
> My conclusion: ARC lowmem hook should never wait on ARC reclaim
> thread. At
> least as long as the ARC code calls malloc(M_WAITOK) while holding
> locks.
> 
> Perhaps the root cause here is that we treat both KM_PUSHPAGE and
> KM_SLEEP as
> M_WAITOK. We do not seem to have an equivalent of KM_PUSHPAGE?
> Perhaps resurrected M_USE_RESERVE could serve this role?
> 
Good work figuring this out! Obviously, better folk that I will
have to figure out how to fix this.

Good luck with it, rick
ps: Having some "special" place malloc() can go for critical allocations,
sounds like a good plan to me. Possibly have malloc() follow the
M_NOWAIT path and then go to this area when M_NOWAIT fails to allocate?

> Quote:
> A small pool of reserved memory is available to allow the system to
> progress
> toward the goal of freeing additional memory while in a low memory
> situation.
> The KM_PUSHPAGE flag enables use of this reserved memory pool on an
> allocation.
> This flag can be used by drivers that implement strategy(9E) on memory
> allocations associated with a single I/O operation. The driver
> guarantees that
> the I/O operation will complete (or timeout) and, on completion, that
> the memory
> will be returned. The KM_PUSHPAGE flag should be used only in
> kmem_cache_alloc()
> calls. All allocations from a given cache should be consistent in
> their use of
> the flag. A driver that adheres to these restrictions can guarantee
> progress in
> a low memory situation without resorting to complex private allocation
> and
> queuing schemes. If KM_PUSHPAGE is specified, KM_SLEEP can also be
> used without
> causing deadlock.
> 
> 
> But please note how the Solaris API allows to use KM_PUSHPAGE with
> KM_SLEEP, not
> sure what's going on under the hood in that case.
> 
> >> See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and
> >> tid
> >> 100639
> >> (nfsd in kmem_back).
> >>
> >> --
> >> Andriy Gapon
> 
> 
> --
> Andriy Gapon
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: NFS-exported ZFS instability

2013-01-30 Thread Andriy Gapon

on 30/01/2013 01:06 Rick Macklem said the following:
> Andriy Gapon wrote:
>> on 29/01/2013 23:44 Hiroki Sato said the following:
>>>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
>>>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
>>
>> I recognize here a ZFS ARC deadlock that should have been prevented by
>> r241773
>> and its MFCs (r242858 for 9, r242859 for 8).
>>
> Unfortunately, pool-20130130-info.txt shows a kernel built from r244417,
> unless I somehow misread it.

You are right.  I slightly misdiagnosed the problem - it's not the same, but a
slightly different problem.  So it has "almost the same" cause, but r241773
didn't handle this situation.

Basically:
- a thread goes into ARC, acquires some ARC lock and then calls malloc(M_WAITOK)
- there is a page shortage, so the thread ends up in VM_WAIT() waiting on 
pagedaemon
- pagedaemon synchronously invokes lowmem hook
- the ARC hook sleeps waiting on ARC reclaim thread to make a pass
- ARC reclaim thread is blocked on the ARC lock held by the original thread

My conclusion: ARC lowmem hook should never wait on ARC reclaim thread.  At
least as long as the ARC code calls malloc(M_WAITOK) while holding locks.

Perhaps the root cause here is that we treat both KM_PUSHPAGE and KM_SLEEP as
M_WAITOK.  We do not seem to have an equivalent of KM_PUSHPAGE?
Perhaps resurrected M_USE_RESERVE could serve this role?

Quote:
A small pool of reserved memory is available to allow the system to progress
toward the goal of freeing additional memory while in a low memory situation.
The KM_PUSHPAGE flag enables use of this reserved memory pool on an allocation.
This flag can be used by drivers that implement strategy(9E) on memory
allocations associated with a single I/O operation. The driver guarantees that
the I/O operation will complete (or timeout) and, on completion, that the memory
will be returned. The KM_PUSHPAGE flag should be used only in kmem_cache_alloc()
calls. All allocations from a given cache should be consistent in their use of
the flag. A driver that adheres to these restrictions can guarantee progress in
a low memory situation without resorting to complex private allocation and
queuing schemes. If KM_PUSHPAGE is specified, KM_SLEEP can also be used without
causing deadlock.

But please note how the Solaris API allows to use KM_PUSHPAGE with KM_SLEEP, not
sure what's going on under the hood in that case.

>> See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid
>> 100639
>> (nfsd in kmem_back).
>>
>> --
>> Andriy Gapon

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: NFS-exported ZFS instability

2013-01-29 Thread Rick Macklem

Andriy Gapon wrote:
> on 29/01/2013 23:44 Hiroki Sato said the following:
> >   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
> >   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
> 
> I recognize here a ZFS ARC deadlock that should have been prevented by
> r241773
> and its MFCs (r242858 for 9, r242859 for 8).
> 
Unfortunately, pool-20130130-info.txt shows a kernel built from r244417,
unless I somehow misread it.

rick

> See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid
> 100639
> (nfsd in kmem_back).
> 
> --
> Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: NFS-exported ZFS instability

2013-01-29 Thread Andriy Gapon

on 29/01/2013 23:44 Hiroki Sato said the following:
>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt

I recognize here a ZFS ARC deadlock that should have been prevented by r241773
and its MFCs (r242858 for 9, r242859 for 8).

See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid 100639
(nfsd in kmem_back).

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: NFS-exported ZFS instability

2013-01-29 Thread Hiroki Sato

Hiroki Sato  wrote
  in <20130104.023244.472910818423317661@allbsd.org>:

hr> Konstantin Belousov  wrote
hr>   in <20130102174044.gb82...@kib.kiev.ua>:
hr>
hr> ko> > I might take a closer look this evening and see if I can spot anything
hr> ko> > in the log, rick
hr> ko> > ps: I hope Alan and Kostik don't mind being added to the cc list.
hr> ko>
hr> ko> What I see in the log is that the lock cascade rooted in the thread
hr> ko> 100838, which owns system map mutex. I believe this prevents malloc(9)
hr> ko> from making a progress in other threads, which e.g. own the ZFS vnode
hr> ko> locks. As the result, the whole system wedged.
hr> ko>
hr> ko> Looking back at the thread 100838, we can see that it executes
hr> ko> smp_tlb_shootdown(). It is impossible to tell from the static dump,
hr> ko> is the appearance of the smp_tlb_shootdown() in the backtrace is
hr> ko> transient, or the thread is spinning there, waiting for other CPUs to
hr> ko> acknowledge the request. But, since the system wedged, most likely,
hr> ko> smp_tlb_shootdown spins.
hr> ko>
hr> ko> Taking this hypothesis, the situation can occur, most likely, due to
hr> ko> some other core running with the interrupts disabled. Inspection of the
hr> ko> backtraces of the processes running on all cores does not show any which
hr> ko> could legitimately own a spinlock or otherwise run with the interrupts
hr> ko> disabled.
hr> ko>
hr> ko> One thing you could try to do is to enable WITNESS for the spinlocks,
hr> ko> to try to catch the leaked spinlock. I very much doubt that this is
hr> ko> the case.
hr> ko>
hr> ko> Another thing to try is to switch the CPU idle method to something
hr> ko> else. Look at the machdep.idle* sysctls. It could be some CPU errata
hr> ko> which blocks wakeup due the interrupt in some conditions in C1 ?
hr>
hr>  Thank you.  It can take 1-2 weeks to reproduce this, so I set
hr>  debug.witness.skipspin=0 and keeping machdep.idle acpi abd will see
hr>  how it goes for a while.  I will report again if I can get another
hr>  freeze.

 Hmm, I could reproduce the same freeze when debug.witness.skipspin=0,
 too.  DDB and crash dump outputs are the following:

  http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
  http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt

 The value of machdep.idle was acpi.  I have seen this symptom on two
 boxes with the following CPUs, so I am guessing it is not specific to
 a CPU model:

  CPU: Intel(R) Pentium(R) D CPU 3.40GHz (3391.52-MHz K8-class CPU)
  CPU: Intel(R) Xeon(R) CPU X5650  @ 2.67GHz (2666.82-MHz K8-class CPU)

-- Hiroki


pgpD2om1nCoqH.pgp
Description: PGP signature

Re: NFS-exported ZFS instability

2013-01-03 Thread Hiroki Sato

Konstantin Belousov  wrote
  in <20130102174044.gb82...@kib.kiev.ua>:

ko> > I might take a closer look this evening and see if I can spot anything
ko> > in the log, rick
ko> > ps: I hope Alan and Kostik don't mind being added to the cc list.
ko>
ko> What I see in the log is that the lock cascade rooted in the thread
ko> 100838, which owns system map mutex. I believe this prevents malloc(9)
ko> from making a progress in other threads, which e.g. own the ZFS vnode
ko> locks. As the result, the whole system wedged.
ko>
ko> Looking back at the thread 100838, we can see that it executes
ko> smp_tlb_shootdown(). It is impossible to tell from the static dump,
ko> is the appearance of the smp_tlb_shootdown() in the backtrace is
ko> transient, or the thread is spinning there, waiting for other CPUs to
ko> acknowledge the request. But, since the system wedged, most likely,
ko> smp_tlb_shootdown spins.
ko>
ko> Taking this hypothesis, the situation can occur, most likely, due to
ko> some other core running with the interrupts disabled. Inspection of the
ko> backtraces of the processes running on all cores does not show any which
ko> could legitimately own a spinlock or otherwise run with the interrupts
ko> disabled.
ko>
ko> One thing you could try to do is to enable WITNESS for the spinlocks,
ko> to try to catch the leaked spinlock. I very much doubt that this is
ko> the case.
ko>
ko> Another thing to try is to switch the CPU idle method to something
ko> else. Look at the machdep.idle* sysctls. It could be some CPU errata
ko> which blocks wakeup due the interrupt in some conditions in C1 ?

 Thank you.  It can take 1-2 weeks to reproduce this, so I set
 debug.witness.skipspin=0 and keeping machdep.idle acpi abd will see
 how it goes for a while.  I will report again if I can get another
 freeze.

-- Hiroki


pgppNW6a6Bds7.pgp
Description: PGP signature

Re: NFS-exported ZFS instability

2013-01-03 Thread Hiroki Sato

Rick Macklem  wrote
  in <1914428061.1617223.1357133079421.javamail.r...@erie.cs.uoguelph.ca>:

rm> Hiroki Sato wrote:
rm> > Hello,
rm> >
rm> > I have been in a trouble about my NFS server for a long time. The
rm> > symptom is that it stops working in one or two weeks after a boot. I
rm> > could not track down the cause yet, but it is reproducible and only
rm> > occurred under a very high I/O load.
rm> >
rm> > It did not panic, just stopped working---while it responded to ping,
rm> > userland programs seemed not working. I could break it into DDB and
rm> > get a kernel dump. The following URLs are a log of ps, trace, and
rm> > etc.:
rm> >
rm> > http://people.allbsd.org/~hrs/FreeBSD/pool.log.20130102
rm> > http://people.allbsd.org/~hrs/FreeBSD/pool.dmesg.20130102
rm> >
rm> > Does anyone see how to debug this? I guess this is due to a deadlock
rm> > somewhere. I have suffered from this problem for almost two years.
rm> > The above log is from stable/9 as of Dec 19, but this have persisted
rm> > since 8.X.
rm> >
rm> Well, I took a quick glance at the log and there are a lot of processes
rm> sleeping on "pfault" (in vm_waitpfault() in sys/vm/vm_page.c). I'm no
rm> vm guy, so I'm not sure when/why that will happen. The comment on the
rm> function suggests they are waiting for free pages.
rm>
rm> Maybe something as simple as running out of swap space or a problem
rm> talking to the disk(s) that has the swap partition(s) or ???
rm> (I'm talking through my hat here, because I'm not conversant with
rm>  the vm side of things.)
rm>
rm> I might take a closer look this evening and see if I can spot anything
rm> in the log, rick
rm> ps: I hope Alan and Kostik don't mind being added to the cc list.

 Thank you.  This machine has 24GB RAM + 30GB swap.  16GB of them are
 used for ZFS ARC, and I can see 1.5GB free space on average.
 However, frequent swapouts happen in a regular basis even when the
 I/O load is low.  The amount used in the swap was 20-30MB only
 regardless of the load.

 I checked vm.stats and the outputs of vmstat -z/-m every 10 sec until
 the freeze several times but vm.stats.vm.v_free_count was around
 300,000 (>1GB) even just before the freeze.

-- Hiroki


pgpt4cIux6h0I.pgp
Description: PGP signature

Re: NFS-exported ZFS instability

2013-01-02 Thread Konstantin Belousov

On Wed, Jan 02, 2013 at 08:24:39AM -0500, Rick Macklem wrote:
> Hiroki Sato wrote:
> > Hello,
> > 
> > I have been in a trouble about my NFS server for a long time. The
> > symptom is that it stops working in one or two weeks after a boot. I
> > could not track down the cause yet, but it is reproducible and only
> > occurred under a very high I/O load.
> > 
> > It did not panic, just stopped working---while it responded to ping,
> > userland programs seemed not working. I could break it into DDB and
> > get a kernel dump. The following URLs are a log of ps, trace, and
> > etc.:
> > 
> > http://people.allbsd.org/~hrs/FreeBSD/pool.log.20130102
> > http://people.allbsd.org/~hrs/FreeBSD/pool.dmesg.20130102
> > 
> > Does anyone see how to debug this? I guess this is due to a deadlock
> > somewhere. I have suffered from this problem for almost two years.
> > The above log is from stable/9 as of Dec 19, but this have persisted
> > since 8.X.
> > 
> Well, I took a quick glance at the log and there are a lot of processes
> sleeping on "pfault" (in vm_waitpfault() in sys/vm/vm_page.c). I'm no
> vm guy, so I'm not sure when/why that will happen. The comment on the
> function suggests they are waiting for free pages.
> 
> Maybe something as simple as running out of swap space or a problem
> talking to the disk(s) that has the swap partition(s) or ???
> (I'm talking through my hat here, because I'm not conversant with
>  the vm side of things.)
> 
> I might take a closer look this evening and see if I can spot anything
> in the log, rick
> ps: I hope Alan and Kostik don't mind being added to the cc list.

What I see in the log is that the lock cascade rooted in the thread
100838, which owns system map mutex. I believe this prevents malloc(9)
from making a progress in other threads, which e.g. own the ZFS vnode
locks. As the result, the whole system wedged.

Looking back at the thread 100838, we can see that it executes
smp_tlb_shootdown(). It is impossible to tell from the static dump,
is the appearance of the smp_tlb_shootdown() in the backtrace is
transient, or the thread is spinning there, waiting for other CPUs to
acknowledge the request. But, since the system wedged, most likely,
smp_tlb_shootdown spins.

Taking this hypothesis, the situation can occur, most likely, due to
some other core running with the interrupts disabled. Inspection of the
backtraces of the processes running on all cores does not show any which
could legitimately own a spinlock or otherwise run with the interrupts
disabled.

One thing you could try to do is to enable WITNESS for the spinlocks,
to try to catch the leaked spinlock. I very much doubt that this is
the case.

Another thing to try is to switch the CPU idle method to something
else. Look at the machdep.idle* sysctls. It could be some CPU errata
which blocks wakeup due the interrupt in some conditions in C1 ?

pgpE8MByHmYh5.pgp
Description: PGP signature

Re: NFS-exported ZFS instability

2013-01-02 Thread Rick Macklem

Hiroki Sato wrote:
> Hello,
> 
> I have been in a trouble about my NFS server for a long time. The
> symptom is that it stops working in one or two weeks after a boot. I
> could not track down the cause yet, but it is reproducible and only
> occurred under a very high I/O load.
> 
> It did not panic, just stopped working---while it responded to ping,
> userland programs seemed not working. I could break it into DDB and
> get a kernel dump. The following URLs are a log of ps, trace, and
> etc.:
> 
> http://people.allbsd.org/~hrs/FreeBSD/pool.log.20130102
> http://people.allbsd.org/~hrs/FreeBSD/pool.dmesg.20130102
> 
> Does anyone see how to debug this? I guess this is due to a deadlock
> somewhere. I have suffered from this problem for almost two years.
> The above log is from stable/9 as of Dec 19, but this have persisted
> since 8.X.
> 
Well, I took a quick glance at the log and there are a lot of processes
sleeping on "pfault" (in vm_waitpfault() in sys/vm/vm_page.c). I'm no
vm guy, so I'm not sure when/why that will happen. The comment on the
function suggests they are waiting for free pages.

Maybe something as simple as running out of swap space or a problem
talking to the disk(s) that has the swap partition(s) or ???
(I'm talking through my hat here, because I'm not conversant with
 the vm side of things.)

I might take a closer look this evening and see if I can spot anything
in the log, rick
ps: I hope Alan and Kostik don't mind being added to the cc list.

> -- Hiroki
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: NFS-exported ZFS instability

2013-01-01 Thread Perry Hutchison

Hiroki Sato  wrote:

>  I have been in a trouble about my NFS server for a long time.
>  The symptom is that it stops working in one or two weeks after
>  a boot ...  It did not panic, just stopped working---while it
>  responded to ping, userland programs seemed not working ...

>  Does anyone see how to debug this?  I guess this is due to a
>  deadlock somewhere ...

If you can afford the overhead, you could try running with some
of the kernel debug options enabled (e.g. WITNESS, INVARIANTS,
MUTEX_DEBUG).  See conf/NOTES for descriptions.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

Re: NFS-exported ZFS instability

12 matches

Site Navigation

Mail list logo

Footer information