Re: Intermittent system hangs on 7.2-RELEASE-p1
Just to follow up, I've been doing some testing with masking for KTR_LOCK rather than KTR_SCHED. I'm having trouble with this because I have the KTR buffer size set to 1048576 entries, and with only KTR_LOCK enabled, this isn't enough for even a full second of tracing; the sample I'm working with now is just under 0.9s. It's an average of one entry every 2001 TSC ticks. That *seems* like a lot of locking activity, but some of the lock points are only a couple of lines apart, so maybe it's just incredibly verbose. Since it's so much data and I'm still working on a way to correlate it (lockgraph.py?), all I've got so far is a list of what trace points are coming up the most: 51927 src/sys/kern/kern_lock.c:215 (_lockmgr UNLOCK mtx_unlock() when flags & LK_INTERLOCK) 48033 src/sys/kern/vfs_subr.c:2284 (vdropl UNLOCK) 41548 src/sys/kern/vfs_subr.c:2187 (vput VI_LOCK) 29359 src/sys/kern/vfs_subr.c:2067 (vget VI_LOCK) 29358 src/sys/kern/vfs_subr.c:2079 (vget VI_UNLOCK) 23799 src/sys/nfsclient/nfs_subs.c:755 (nfs_getattrcache mtx_lock) 23460 src/sys/nfsclient/nfs_vnops.c:645 (nfs_getattr mtx_unlock) 23460 src/sys/nfsclient/nfs_vnops.c:642 (nfs_getattr mtx_lock) 23460 src/sys/nfsclient/nfs_subs.c:815 (nfs_getattrcache mtx_unlock) 23138 src/sys/kern/vfs_cache.c:345 (cache_lookup CACHE_LOCK) Unfortunately, it kind of sounds like I'm on my way to answering "why is this system slow?" even though it really isn't slow. (And I rush to point out that the Apache process in question doesn't at any point in its life touch NFS, though some of the other ones on the machine do.) In order to be the cause of my Apache problem, all this goobering around with NFS would have to be relatively infrequent but so intense that it shoves everything else out of the way. I'm skeptical, but I'm sure one of you guys can offer a more informed opinion. The only other thing I can think of is maybe all this is running me out of something I need (vnodes?) so everybody else blocks until it finishes and lets go of whatever finite resource it's using up? But that doesn't make a ton of sense either, because why would a lack of vnodes cause stalls in accept() or select() in unrelated processes? Not sure if I'm going in the right direction here or not. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Intermittent system hangs on 7.2-RELEASE-p1
On Thu, Sep 10, 2009 at 2:46 PM, Julian Elischer wrote: > I've noticed that schedgraph tends to show the idle threads slightly > skewed one way or the other. I think there is a cumulative rounding > error in the way they are drawn due to the fact that they are run so > often. Check the raw data and I think you will find that you just > need to imagine the idle threads slightly to the left or right a bit. No, there's no period anywhere in the trace where either idle thread didn't run for an entire second. I'm pretty sure schedgraph is throwing in some nonsense results. I did capture a second, larger, dataset after a 2.1s stall, and schedgraph includes an httpd process that supposedly spent 58 seconds on the run queue. I don't know if it's a dropped record or a parsing error or what. I do think on this second graph I can kind of see the *end* of the stall, because all of a sudden a ton of processes... everything from sshd to httpd to gmond to sh to vnlru to bufdaemon to fdc0... comes off of whatever it's waiting on and hits the run queue. The combined run queues for both processors spike up to 32 tasks at one point and then rapidly tail off as things return to normal. That pretty much matches the behavior shown by ktrace in my initial post, where everything goes to sleep on something-or-other in the kernel, and then at the end of the stall, everything wakes up at the same time. I think this means the problem is somehow related to locking, rather than scheduling. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Intermittent system hangs on 7.2-RELEASE-p1
Linda Messerschmidt wrote: On Thu, Sep 10, 2009 at 12:57 PM, Ryan Stone wrote: You should be able to run schedgraph.py on a windows machine with python installed. It works just fine for me on XP. Don't have any of those either, but I *did* get it working on a Mac right out of the box. Should have thought of that sooner. :) The output looks pretty straightforward, but there are a couple of things I find odd. First, there's a point right around what I estimate to be the problem time where schedgraph.py indicates gmond (the Ganglia monitor) was running uninterrupted for a period of exactly 1 second. However, it also indicates that both CPU's idle tasks were *also* running almost continuously during that time (subject to clock/net interrupts), and that the run queue on both CPU's was zero for most of that second while gmond was allegedly running. I've noticed that schedgraph tends to show the idle threads slightly skewed one way or the other. I think there is a cumulative rounding error in the way they are drawn due to the fact that they are run so often. Check the raw data and I think you will find that you just need to imagine the idle threads slightly to the left or right a bit. The longer the trace and the further to he right you are looking the more "out" the idle threads appear to be. I saw this on both Linux and Mac python implementations. Second, the interval I graphed was about nine seconds. During that time, the PHP command line script made a whole lot of requests: it usleeps 50ms between requests, and non-broken requests average about 1.4ms. So even with the stalled request chopping 2 seconds off the end, there should be somewhere in the neighborhood of 130 requests during the graphed period. But that php process doesn't appear in the schedgraph output at all. So that doesn't make a whole lot of sense to me. I'll try to get another trace and see if that happens the same way again. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Intermittent system hangs on 7.2-RELEASE-p1
On Thu, Sep 10, 2009 at 12:57 PM, Ryan Stone wrote: > You should be able to run schedgraph.py on a windows machine with python > installed. It works just fine for me on XP. Don't have any of those either, but I *did* get it working on a Mac right out of the box. Should have thought of that sooner. :) The output looks pretty straightforward, but there are a couple of things I find odd. First, there's a point right around what I estimate to be the problem time where schedgraph.py indicates gmond (the Ganglia monitor) was running uninterrupted for a period of exactly 1 second. However, it also indicates that both CPU's idle tasks were *also* running almost continuously during that time (subject to clock/net interrupts), and that the run queue on both CPU's was zero for most of that second while gmond was allegedly running. Second, the interval I graphed was about nine seconds. During that time, the PHP command line script made a whole lot of requests: it usleeps 50ms between requests, and non-broken requests average about 1.4ms. So even with the stalled request chopping 2 seconds off the end, there should be somewhere in the neighborhood of 130 requests during the graphed period. But that php process doesn't appear in the schedgraph output at all. So that doesn't make a whole lot of sense to me. I'll try to get another trace and see if that happens the same way again. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Intermittent system hangs on 7.2-RELEASE-p1
You should be able to run schedgraph.py on a windows machine with python installed. It works just fine for me on XP. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Re: How to prevent other CPU from accessing a set of pages before calling pmap_remove_all functi
On Sep 10, 2009 5:08am, Kostik Belousov wrote: On Wed, Sep 09, 2009 at 11:57:24PM -0700, MingyanGuo wrote: > On Wed, Sep 9, 2009 at 11:26 PM, MingyanGuo guoming...@gmail.com> wrote: > > > Hi all, > > > > I find that function pmap_remove_all for arch amd64 works with a time > > window between reading & clearing the PTE flags(access flag and dirty flag) > > and invalidating its TLB entry on other CPU. After some discussion with Li > > Xin(cced), I think all the processes that are using the PTE being removed > > should be blocked before calling pmap_remove_all, or other CPU may dirty the > > page but does not set the dirty flag before the TLB entry is flushed. But I > > can not find how to block them to call the function. I read the function > > vm_pageout_scan in file vm/vm_pageout.c but can not find the exact method it > > used. Or I just misunderstood the semantics of function pmap_remove_all ? > > > > Thanks in advance. > > > > Regards, > > MingyanGuo > > > > Sorry for the noise. I understand the logic now. There is no time window > problem between reading & clearing the PTE and invalidating it on other CPU, > even if other CPU is using the PTE. I misunderstood the logic. Hmm. What would happen for the following scenario. Assume that the page m is mapped by vm map active on CPU1, and that CPU1 has cached TLB entry for some writable mapping of this page, but neither TLB entry not PTE has dirty bit set. Then, assume that the following sequence of events occur: CPU1: CPU2: call pmap_remove_all(m) clear pte write to the address mapped by m [*] invalidate the TLB, possibly making IPI to CPU1 I assume that at the point marked [*], we can - either loose the dirty bit, while CPU1 (atomically) sets the dirty bit in the cleared pte. Besides not properly tracking the modification status of the page, it could also cause the page table page to be modified, that would create non-zero page with PG_ZERO flag set. - or CPU1 re-reads the PTE entry when setting the dirty bit, and generates #pf since valid bit in PTE is zero. Intel documentation mentions that dirty or accessed bits updates are done with locked cycle, that definitely means that PTE is re-read, but I cannot find whether valid bit is rechecked. I am not an architecture expert, but from a programmer's view, I *think* using the 'in memory' PTE structure for the first write to that PTE is more reasonable. To set the dirty bit, a CPU has to access memory with locked cycles, so using the 'in memory' PTE structure should add few performance burden but more friendly to software. However, it is just my guess, I am reading the manuals to find if any description about it. Regards, MingyanGuo ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Intermittent system hangs on 7.2-RELEASE-p1
On Thu, Aug 27, 2009 at 5:29 PM, John Baldwin wrote: > Ah, cool, what you want to do is use KTR with KTR_SCHED and then use > schedgraph.py (src/tools/sched) to get a visual picture of what the box does > during a hang. The timestamps in KTR are TSC cycle counts rather than an > actual wall time which is why they look off. If you have a lot of events you > may want to use a larger KTR_ENTRIES size btw (I use 1048576 (2 ^ 20) here at > work to get large (multiple-second) traces). I'm still working on this. I enabled KTR and set it up to log KTR_SCHED events. Then, I wrote a script to exercise the HTTP server that actually ran on that machine, and set it to issue "sysctl debug.ktr.cpumask=0" and abort if a request took over 2 seconds. 28,613 requests later, it tripped over one that took 2007ms. (Just a refresher: this is a static file being served by an Apache process that has nothing else to do but serve this file on a relatively unloaded machine.) I don't have access to any machines that can run X, so I did the best I could to examine it from the shell. First, this machine has two CPU's so I split up the KTR results on a per-CPU basis so I could look at each individually. With KTR_ENTRIES set to 1048576, I got about 53 seconds of data with just KTR_SCHED enabled. Since I was interested in a 2.007 second period of time right at the end, I hacked it down to the last 3.795 seconds. In the 3.795 seconds captured in the trace period on CPU 0 that includes the entire 2.007 second stall, CPU 0 was idle for 3.175 seconds. In the same period, CPU 1 was idle for 3.2589 seconds. I did the best I could to manually page through all the scheduling activity on both CPUs during that 3.7 second time, and I didn't see anything really disruptive. Mainly idle, with jumps into the clock and ethernet kernel threads, as well as httpd. If I understand that correctly and have done everything right, that means that whatever happened, it wasn't related to CPU contention or scheduling issues of any sort. So, a couple of follow-up questions: First, what else should I be looking at? I built the kernel with kind of a lot of KTR flags (KTR_LOCK|KTR_SCHED|KTR_PROC|KTR_INTR|KTR_CALLOUT|KTR_UMA|KTR_SYSC) but enabling them all produces enough output that even 1048576 entries doesn't always go back two seconds; the volume of data is all but unmanageable. Second, is there any way to correlate the process address reported by the KTR scheduler entries back to a PID? It'd be nice to be able to view the scheduler activity just for the process I'm interested in, but I can't figure out which one it is. :) Thanks! ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: How to prevent other CPU from accessing a set of pages before calling pmap_remove_all function
On Wed, Sep 09, 2009 at 11:57:24PM -0700, MingyanGuo wrote: > On Wed, Sep 9, 2009 at 11:26 PM, MingyanGuo wrote: > > > Hi all, > > > > I find that function pmap_remove_all for arch amd64 works with a time > > window between reading & clearing the PTE flags(access flag and dirty flag) > > and invalidating its TLB entry on other CPU. After some discussion with Li > > Xin(cced), I think all the processes that are using the PTE being removed > > should be blocked before calling pmap_remove_all, or other CPU may dirty the > > page but does not set the dirty flag before the TLB entry is flushed. But I > > can not find how to block them to call the function. I read the function > > vm_pageout_scan in file vm/vm_pageout.c but can not find the exact method it > > used. Or I just misunderstood the semantics of function pmap_remove_all ? > > > > Thanks in advance. > > > > Regards, > > MingyanGuo > > > > Sorry for the noise. I understand the logic now. There is no time window > problem between reading & clearing the PTE and invalidating it on other CPU, > even if other CPU is using the PTE. I misunderstood the logic. Hmm. What would happen for the following scenario. Assume that the page m is mapped by vm map active on CPU1, and that CPU1 has cached TLB entry for some writable mapping of this page, but neither TLB entry not PTE has dirty bit set. Then, assume that the following sequence of events occur: CPU1: CPU2: call pmap_remove_all(m) clear pte write to the address mapped by m [*] invalidate the TLB, possibly making IPI to CPU1 I assume that at the point marked [*], we can - either loose the dirty bit, while CPU1 (atomically) sets the dirty bit in the cleared pte. Besides not properly tracking the modification status of the page, it could also cause the page table page to be modified, that would create non-zero page with PG_ZERO flag set. - or CPU1 re-reads the PTE entry when setting the dirty bit, and generates #pf since valid bit in PTE is zero. Intel documentation mentions that dirty or accessed bits updates are done with locked cycle, that definitely means that PTE is re-read, but I cannot find whether valid bit is rechecked. pgpQ2H6adkva4.pgp Description: PGP signature