Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-10 Thread Linda Messerschmidt
Just to follow up, I've been doing some testing with masking for
KTR_LOCK rather than KTR_SCHED.

I'm having trouble with this because I have the KTR buffer size set to
1048576 entries, and with only KTR_LOCK enabled, this isn't enough for
even a full second of tracing; the sample I'm working with now is just
under 0.9s.  It's an average of one entry every 2001 TSC ticks.  That
*seems* like a lot of locking activity, but some of the lock points
are only a couple of lines apart, so maybe it's just incredibly
verbose.

Since it's so much data and I'm still working on a way to correlate it
(lockgraph.py?), all I've got so far is a list of what trace points
are coming up the most:

51927 src/sys/kern/kern_lock.c:215  (_lockmgr UNLOCK mtx_unlock() when
flags & LK_INTERLOCK)
48033 src/sys/kern/vfs_subr.c:2284  (vdropl UNLOCK)
41548 src/sys/kern/vfs_subr.c:2187  (vput VI_LOCK)
29359 src/sys/kern/vfs_subr.c:2067  (vget VI_LOCK)
29358 src/sys/kern/vfs_subr.c:2079  (vget VI_UNLOCK)
23799 src/sys/nfsclient/nfs_subs.c:755  (nfs_getattrcache mtx_lock)
23460 src/sys/nfsclient/nfs_vnops.c:645  (nfs_getattr mtx_unlock)
23460 src/sys/nfsclient/nfs_vnops.c:642  (nfs_getattr mtx_lock)
23460 src/sys/nfsclient/nfs_subs.c:815  (nfs_getattrcache mtx_unlock)
23138 src/sys/kern/vfs_cache.c:345  (cache_lookup CACHE_LOCK)

Unfortunately, it kind of sounds like I'm on my way to answering "why
is this system slow?" even though it really isn't slow.  (And I rush
to point out that the Apache process in question doesn't at any point
in its life touch NFS, though some of the other ones on the machine
do.)

In order to be the cause of my Apache problem, all this goobering
around with NFS would have to be relatively infrequent but so intense
that it shoves everything else out of the way.  I'm skeptical, but I'm
sure one of you guys can offer a more informed opinion.

The only other thing I can think of is maybe all this is running me
out of something I need (vnodes?) so everybody else blocks until it
finishes and lets go of whatever finite resource it's using up?  But
that doesn't make a ton of sense either, because why would a lack of
vnodes cause stalls in accept() or select() in unrelated processes?

Not sure if I'm going in the right direction here or not.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-10 Thread Linda Messerschmidt
On Thu, Sep 10, 2009 at 2:46 PM, Julian Elischer wrote:
> I've noticed that schedgraph tends to show the idle threads slightly
> skewed one way or the other.  I think there is a cumulative rounding
> error in the way they are drawn due to the fact that they are run so
> often.  Check the raw data and I think you will find that you just
> need to imagine the idle threads slightly to the left or right a bit.

No, there's no period anywhere in the trace where either idle thread
didn't run for an entire second.

I'm pretty sure schedgraph is throwing in some nonsense results.  I
did capture a second, larger, dataset after a 2.1s stall, and
schedgraph includes an httpd process that supposedly spent 58 seconds
on the run queue.  I don't know if it's a dropped record or a parsing
error or what.

I do think on this second graph I can kind of see the *end* of the
stall, because all of a sudden a ton of processes... everything from
sshd to httpd to gmond to sh to vnlru to bufdaemon to fdc0... comes
off of whatever it's waiting on and hits the run queue.  The combined
run queues for both processors spike up to 32 tasks at one point and
then rapidly tail off as things return to normal.

That pretty much matches the behavior shown by ktrace in my initial
post, where everything goes to sleep on something-or-other in the
kernel, and then at the end of the stall, everything wakes up at the
same time.

I think this means the problem is somehow related to locking, rather
than scheduling.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-10 Thread Julian Elischer

Linda Messerschmidt wrote:

On Thu, Sep 10, 2009 at 12:57 PM, Ryan Stone wrote:

You should be able to run schedgraph.py on a windows machine with python
installed.  It works just fine for me on XP.


Don't have any of those either, but I *did* get it working on a Mac
right out of the box.  Should have thought of that sooner. :)

The output looks pretty straightforward, but there are a couple of
things I find odd.

First, there's a point right around what I estimate to be the problem
time where schedgraph.py indicates gmond (the Ganglia monitor) was
running uninterrupted for a period of exactly 1 second.  However, it
also indicates that both CPU's idle tasks were *also* running almost
continuously during that time (subject to clock/net interrupts), and
that the run queue on both CPU's was zero for most of that second
while gmond was allegedly running.


I've noticed that schedgraph tends to show the idle threads slightly
skewed one way or the other.  I think there is a cumulative rounding
error in the way they are drawn due to the fact that they are run so
often.  Check the raw data and I think you will find that you just
need to imagine the idle threads slightly to the left or right a bit.
The longer the trace and the further to he right you are looking
the more "out" the idle threads appear to be.

I saw this on both Linux and Mac python implementations.



Second, the interval I graphed was about nine seconds.  During that
time, the PHP command line script made a whole lot of requests: it
usleeps 50ms between requests, and non-broken requests average about
1.4ms.  So even with the stalled request chopping 2 seconds off the
end, there should be somewhere in the neighborhood of 130 requests
during the graphed period.  But that php process doesn't appear in the
schedgraph output at all.

So that doesn't make a whole lot of sense to me.

I'll try to get another trace and see if that happens the same way again.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-10 Thread Linda Messerschmidt
On Thu, Sep 10, 2009 at 12:57 PM, Ryan Stone wrote:
> You should be able to run schedgraph.py on a windows machine with python
> installed.  It works just fine for me on XP.

Don't have any of those either, but I *did* get it working on a Mac
right out of the box.  Should have thought of that sooner. :)

The output looks pretty straightforward, but there are a couple of
things I find odd.

First, there's a point right around what I estimate to be the problem
time where schedgraph.py indicates gmond (the Ganglia monitor) was
running uninterrupted for a period of exactly 1 second.  However, it
also indicates that both CPU's idle tasks were *also* running almost
continuously during that time (subject to clock/net interrupts), and
that the run queue on both CPU's was zero for most of that second
while gmond was allegedly running.

Second, the interval I graphed was about nine seconds.  During that
time, the PHP command line script made a whole lot of requests: it
usleeps 50ms between requests, and non-broken requests average about
1.4ms.  So even with the stalled request chopping 2 seconds off the
end, there should be somewhere in the neighborhood of 130 requests
during the graphed period.  But that php process doesn't appear in the
schedgraph output at all.

So that doesn't make a whole lot of sense to me.

I'll try to get another trace and see if that happens the same way again.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-10 Thread Ryan Stone
You should be able to run schedgraph.py on a windows machine with python
installed.  It works just fine for me on XP.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Re: How to prevent other CPU from accessing a set of pages before calling pmap_remove_all functi

2009-09-10 Thread guomingyan

On Sep 10, 2009 5:08am, Kostik Belousov  wrote:

On Wed, Sep 09, 2009 at 11:57:24PM -0700, MingyanGuo wrote:



> On Wed, Sep 9, 2009 at 11:26 PM, MingyanGuo guoming...@gmail.com> wrote:



>



> > Hi all,



> >



> > I find that function pmap_remove_all for arch amd64 works with a time


> > window between reading & clearing the PTE flags(access flag and dirty  
flag)


> > and invalidating its TLB entry on other CPU. After some discussion  
with Li


> > Xin(cced), I think all the processes that are using the PTE being  
removed


> > should be blocked before calling pmap_remove_all, or other CPU may  
dirty the


> > page but does not set the dirty flag before the TLB entry is flushed.  
But I


> > can not find how to block them to call the function. I read the  
function


> > vm_pageout_scan in file vm/vm_pageout.c but can not find the exact  
method it


> > used. Or I just misunderstood the semantics of function  
pmap_remove_all ?



> >



> > Thanks in advance.



> >



> > Regards,



> > MingyanGuo



> >



>



> Sorry for the noise. I understand the logic now. There is no time window


> problem between reading & clearing the PTE and invalidating it on other  
CPU,



> even if other CPU is using the PTE. I misunderstood the logic.





Hmm. What would happen for the following scenario.





Assume that the page m is mapped by vm map active on CPU1, and that



CPU1 has cached TLB entry for some writable mapping of this page,



but neither TLB entry not PTE has dirty bit set.





Then, assume that the following sequence of events occur:





CPU1: CPU2:



call pmap_remove_all(m)



clear pte



write to the address mapped



by m [*]



invalidate the TLB,



possibly making IPI to CPU1





I assume that at the point marked [*], we can



- either loose the dirty bit, while CPU1 (atomically) sets the dirty bit



in the cleared pte.



Besides not properly tracking the modification status of the page,



it could also cause the page table page to be modified, that would



create non-zero page with PG_ZERO flag set.



- or CPU1 re-reads the PTE entry when setting the dirty bit, and generates



#pf since valid bit in PTE is zero.





Intel documentation mentions that dirty or accessed bits updates are done



with locked cycle, that definitely means that PTE is re-read, but I cannot



find whether valid bit is rechecked.



I am not an architecture expert, but from a programmer's view,
I *think* using the 'in memory' PTE structure for the first write to
that PTE is more reasonable. To set the dirty bit, a CPU has to access  
memory

with locked cycles, so using the 'in memory' PTE structure should add few
performance burden but more friendly to software. However, it is just my
guess, I am reading the manuals to find if any description about it.

Regards,
MingyanGuo
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-10 Thread Linda Messerschmidt
On Thu, Aug 27, 2009 at 5:29 PM, John Baldwin wrote:
> Ah, cool, what you want to do is use KTR with KTR_SCHED and then use
> schedgraph.py (src/tools/sched) to get a visual picture of what the box does
> during a hang.  The timestamps in KTR are TSC cycle counts rather than an
> actual wall time which is why they look off.  If you have a lot of events you
> may want to use a larger KTR_ENTRIES size btw (I use 1048576 (2 ^ 20) here at
> work to get large (multiple-second) traces).

I'm still working on this.

I enabled KTR and set it up to log KTR_SCHED events.  Then, I wrote a
script to exercise the HTTP server that actually ran on that machine,
and set it to issue "sysctl debug.ktr.cpumask=0" and abort if a
request took over 2 seconds.  28,613 requests later, it tripped over
one that took 2007ms.

(Just a refresher: this is a static file being served by an Apache
process that has nothing else to do but serve this file on a
relatively unloaded machine.)

I don't have access to any machines that can run X, so I did the best
I could to examine it from the shell.

First, this machine has two CPU's so I split up the KTR results on a
per-CPU basis so I could look at each individually.

With KTR_ENTRIES set to 1048576, I got about 53 seconds of data with
just KTR_SCHED enabled.  Since I was interested in a 2.007 second
period of time right at the end, I hacked it down to the last 3.795
seconds.

In the 3.795 seconds captured in the trace period on CPU 0 that
includes the entire 2.007 second stall, CPU 0 was idle for 3.175
seconds.

In the same period, CPU 1 was idle for 3.2589 seconds.

I did the best I could to manually page through all the scheduling
activity on both CPUs during that 3.7 second time, and I didn't see
anything really disruptive.  Mainly idle, with jumps into the clock
and ethernet kernel threads, as well as httpd.

If I understand that correctly and have done everything right, that
means that whatever happened, it wasn't related to CPU contention or
scheduling issues of any sort.

So, a couple of follow-up questions:

First, what else should I be looking at?  I built the kernel with kind
of a lot of KTR flags
(KTR_LOCK|KTR_SCHED|KTR_PROC|KTR_INTR|KTR_CALLOUT|KTR_UMA|KTR_SYSC)
but enabling them all produces enough output that even 1048576 entries
doesn't always go back two seconds; the volume of data is all but
unmanageable.

Second, is there any way to correlate the process address reported by
the KTR scheduler entries back to a PID?  It'd be nice to be able to
view the scheduler activity just for the process I'm interested in,
but I can't figure out which one it is. :)

Thanks!
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: How to prevent other CPU from accessing a set of pages before calling pmap_remove_all function

2009-09-10 Thread Kostik Belousov
On Wed, Sep 09, 2009 at 11:57:24PM -0700, MingyanGuo wrote:
> On Wed, Sep 9, 2009 at 11:26 PM, MingyanGuo  wrote:
> 
> > Hi all,
> >
> > I find that function pmap_remove_all for arch amd64 works with a time
> > window between reading & clearing the PTE flags(access flag and dirty flag)
> > and invalidating its TLB entry on other CPU. After some discussion with Li
> > Xin(cced), I think all the processes that are using the PTE being removed
> > should be blocked before calling pmap_remove_all, or other CPU may dirty the
> > page but does not set the dirty flag before the TLB entry is flushed. But I
> > can not find how to block them to call the function. I read the function
> > vm_pageout_scan in file vm/vm_pageout.c but can not find the exact method it
> > used.  Or I just misunderstood the semantics of function pmap_remove_all ?
> >
> > Thanks in advance.
> >
> > Regards,
> > MingyanGuo
> >
> 
> Sorry for the noise. I understand the logic now. There is no time window
> problem between reading & clearing the PTE and invalidating it on other CPU,
> even if other CPU is using the PTE.  I misunderstood the logic.

Hmm. What would happen for the following scenario.

Assume that the page m is mapped by vm map active on CPU1, and that
CPU1 has cached TLB entry for some writable mapping of this page,
but neither TLB entry not PTE has dirty bit set.

Then, assume that the following sequence of events occur:

CPU1:   CPU2:
call pmap_remove_all(m)
clear pte
write to the address mapped
by m [*]
invalidate the TLB,
possibly making IPI to CPU1

I assume that at the point marked [*], we can
- either loose the dirty bit, while CPU1 (atomically) sets the dirty bit
  in the cleared pte.
  Besides not properly tracking the modification status of the page,
  it could also cause the page table page to be modified, that would
  create non-zero page with PG_ZERO flag set.
- or CPU1 re-reads the PTE entry when setting the dirty bit, and generates
  #pf since valid bit in PTE is zero.

Intel documentation mentions that dirty or accessed bits updates are done
with locked cycle, that definitely means that PTE is re-read, but I cannot
find whether valid bit is rechecked.


pgpQ2H6adkva4.pgp
Description: PGP signature