All:

We have thus far been unable to reproduce the following application
behavior when running outside of gdb running inside emacs.  Therefore, I
am starting with this list for suggestions on how to proceed.



The high level description of the problem is this:  At times, when
running an application that (a) opens many sockets (b) receives
relatively high rates of traffic on those sockets (c) eventually has
many threads running [50+] and (d) mallocs several large blocks of
memory, some up to 500M or 1G ... the application will "hang" for long
periods inside memset(0) of one of those memory blocks.  It is not clear
that (a-c) are relevant, since the behavior is often exhibited in an
initialization thread ahead of starting the sockets.

The slow memset() happens most or more often when running gdb inside
emacs.  C-c C-c will take 30sec+ and sometimes up to several minutes to
return to the (gdb) prompt, and the machine will be generally slow for
some period of time during/after this.  We do not see evidence of
virtual memory paging but we are not certain we are looking in all the
right places -- hints appreciated.

The problem occurs both within stock gdb and within a gdb patched to
with this patch: 
http://sourceware.org/ml/gdb-patches/2010-04/msg00466.html.  The version
strings are respectively:

GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2
GNU gdb (GDB) 7.2

Invariably, if breaking via C-c C-c we end up with the program counter
*inside* memset, which causes me to suspect that some kind of overactive
page fault situation is occurring, and that this is drastically slowing
the machine.

We have seen references to problems with memset when crossing 2G
boundaries, but this is a 64bit box with 33G of ram, so I would have
thought this is not a problem.  Further, the reference seems to be to
introduce pointers mapped to userland from a driver, which is not the
case here: 
http://lists.kernelnewbies.org/pipermail/kernelnewbies/2011-February/000760.html

Another reference, which implicates MCFG ACPI table problems
(http://lists.us.dell.com/pipermail/linux-precision/2011-February/001503.html
and https://bugzilla.redhat.com/show_bug.cgi?id=581933) also seems
unrelated, as booting with pci=nommconf does not help.

I am addressing gdb to start with because this intermittent problem does
not seem to occur outside of gdb.  My thought is that when running
inside both emacs and gdb we end up mapped into an area of physical
memory that exhibits a problem at the kernel level.  However, I am not
entirely sure whether gdb could be receiving any signals or be otherwise
interposed in the memory allocation and subsequent walk of those bytes
by memset().

I can say with a low degree of confidence that the problem occurs more
frequently when the system has higher incoming network load, though
there is no chance that all CPUs are pegged or that "lots" of active
threads are running by the time the slow call is made.  In fact, often,
the memset() occurs in an initialization thread that starts before other
threads.  This, gdb, emacs and bash are always the *only* running
processes besides stock ubuntu processes -- i.e., there is not other
work going on that is starving the CPUs.  Finally, we have disabled the
"ondemand" functionality, so all cores are running at full speed.

Another reason to suspect gdb is that we have run fairly thorough memory
tests (both "bios level" and memtester within linux) on the machine,
including writing an application that simply allocates huge chunks of
memory and memsets them.  The large memset application was run inside
gdb and inside emacs and did not exhibit the behavior.  Could this be
some kind of code offset/alignment or symbol lookup problem exhibited
only by the problem executable when loaded by gdb???

Also, this does not appear to be a problem with emacs' prompt
parsing/gud-mode/etc because the application definitely does not proceed
beyond the memset for a long period and breaking puts us within memset. 
Under these conditions it is sometimes necessary to C-z out of emacs,
and kill -9 both gdb and the application.

I am somewhat at a loss on how to debug this and do not have the
resources to run into too many dead ends.  Therefore, can anyone suggest
whether whole-system profiling such as oprofile would help catch
kmap/kunmap or other kernel / virtual-memory badness ?  Would running
gdb inside gdb be fruitful, and if so can anyone point me to functions
or areas I would look to breakpoint or otherwise monitor under such a setup?

I am unable to immediately reproduce the problem in order to provide a
stack trace of the slow thread at this moment, but will follow up when I
can.  Details of the hardware and OS below.

Thanks in advance,
A.A.



Ubuntu system with 48 AMD cores

uname -a:  Linux  2.6.38-8-server #42-Ubuntu SMP Mon Apr 11 03:49:04 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux

Last /proc/cpuinfo stanza:

processor    : 47
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 9
model name    : AMD Opteron(tm) Processor 6180 SE
stepping    : 1
cpu MHz        : 2500.000
cache size    : 512 KB
physical id    : 3
siblings    : 12
core id        : 5
cpu cores    : 12
apicid        : 75
initial apicid    : 59
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc
extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm
extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit
wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter
bogomips    : 5000.18
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate



MemTotal:       33008392 kB
MemFree:        28624044 kB
Buffers:           59708 kB
Cached:          1302196 kB
SwapCached:         1764 kB
Active:          2700812 kB
Inactive:         369772 kB
Active(anon):    1673156 kB
Inactive(anon):    35572 kB
Active(file):    1027656 kB
Inactive(file):   334200 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      14865404 kB
SwapFree:       14861092 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:       1707772 kB
Mapped:            59296 kB
Shmem:                28 kB
Slab:             666848 kB
SReclaimable:      85424 kB
SUnreclaim:       581424 kB
KernelStack:        3128 kB
PageTables:        10576 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    31369600 kB
Committed_AS:    4772588 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      122392 kB
VmallocChunk:   34328955388 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       65920 kB
DirectMap2M:     9369600 kB
DirectMap1G:    24117248 kB


_______________________________________________
bug-gdb mailing list
bug-gdb@gnu.org
https://lists.gnu.org/mailman/listinfo/bug-gdb

Reply via email to