--- Begin Message ---
Package: portmap
Version: 5-9
Severity: important
I have a heterogeneous network of Un*x boxes, of which several are identical
Debian boxes, networked together using NIS (the server is a Solaris-8 box if
that matters) and mounting/exporting NFS via automount.
During heavy I/O, portmap dies, leaving no error messages in the log file.
This takes NFS down with it, cauing a loss of availability to the other
machines in the network.
By "heavy I/O", I mean the following: I can consistently kill portmap in a
matter of hours by running enough (six seems to do it) cp commands of large
(800+MB) files in infinite loops, while having other machines mount and
unmount directories served by this machine via NFS (which causes calls to
portmap). I need not copy the files over NFS -- merely copying them from one
directory to another on the internal hard drive is sufficient.
Specifically, on the local machine, the machine on which I intend to kill
portmap, I open up six xterms, and in each type the following:
while true; do cp <some huge file> <somewhere else> ; done
Where both source and destination are on the local drive. Once the local host
becomes good ans sluggish, I ssh into several remote machines and run the
following:
while true; do umount <some NFS dir> ; mount <some NFS dir> ; done
I don't know if all of that is actually *necessary* to reproduce the problem,
but it seems to work consistently. If I leave the machine be, portmap seems
to run forever. If I'm not doing heavy I/O, but merely having other hosts
mount/unmount NFS (causing calls to portmap), portmap runs forever. It only
seems to die *during* a call to portmap *while* the machine is doing I/O. (It
may only take 30 minutes, or it may take several hours, but during each of my
tests, I can begin testing when I leave work and have portmap broken by
morning.)
Running strace, I was able to find that portmap dies of a SIGPIPE while
writing to the network connection. (Snippets of strace output included
below.) I've performed these tests several (about 10) times and the output is
the same.
An example snippet from strace:
------------------------------------------------------------------------------
9553 15:17:58.333626 accept(4, {sa_family=AF_INET, sin_port=htons(46428),
sin_addr=inet_addr("127.0.0.1")}, [16]) = 5
9553 15:17:58.360418 poll([{fd=3,
events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=4,
events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND, revents=POLLIN|POLLRDNORM}, {fd=5,
events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND, revents=POLLIN|POLLRDNORM},
{fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}], 8, -1) = 2
9553 15:17:58.367775 accept(4, {sa_family=AF_INET, sin_port=htons(46429),
sin_addr=inet_addr("127.0.0.1")}, [16]) = 6
9553 15:17:58.367911 poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, 35000) =
1
9553 15:17:58.367974 read(5,
"\x80\x00\x00\x28\x7b\x9c\xf4\xf8\x00\x00\x00\x00\x00\x00"..., 400) = 44
9553 15:17:58.369828 write(5,
"\x00\x00\x01\x8c\x7b\x9c\xf4\xf8\x00\x00\x00\x01\x00\x00"..., 400) = 400
9553 15:17:58.375245 write(5,
"\x80\x00\x00\x20\x00\x00\x00\x11\x00\x00\x03\xd2\x00\x00"..., 36) = -1 EPIPE
(Broken pipe)
9553 15:17:58.377843 --- SIGPIPE (Broken pipe) @ 0 (0) ---
------------------------------------------------------------------------------
Another example snippet from strace:
------------------------------------------------------------------------------
17358 06:27:52.325025 accept(4, {sa_family=AF_INET, sin_port=htons(36450),
sin_addr=inet_addr("127.0.0.1")}, [16]) = 6
17358 06:27:52.325109 poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, 35000) =
1
17358 06:27:52.325148 read(5,
"\x80\x00\x00\x28\x77\xf5\xf2\x75\x00\x00\x00\x00\x00\x00"..., 400) = 44
17358 06:27:52.325211 write(5,
"\x00\x00\x01\x8c\x77\xf5\xf2\x75\x00\x00\x00\x01\x00\x00"..., 400) = 400
17358 06:27:52.325299 write(5,
"\x80\x00\x00\x20\x00\x00\x00\x11\x00\x00\x02\x7d\x00\x00"..., 36) = 36
17358 06:27:52.325431 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND,
revents=POLLIN|POLLRDNORM}, {fd=4, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND,
revents=POLLIN|POLLRDNORM}, {fd=5, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND,
revents=POLLIN|POLLRDNORM}, {fd=6, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND,
revents=POLLIN|POLLRDNORM}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}, {fd=-1}], 9,
-1) = 4
17358 06:27:52.325500 recvmsg(3, {msg_name(16)={sa_family=AF_INET,
sin_port=htons(56537), sin_addr=inet_addr("www.xxx.yyy.zzz")},
msg_iov(1)=[{"\x43\x96\x90\x0b\x00\x00\x00\x00\x00\x00\x00\x02\x00\x01"...,
8800}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=, ...},
msg_flags=0}, 0) = 56
17358 06:27:52.325590 getpid() = 17358
17358 06:27:52.325629 open("/etc/hosts.allow", O_RDONLY) = 7
17358 06:27:52.325675 fstat64(7, {st_mode=S_IFREG|0644, st_size=20926, ...}) = 0
17358 06:27:52.325733 mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4016b000
17358 06:27:52.325772 read(7, "################################"..., 4096) =
4096
17358 06:27:52.411990 close(7) = 0
17358 06:27:52.412034 munmap(0x4016b000, 4096) = 0
17358 06:27:52.412091 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40021b88)
= 18081
17358 06:27:52.412206 wait4(-1, <unfinished ...>
17358 06:28:03.860962 <... wait4 resumed> NULL, 0, NULL) = -1 ECHILD (No child
processes)
17358 06:28:03.861011 sendmsg(3, {msg_name(16)={sa_family=AF_INET,
sin_port=htons(56537), sin_addr=inet_addr("www.xxx.yyy.zzz")},
msg_iov(1)=[{"\x43\x96\x90\x0b\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00"...,
28}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=, ...},
msg_flags=0}, 0) = 28
17358 06:28:03.861103 accept(4, {sa_family=AF_INET, sin_port=htons(36451),
sin_addr=inet_addr("127.0.0.1")}, [16]) = 7
17358 06:28:03.861188 poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, 35000) =
1
17358 06:28:03.861226 read(5, "", 400) = 0
17358 06:28:03.861266 close(5) = 0
17358 06:28:03.861332 poll([{fd=6, events=POLLIN, revents=POLLIN}], 1, 35000) =
1
17358 06:28:03.861367 read(6,
"\x80\x00\x00\x28\x44\x09\x28\x0d\x00\x00\x00\x00\x00\x00"..., 400) = 44
17358 06:28:03.861432 write(6,
"\x00\x00\x01\x8c\x44\x09\x28\x0d\x00\x00\x00\x01\x00\x00"..., 400) = 400
17358 06:28:03.861490 write(6,
"\x80\x00\x00\x20\x00\x00\x00\x11\x00\x00\x02\x7d\x00\x00"..., 36) = -1 EPIPE
(Broken pipe)
17358 06:28:03.861536 --- SIGPIPE (Broken pipe) @ 0 (0) ---
------------------------------------------------------------------------------
(Not being a guru, not understanding the internals of portmap, and not
understanding network programming, my opinions should probably be taken with a
grain of salt. But...) A few solutions that come to mind are:
1) If it's possible to check the network connection before writing, that would
prevent SIGPIPE.
2) Have portmap ignore SIGPIPE, assuming that portmap is capable of
recovering gracefully on it's own, or...
3) Installing an event handler to intercept SIGPIPE and allow portmap the
opportunity to recover gracefully.
Thanks for all your hard work. Debian rocks because of people like you.
Michael Peek
P.S. Full strace output can be made available (about 1-3k in size per file)
and I am available for testing patched packages.
-- System Information:
Debian Release: 3.1
Architecture: i386 (i686)
Kernel: Linux 2.6.8-2-686
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)
Versions of packages portmap depends on:
ii libc6 2.3.2.ds1-22 GNU C Library: Shared libraries an
ii libwrap0 7.6.dbs-8 Wietse Venema's TCP wrappers libra
-- no debconf information
--- End Message ---