On Tue, Jun 01, 2004 at 11:51:39AM +0200, Jim Page - emailsystems.com wrote:
> I believe I have a similar issue to that described by Steve in the post
> http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2003/5878.html -
> which I don't think was ever resolved, so I'd like to take up the torch.
Well, I did find and fix what I believed to be the problem, an
exponential growth in the RPC2 lookup path. But that only affects
servers that are contacted by clients behind a masquerading firewall.
> The failure mode is always the same. Here is a typical entry the SrvLog:
>
> 07:34:34 ****** WARNING entry at 0x8122320 already has deqing set!
I remember seeing this message in my logs as well, but it never seemed
fatal. Just checked the source to see what is triggering this. It looks
like the server receives a COP2 (phase 2 commit) message for some
operation that is already being committed.
My quick guess is that the open() from the MTA was given a piggybacked
COP2 message to commit a recently performed operation. But we get stuck
trying to obtain a lock which we need to bump the version vector. At
some point some thread in the client checks for as yet uncommitted
operations and sends a ViceCOP2 message for the same operation which
triggers the deqing message we see.
It looks like a lot of the related debug messages are at least level 10
or higher which means that there will be a lot of noise, and since a
server slows down when logging it might affect the reproducability.
But I would be interested in your current log, and if it is reproducable
a log at level 10 (volutil setdebug 10), and level 30. Just bump that
loglevel somewhere after the onslaught has started and hopefully before
the lockup.
> here is where codasrv is at:
>
> (gdb) where
> #0 0x4024b44e in select () from /lib/libc.so.6
> #1 0x400812fc in __JCR_LIST__ () from /usr/lib/liblwp.so.2
> #2 0x4007d130 in IOMGR (dummy=0x0) at iomgr.c:354
> #3 0x4007ef16 in Create_Process_Part2 () at lwp.c:796
Because we have userspace threads the gdb backtrace isn't useful in this
case. There is a way to get traces from the threads although it tends to
crash the server halfway through. Running it a second time sometimes
manages to dump all the related threadstacks.
(gdb) p LWP_Print_Processes()
gdb intercepts a segfault or bus error
(gdb) p LWP_Print_Processes()
The information should all end up in the SrvErr file, but it only shows
return addresses and no arguments. I slightly modified a script that
came with dmalloc to do the trick of decoding (attached it).
Jan
#!/usr/bin/perl -w
#
# lwp_process -- processes lwp process dumps.
# based upon code from.
#
# dmalloc_summarize -- summarizes dmalloc log files
#
# Copyright 1997 by USC/ISI All rights reserved.
#
# Redistribution and use in source and binary forms are permitted
# provided that the above copyright notice and this paragraph are
# duplicated in all such forms and that any documentation, advertising
# materials, and other materials related to such distribution and use
# acknowledge that the software was developed by the University of
# Southern California, Information Sciences Institute. The name of
# the University may not be used to endorse or promote products
# derived from this software without specific prior written
# permission.
#
# THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
# WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
#
# $Id: dmalloc_summarize.pl,v 1.1 1997/07/07 08:13:52 gray Exp $
#
sub usage {
print STDERR <<END;
usage: $0 [executable] < logfile
Post-process a dmalloc logfile (read from standard input).
If an EXECUTABLE is specified, unknown symbols will be translated
according to that executable.
END
exit 1;
}
require 5.000;
use strict;
# needed for gdb communications
use IPC::Open2;
# process args
use Getopt::Long;
&usage if ($#ARGV >= 1 && $ARGV[0] eq '-?');
my($exe) = undef;
if ($#ARGV >= 0) {
$exe = $ARGV[0];
};
######################################################################
my($gdb_pid);
#
# Start a gdb pipe to resolve unknown return addresses.
#
sub start_gdb {
$SIG{'PIPE'} = sub { mydie("ERROR: premature end-of-data.\n"); };
$gdb_pid = open2('GDB_RDR', 'GDB_WTR', 'gdb', '-nx', '-q', $exe) ||
die "$0: cannot run gdb on $exe\n";
# tidy things up
# prompt becomes a magic number to look for
print GDB_WTR "set prompt (gdb)\\n\n";
print GDB_WTR "set print asm-demangle on\n";
print GDB_WTR "set height 0\n";
}
#
# Huh?
#
sub never_called {
<GDB_RDR>; <GDB_WTR>; # hack for warnings
}
#
# Lookup an unknown ra-address with gdb
#
sub interpret_name {
my($name) = $_[0] ;
return unless $name ;
start_gdb() if (!defined($gdb_pid));
print GDB_WTR "info line *($name)\n";
my($something) = undef;
my($file_line, $function);
while (<GDB_RDR>) {
if (/^\(gdb\)$/) {
last if ($something);
next; # skip prompts
};
$something = 1;
if (/^Line (\d+) of "([^\"]+)"/) {
$file_line = "$2:$1";
};
if (/\<(.*)\+\d+\>/) {
$function = $1;
};
};
my($n) = "";
$n .= "$function " if (defined($function));
$n .= "[$file_line] " if (defined($file_line));
$n = $name if ($n eq '');
return $n;
}
######################################################################
# read the data
sub read_data {
print "*************************************************************\n" ;
while (<STDIN>) {
chop ;
print "$_\n" ;
next unless /^[ \t]+Stack/;
s/[ \t]+Stack: .* - // ;
my $function = interpret_name($_);
print "\t $function\n" ;
};
}
read_data;
exit 0;