On 11.02.13 17.14, Karl Wright wrote:
I've confirmed that there's a deadlock of some kind; it's occurring as
a result of the hopcount depth tracking. The issue seems to be nested
transactions; PostgreSQL doesn't appear to be dealing with these
properly in all cases, and winds up getting a deadlock that it doesn't
detect.
I have to look carefully at the code to see if it can be restructured.
Perhaps I should wait till I start the job once again?
When I had the last version from trunk deployed on our test version, the
job ran for over three days without any problems. This deadlock occurred
only one hour before I started he job earlier today with version 1.1.1
RC0 installed. The only thing I did prior to the upgrade was to stop the
running job, stopping the Agent process and the Resin instance. I _did_
notice that the Agent process was still running, so I killed it (-9) and
cleaned locks thereafter. I doubt that this had some impact on PG, but
I'm mentioning this anyway in order to provide as much information as
possible. By looking bash history, I can confirm that I did everything
in the correct order: killed the process and then ran the lock clean
command class.
Our agent control script is also attached in case we're doing something
wrong.
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
#!/bin/bash
# ManifoldCF control script for the Agent process in production environment.
# Which user the script should run as.
RUN_AS=resin
# Pid file and log dir
PIDFILE=/www/var/run/mcf_agent/mcf_agent.pid
LOGDIR=/www/var/log/mcf/mcf-1/
# MCF_HOME
MCF_HOME=/www/var/data/mcf/mcf-1/conf
JAVA_HOME=/www/java/java1.6
export MCF_HOME
export JAVA_HOME
readonly MCF_HOME
cd "$MCF_HOME"
# Check user
if [ "$(id -nu)" != "$RUN_AS" ]; then
echo "Error: this script should be run as $RUN_AS"
exit 10
fi
# util funcs
waitpid() {
local pid=$1 timeout=$2 progress=$3
{ [ -z "$pid" ] || [ -z "$timeout" ]; } && return 10
local t=0
while ps -p $pid 1>/dev/null 2>&1; do
sleep 1
t=$((t + 1))
if [ $t -eq $timeout ]; then
return 1
fi
echo -n "$progress"
done
return 0
}
is_running() {
if ! [ -f $PIDFILE ]; then
return 1
fi
local pid
pid=$(cat $PIDFILE)
if ! ps -p $pid 1>/dev/null 2>&1; then
echo "Warn: a stale PID file was detected, removing it."
rm -f $PIDFILE
return 1
fi
return 0
}
# Command funcs
cmd_start() {
if is_running; then
echo "Error: MCF Agent seems to be running already with PID $(cat
$PIDFILE)"
return 1
fi
local pid
echo "Starting MCF agent ..."
$MCF_HOME/processes/executecommand.sh
org.apache.manifoldcf.agents.AgentRun \
1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log & pid=$!
disown $pid
echo "NCF Agent started with PID $pid"
echo $pid > $PIDFILE
return 0
}
cmd_stop() {
if ! is_running; then
echo "Error: MCF Agent does not seem to be running."
return 1
fi
local pid
pid=$(cat $PIDFILE)
echo -n "Stopping MCF Agent ..."
$MCF_HOME/processes/script/executecommand.sh
org.apache.manifoldcf.agents.AgentStop \
1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log
kill $pid
waitpid $pid 30 .
if [ $? -ne 0 ]; then
echo
echo "Warn: failed to stop MCF Agent in 30 seconds, sending SIGKILL"
kill -9 $pid
sleep 1
fi
echo "stopped."
rm -f $PIDFILE
}
cmd_status() {
if is_running; then
echo "MCF Agent is running:"
ps -o pid,cmd --width 5000 -p $(cat $PIDFILE)
else
echo "MCF Agent is not running."
fi
}
usage_exit() {
echo "Usage: <command>"
echo "Available commands:"
echo "start stop restart status"
exit 2
}
COMMAND=$1
case $COMMAND in
start)
cmd_start
;;
stop)
cmd_stop
;;
restart)
if is_running; then
cmd_stop
fi
cmd_start
;;
status)
cmd_status
;;
*)
usage_exit ;;
esac