On 11.02.13 17.14, Karl Wright wrote:
I've confirmed that there's a deadlock of some kind; it's occurring as
a result of the hopcount depth tracking.  The issue seems to be nested
transactions; PostgreSQL doesn't appear to be dealing with these
properly in all cases, and winds up getting a deadlock that it doesn't
detect.

I have to look carefully at the code to see if it can be restructured.

Perhaps I should wait till I start the job once again?

When I had the last version from trunk deployed on our test version, the job ran for over three days without any problems. This deadlock occurred only one hour before I started he job earlier today with version 1.1.1 RC0 installed. The only thing I did prior to the upgrade was to stop the running job, stopping the Agent process and the Resin instance. I _did_ notice that the Agent process was still running, so I killed it (-9) and cleaned locks thereafter. I doubt that this had some impact on PG, but I'm mentioning this anyway in order to provide as much information as possible. By looking bash history, I can confirm that I did everything in the correct order: killed the process and then ran the lock clean command class.

Our agent control script is also attached in case we're doing something wrong.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
#!/bin/bash
# ManifoldCF control script for the Agent process in production environment.

# Which user the script should run as.
RUN_AS=resin

# Pid file and log dir
PIDFILE=/www/var/run/mcf_agent/mcf_agent.pid
LOGDIR=/www/var/log/mcf/mcf-1/

# MCF_HOME
MCF_HOME=/www/var/data/mcf/mcf-1/conf
JAVA_HOME=/www/java/java1.6

export MCF_HOME
export JAVA_HOME
readonly MCF_HOME

cd "$MCF_HOME"

# Check user
if [ "$(id -nu)" != "$RUN_AS" ]; then
    echo "Error: this script should be run as $RUN_AS"
    exit 10
fi

# util funcs
waitpid() {
    local pid=$1 timeout=$2 progress=$3
    { [ -z "$pid" ] || [ -z "$timeout" ]; } && return 10
    local t=0
    while ps -p $pid 1>/dev/null 2>&1; do
        sleep 1
        t=$((t + 1))
        if [ $t -eq $timeout ]; then
            return 1
        fi
        echo -n "$progress"
    done
    return 0
}

is_running() {
    if ! [ -f $PIDFILE ]; then
        return 1
    fi

    local pid
    pid=$(cat $PIDFILE)
    
    if ! ps -p $pid 1>/dev/null 2>&1; then
        echo "Warn: a stale PID file was detected, removing it."
        rm -f $PIDFILE
        return 1
    fi
    return 0        
}

# Command funcs
cmd_start() {
    if is_running; then
        echo "Error: MCF Agent seems to be running already with PID $(cat 
$PIDFILE)"
        return 1
    fi

    local pid
    echo "Starting MCF agent ..."
        $MCF_HOME/processes/executecommand.sh 
org.apache.manifoldcf.agents.AgentRun \
        1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log & pid=$!
    disown $pid
    echo "NCF Agent started with PID $pid"
    echo $pid > $PIDFILE
    return 0
}

cmd_stop() {
    if ! is_running; then
        echo "Error: MCF Agent does not seem to be running."
        return 1
    fi

    local pid
    pid=$(cat $PIDFILE)

    echo -n "Stopping MCF Agent ..."
        $MCF_HOME/processes/script/executecommand.sh 
org.apache.manifoldcf.agents.AgentStop \
        1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log
    kill $pid
    waitpid $pid 30 .
    if [ $? -ne 0 ]; then
        echo
        echo "Warn: failed to stop MCF Agent in 30 seconds, sending SIGKILL"
        kill -9 $pid
        sleep 1
    fi
    echo "stopped."
    rm -f $PIDFILE
}

cmd_status() {
    if is_running; then
        echo "MCF Agent is running:"
        ps -o pid,cmd --width 5000 -p $(cat $PIDFILE)
    else
        echo "MCF Agent is not running."
    fi
}

usage_exit() {
    echo "Usage: <command>"
    echo "Available commands:"
    echo "start stop restart status"
    exit 2
}

COMMAND=$1

case $COMMAND in
    start)
        cmd_start
        ;;
    
    stop)
        cmd_stop
        ;;
    
    restart)
        if is_running; then
            cmd_stop
        fi
        cmd_start
        ;;
    
    status)
        cmd_status
        ;;
    *)
        usage_exit ;;
esac

Reply via email to