Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

Alan Robertson Mon, 20 Oct 2014 13:53:11 -0700

For the Assimilation code I use the full pathname of the binary from
/proc to tell if it's "one of mine".  That's not perfect if you're using
an interpreted language.  It works quite well for compiled languages.



On 10/20/2014 01:17 PM, Lars Ellenberg wrote:
> Recent discussions with Dejan made me again more prominently aware of a
> few issues we probably all know about, but usually dismis as having not
> much relevance in the real-world.
>
> The facts:
>
>  * a pidfile typically only stores a pid
>  * a pidfile may "stale", not properly cleaned up
>    when the pid it references died.
>  * pids are recycled
>
>    This is more an issue if kernel.pid_max is small
>    wrt the number of processes created per unit time,
>    for example on some embeded systems,
>    or on some very busy systems.
>
>    But it may be an issue on any system,
>    even a mostly idle one, given "bad luck^W timing",
>    see below.
>
> A common idiom in resource agents is to
>
> kill_that_pid_and_wait_until_dead()
> {
>       local pid=$1
>       is_alive $pid || return 0
>       kill -TERM $pid
>       while is_alive $pid ; sleep 1; done
>       return 0
> }
>
> The naïve implementation of is_alive() is
> is_alive() { kill -0 $1 ; }
>
> This is the main issue:
> -----------------------
>
> If the last-used-pid is just a bit smaller then $pid,
> during the sleep 1, $pid may die,
> and the OS may already have created a new process with that exact pid.
>
> Using above "is_alive", kill_that_pid() will not notice that the
> to-be-killed pid has actually terminated while that new process runs.
> Which may be a very long time if that is some other long running daemon.
>
> This may result in stop failure and resulting node level fencing.
>
> The question is, which better way do we have to detect if some pid died
> after we killed it. Or, related, and even better: how to detect if the
> process currently running with some pid is in fact still the process
> referenced by the pidfile.
>
> I have two suggestions.
>
> (I am trying to avoid bashisms in here.
>  But maybe I overlook some.
>  Also, the code is typed, not sourced from some working script,
>  so there may be logic bugs and typos.
>  My intent should be obvious enough, though.)
>
> using "cd /proc/$pid; stat ."
> -----------------------------
>
> # this is most likely linux specific
> kill_that_pid_and_wait_until_dead()
> {
>       local pid=$1
>       (
>               cd /proc/$pid || return 0
>               kill -TERM $pid
>               while stat . ; sleep 1; done
>       )
>       return 0
> }
>
> Once pid dies, /proc/$pid will become stale (but not completely go away,
> because it is our cwd), and stat . will return "No such process".
>
> Variants:
>
> using test -ef
> --------------
>
>       exec 7</proc/$pid || return 0
>       kill -TERM $pid
>       while :; do
>               exec 8</proc/$pid || break
>               test /proc/self/fd/7 -ef /proc/self/fd/8 || break
>               sleep 1
>       done
>       exec 7<&- 8<&-
>
> using stat -c %Y /proc/$pid
> ---------------------------
>
>       ctime0=$(stat -c %Y /proc/$pid)
>       kill -TERM $pid
>       while ctime=$(stat -c %Y /proc/$pid) && [ $ctime = $ctime0 ] ; do sleep 
> 1; done
>
>
> Why not use the inode number I hear you say.
> Because it is not stable. Sorry.
> Don't believe me? Don't want to read kernel source?
> Try it yourself:
>
>       sleep 120 & k=$!
>       stat /proc/$k
>       echo 3 > /proc/sys/vm/drop_caches
>       stat /proc/$k
>
> But that leads me to an other proposal:               
> store the starttime together with the pid in a pidfile.
>
> For linux that would be:
>
> (see proc(5) for /proc/pid/stat field meanings.
>  note that (comm) may contain both whitespace and ")",
>  which is the reason for my sed | cut below)
>
> spawn_create_exclusive_pid_starttime()
> {
>       local pidfile=$1
>       shift
>       local reset
>       case $- in *C*) reset=":";; *) set -C; reset="set +C";; esac
>       if ! exec 3>$pidfile ; then
>               $reset
>               return 1
>       fi
>
>       $reset
>       setsid sh -c '
>               read pid _ < /proc/self/stat
>               starttime=$(sed -e 's/^.*) //' /proc/$pid/stat | cut -d' ' -f 
> 20)
>               >&3 echo $pid $starttime
>               3>&- exec "$@"
>       ' -- "$@" &
>       return 0
> }
>
> It does not seem possible to cycle through all available pids
> within fractions of time smaller than the granularity of starttime,
> so "pid starttime" should be a unique tuple (until the next reboot --
> at least on linux, starttime is measured as strictly monotonic "uptime").
>
>
> If we have "pid starttime" in the pidfile,
> we can:
>
> get_proc_pid_starttime()
> {
>       proc_pid_starttime=$(sed -e 's/^.*) //' /proc/$pid/stat) || return 1
>       proc_pid_starttime=$(echo "$proc_pid_starttime" | cut -d' ' -f 20)
> }
>
> kill_using_pidfile()
> {
>       local pidfile=$1
>       local pid starttime proc_pid_starttime
>
>       test -e $pidfile                || return # already dead
>       read pid starttime <$pidfile    || return # unreadable
>
>       # check pid and starttime are both present, numeric only, ...
>       # I have a version that distinguishes 16 distinct error
>       # conditions; this is the short version only...
>
>       local i=0
>       while
>               get_proc_pid_starttime &&
>               [ "$starttime" = "$proc_pid_starttime" ]
>       do
>               : $(( i+=1 ))
>               [ $i =  1 ] && kill -TERM $pid
>               # MAYBE # [ $i = 30 ] && kill -KILL $pid
>               sleep 1
>       done
>
>       # it's not (anymore) the process we where looking for
>       # remove that pidfile.
>
>       rm -f "$pidfile"
> }
>
> In other OSes, ps may be able to give a good enough equivalent?
>
> Any comments?
>
> Thanks,
>       Lars
>
> _______________________________________________________
> Linux-HA-Dev: [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

Reply via email to