Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

Dejan Muhamedagic Wed, 22 Oct 2014 02:34:21 -0700

Hi Alan,

On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
> For the Assimilation code I use the full pathname of the binary from
> /proc to tell if it's "one of mine".  That's not perfect if you're using
> an interpreted language.  It works quite well for compiled languages.


Yes, though not perfect, that may be good enough. I supposed that
the probability that the very same program gets the same recycled
pid is rather low. (Or is it?)

Cheers,

Dejan

> 
> On 10/20/2014 01:17 PM, Lars Ellenberg wrote:
> > Recent discussions with Dejan made me again more prominently aware of a
> > few issues we probably all know about, but usually dismis as having not
> > much relevance in the real-world.
> >
> > The facts:
> >
> >  * a pidfile typically only stores a pid
> >  * a pidfile may "stale", not properly cleaned up
> >    when the pid it references died.
> >  * pids are recycled
> >
> >    This is more an issue if kernel.pid_max is small
> >    wrt the number of processes created per unit time,
> >    for example on some embeded systems,
> >    or on some very busy systems.
> >
> >    But it may be an issue on any system,
> >    even a mostly idle one, given "bad luck^W timing",
> >    see below.
> >
> > A common idiom in resource agents is to
> >
> > kill_that_pid_and_wait_until_dead()
> > {
> >     local pid=$1
> >     is_alive $pid || return 0
> >     kill -TERM $pid
> >     while is_alive $pid ; sleep 1; done
> >     return 0
> > }
> >
> > The naïve implementation of is_alive() is
> > is_alive() { kill -0 $1 ; }
> >
> > This is the main issue:
> > -----------------------
> >
> > If the last-used-pid is just a bit smaller then $pid,
> > during the sleep 1, $pid may die,
> > and the OS may already have created a new process with that exact pid.
> >
> > Using above "is_alive", kill_that_pid() will not notice that the
> > to-be-killed pid has actually terminated while that new process runs.
> > Which may be a very long time if that is some other long running daemon.
> >
> > This may result in stop failure and resulting node level fencing.
> >
> > The question is, which better way do we have to detect if some pid died
> > after we killed it. Or, related, and even better: how to detect if the
> > process currently running with some pid is in fact still the process
> > referenced by the pidfile.
> >
> > I have two suggestions.
> >
> > (I am trying to avoid bashisms in here.
> >  But maybe I overlook some.
> >  Also, the code is typed, not sourced from some working script,
> >  so there may be logic bugs and typos.
> >  My intent should be obvious enough, though.)
> >
> > using "cd /proc/$pid; stat ."
> > -----------------------------
> >
> > # this is most likely linux specific
> > kill_that_pid_and_wait_until_dead()
> > {
> >     local pid=$1
> >     (
> >             cd /proc/$pid || return 0
> >             kill -TERM $pid
> >             while stat . ; sleep 1; done
> >     )
> >     return 0
> > }
> >
> > Once pid dies, /proc/$pid will become stale (but not completely go away,
> > because it is our cwd), and stat . will return "No such process".
> >
> > Variants:
> >
> > using test -ef
> > --------------
> >
> >     exec 7</proc/$pid || return 0
> >     kill -TERM $pid
> >     while :; do
> >             exec 8</proc/$pid || break
> >             test /proc/self/fd/7 -ef /proc/self/fd/8 || break
> >             sleep 1
> >     done
> >     exec 7<&- 8<&-
> >
> > using stat -c %Y /proc/$pid
> > ---------------------------
> >
> >     ctime0=$(stat -c %Y /proc/$pid)
> >     kill -TERM $pid
> >     while ctime=$(stat -c %Y /proc/$pid) && [ $ctime = $ctime0 ] ; do sleep 
> > 1; done
> >
> >
> > Why not use the inode number I hear you say.
> > Because it is not stable. Sorry.
> > Don't believe me? Don't want to read kernel source?
> > Try it yourself:
> >
> >     sleep 120 & k=$!
> >     stat /proc/$k
> >     echo 3 > /proc/sys/vm/drop_caches
> >     stat /proc/$k
> >
> > But that leads me to an other proposal:             
> > store the starttime together with the pid in a pidfile.
> >
> > For linux that would be:
> >
> > (see proc(5) for /proc/pid/stat field meanings.
> >  note that (comm) may contain both whitespace and ")",
> >  which is the reason for my sed | cut below)
> >
> > spawn_create_exclusive_pid_starttime()
> > {
> >     local pidfile=$1
> >     shift
> >     local reset
> >     case $- in *C*) reset=":";; *) set -C; reset="set +C";; esac
> >     if ! exec 3>$pidfile ; then
> >             $reset
> >             return 1
> >     fi
> >
> >     $reset
> >     setsid sh -c '
> >             read pid _ < /proc/self/stat
> >             starttime=$(sed -e 's/^.*) //' /proc/$pid/stat | cut -d' ' -f 
> > 20)
> >             >&3 echo $pid $starttime
> >             3>&- exec "$@"
> >     ' -- "$@" &
> >     return 0
> > }
> >
> > It does not seem possible to cycle through all available pids
> > within fractions of time smaller than the granularity of starttime,
> > so "pid starttime" should be a unique tuple (until the next reboot --
> > at least on linux, starttime is measured as strictly monotonic "uptime").
> >
> >
> > If we have "pid starttime" in the pidfile,
> > we can:
> >
> > get_proc_pid_starttime()
> > {
> >     proc_pid_starttime=$(sed -e 's/^.*) //' /proc/$pid/stat) || return 1
> >     proc_pid_starttime=$(echo "$proc_pid_starttime" | cut -d' ' -f 20)
> > }
> >
> > kill_using_pidfile()
> > {
> >     local pidfile=$1
> >     local pid starttime proc_pid_starttime
> >
> >     test -e $pidfile                || return # already dead
> >     read pid starttime <$pidfile    || return # unreadable
> >
> >     # check pid and starttime are both present, numeric only, ...
> >     # I have a version that distinguishes 16 distinct error
> >     # conditions; this is the short version only...
> >
> >     local i=0
> >     while
> >             get_proc_pid_starttime &&
> >             [ "$starttime" = "$proc_pid_starttime" ]
> >     do
> >             : $(( i+=1 ))
> >             [ $i =  1 ] && kill -TERM $pid
> >             # MAYBE # [ $i = 30 ] && kill -KILL $pid
> >             sleep 1
> >     done
> >
> >     # it's not (anymore) the process we where looking for
> >     # remove that pidfile.
> >
> >     rm -f "$pidfile"
> > }
> >
> > In other OSes, ps may be able to give a good enough equivalent?
> >
> > Any comments?
> >
> > Thanks,
> >     Lars
> >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> 
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

Reply via email to