Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

2014-10-22 Thread Dejan Muhamedagic
Hi Lars,

On Mon, Oct 20, 2014 at 09:17:29PM +0200, Lars Ellenberg wrote:
 
 Recent discussions with Dejan made me again more prominently aware of a
 few issues we probably all know about, but usually dismis as having not
 much relevance in the real-world.
 
 The facts:
 
  * a pidfile typically only stores a pid
  * a pidfile may stale, not properly cleaned up
when the pid it references died.
  * pids are recycled
 
This is more an issue if kernel.pid_max is small
wrt the number of processes created per unit time,
for example on some embeded systems,
or on some very busy systems.
 
But it may be an issue on any system,
even a mostly idle one, given bad luck^W timing,
see below.
 
 A common idiom in resource agents is to
 
 kill_that_pid_and_wait_until_dead()
 {
   local pid=$1
   is_alive $pid || return 0
   kill -TERM $pid
   while is_alive $pid ; sleep 1; done
   return 0
 }
 
 The naïve implementation of is_alive() is
 is_alive() { kill -0 $1 ; }
 
 This is the main issue:
 ---
 
 If the last-used-pid is just a bit smaller then $pid,
 during the sleep 1, $pid may die,
 and the OS may already have created a new process with that exact pid.
 
 Using above is_alive, kill_that_pid() will not notice that the
 to-be-killed pid has actually terminated while that new process runs.
 Which may be a very long time if that is some other long running daemon.
 
 This may result in stop failure and resulting node level fencing.
 
 The question is, which better way do we have to detect if some pid died
 after we killed it. Or, related, and even better: how to detect if the
 process currently running with some pid is in fact still the process
 referenced by the pidfile.
 
 I have two suggestions.
 
 (I am trying to avoid bashisms in here.
  But maybe I overlook some.
  Also, the code is typed, not sourced from some working script,
  so there may be logic bugs and typos.
  My intent should be obvious enough, though.)
 
 using cd /proc/$pid; stat .
 -
 
 # this is most likely linux specific

Apparently not. According to Wikipedia at least, most UNIX
platforms (including BSD and Solaris) support /proc/$pid.

 kill_that_pid_and_wait_until_dead()
 {
   local pid=$1
   (
   cd /proc/$pid || return 0
   kill -TERM $pid
   while stat . ; sleep 1; done

I'd rather test -d . (it's more common in shell scripts and
runs faster). BTW, on my laptop, test -d is so fast that the
process doesn't get removed before it runs and the while loop
always gets executed. In that respect, stat or ls -d performs
better.

   )
   return 0
 }
 
 Once pid dies, /proc/$pid will become stale (but not completely go away,
 because it is our cwd), and stat . will return No such process.

This seems to be a very elegant solution and I cannot find fault
with it. Short and easy to understand too.

[... Skipping other proposals, some of which are quite exotic :) ]

 kill_using_pidfile()
 {
   local pidfile=$1
   local pid starttime proc_pid_starttime
 
   test -e $pidfile|| return # already dead
   read pid starttime $pidfile|| return # unreadable

I'd assume that we (the caller) knows what the process should
look like in the process table, as in say command and arguments.
We could also test that if there's a possibility that the process
left but the PID file somehow stayed behind.

   # check pid and starttime are both present, numeric only, ...
   # I have a version that distinguishes 16 distinct error

Wow!

   # conditions; this is the short version only...
 
   local i=0
   while
   get_proc_pid_starttime 
   [ $starttime = $proc_pid_starttime ]
   do
   : $(( i+=1 ))
   [ $i =  1 ]  kill -TERM $pid
   # MAYBE # [ $i = 30 ]  kill -KILL $pid
   sleep 1
   done
 
   # it's not (anymore) the process we where looking for
   # remove that pidfile.
 
   rm -f $pidfile
 }
 
 In other OSes, ps may be able to give a good enough equivalent?
 
 Any comments?

I'd just go with the cd /proc/$pid thing. Perhaps add a test
for ps -o cmd $pid output.

And thanks for giving this such a thorough analysis!

Thanks,

Dejan

 Thanks,
   Lars
 
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

2014-10-22 Thread Dejan Muhamedagic
Hi Alan,

On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
 For the Assimilation code I use the full pathname of the binary from
 /proc to tell if it's one of mine.  That's not perfect if you're using
 an interpreted language.  It works quite well for compiled languages.

Yes, though not perfect, that may be good enough. I supposed that
the probability that the very same program gets the same recycled
pid is rather low. (Or is it?)

Cheers,

Dejan

 
 On 10/20/2014 01:17 PM, Lars Ellenberg wrote:
  Recent discussions with Dejan made me again more prominently aware of a
  few issues we probably all know about, but usually dismis as having not
  much relevance in the real-world.
 
  The facts:
 
   * a pidfile typically only stores a pid
   * a pidfile may stale, not properly cleaned up
 when the pid it references died.
   * pids are recycled
 
 This is more an issue if kernel.pid_max is small
 wrt the number of processes created per unit time,
 for example on some embeded systems,
 or on some very busy systems.
 
 But it may be an issue on any system,
 even a mostly idle one, given bad luck^W timing,
 see below.
 
  A common idiom in resource agents is to
 
  kill_that_pid_and_wait_until_dead()
  {
  local pid=$1
  is_alive $pid || return 0
  kill -TERM $pid
  while is_alive $pid ; sleep 1; done
  return 0
  }
 
  The naïve implementation of is_alive() is
  is_alive() { kill -0 $1 ; }
 
  This is the main issue:
  ---
 
  If the last-used-pid is just a bit smaller then $pid,
  during the sleep 1, $pid may die,
  and the OS may already have created a new process with that exact pid.
 
  Using above is_alive, kill_that_pid() will not notice that the
  to-be-killed pid has actually terminated while that new process runs.
  Which may be a very long time if that is some other long running daemon.
 
  This may result in stop failure and resulting node level fencing.
 
  The question is, which better way do we have to detect if some pid died
  after we killed it. Or, related, and even better: how to detect if the
  process currently running with some pid is in fact still the process
  referenced by the pidfile.
 
  I have two suggestions.
 
  (I am trying to avoid bashisms in here.
   But maybe I overlook some.
   Also, the code is typed, not sourced from some working script,
   so there may be logic bugs and typos.
   My intent should be obvious enough, though.)
 
  using cd /proc/$pid; stat .
  -
 
  # this is most likely linux specific
  kill_that_pid_and_wait_until_dead()
  {
  local pid=$1
  (
  cd /proc/$pid || return 0
  kill -TERM $pid
  while stat . ; sleep 1; done
  )
  return 0
  }
 
  Once pid dies, /proc/$pid will become stale (but not completely go away,
  because it is our cwd), and stat . will return No such process.
 
  Variants:
 
  using test -ef
  --
 
  exec 7/proc/$pid || return 0
  kill -TERM $pid
  while :; do
  exec 8/proc/$pid || break
  test /proc/self/fd/7 -ef /proc/self/fd/8 || break
  sleep 1
  done
  exec 7- 8-
 
  using stat -c %Y /proc/$pid
  ---
 
  ctime0=$(stat -c %Y /proc/$pid)
  kill -TERM $pid
  while ctime=$(stat -c %Y /proc/$pid)  [ $ctime = $ctime0 ] ; do sleep 
  1; done
 
 
  Why not use the inode number I hear you say.
  Because it is not stable. Sorry.
  Don't believe me? Don't want to read kernel source?
  Try it yourself:
 
  sleep 120  k=$!
  stat /proc/$k
  echo 3  /proc/sys/vm/drop_caches
  stat /proc/$k
 
  But that leads me to an other proposal: 
  store the starttime together with the pid in a pidfile.
 
  For linux that would be:
 
  (see proc(5) for /proc/pid/stat field meanings.
   note that (comm) may contain both whitespace and ),
   which is the reason for my sed | cut below)
 
  spawn_create_exclusive_pid_starttime()
  {
  local pidfile=$1
  shift
  local reset
  case $- in *C*) reset=:;; *) set -C; reset=set +C;; esac
  if ! exec 3$pidfile ; then
  $reset
  return 1
  fi
 
  $reset
  setsid sh -c '
  read pid _  /proc/self/stat
  starttime=$(sed -e 's/^.*) //' /proc/$pid/stat | cut -d' ' -f 
  20)
  3 echo $pid $starttime
  3- exec $@
  ' -- $@ 
  return 0
  }
 
  It does not seem possible to cycle through all available pids
  within fractions of time smaller than the granularity of starttime,
  so pid starttime should be a unique tuple (until the next reboot --
  at least on linux, starttime is measured as strictly monotonic uptime).
 
 
  If we have pid starttime in the pidfile,
  we can:
 
  get_proc_pid_starttime()
  {
  proc_pid_starttime=$(sed -e 's/^.*) //' /proc/$pid/stat) || return 1
  proc_pid_starttime=$(echo $proc_pid_starttime | 

Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

2014-10-22 Thread Alan Robertson
On 10/22/2014 03:33 AM, Dejan Muhamedagic wrote:
 Hi Alan,

 On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
 For the Assimilation code I use the full pathname of the binary from
 /proc to tell if it's one of mine.  That's not perfect if you're using
 an interpreted language.  It works quite well for compiled languages.
 Yes, though not perfect, that may be good enough. I supposed that
 the probability that the very same program gets the same recycled
 pid is rather low. (Or is it?)
From my 'C' code I could touch the lock file to match the timestamp of
the /proc/pid/stat (or /proc/pid/exe) symlink -- and verify that they
match.  If there is no /proc/pid/stat, then you won't get that extra
safeguard.  But as you suggest, it decreases the probability by orders
of magnitude even without the

The /proc/pid/exe symlink appears to have the same timestamp as
/proc/pid/stat

Does anyone know which OSes have either or both of those /proc names?

-- AlanRobertson
   al...@unix.sh


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

2014-10-22 Thread Tim Small
On 22/10/14 13:50, Alan Robertson wrote:
 Does anyone know which OSes have either or both of those /proc names?

Once again, can I recommend taking a look at the start-stop-daemon
source (see earlier posting), which does this stuff, and includes checks
for Linux/Hurd/Sun/OpenBSD/FreeBSD/NetBSD/DragonFly, and whilst I've
only ever used it on Linux, at the very least the BSD side seems to be
maintained:

http://anonscm.debian.org/cgit/dpkg/dpkg.git/tree/utils/start-stop-daemon.c

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.  
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

2014-10-22 Thread Alan Robertson
On 10/22/2014 07:09 AM, Dejan Muhamedagic wrote:
 On Wed, Oct 22, 2014 at 06:50:37AM -0600, Alan Robertson wrote:
 On 10/22/2014 03:33 AM, Dejan Muhamedagic wrote:
 Hi Alan,

 On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
 For the Assimilation code I use the full pathname of the binary from
 /proc to tell if it's one of mine.  That's not perfect if you're using
 an interpreted language.  It works quite well for compiled languages.
 Yes, though not perfect, that may be good enough. I supposed that
 the probability that the very same program gets the same recycled
 pid is rather low. (Or is it?)
 From my 'C' code I could touch the lock file to match the timestamp of
 the /proc/pid/stat (or /proc/pid/exe) symlink -- and verify that they
 match.  If there is no /proc/pid/stat, then you won't get that extra
 safeguard.  But as you suggest, it decreases the probability by orders
 of magnitude even without the

 The /proc/pid/exe symlink appears to have the same timestamp as
 /proc/pid/stat
 Hmm, not here:

 $ sudo ls -lt /proc/1
 ...
 lrwxrwxrwx 1 root root 0 Aug 27 13:51 exe - /sbin/init
 dr-x-- 2 root root 0 Aug 27 13:51 fd
 -r--r--r-- 1 root root 0 Aug 27 13:20 cmdline
 -r--r--r-- 1 root root 0 Aug 27 13:18 stat

 And the process (init) has been running since July:

 $ ps auxw | grep -w [i]nit
 root 1  0.0  0.0  10540   780 ?Ss   Jul07   1:03 init [3]

 Interesting.
And a little worrisome for these strategies...

Here is what I see for timestamps that look to be about the time of
system boot:

-r 1 root root 0 Oct 21 15:42 environ
lrwxrwxrwx 1 root root 0 Oct 21 15:42 root - /
-r--r--r-- 1 root root 0 Oct 21 15:42 limits
dr-x-- 2 root root 0 Oct 21 15:42 fd
lrwxrwxrwx 1 root root 0 Oct 21 15:42 exe - /sbin/init
-r--r--r-- 1 root root 0 Oct 21 15:42 stat
-r--r--r-- 1 root root 0 Oct 21 15:42 cgroup
-r--r--r-- 1 root root 0 Oct 21 15:42 cmdline

servidor:/proc/1 $ ls -l /var/log/boot.log
-rw-r--r-- 1 root root 5746 Oct 21 15:42 /var/log/boot.log

servidor:/proc/1 $ ls -ld .
dr-xr-xr-x 9 root root 0 Oct 21 15:42 .

So, you can open file descriptors (fd), change your environment and
cmdline and (soft) limits.  You can't change your exe, or root.  Cgroup
is new, and I suspect you can't change it.  I suspect that the directory
timestamp (/proc//pid/) won't change either.

I wonder if it will change on BSD or Solaris or AIX.

/proc info for AIX:
   
http://www-01.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.files/proc.htm
It doesn't say anything about file timestamps.
Solaris info is here:
http://docs.oracle.com/cd/E23824_01/html/821-1473/proc-4.html#scrolltoc
It also doesn't mention timestamps.
FreeBSD is here:
http://www.unix.com/man-page/freebsd/5/procfs/



___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

2014-10-22 Thread Alan Robertson
On 10/22/2014 07:11 AM, Tim Small wrote:
 On 22/10/14 13:50, Alan Robertson wrote:
 Does anyone know which OSes have either or both of those /proc names?
 Once again, can I recommend taking a look at the start-stop-daemon
 source (see earlier posting), which does this stuff, and includes checks
 for Linux/Hurd/Sun/OpenBSD/FreeBSD/NetBSD/DragonFly, and whilst I've
 only ever used it on Linux, at the very least the BSD side seems to be
 maintained:

 http://anonscm.debian.org/cgit/dpkg/dpkg.git/tree/utils/start-stop-daemon.c
According to how you described it earlier, it didn't seem to solve the
problems described in this thread. At best it does pretty much exactly
what my previously-implemented solution does.

This discussion has been a bit esoteric.  Although my method (and also
start-stop-daemon) are highly unlikely to err, they can make mistakes in
some circumstances.

-- Alan Robertson
   al...@unix.sh
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Remote node attributes support in crmsh

2014-10-22 Thread Dejan Muhamedagic
On Mon, Oct 20, 2014 at 07:12:23PM +0300, Vladislav Bogdanov wrote:
 20.10.2014 18:23, Dejan Muhamedagic wrote:
  Hi Vladislav,
 
 Hi Dejan!
 
  
  On Mon, Oct 20, 2014 at 09:03:40AM +0300, Vladislav Bogdanov wrote:
  Hi Kristoffer,
 
  do you plan to add support for recently added remote node attributes
  feature to chmsh?
 
  Currently (at least as of 2.1, and I do not see anything relevant in the
  git log) crmsh fails to update CIB if it contains node attributes for
  remote (bare-metal) node, complaining that duplicate element is found.
  
  No wonder :) The uname effectively dubs as an element id.
  
  But for bare-metal nodes it is natural to have ocf:pacemaker:remote
  resource with name equal to remote node uname (I doubt it can be
  configured differently).
  
  Is that required?
 
 Didn't look in code, but seems like yes, :remote resource name is the
 only place where pacemaker can obtain that node name.

I find it surprising that the id is used to carry information.
I'm not sure if we had a similar case (apart from attributes).

  If I comment check for 'obj_id in id_set', then it fails to update CIB
  because it inserts above primitive definition into the node section.
  
  Could you please show what would the CIB look like with such a
  remote resource (in crmsh notation).
  
 
 
 node 1: node01
 node rnode001:remote \
   attributes attr=value
 primitive rnode001 ocf:pacemaker:remote \
 params server=192.168.168.20 \
 op monitor interval=10 \
 meta target-role=Started

What do you expect to happen when you reference rnode001, in say:

crm configure show rnode001

I'm still trying to digest having hostname used to name some
other element. Wonder what/where else will we have issues for
this reason.

Cheers,

Dejan

 Best,
 Vladislav
 
  Given that nodes are for the most part referenced by uname
  (instead of by id), do you think that a configuration where
  a primitive element is named the same as a node, the user can
  handle that in an efficient manner? (NB: No experience here with
  ocf:pacemaker:remote :)
 
 
 
  
  Cheers,
  
  Dejan
  
  
 
  Best,
  Vladislav
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
  
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Remote node attributes support in crmsh

2014-10-22 Thread Vladislav Bogdanov
22.10.2014 12:02, Dejan Muhamedagic wrote:
 On Mon, Oct 20, 2014 at 07:12:23PM +0300, Vladislav Bogdanov wrote:
 20.10.2014 18:23, Dejan Muhamedagic wrote:
 Hi Vladislav,

 Hi Dejan!


 On Mon, Oct 20, 2014 at 09:03:40AM +0300, Vladislav Bogdanov wrote:
 Hi Kristoffer,

 do you plan to add support for recently added remote node attributes
 feature to chmsh?

 Currently (at least as of 2.1, and I do not see anything relevant in the
 git log) crmsh fails to update CIB if it contains node attributes for
 remote (bare-metal) node, complaining that duplicate element is found.

 No wonder :) The uname effectively dubs as an element id.

 But for bare-metal nodes it is natural to have ocf:pacemaker:remote
 resource with name equal to remote node uname (I doubt it can be
 configured differently).

 Is that required?

 Didn't look in code, but seems like yes, :remote resource name is the
 only place where pacemaker can obtain that node name.
 
 I find it surprising that the id is used to carry information.
 I'm not sure if we had a similar case (apart from attributes).
 
 If I comment check for 'obj_id in id_set', then it fails to update CIB
 because it inserts above primitive definition into the node section.

 Could you please show what would the CIB look like with such a
 remote resource (in crmsh notation).



 node 1: node01
 node rnode001:remote \
  attributes attr=value
 primitive rnode001 ocf:pacemaker:remote \
 params server=192.168.168.20 \
 op monitor interval=10 \
 meta target-role=Started
 
 What do you expect to happen when you reference rnode001, in say:

That is not me ;) I just want to be able to use crmsh to assign remote
node operational and utilization (?) attributes and to work with it
after that.

Probably that is not yet set in stone, and David may change that
allowing to f.e. new 'node_name' parameter to ocf:pacemaker:remote
override remote node name guessed from the primitive name.

David, could you comment please?

Best,
Vladislav

 
 crm configure show rnode001
 
 I'm still trying to digest having hostname used to name some
 other element. Wonder what/where else will we have issues for
 this reason.
 
 Cheers,
 
 Dejan
 
 Best,
 Vladislav

 Given that nodes are for the most part referenced by uname
 (instead of by id), do you think that a configuration where
 a primitive element is named the same as a node, the user can
 handle that in an efficient manner? (NB: No experience here with
 ocf:pacemaker:remote :)




 Cheers,

 Dejan



 Best,
 Vladislav
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems