[ https://issues.apache.org/jira/browse/MESOS-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171017#comment-15171017 ]
Pradeep Chhetri commented on MESOS-1648: ---------------------------------------- Is anyone working on it ? Otherwise, I would like to pick this up. > Add a --pidfile option to master and agent binaries. > ---------------------------------------------------- > > Key: MESOS-1648 > URL: https://issues.apache.org/jira/browse/MESOS-1648 > Project: Mesos > Issue Type: Improvement > Components: master, slave > Reporter: Tobias Weingartner > Labels: newbie, twitter > > Right now we use a number of wrapper scripts to try and keep up a > {{/var/run/mesos/mesos-slave.pid}} in order to be able to monitor the > process. This has proven to be somewhat fragile due to the lack of locking > and the possibility of races and stale data. > By adding a {{--pidfile}}, we can obtain a lock on the file to prevent > multiple binaries from starting, and to enable the tooling to validate that > the lock is held before doing any signaling. We can also do a best effort > unlink in the signal handler upon termination: > {code} > // Get exclusive access to the file. > fd = open(O_CREAT ...) > flock(fd, LOCK_EX) > if not locked, abort > ftruncate(fd, 0) > // Write the pid. > write(fd, "<pid>") > // Inside signal handler.. > unlink(pidfile) > {code} > Digging around, looks like the open, ftruncate, write pattern is pretty > common: > http://man7.org/tlpi/code/online/diff/filelock/create_pid_file.c.html > The tooling around it could that the file is locked by the pid inside it, > before taking any action (like signaling): > *Case 1*: If the file does not exist or is not locked, then assume nothing is > running. It's possible for something to be running and about to grab the > lock, but we'll eventually read it correctly and converge on a single > instance started correctly. > *Case 2*: If the file is locked, and the pid doesn't match, then assume it is > running but not as the pid in the file (.. yet). Treat this the same as (1), > assume it's not running, and the next attempts to start will eventually > converge on a single instance running. > *Case 3*: If the file is locked, and the pid matches the locker process, then > assume it is running as that pid. Note that it's still possible that in > between matching the pid and taking an action (e.g. kill), the pid may become > stale, but the recycling pattern of pids makes it unlikely to be re-used > unless there is a large delay. > It seems like some tools already do this signal wrapping (note the comment > about fcntl and note the race from (3) in the BUGS section): > http://manpages.ubuntu.com/manpages/natty/man8/ovs-kill.8.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)