[ 
https://issues.apache.org/jira/browse/MESOS-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171017#comment-15171017
 ] 

Pradeep Chhetri commented on MESOS-1648:
----------------------------------------

Is anyone working on it ? Otherwise, I would like to pick this up.


> Add a --pidfile option to master and agent binaries.
> ----------------------------------------------------
>
>                 Key: MESOS-1648
>                 URL: https://issues.apache.org/jira/browse/MESOS-1648
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master, slave
>            Reporter: Tobias Weingartner
>              Labels: newbie, twitter
>
> Right now we use a number of wrapper scripts to try and keep up a 
> {{/var/run/mesos/mesos-slave.pid}} in order to be able to monitor the 
> process.  This has proven to be somewhat fragile due to the lack of locking 
> and the possibility of races and stale data.
> By adding a {{--pidfile}}, we can obtain a lock on the file to prevent 
> multiple binaries from starting, and to enable the tooling to validate that 
> the lock is held before doing any signaling. We can also do a best effort 
> unlink in the signal handler upon termination:
> {code}
> // Get exclusive access to the file.
> fd = open(O_CREAT ...)
> flock(fd, LOCK_EX)
> if not locked, abort
> ftruncate(fd, 0)
> // Write the pid.
> write(fd, "<pid>")
> // Inside signal handler..
> unlink(pidfile)
> {code}
> Digging around, looks like the open, ftruncate, write pattern is pretty 
> common:
> http://man7.org/tlpi/code/online/diff/filelock/create_pid_file.c.html
> The tooling around it could that the file is locked by the pid inside it, 
> before taking any action (like signaling):
> *Case 1*: If the file does not exist or is not locked, then assume nothing is 
> running. It's possible for something to be running and about to grab the 
> lock, but we'll eventually read it correctly and converge on a single 
> instance started correctly.
> *Case 2*: If the file is locked, and the pid doesn't match, then assume it is 
> running but not as the pid in the file (.. yet). Treat this the same as (1), 
> assume it's not running, and the next attempts to start will eventually 
> converge on a single instance running.
> *Case 3*: If the file is locked, and the pid matches the locker process, then 
> assume it is running as that pid. Note that it's still possible that in 
> between matching the pid and taking an action (e.g. kill), the pid may become 
> stale, but the recycling pattern of pids makes it unlikely to be re-used 
> unless there is a large delay.
> It seems like some tools already do this signal wrapping (note the comment 
> about fcntl and note the race from (3) in the BUGS section):
> http://manpages.ubuntu.com/manpages/natty/man8/ovs-kill.8.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to