asekretenko edited a comment on pull request #388:
URL: https://github.com/apache/mesos/pull/388#issuecomment-850879962


   @cf-natali After looking at this code more carefully and doing some 
experiments, I'm wondering if it is possible to more reliably prevent the root 
cause of this issue (leaving cgroup in a wrong state) from occurring.
   
   If I'm not missing something, `TasksKiller::freeze()` essentially runs a 
retry loop, each iteration of which does the following:
   1. calls `cgroups::freezer::freeze(hierarchy, cgroup)` limited by a timeout 
of a length determined by `FREEZE_RETRY_INTERVAL`
   2. cancels timed-out result of (1)
   3. calls `TasksKiller::kill()`
   4. calls `TasksKiller::thaw()`, that is, `cgroups::freezer::thaw(hierarchy, 
cgroup)`
   
   and then the loop repeats, potentially indefinitely (at least, the iteration 
count is not limited by the `TasksKiller` itself)
   
   There are three conditions, upon which the loop ends:
   a) success of `cgroups::freezer::freeze()`
   b) failure at any stage
   c) process termination caused by discard of TaskKiller's result future
   
   That said, I'm not convinced that postponing c) is the best available 
option. 
   
   Maybe we should consider replacing c) with another termination mechanism 
that will be **guaranteed** not to halt this loop in the middle of an 
iteration? Something like setting a flag by that discard callback and checking 
this flag inside this loop before calling  `cgroups::freezer:freeze()` ?
   
   Btw, how are you testing the code handling freeze/kill failures? 
   In my experience, the simplest way to reliably create a process in an 
uninterruptable sleep (D state) is to create a file on some FUSE-backed FS 
(say, sshfs), mmap that file, stop the program backing the FS with SIGSTOP, and 
repeatedly read/write the mmap-ed bytes until the process eventually hangs. 
That's rather complicated, takes 30 seconds of read/write on my kernel and is 
not really usable in tests... I'm wondering if there is a simpler option.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to