asekretenko commented on pull request #388: URL: https://github.com/apache/mesos/pull/388#issuecomment-850879962
@cf-natali After looking at this code more carefully and doing some experiments, I'm wondering if it is possible to more reliably prevent the root cause of this issue (leaving cgroup in a wrong state) from occurring. If I'm not missing something, `TasksKiller::freeze()` essentially runs a retry loop, each iteration of which does the following: 1. calls `cgroups::freezer::freeze(hierarchy, cgroup)` limited by a timeout of a length determined by `FREEZE_RETRY_INTERVAL` 2. cancels timed-out result of the previous stage 3. calls `TasksKiller::kill()` 4. calls `TasksKiller::thaw()`, that is, `cgroups::freezer::thaw(hierarchy, cgroup)` and then the loop repeats. There are three conditions, upon which the loop ends: a) success of `cgroups::freezer::freeze()` b) failure at any stage c) process termination caused by discard of TaskKiller's result future That said, I'm not convinced that postponing c) is the best available option. Maybe we should consider replacing c) with another termination mechanism that will be **guaranteed** not to halt this loop in the middle of an iteration? Something like setting a flag in that discard callback and checking it before calling `cgroups::freezer:freeze()` ? Btw, how are you testing the code handling freeze/kill failures? In my experience, the simplest way to reliably create a process in an uninterruptable sleep (D state) is to create a file on some FUSE-backed FS (say, sshfs), mmap that file, stop the program backing the FS with SIGSTOP, and repeatedly read/write the mmap-ed bytes until the process eventually hangs. That's rather complicated, takes 30 seconds of read/write on my kernel and is not really usable in tests... I'm wondering if there is a simpler option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org