[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141201#comment-16141201 ] Deshi Xiao commented on MESOS-6933: --- [~alexr] hi guy, do you have cycles to shepherd me. i want to fix it in my try. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > Attachments: 屏幕快照 2017-07-17 下午2.19.03.png > > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089848#comment-16089848 ] Deshi Xiao commented on MESOS-6933: --- [~janisz] do you can write a testing to cover it? i have no clues to check where code to start the fixing. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > Attachments: 屏幕快照 2017-07-17 下午2.19.03.png > > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089547#comment-16089547 ] Tomasz Janiszewski commented on MESOS-6933: --- I can reproduce it on latest master https://github.com/apache/mesos/commit/400d3002d4aa82cbae4b55bced608e95225176e4 > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > Attachments: 屏幕快照 2017-07-17 下午2.19.03.png > > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089452#comment-16089452 ] Deshi Xiao commented on MESOS-6933: --- [~janisz] yes, i build on upstream mesos code base. it is 1.4. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > Attachments: 屏幕快照 2017-07-17 下午2.19.03.png > > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089403#comment-16089403 ] Tomasz Janiszewski commented on MESOS-6933: --- [~xds2000] On my setup it works differently. I'm on Mesos 1.3 and when I follow steps described above I finish with a state where task is killed but it's still running. In logs you provide I see Mesos executor somehow determined that not all proceses has exited and sent KILL signal to them. In my case it ends on SIGTERM {code} Sent SIGTERM to the following process trees: [ -+- 18776 sh -c /tmp/script.sh \-+- 18790 /bin/sh /tmp/script.sh \--- 18832 sleep 1 ] Scheduling escalation to SIGKILL in 3secs from now Terminated SIGNAL Command terminated with signal Terminated (pid: 18776) {code} > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > Attachments: 屏幕快照 2017-07-17 下午2.19.03.png > > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088708#comment-16088708 ] Tomasz Janiszewski commented on MESOS-6933: --- The log I mention it {{/tmp/date.txt}}. You should be able to see new entries after task is killed. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088603#comment-16088603 ] Deshi Xiao commented on MESOS-6933: --- [~janisz] i have reproduce the step. and not sure to check if " the shell has excited but script is still running and producing output." is happened. so cloud you please give a patient comments in the mesos log will prefer way to let me understand. sorry for the request. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999352#comment-15999352 ] Tomasz Janiszewski commented on MESOS-6933: --- This error is quite easy to reproduce. 1. Run Mesos cluster with default configuration (you can use {{./build/bin/mesos-local.sh}}). Do not enable any isolators especially naespace/pid isolator because it can cover this bug. 2. Create script that works in infinite loop and ignore signals {code} cat > /tmp/script.sh <> /tmp/date.txt sleep 1 done EOF {code} 3. Start created script on Mesos and kill it after couple of seconds working. You can use any framework e.g., {{ mesos-execute --kill_after=10secs --master=localhost:5050 --command="/tmp/script.sh" --name="graceful-kill-test"}} 4. Monitor logs. You can see there that script is signaled with SIGTERM and the shell has excited but script is still running and producing output. The easiest solution will be to signal tree and then wait for all processes in this tree to exit, not only the root. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999199#comment-15999199 ] Deshi Xiao commented on MESOS-6933: --- how to reproduce this bug? let me understand where can do a patch. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982998#comment-15982998 ] Alexander Rukletsov commented on MESOS-6933: [~janisz], this is—unfortunately—a known issue that've been here for a while (linked the original ticket). Surprisingly we haven't seen a lot of requests to fix it (do folks avoid wrapping their tasks in {{sh}}?) and never got to work on this. Do you want to suggest a patch? I'll be happy to shepherd. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927809#comment-15927809 ] Tomasz Janiszewski commented on MESOS-6933: --- I think it's related to following [~benjaminhindman] comment: {code} // TODO(benh): Allow excluding the root pid from stopping, killing, // and continuing so as to provide a means for expressing "kill all of // my children". This is non-trivial because of the current // implementation. {code} https://github.com/apache/mesos/blob/22d3f56ce10cf61b6a1f06614bd63e0943a8b769/3rdparty/stout/include/stout/os/posix/killtree.hpp#L54-L57 > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839526#comment-15839526 ] Tomasz Janiszewski commented on MESOS-6933: --- I think ??{{/bin/sh}} doesn't forward signals to any child processes?? is not a problem, {{killTree}} deliver signal to every {{sh}} children. The problem is {{sh}} terminates fast and children could need some time to gracefully shutdown. > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839421#comment-15839421 ] haosdent commented on MESOS-6933: - [~klueska][~janisz] This is {{sh}} problem rather than Mesos bug, because {{/bin/sh}} doesn't forward signals to any child processes. Docker has similar problem when you try to exit gracefully if you use {{sh}} to launch commands, refer to https://www.ctl.io/developers/blog/post/gracefully-stopping-docker-containers/ for the details. So the correct way to implement exit gracefully in Docker, Mesos and other applications is to avoid use {{sh}}. More precisely, user should set {{CommandInfo.shell}} to false and use {{exec}} form to launch tasks if they would like to make task exit gracefully. Make sense? > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Command Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX containerizer is used because if command ignores > SIGTERM it will be attached to initialize and never get killed. Using > pid/namespace only masks the problem because hanging process is captured > before it can gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all children processes finish. If not they will be killed by escalation > to SIGKILL. > All versions from 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6933) Executor does not respect grace period
[ https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828586#comment-15828586 ] Kevin Klues commented on MESOS-6933: I assume youa re referring to the "Command executor", not the "Default Exector" (the default executor is new as of the 1.1 release and deals with launching task groups). [~jieyu] [~vinodkone][~bmahler] Who is the best person to take a look at this bug? > Executor does not respect grace period > -- > > Key: MESOS-6933 > URL: https://issues.apache.org/jira/browse/MESOS-6933 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Tomasz Janiszewski > > Mesos Defult Executor try to support grace period with escalate but > unfortunately it does not work. It launches {{command}} by wrapping it in > {{sh -c}} this cause process tree to look like this > {code} > Received killTask > Shutting down > Sending SIGTERM to process tree at pid 18 > Sent SIGTERM to the following process trees: > [ > -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so > ./bin/offer-i18n -e prod -p $PORT0 > \--- 19 command... > ] > Command terminated with signal Terminated (pid: 18) > {code} > This cause {{sh}} to immediately close and so executor, while wrapped > {{command}} might need some more time to finish. Finally, executor thinks > command executed gracefully so it won't > [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695] > to SIGKILL. > This cause leaks when POSIX contenerizer is used because if command ignores > SIGTERM it will be attached to init and never get killed. Using pid/namespace > only masks the problem because hanging process is cpatured before it can > gracefully shutdown. > Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit > when all sub processes finish. If not they will be killed by escalation to > SIGKILL. > All versions from: 0.20 are affected. > This test should pass > [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343] > [Mailing list > thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E] -- This message was sent by Atlassian JIRA (v6.3.4#6332)