[ https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209666#comment-15209666 ]
Benjamin Mahler commented on MESOS-4869: ---------------------------------------- Hi [~scalp42], thank you for reporting this! [~jieyu] and I took a look at the health check program and found that there is a memory leak in some of the library code it uses. I filed MESOS-5021 to follow up on it. We leak the environment variables each time a health check process is run, which is likely the leak you've observed. > /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory > ------------------------------------------------------------------- > > Key: MESOS-4869 > URL: https://issues.apache.org/jira/browse/MESOS-4869 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.27.1 > Reporter: Anthony Scalisi > Priority: Critical > > We switched our health checks in Marathon from HTTP to COMMAND: > {noformat} > "healthChecks": [ > { > "protocol": "COMMAND", > "path": "/ops/ping", > "command": { "value": "curl --silent -f -X GET > http://$HOST:$PORT0/ops/ping > /dev/null" }, > "gracePeriodSeconds": 90, > "intervalSeconds": 2, > "portIndex": 0, > "timeoutSeconds": 5, > "maxConsecutiveFailures": 3 > } > ] > {noformat} > All our applications have the same health check (and /ops/ping endpoint). > Even though we have the issue on all our Meos slaves, I'm going to focus on a > particular one: *mesos-slave-i-e3a9c724*. > The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks: > !https://i.imgur.com/gbRf804.png! > Here is a *docker ps* on it: > {noformat} > root@mesos-slave-i-e3a9c724 # docker ps > CONTAINER ID IMAGE COMMAND CREATED > STATUS PORTS NAMES > 4f7c0aa8d03a java:8 "/bin/sh -c 'JAVA_OPT" 6 hours ago > Up 6 hours 0.0.0.0:31926->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d > 66f2fc8f8056 java:8 "/bin/sh -c 'JAVA_OPT" 6 hours ago > Up 6 hours 0.0.0.0:31939->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a > f7382f241fce java:8 "/bin/sh -c 'JAVA_OPT" 6 hours ago > Up 6 hours 0.0.0.0:31656->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d > 880934c0049e java:8 "/bin/sh -c 'JAVA_OPT" 24 hours ago > Up 24 hours 0.0.0.0:31371->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0 > 5eab1f8dac4a java:8 "/bin/sh -c 'JAVA_OPT" 46 hours ago > Up 46 hours 0.0.0.0:31500->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7 > b63740fe56e7 java:8 "/bin/sh -c 'JAVA_OPT" 46 hours ago > Up 46 hours 0.0.0.0:31382->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe > 5c7a9ea77b0e java:8 "/bin/sh -c 'JAVA_OPT" 2 days ago > Up 2 days 0.0.0.0:31186->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4 > 53065e7a31ad java:8 "/bin/sh -c 'JAVA_OPT" 2 days ago > Up 2 days 0.0.0.0:31839->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c > {noformat} > Here is a *docker stats* on it: > {noformat} > root@mesos-slave-i-e3a9c724 # docker stats > CONTAINER CPU % MEM USAGE / LIMIT MEM % > NET I/O BLOCK I/O > 4f7c0aa8d03a 2.93% 797.3 MB / 1.611 GB 49.50% > 1.277 GB / 1.189 GB 155.6 kB / 151.6 kB > 53065e7a31ad 8.30% 738.9 MB / 1.611 GB 45.88% > 419.6 MB / 554.3 MB 98.3 kB / 61.44 kB > 5c7a9ea77b0e 4.91% 1.081 GB / 1.611 GB 67.10% > 423 MB / 526.5 MB 3.219 MB / 61.44 kB > 5eab1f8dac4a 3.13% 1.007 GB / 1.611 GB 62.53% > 2.737 GB / 2.564 GB 6.566 MB / 118.8 kB > 66f2fc8f8056 3.15% 768.1 MB / 1.611 GB 47.69% > 258.5 MB / 252.8 MB 1.86 MB / 151.6 kB > 880934c0049e 10.07% 735.1 MB / 1.611 GB 45.64% > 1.451 GB / 1.399 GB 573.4 kB / 94.21 kB > b63740fe56e7 12.04% 629 MB / 1.611 GB 39.06% > 10.29 GB / 9.344 GB 8.102 MB / 61.44 kB > f7382f241fce 6.21% 505 MB / 1.611 GB 31.36% > 153.4 MB / 151.9 MB 5.837 MB / 94.21 kB > {noformat} > Not much else is running on the slave, yet the used memory doesn't map to the > tasks memory: > {noformat} > Mem:16047M used:13340M buffers:1139M cache:776M > {noformat} > If I exec into the container (*java:8* image), I can see correctly the shell > calls to execute the curl specified in the health check as expected and exit > correctly. > The only change we noticed since the memory usage woes was related to moving > to Mesos doing the health checks instead, so I decided to take a look: > {noformat} > root@mesos-slave-i-e3a9c724 # ps awwx | grep health_check | grep -v grep > 2504 ? Sl 47:33 /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:53432 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_email-green.b086206a-e000-11e5-a617-02429957d388 > 4220 ? Sl 47:26 /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:54982 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4 > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_chat-green.ed53ec41-e000-11e5-a617-02429957d388 > 7444 ? Sl 1:31 /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:59422 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_identity-green.aeb2ef3b-e219-11e5-a617-02429957d388 > 10368 ? Sl 1:30 /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:40981 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_channel-green.c6fbd2ac-e219-11e5-a617-02429957d388 > 12399 ? Sl 9:45 /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:44815 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0 > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_integration-green.143865d5-e17d-11e5-a617-02429957d388 > 13538 ? Sl 24:54 /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:56598 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_metric-green.75296986-e0c7-11e5-a617-02429957d388 > 32034 ? Sl 1:31 /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:48119 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_push-green.601337e6-e219-11e5-a617-02429957d388 > {noformat} > The memory usage is really bad: > {noformat} > root@mesos-slave-i-e3a9c724 # ps -eo size,pid,user,command --sort -size | > grep health_check | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 > ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' > 2185.39 Mb /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:53432 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_email-green.b086206a-e000-11e5-a617-02429957d388 > 2185.39 Mb /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:54982 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4 > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_chat-green.ed53ec41-e000-11e5-a617-02429957d388 > 1673.39 Mb /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:56598 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_metric-green.75296986-e0c7-11e5-a617-02429957d388 > 1161.39 Mb /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:44815 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0 > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_integration-green.143865d5-e17d-11e5-a617-02429957d388 > 649.39 Mb /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:59422 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_identity-green.aeb2ef3b-e219-11e5-a617-02429957d388 > 649.39 Mb /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:40981 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_channel-green.c6fbd2ac-e219-11e5-a617-02429957d388 > 649.39 Mb /usr/libexec/mesos/mesos-health-check > --executor=(1)@10.92.32.63:48119 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d > sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > > \/dev\/null > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0} > --task_id=prod_talkk_push-green.601337e6-e219-11e5-a617-02429957d388 > 0.32 Mb grep --color=auto health_check > {noformat} > Killing the *mesos-health-check* process for each container fix our memory > issues (but I'm assuming health checks won't be reported anymore or > something): > {noformat} > root@mesos-slave-i-e3a9c724 # date ; free -m ; ps awwx | grep health_check | > grep -v grep | awk '{print $1}' | xargs -I% -P1 kill % ; date ; free -m > Fri Mar 4 21:20:55 UTC 2016 > total used free shared buffers cached > Mem: 16047 13538 2508 0 1140 774 > -/+ buffers/cache: 11623 4423 > Swap: 0 0 0 > Fri Mar 4 21:20:56 UTC 2016 > total used free shared buffers cached > Mem: 16047 9101 6945 0 1140 774 > -/+ buffers/cache: 7186 8860 > Swap: 0 0 0 > {noformat} > We're reverting to Marathon doing the health checks for now but would like to > emphasize it's happening across all our slaves (not an isolated issue). > Thanks for looking into it :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)