[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089005#comment-17089005 ] Charles Natali edited comment on MESOS-8038 at 4/21/20, 7:46 PM: - [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After less than a minute, tasks start failing with this error: {noformat} Failed to launch container: Requested 1 gpus but only 0 available{noformat} I'll try to see if I can find a simpler reproducer, but this seems to fail systematically for me. was (Author: cf.natali): [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After a few minutes, tasks start failing with this error: {noformat} Failed to launch container: Requested 1 gpus but only 0 available{noformat} I'll try to see if I can find a simpler reproducer, but this to fail systematically for me. > Launching GPU task sporadically fails. > -- > > Key: MESOS-8038 > URL: https://issues.apache.org/jira/browse/MESOS-8038 > Project: Mesos > Issue Type: Bug > Components: containerization, gpu >Affects Versions: 1.4.0 >Reporter: Sai Teja Ranuva >Assignee: Zhitao Li >Priority: Critical > Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, > mesos-slave.INFO.log, start_short_tasks_gpu.py > > > I was running a job which uses GPUs. It runs fine most of the time. > But occasionally I see the following message in the mesos log. > "Collect failed: Requested 1 but only 0 available" > Followed by executor getting killed and the tasks getting lost. This happens > even before the the job starts. A little search in the code base points me to > something related to GPU resource being the probable cause. > There is no deterministic way that this can be reproduced. It happens > occasionally. > I have attached the slave log for the issue. > Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089005#comment-17089005 ] Charles Natali edited comment on MESOS-8038 at 4/21/20, 7:32 PM: - [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After a few minutes, tasks start failing with this error: {noformat} Failed to launch container: Requested 1 gpus but only 0 available{noformat} I'll try to see if I can find a simpler reproducer, but this to fail systematically for me. was (Author: cf.natali): [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks while allocate 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After a few minutes, tasks start failing with this error: Failed to launch container: Requested 1 gpus but only 0 available I'll try to see if I can find a simpler reproducer, but this to fail systematically for me. > Launching GPU task sporadically fails. > -- > > Key: MESOS-8038 > URL: https://issues.apache.org/jira/browse/MESOS-8038 > Project: Mesos > Issue Type: Bug > Components: containerization, gpu >Affects Versions: 1.4.0 >Reporter: Sai Teja Ranuva >Assignee: Zhitao Li >Priority: Critical > Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, > mesos-slave.INFO.log, start_short_tasks_gpu.py > > > I was running a job which uses GPUs. It runs fine most of the time. > But occasionally I see the following message in the mesos log. > "Collect failed: Requested 1 but only 0 available" > Followed by executor getting killed and the tasks getting lost. This happens > even before the the job starts. A little search in the code base points me to > something related to GPU resource being the probable cause. > There is no deterministic way that this can be reproduced. It happens > occasionally. > I have attached the slave log for the issue. > Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089005#comment-17089005 ] Charles Natali commented on MESOS-8038: --- [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks while allocate 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After a few minutes, tasks start failing with this error: Failed to launch container: Requested 1 gpus but only 0 available I'll try to see if I can find a simpler reproducer, but this to fail systematically for me. > Launching GPU task sporadically fails. > -- > > Key: MESOS-8038 > URL: https://issues.apache.org/jira/browse/MESOS-8038 > Project: Mesos > Issue Type: Bug > Components: containerization, gpu >Affects Versions: 1.4.0 >Reporter: Sai Teja Ranuva >Assignee: Zhitao Li >Priority: Critical > Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, > mesos-slave.INFO.log, start_short_tasks_gpu.py > > > I was running a job which uses GPUs. It runs fine most of the time. > But occasionally I see the following message in the mesos log. > "Collect failed: Requested 1 but only 0 available" > Followed by executor getting killed and the tasks getting lost. This happens > even before the the job starts. A little search in the code base points me to > something related to GPU resource being the probable cause. > There is no deterministic way that this can be reproduced. It happens > occasionally. > I have attached the slave log for the issue. > Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU
[ https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088973#comment-17088973 ] Charles Natali commented on MESOS-10119: So for the good news: I couldn't reproduce it - it turned out to be a bug in one of our legacy systems which caused it to remove the agent's cgroups... However I did observe this particular failure as a consequence of the now fixed https://issues.apache.org/jira/browse/MESOS-10107 > Marking as a duplicate of MESOS-8038. Ah, let's close this one then. > failure to destroy container can cause the agent to "leak" a GPU > > > Key: MESOS-10119 > URL: https://issues.apache.org/jira/browse/MESOS-10119 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Charles Natali >Priority: Major > > At work we hit the following problem: > # cgroup for a task using the GPU isolation failed to be destroyed on OOM > # the agent continued advertising the GPU as available > # all subsequent attempts to start tasks using a GPU fails with "Requested 1 > gpus but only 0 available" > Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can > be tackled separately, however the fact that the agent basically leaks the > GPU is pretty bad, because it basically turns into /dev/null, failing all > subsequent tasks requesting a GPU. > > See the logs: > > > {noformat} > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 > memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 > memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 > memory.cpp:686] Failed to read 'memory.stat': No such file or directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 > memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 > memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 > memory.cpp:686] Failed to read 'memory.stat': No such file or directory > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 > slave.cpp:6994] Termination of executor > 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an > isolator when destroying container: Failed to destroy cgroups: Failed to get > nested cgroups: Failed to determine canonical path of > '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such > file or directory > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 > containerizer.cpp:2567] Skipping status for container > 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 > containerizer.cpp:2428] Ignoring update for currently being destroyed > container 8ef00748-b640-4620-97dc-f719e9775e88 > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 > slave.cpp:6994] Termination of executor > 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device > or resource busy > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 > containerizer.cpp:2567] Skipping status for container > 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 > containerizer.cpp:2428] Ignoring update for currently being destroyed > container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 > slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor > 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus > but only 0 available > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 > memory.cpp:637] Listening on OOM events failed for container > 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 > containerizer.cpp:2
[jira] [Assigned] (MESOS-10113) OpenSSLSocketImpl with 'support_downgrade' waits for incoming bytes before accepting new connection.
[ https://issues.apache.org/jira/browse/MESOS-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-10113: --- Assignee: Benjamin Mahler https://reviews.apache.org/r/72352/ > OpenSSLSocketImpl with 'support_downgrade' waits for incoming bytes before > accepting new connection. > > > Key: MESOS-10113 > URL: https://issues.apache.org/jira/browse/MESOS-10113 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > > The accept loop in OpenSSLSocketImpl in the case of {{support_downgrade}} > enabled will wait for incoming bytes on the accepted socket before allowing > another socket to be accepted. This will lead to significant throughput > issues for accepting new connections (e.g. during a master failover), or may > block entirely if a client doesn't send any data for whatever reason. > Marking as a bug due to the potential for blocking incoming connections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10124) OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling for read readiness.
[ https://issues.apache.org/jira/browse/MESOS-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-10124: --- Assignee: Benjamin Mahler > OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling > for read readiness. > > > Key: MESOS-10124 > URL: https://issues.apache.org/jira/browse/MESOS-10124 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: windows > > OpenSSLSocket is currently using the zero byte read trick on Windows to poll > for read readiness when peaking at the data to determine whether the incoming > connection is performing an SSL handshake. However, io::read is designed to > provide consistent semantics for a zero byte read across posix and windows, > which is to return immediately. > To fix this, we can either: > (1) Have different semantics for zero byte io::read on posix / windows, where > we just let it fall through to the system calls. This might be confusing for > users, but it's unlikely that a caller would perform a zero byte read in > typical code so the confusion is probably avoided. > (2) Implement io::poll for reads on windows. This would make the caller code > consistent and is probably less confusing to users. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10124) OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling for read readiness.
Benjamin Mahler created MESOS-10124: --- Summary: OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling for read readiness. Key: MESOS-10124 URL: https://issues.apache.org/jira/browse/MESOS-10124 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler OpenSSLSocket is currently using the zero byte read trick on Windows to poll for read readiness when peaking at the data to determine whether the incoming connection is performing an SSL handshake. However, io::read is designed to provide consistent semantics for a zero byte read across posix and windows, which is to return immediately. To fix this, we can either: (1) Have different semantics for zero byte io::read on posix / windows, where we just let it fall through to the system calls. This might be confusing for users, but it's unlikely that a caller would perform a zero byte read in typical code so the confusion is probably avoided. (2) Implement io::poll for reads on windows. This would make the caller code consistent and is probably less confusing to users. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10123) Windows overlapped IO discard handling can drop data.
Benjamin Mahler created MESOS-10123: --- Summary: Windows overlapped IO discard handling can drop data. Key: MESOS-10123 URL: https://issues.apache.org/jira/browse/MESOS-10123 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler Assignee: Benjamin Mahler When getting a discard request for an io operation on windows, a cancellation is requested [1] and when the io operation completes we check whether the future had a discard request to decide whether to discard it [2]: {code} template static void set_io_promise(Promise* promise, const T& data, DWORD error) { if (promise->future().hasDiscard()) { promise->discard(); } else if (error == ERROR_SUCCESS) { promise->set(data); } else { promise->fail("IO failed with error code: " + WindowsError(error).message); } } {code} However, it's possible the operation completed successfully, in which case we did not succeed at canceling it. We need to check for {{ERROR_OPERATION_ABORTED}} [3]: {code} template static void set_io_promise(Promise* promise, const T& data, DWORD error) { if (promise->future().hasDiscard() && error == ERROR_OPERATION_ABORTED) { promise->discard(); } else if (error == ERROR_SUCCESS) { promise->set(data); } else { promise->fail("IO failed with error code: " + WindowsError(error).message); } } {code} I don't think there are currently any major consequences to this issue, since most callers tend to be discarding only when they're essentially abandoning the entire process of reading or writing. [1] https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/windows/libwinio.cpp#L448 [2] https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/windows/libwinio.cpp#L141-L151 [3] https://docs.microsoft.com/en-us/windows/win32/fileio/cancelioex-func -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10122) cmake+MSBuild is uncapable of building all Mesos sources in parallel.
Andrei Sekretenko created MESOS-10122: - Summary: cmake+MSBuild is uncapable of building all Mesos sources in parallel. Key: MESOS-10122 URL: https://issues.apache.org/jira/browse/MESOS-10122 Project: Mesos Issue Type: Bug Components: build Environment: When a library (in cmake's sense) contains several sources with different paths but the same filename (for example, slave/validation.cpp and resource_provider/validation.cpp), the build generated by CMake for MSVC does not allow for building those files in parallel (presumably, because the .obj files will be located in the same directory). This has been tested observed with both cmake 3.9 and 3.17, with "Visual Studio 15 2017 Win64" generator. It seems to be a known behaviour - see https://stackoverflow.com/questions/7033855/msvc10-mp-builds-not-multicore-across-folders-in-a-project. Two options for fixing this in a way that will work with these cmake/MSVC configurations are: - splitting the build into small static libraries (a library per directory) - introducing an intermediate code-generation-like step optionally flattening the directory structure (slave/validation.cpp -> slave_validation.cpp) Both options have their drawbacks: - The first will result in changing the layout the static build artifacts (mesos.lib will be replaced with a ton of smaller libraries), that will pose integration cahllenges, and potentially will result in worse parallelism. - The second will result in being unable to use #include without a path (right now there are three or four such #include's in the whole Mesos/libprocess code buildable on Windows) in changed value of __FILE__ macro (as a consequence, in the example above, `validation.cpp` in logs will be replaced either with `slave_validation.cpp` or with `resource_provider_validation.cpp`) Note that the second approach will need to deal with potential collisions when the source tree has filenames with underscores. If, for example, we had both slave/validation.cpp and slave_validation.cpp, then either some additional escaping will be needed when or, alternatively, such layout could be just forbidden (and made to fail the build). Preliminary testing shows that on a 8-core AWS instance flattening source trees of libprocess, mesos-protobufs and libmesos results in clean build speedup from 54 minutes to 33 minutes. Reporter: Andrei Sekretenko Assignee: Andrei Sekretenko -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10121) stdout/stderr not rotating
Evgeny created MESOS-10121: -- Summary: stdout/stderr not rotating Key: MESOS-10121 URL: https://issues.apache.org/jira/browse/MESOS-10121 Project: Mesos Issue Type: Task Reporter: Evgeny Hello, i am trying to rotate my stdout/stderr files in Mesos containers. Starting mesos-slave: {code} docker run -d mesos-slave:1.4 --container_logger=org_apache_mesos_LogrotateContainerLogger --modules=/var/tmp/mesos/mesos-slave-modules.json {code} config for rotation: {code} cat /var/tmp/mesos/mesos-slave-modules.json: { "libraries": [{ "file": "/usr/lib/liblogrotate_container_logger.so", "modules": [{ "name": "org_apache_mesos_LogrotateContainerLogger", "parameters": [{ "key": "launcher_dir", "value": "/usr/libexec/mesos" }, { "key": "logrotate_path", "value": "/usr/sbin/logrotate" }, { "key": "max_stdout_size", "value": "10240MB" }, { "key": "max_stderr_size", "value": "10240MB" }, { "key": "logrotate_stdout_options", "value": "rotate 5\nmissingok\nnotifempty\ncompress\nnomail\n" }, { "key": "logrotate_stderr_options", "value": "rotate 5\nmissingok\nnotifempty\ncompress\nnomail\n" }] }] }] } {code} mesos-logrotate-logger is running for all containers: {code} root 733 0.1 0.0 841724 25020 ? Ssl Apr19 4:02 mesos-logrotate-logger --help=false --log_filename=/var/tmp/mesos/slaves/c931deec-e65a-4362-9c6f-d4b278f52f5b-S0/frameworks/cb0d4342-fcf5-4e6d-abf2-764b8c5b8cf3-/executors/gateway.478dac1a-8276-11ea-b32f-0242ac110002/runs/60bc8179-3051-4214-8902-7d9747f1713e/stdout --logrotate_options=rotate 5 missingok notifempty compress nomail --logrotate_path=/usr/sbin/logrotate --max_size=10GB --user=root ... {code} limit 10Gb is reached: {code} ls -lh /var/tmp/mesos/slaves/c931deec-e65a-4362-9c6f-d4b278f52f5b-S0/frameworks/cb0d4342-fcf5-4e6d-abf2-764b8c5b8cf3-/executors/gateway.478dac1a-8276-11ea-b32f-0242ac110002/runs/60bc8179-3051-4214-8902-7d9747f1713e/stdout -rw-r--r-- 1 root root 12G Apr 21 10:57 /var/tmp/mesos/slaves/c931deec-e65a-4362-9c6f-d4b278f52f5b-S0/frameworks/cb0d4342-fcf5-4e6d-abf2-764b8c5b8cf3-/executors/gateway.478dac1a-8276-11ea-b32f-0242ac110002/runs/60bc8179-3051-4214-8902-7d9747f1713e/stdout {code} logrotate is available (--logrotate_path=/usr/sbin/logrotate). But rotating does not occur. Cant find a problem. Any tips pls? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087845#comment-17087845 ] Qian Zhang edited comment on MESOS-10054 at 4/21/20, 1:22 PM: -- RR: [https://reviews.apache.org/r/72401/] [https://reviews.apache.org/r/72391/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/72391/] > Update Docker containerizer to set Docker container’s resource limits and > `oom_score_adj` > - > > Key: MESOS-10054 > URL: https://issues.apache.org/jira/browse/MESOS-10054 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > This is to set resource limits for executor which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU
[ https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088568#comment-17088568 ] Andrei Budnik commented on MESOS-10119: --- Could you reproduce the cgroups desctruction problem consistently? What are the kernel and systemd versions installed on your agents? > failure to destroy container can cause the agent to "leak" a GPU > > > Key: MESOS-10119 > URL: https://issues.apache.org/jira/browse/MESOS-10119 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Charles Natali >Priority: Major > > At work we hit the following problem: > # cgroup for a task using the GPU isolation failed to be destroyed on OOM > # the agent continued advertising the GPU as available > # all subsequent attempts to start tasks using a GPU fails with "Requested 1 > gpus but only 0 available" > Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can > be tackled separately, however the fact that the agent basically leaks the > GPU is pretty bad, because it basically turns into /dev/null, failing all > subsequent tasks requesting a GPU. > > See the logs: > > > {noformat} > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 > memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 > memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 > memory.cpp:686] Failed to read 'memory.stat': No such file or directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 > memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 > memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 > memory.cpp:686] Failed to read 'memory.stat': No such file or directory > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 > slave.cpp:6994] Termination of executor > 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an > isolator when destroying container: Failed to destroy cgroups: Failed to get > nested cgroups: Failed to determine canonical path of > '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such > file or directory > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 > containerizer.cpp:2567] Skipping status for container > 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 > containerizer.cpp:2428] Ignoring update for currently being destroyed > container 8ef00748-b640-4620-97dc-f719e9775e88 > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 > slave.cpp:6994] Termination of executor > 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device > or resource busy > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 > containerizer.cpp:2567] Skipping status for container > 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 > containerizer.cpp:2428] Ignoring update for currently being destroyed > container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 > slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor > 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus > but only 0 available > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 > memory.cpp:637] Listening on OOM events failed for container > 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 > containerizer.cpp:2421] Ignoring update for unknown container > 87253521-8d39-47ea-b4d1-febe527d230c > Apr 17 17:00:08 engpuc006 mesos-slave[2068]: W0417 17:00:08.109067 2154 > process.cpp:1480] Failed to link to '172.16.22.201:34059', connect: Failed >