[jira] [Commented] (MESOS-4875) overlayfs does not work when lauching tasks
[ https://issues.apache.org/jira/browse/MESOS-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182026#comment-15182026 ] Guangya Liu commented on MESOS-4875: It is a document issue and is being handled here https://reviews.apache.org/r/44391/ > overlayfs does not work when lauching tasks > --- > > Key: MESOS-4875 > URL: https://issues.apache.org/jira/browse/MESOS-4875 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > Enable the overlay backend and launch a task, the task failed to start. Check > executor log, found the followinig: > {code} > Failed to create sandbox mount point at > '/tmp/mesos/slaves/bbc41bda-747a-420e-88d2-cf100fa8b6d5-S1/frameworks/bbc41bda-747a-420e-88d2-cf100fa8b6d5-0001/executors/test_mesos/runs/3736fb2a-de7a-4aba-9b08-25c73be7879f/.rootfs/mnt/mesos/sandbox': > Read-only file system > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4869) /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory
[ https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181914#comment-15181914 ] Timothy Chen commented on MESOS-4869: - Btw what's your slave memory usage? > /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory > --- > > Key: MESOS-4869 > URL: https://issues.apache.org/jira/browse/MESOS-4869 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.1 >Reporter: Anthony Scalisi >Priority: Critical > > We switched our health checks in Marathon from HTTP to COMMAND: > {noformat} > "healthChecks": [ > { > "protocol": "COMMAND", > "path": "/ops/ping", > "command": { "value": "curl --silent -f -X GET > http://$HOST:$PORT0/ops/ping > /dev/null" }, > "gracePeriodSeconds": 90, > "intervalSeconds": 2, > "portIndex": 0, > "timeoutSeconds": 5, > "maxConsecutiveFailures": 3 > } > ] > {noformat} > All our applications have the same health check (and /ops/ping endpoint). > Even though we have the issue on all our Meos slaves, I'm going to focus on a > particular one: *mesos-slave-i-e3a9c724*. > The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks: > !https://i.imgur.com/gbRf804.png! > Here is a *docker ps* on it: > {noformat} > root@mesos-slave-i-e3a9c724 # docker ps > CONTAINER IDIMAGE COMMAND CREATED >STATUS PORTS NAMES > 4f7c0aa8d03ajava:8 "/bin/sh -c 'JAVA_OPT" 6 hours ago >Up 6 hours 0.0.0.0:31926->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d > 66f2fc8f8056java:8 "/bin/sh -c 'JAVA_OPT" 6 hours ago >Up 6 hours 0.0.0.0:31939->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a > f7382f241fcejava:8 "/bin/sh -c 'JAVA_OPT" 6 hours ago >Up 6 hours 0.0.0.0:31656->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d > 880934c0049ejava:8 "/bin/sh -c 'JAVA_OPT" 24 hours ago >Up 24 hours 0.0.0.0:31371->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0 > 5eab1f8dac4ajava:8 "/bin/sh -c 'JAVA_OPT" 46 hours ago >Up 46 hours 0.0.0.0:31500->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7 > b63740fe56e7java:8 "/bin/sh -c 'JAVA_OPT" 46 hours ago >Up 46 hours 0.0.0.0:31382->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe > 5c7a9ea77b0ejava:8 "/bin/sh -c 'JAVA_OPT" 2 days ago >Up 2 days 0.0.0.0:31186->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4 > 53065e7a31adjava:8 "/bin/sh -c 'JAVA_OPT" 2 days ago >Up 2 days 0.0.0.0:31839->8080/tcp > mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c > {noformat} > Here is a *docker stats* on it: > {noformat} > root@mesos-slave-i-e3a9c724 # docker stats > CONTAINER CPU % MEM USAGE / LIMIT MEM % > NET I/O BLOCK I/O > 4f7c0aa8d03a2.93% 797.3 MB / 1.611 GB 49.50% > 1.277 GB / 1.189 GB 155.6 kB / 151.6 kB > 53065e7a31ad8.30% 738.9 MB / 1.611 GB 45.88% > 419.6 MB / 554.3 MB 98.3 kB / 61.44 kB > 5c7a9ea77b0e4.91% 1.081 GB / 1.611 GB 67.10% > 423 MB / 526.5 MB 3.219 MB / 61.44 kB > 5eab1f8dac4a3.13% 1.007 GB / 1.611 GB 62.53% > 2.737 GB / 2.564 GB 6.566 MB / 118.8 kB > 66f2fc8f80563.15% 768.1 MB / 1.611 GB 47.69% > 258.5 MB / 252.8 MB 1.86 MB / 151.6 kB > 880934c0049e10.07% 735.1 MB / 1.611 GB 45.64% > 1.451 GB / 1.399 GB 573.4 kB / 94.21 kB > b63740fe56e712.04% 629 MB / 1.611 GB 39.06% > 10.29 GB / 9.344 GB 8.102 MB / 61.44 kB > f7382f241fce6.21% 505 MB / 1.611 GB 31.36% > 153.4 MB / 151.9 MB 5.837 MB / 94.21 kB > {noformat} > Not much else is running on the slave, yet the used memory doesn't map to the > tasks memory: > {noformat} > Mem:16047M used:13340M buffers:1139M cache:776M > {noformat} > If I exec into the container (*java:8* image), I can see correctly the shell > calls to exe
[jira] [Commented] (MESOS-4427) Ensure ip_address in state.json (from NetworkInfo) is valid
[ https://issues.apache.org/jira/browse/MESOS-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181855#comment-15181855 ] Martin Evgeniev commented on MESOS-4427: Oops sry... > Ensure ip_address in state.json (from NetworkInfo) is valid > --- > > Key: MESOS-4427 > URL: https://issues.apache.org/jira/browse/MESOS-4427 > Project: Mesos > Issue Type: Bug >Reporter: Sargun Dhillon >Priority: Critical > Labels: mesosphere > Fix For: 0.28.0 > > > We have seen a master state.json where the state.json has a field that looks > similar to: > ---REDACTED--- > {code:json} > { > "container": { > "docker": { > "force_pull_image": false, > "image": "REDACTED", > "network": "HOST", > "privileged": false > }, > "type": "DOCKER" > }, > "executor_id": "", > "framework_id": "9f0e50ea-54b0-44e3-a451-c69e0c1a58fb-", > "id": "ping-as-a-service.c2d1c17a-be22-11e5-b053-002590e56e25", > "name": "ping-as-a-service", > "resources": { > "cpus": 0.1, > "disk": 0, > "mem": 64, > "ports": "[7907-7907]" > }, > "slave_id": "9f0e50ea-54b0-44e3-a451-c69e0c1a58fb-S76043", > "state": "TASK_RUNNING", > "statuses": [ > { > "container_status": { > "network_infos": [ > { > "ip_address": "", > "ip_addresses": [ > { > "ip_address": "" > } > ] > } > ] > }, > "labels": [ > { > "key": "Docker.NetworkSettings.IPAddress", > "value": "" > } > ], > "state": "TASK_RUNNING", > "timestamp": 1453149270.95511 > } > ] > } > {code} > ---REDACTED--- > This is invalid, and it mesos-core should filter it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2043) framework auth fail with timeout error and never get authenticated
[ https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181854#comment-15181854 ] Kevin Cox commented on MESOS-2043: -- I'm having this same issue on mesos 0.27.1. It occurs both when mesos-slave and marathon is attempting to connect. > framework auth fail with timeout error and never get authenticated > -- > > Key: MESOS-2043 > URL: https://issues.apache.org/jira/browse/MESOS-2043 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.21.0 >Reporter: Bhuvan Arumugam > Labels: security > Attachments: aurora-scheduler.20141104-1606-1706.log, > mesos-master.20141104-1606-1706.log > > > I'm facing this issue in master as of > https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4 > As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm > running 1 master and 1 scheduler (aurora). The framework authentication fail > due to time out: > error on mesos master: > {code} > I1104 19:37:17.741449 8329 master.cpp:3874] Authenticating > scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 > I1104 19:37:17.741585 8329 master.cpp:3885] Using default CRAM-MD5 > authenticator > I1104 19:37:17.742106 8336 authenticator.hpp:169] Creating new server SASL > connection > W1104 19:37:22.742959 8329 master.cpp:3953] Authentication timed out > W1104 19:37:22.743548 8329 master.cpp:3930] Failed to authenticate > scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: > Authentication discarded > {code} > scheduler error: > {code} > I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master > master@MASTER_IP:PORT > I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL > connection > I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL > authentication mechanisms: CRAM-MD5 > I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate > with mechanism 'CRAM-MD5' > W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out > I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master > master@MASTER_IP:PORT: Authentication discarded > {code} > Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & > {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is > trying to authenticate and fail. > {code} > W1104 19:36:30.769420 8319 master.cpp:3930] Failed to authenticate > scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to > communicate with authenticatee > I1104 19:36:42.701441 8328 master.cpp:3860] Queuing up authentication > request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 > because authentication is still in progress > {code} > Restarting master and scheduler didn't fix it. > This particular issue happen with 1 master and 1 scheduler after MESOS-1866 > is fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4876) bind backend does not work when launching tasks
Guangya Liu created MESOS-4876: -- Summary: bind backend does not work when launching tasks Key: MESOS-4876 URL: https://issues.apache.org/jira/browse/MESOS-4876 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu Enable the bind backend and launch a task, the task failed to start. Check executor log, found the followinig: {code} Failed to create sandbox mount point at '/tmp/mesos/slaves/f7011c10-5098-4f13-96d5-f5ba2699203e-S0/frameworks/f7011c10-5098-4f13-96d5-f5ba2699203e-/executors/test_mesos/runs/5c89c8e1-77b4-4f06-aaf6-9c4ceaf262a3/.rootfs/mnt/mesos/sandbox': Read-only file system {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4870) As a developer I WANT Mesos to provide a channel for richly structured error messages to surface from events like TASK_FAILED
[ https://issues.apache.org/jira/browse/MESOS-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181724#comment-15181724 ] Guangya Liu commented on MESOS-4870: Does the {{stderr}} in sandbox can help? The {{stderr}} can always give some very detailed message for why the task failed. > As a developer I WANT Mesos to provide a channel for richly structured error > messages to surface from events like TASK_FAILED > - > > Key: MESOS-4870 > URL: https://issues.apache.org/jira/browse/MESOS-4870 > Project: Mesos > Issue Type: Improvement >Reporter: James DeFelice > Labels: external-volumes, mesosphere > > For example, a storage module attempts to mount a volume into my task's > container. The mount operation fails because the file system driver required > by the volume type isn't available on the host. Mesos generates a TASK_FAILED > event and passes along the failure message generated by the module. > If I'm LUCKY then the module populates the failure message with some text > that explains the nature of the problem and the rich Mesos console that I'm > using surfaces the nicely formatted text message. > If I'm UNLUCKY then the module populates the failure message with something > cryptic that doesn't help me understand what went wrong at all. I'm left with > little context with which to troubleshoot the problem and my rich Mesos > console can't help because there's very little additional context that > shipped with the TASK_FAILED event. > What I WANT is additional context so that my rich Mesos console can offer > features like: > a) tell me which subsystem/module failed (subsystem="storage", > modulename="libfoobaz") and subsystem-specific details (storageprovider="foo" > providerversion=0.1) > b) provide an OS process details: > i) the OS command line that failed > ii) the UID of the process that failed > iii) the GID of the process that failed > iv) the environment of the command line that failed > v) the error code that the process exited with > c) how many time this type of error has happened, for this (or other) > frameworks, and when -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4875) overlayfs does not work when lauching tasks
Guangya Liu created MESOS-4875: -- Summary: overlayfs does not work when lauching tasks Key: MESOS-4875 URL: https://issues.apache.org/jira/browse/MESOS-4875 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu Enable the overlay backend and launch a task, the task failed to start. Check executor log, found the followinig: {code} Failed to create sandbox mount point at '/tmp/mesos/slaves/bbc41bda-747a-420e-88d2-cf100fa8b6d5-S1/frameworks/bbc41bda-747a-420e-88d2-cf100fa8b6d5-0001/executors/test_mesos/runs/3736fb2a-de7a-4aba-9b08-25c73be7879f/.rootfs/mnt/mesos/sandbox': Read-only file system {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)