[jira] [Commented] (MESOS-4875) overlayfs does not work when lauching tasks

2016-03-05 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182026#comment-15182026
 ] 

Guangya Liu commented on MESOS-4875:


It is a document issue and is being handled here 
https://reviews.apache.org/r/44391/

> overlayfs does not work when lauching tasks
> ---
>
> Key: MESOS-4875
> URL: https://issues.apache.org/jira/browse/MESOS-4875
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> Enable the overlay backend and launch a task, the task failed to start. Check 
> executor log, found the followinig:
> {code}
> Failed to create sandbox mount point  at 
> '/tmp/mesos/slaves/bbc41bda-747a-420e-88d2-cf100fa8b6d5-S1/frameworks/bbc41bda-747a-420e-88d2-cf100fa8b6d5-0001/executors/test_mesos/runs/3736fb2a-de7a-4aba-9b08-25c73be7879f/.rootfs/mnt/mesos/sandbox':
>  Read-only file system
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4869) /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory

2016-03-05 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181914#comment-15181914
 ] 

Timothy Chen commented on MESOS-4869:
-

Btw what's your slave memory usage?

> /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory
> ---
>
> Key: MESOS-4869
> URL: https://issues.apache.org/jira/browse/MESOS-4869
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.1
>Reporter: Anthony Scalisi
>Priority: Critical
>
> We switched our health checks in Marathon from HTTP to COMMAND:
> {noformat}
> "healthChecks": [
> {
>   "protocol": "COMMAND",
>   "path": "/ops/ping",
>   "command": { "value": "curl --silent -f -X GET 
> http://$HOST:$PORT0/ops/ping > /dev/null" },
>   "gracePeriodSeconds": 90,
>   "intervalSeconds": 2,
>   "portIndex": 0,
>   "timeoutSeconds": 5,
>   "maxConsecutiveFailures": 3
> }
>   ]
> {noformat}
> All our applications have the same health check (and /ops/ping endpoint).
> Even though we have the issue on all our Meos slaves, I'm going to focus on a 
> particular one: *mesos-slave-i-e3a9c724*.
> The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:
> !https://i.imgur.com/gbRf804.png!
> Here is a *docker ps* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724 # docker ps
> CONTAINER IDIMAGE   COMMAND  CREATED  
>STATUS  PORTS NAMES
> 4f7c0aa8d03ajava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31926->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
> 66f2fc8f8056java:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31939->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
> f7382f241fcejava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31656->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
> 880934c0049ejava:8  "/bin/sh -c 'JAVA_OPT"   24 hours ago 
>Up 24 hours 0.0.0.0:31371->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
> 5eab1f8dac4ajava:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>Up 46 hours 0.0.0.0:31500->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
> b63740fe56e7java:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>Up 46 hours 0.0.0.0:31382->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
> 5c7a9ea77b0ejava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago   
>Up 2 days   0.0.0.0:31186->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
> 53065e7a31adjava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago   
>Up 2 days   0.0.0.0:31839->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
> {noformat}
> Here is a *docker stats* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724  # docker stats
> CONTAINER   CPU %   MEM USAGE / LIMIT MEM %   
> NET I/O   BLOCK I/O
> 4f7c0aa8d03a2.93%   797.3 MB / 1.611 GB   49.50%  
> 1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
> 53065e7a31ad8.30%   738.9 MB / 1.611 GB   45.88%  
> 419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
> 5c7a9ea77b0e4.91%   1.081 GB / 1.611 GB   67.10%  
> 423 MB / 526.5 MB 3.219 MB / 61.44 kB
> 5eab1f8dac4a3.13%   1.007 GB / 1.611 GB   62.53%  
> 2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
> 66f2fc8f80563.15%   768.1 MB / 1.611 GB   47.69%  
> 258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
> 880934c0049e10.07%  735.1 MB / 1.611 GB   45.64%  
> 1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
> b63740fe56e712.04%  629 MB / 1.611 GB 39.06%  
> 10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
> f7382f241fce6.21%   505 MB / 1.611 GB 31.36%  
> 153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
> {noformat}
> Not much else is running on the slave, yet the used memory doesn't map to the 
> tasks memory:
> {noformat}
> Mem:16047M used:13340M buffers:1139M cache:776M
> {noformat}
> If I exec into the container (*java:8* image), I can see correctly the shell 
> calls to exe

[jira] [Commented] (MESOS-4427) Ensure ip_address in state.json (from NetworkInfo) is valid

2016-03-05 Thread Martin Evgeniev (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181855#comment-15181855
 ] 

Martin Evgeniev commented on MESOS-4427:


Oops sry...

> Ensure ip_address in state.json (from NetworkInfo) is valid
> ---
>
> Key: MESOS-4427
> URL: https://issues.apache.org/jira/browse/MESOS-4427
> Project: Mesos
>  Issue Type: Bug
>Reporter: Sargun Dhillon
>Priority: Critical
>  Labels: mesosphere
> Fix For: 0.28.0
>
>
> We have seen a master state.json where the state.json has a field that looks 
> similar to:
> ---REDACTED---
> {code:json}
> {
> "container": {
> "docker": {
> "force_pull_image": false,
> "image": "REDACTED",
> "network": "HOST",
> "privileged": false
> },
> "type": "DOCKER"
> },
> "executor_id": "",
> "framework_id": "9f0e50ea-54b0-44e3-a451-c69e0c1a58fb-",
> "id": "ping-as-a-service.c2d1c17a-be22-11e5-b053-002590e56e25",
> "name": "ping-as-a-service",
> "resources": {
> "cpus": 0.1,
> "disk": 0,
> "mem": 64,
> "ports": "[7907-7907]"
> },
> "slave_id": "9f0e50ea-54b0-44e3-a451-c69e0c1a58fb-S76043",
> "state": "TASK_RUNNING",
> "statuses": [
> {
> "container_status": {
> "network_infos": [
> {
> "ip_address": "",
> "ip_addresses": [
> {
> "ip_address": ""
> }
> ]
> }
> ]
> },
> "labels": [
> {
> "key": "Docker.NetworkSettings.IPAddress",
> "value": ""
> }
> ],
> "state": "TASK_RUNNING",
> "timestamp": 1453149270.95511
> }
> ]
> }
> {code}
> ---REDACTED---
> This is invalid, and it mesos-core should filter it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2043) framework auth fail with timeout error and never get authenticated

2016-03-05 Thread Kevin Cox (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181854#comment-15181854
 ] 

Kevin Cox commented on MESOS-2043:
--

I'm having this same issue on mesos 0.27.1. It occurs both when mesos-slave and 
marathon is attempting to connect.

> framework auth fail with timeout error and never get authenticated
> --
>
> Key: MESOS-2043
> URL: https://issues.apache.org/jira/browse/MESOS-2043
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.21.0
>Reporter: Bhuvan Arumugam
>  Labels: security
> Attachments: aurora-scheduler.20141104-1606-1706.log, 
> mesos-master.20141104-1606-1706.log
>
>
> I'm facing this issue in master as of 
> https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4
> As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm 
> running 1 master and 1 scheduler (aurora). The framework authentication fail 
> due to time out:
> error on mesos master:
> {code}
> I1104 19:37:17.741449  8329 master.cpp:3874] Authenticating 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083
> I1104 19:37:17.741585  8329 master.cpp:3885] Using default CRAM-MD5 
> authenticator
> I1104 19:37:17.742106  8336 authenticator.hpp:169] Creating new server SASL 
> connection
> W1104 19:37:22.742959  8329 master.cpp:3953] Authentication timed out
> W1104 19:37:22.743548  8329 master.cpp:3930] Failed to authenticate 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: 
> Authentication discarded
> {code}
> scheduler error:
> {code}
> I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master 
> master@MASTER_IP:PORT
> I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL 
> connection
> I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL 
> authentication mechanisms: CRAM-MD5
> I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate 
> with mechanism 'CRAM-MD5'
> W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out
> I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master 
> master@MASTER_IP:PORT: Authentication discarded
> {code}
> Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & 
> {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is 
> trying to authenticate and fail.
> {code}
> W1104 19:36:30.769420  8319 master.cpp:3930] Failed to authenticate 
> scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to 
> communicate with authenticatee
> I1104 19:36:42.701441  8328 master.cpp:3860] Queuing up authentication 
> request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 
> because authentication is still in progress
> {code}
> Restarting master and scheduler didn't fix it. 
> This particular issue happen with 1 master and 1 scheduler after MESOS-1866 
> is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4876) bind backend does not work when launching tasks

2016-03-05 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-4876:
--

 Summary: bind backend does not work when launching tasks
 Key: MESOS-4876
 URL: https://issues.apache.org/jira/browse/MESOS-4876
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


Enable the bind backend and launch a task, the task failed to start. Check 
executor log, found the followinig:

{code}
Failed to create sandbox mount point  at 
'/tmp/mesos/slaves/f7011c10-5098-4f13-96d5-f5ba2699203e-S0/frameworks/f7011c10-5098-4f13-96d5-f5ba2699203e-/executors/test_mesos/runs/5c89c8e1-77b4-4f06-aaf6-9c4ceaf262a3/.rootfs/mnt/mesos/sandbox':
 Read-only file system
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4870) As a developer I WANT Mesos to provide a channel for richly structured error messages to surface from events like TASK_FAILED

2016-03-05 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181724#comment-15181724
 ] 

Guangya Liu commented on MESOS-4870:


Does the {{stderr}} in sandbox can help? The {{stderr}} can always give some 
very detailed message for why the task failed.

> As a developer I WANT Mesos to provide a channel for richly structured error 
> messages to surface from events like TASK_FAILED
> -
>
> Key: MESOS-4870
> URL: https://issues.apache.org/jira/browse/MESOS-4870
> Project: Mesos
>  Issue Type: Improvement
>Reporter: James DeFelice
>  Labels: external-volumes, mesosphere
>
> For example, a storage module attempts to mount a volume into my task's 
> container. The mount operation fails because the file system driver required 
> by the volume type isn't available on the host. Mesos generates a TASK_FAILED 
> event and passes along the failure message generated by the module.
> If I'm LUCKY then the module populates the failure message with some text 
> that explains the nature of the problem and the rich Mesos console that I'm 
> using surfaces the nicely formatted text message.
> If I'm UNLUCKY then the module populates the failure message with something 
> cryptic that doesn't help me understand what went wrong at all. I'm left with 
> little context with which to troubleshoot the problem and my rich Mesos 
> console can't help because there's very little additional context that 
> shipped with the TASK_FAILED event.
> What I WANT is additional context so that my rich Mesos console can offer 
> features like:
> a) tell me which subsystem/module failed (subsystem="storage", 
> modulename="libfoobaz") and subsystem-specific details (storageprovider="foo" 
> providerversion=0.1)
> b) provide an OS process details:
> i) the OS command line that failed
> ii) the UID of the process that failed
> iii) the GID of the process that failed
> iv) the environment of the command line that failed
> v) the error code that the process exited with
> c) how many time this type of error has happened, for this (or other) 
> frameworks, and when



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4875) overlayfs does not work when lauching tasks

2016-03-05 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-4875:
--

 Summary: overlayfs does not work when lauching tasks
 Key: MESOS-4875
 URL: https://issues.apache.org/jira/browse/MESOS-4875
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


Enable the overlay backend and launch a task, the task failed to start. Check 
executor log, found the followinig:

{code}
Failed to create sandbox mount point  at 
'/tmp/mesos/slaves/bbc41bda-747a-420e-88d2-cf100fa8b6d5-S1/frameworks/bbc41bda-747a-420e-88d2-cf100fa8b6d5-0001/executors/test_mesos/runs/3736fb2a-de7a-4aba-9b08-25c73be7879f/.rootfs/mnt/mesos/sandbox':
 Read-only file system
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)