[jira] [Commented] (MESOS-3367) Mesos fetcher does not extract archives for URI with parameters

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245208#comment-15245208
 ] 

haosdent commented on MESOS-3367:
-

Got it. I think MESOS-4735 is a better approach, let me close this. Feel free 
to reopen this if you think it still necessary.

> Mesos fetcher does not extract archives for URI with parameters
> ---
>
> Key: MESOS-3367
> URL: https://issues.apache.org/jira/browse/MESOS-3367
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.22.1, 0.23.0
> Environment: DCOS 1.1
>Reporter: Renat Zubairov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere
>
> I'm deploying using marathon applications with sources served from S3. I'm 
> using a signed URL to give only temporary access to the S3 resources, so URL 
> of the resource have some query parameters.
> So URI is 'https://foo.com/file.tgz?hasi' and fetcher stores it in the file 
> with the name 'file.tgz?hasi', then it thinks that extension 'hasi' is not 
> tgz hence extraction is skipped, despite the fact that MIME Type of the HTTP 
> resource is 'application/x-tar'.
> Workaround - add additional parameter like '&workaround=.tgz'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-17 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245206#comment-15245206
 ] 

James DeFelice commented on MESOS-5224:
---

Here's a JSON of an update that's "rejected" by the slave. I don't know if this 
is THE update that's crashing the slave. But it seems likely since the 
connection is dropped and I see an EOF on the executor. All of the updates are 
generated the exact same way (via 
https://github.com/mesos/mesos-go/blob/executor_proto/cmd/example-executor/main.go#L208).
{code}
{ 
  "executor_id":{"value":"default"},
  "framework_id":{"value":"ad9e5972-8b5e-4042-b97f-ecc36f2c046f-0011"},
  "type":"UPDATE",
  "update":{
"status":{ 
  "task_id":{"value":"1"},
  "state":"TASK_RUNNING",
  "source":"SOURCE_EXECUTOR",
  "executor_id":{"value":"default"},
  "uuid":"ZTZlZTRlNmMtNzE0Ni00NTAwLWJkZWYtNDc0Yzk2MWNmNGU4" // 
base64-decoded: e6ee4e6c-7146-4500-bdef-474c961cf4e8
}
  }
}
{code}

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)

[jira] [Commented] (MESOS-4705) Slave failed to sample container with perf event

2016-04-17 Thread Fan Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245167#comment-15245167
 ] 

Fan Du commented on MESOS-4705:
---

[~haosd...@gmail.com] [~bmahler] I have elaborated more about the comments, 
please review again:

https://reviews.apache.org/r/44379/

Thanks a lot!

> Slave failed to sample container with perf event
> 
>
> Key: MESOS-4705
> URL: https://issues.apache.org/jira/browse/MESOS-4705
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation
>Affects Versions: 0.27.1
>Reporter: Fan Du
>Assignee: Fan Du
>
> When sampling container with perf event on Centos7 with kernel 
> 3.10.0-123.el7.x86_64, slave complained with below error spew:
> {code}
> E0218 16:32:00.591181  8376 perf_event.cpp:408] Failed to get perf sample: 
> Failed to parse perf sample: Failed to parse perf sample line 
> '25871993253,,cycles,mesos/5f23ffca-87ed-4ff6-84f2-6ec3d4098ab8,10059827422,100.00':
>  Unexpected number of fields
> {code}
> it's caused by the current perf format [assumption | 
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/linux/perf.cpp;h=1c113a2b3f57877e132bbd65e01fb2f045132128;hb=HEAD#l430]
>  with kernel version below 3.12 
> On 3.10.0-123.el7.x86_64 kernel, the format is with 6 tokens as below:
> value,unit,event,cgroup,running,ratio
> A local modification fixed this error on my test bed, please review this 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4735) CommandInfo.URI should allow specifying target filename

2016-04-17 Thread Erik Weathers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245149#comment-15245149
 ] 

Erik Weathers commented on MESOS-4735:
--

[~mrbrowning] & [~vinodkone]:  awesome, thanks so much for fixing this!

> CommandInfo.URI should allow specifying target filename
> ---
>
> Key: MESOS-4735
> URL: https://issues.apache.org/jira/browse/MESOS-4735
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Erik Weathers
>Assignee: Michael Browning
>Priority: Minor
> Fix For: 0.29.0
>
>
> The {{CommandInfo.URI}} message should allow explicitly choosing the 
> downloaded file's name, to better mimic functionality present in tools like 
> {{wget}} and {{curl}}.
> This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that 
> has query parameters at the end of the path, resulting in the downloaded 
> filename having those elements.  This also prevents extracting of such files, 
> since the extraction logic is simply looking at the file's suffix. See 
> MESOS-3367, MESOS-1686, and MESOS-1509 for more info.  If this issue was 
> fixed, then I could workaround the other issues not being fixed by modifying 
> my framework's scheduler to set the target filename.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3367) Mesos fetcher does not extract archives for URI with parameters

2016-04-17 Thread Erik Weathers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245147#comment-15245147
 ] 

Erik Weathers commented on MESOS-3367:
--

[~haosd...@gmail.com]: yup, that looks good to me!  It should suffice from my 
perspective.  Not sure about [~bernd-mesos] & [~zubairov] though.

> Mesos fetcher does not extract archives for URI with parameters
> ---
>
> Key: MESOS-3367
> URL: https://issues.apache.org/jira/browse/MESOS-3367
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.22.1, 0.23.0
> Environment: DCOS 1.1
>Reporter: Renat Zubairov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere
>
> I'm deploying using marathon applications with sources served from S3. I'm 
> using a signed URL to give only temporary access to the S3 resources, so URL 
> of the resource have some query parameters.
> So URI is 'https://foo.com/file.tgz?hasi' and fetcher stores it in the file 
> with the name 'file.tgz?hasi', then it thinks that extension 'hasi' is not 
> tgz hence extraction is skipped, despite the fact that MIME Type of the HTTP 
> resource is 'application/x-tar'.
> Workaround - add additional parameter like '&workaround=.tgz'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-17 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan reassigned MESOS-5225:


Assignee: Avinash Sridharan  (was: Qian Zhang)

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Avinash Sridharan
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins}}
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
> '192.168.1.2/24' from CNI network 'net1' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
> container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
> I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
> I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 6.061056ms
> I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 2.46016ms
> I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace 
> handle 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b31
> 4-477f-b734-7771d07d41e3
> I0418 08:2

[jira] [Comment Edited] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-17 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245085#comment-15245085
 ] 

Avinash Sridharan edited comment on MESOS-5225 at 4/18/16 4:44 AM:
---

Thanks Qian !!
This does seem like a bug. 

Odd part is we do set the rootfs on which we are going to bind mount the 
network files by checking the `ContainerConfig` 
https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566

However, my suspicion is that for command tasks the rootfs in `ContainerConfig` 
is set to the actual rootfs of the container. (Need to confirm this). 

This does raise a question though, even if we bind mount the files to the 
corresponding files in the host file system, we still need to bind mount the 
same files to the container file system as well. Reason being, that after 
`pivot_root` the process will start treating the container as the root 
filesystem and if the network files are not bind mounted into the rootfs of the 
container, we will start seeing the same failure.

I am thinking that the fix should be to bind mount the files to the rootfs of 
the container and the rootfs of the host file system. These mount points will 
get destroyed anyway when the mnt namespace is destroyed (container dies).


was (Author: avin...@mesosphere.io):
Thanks Qian !!
This does seem like a bug. 

Odd part is we do set the rootfs on which we are going to bind mount the 
network files by checking the `ContainerConfig` 
https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566

However, my suspicion is that for command tasks the rootfs in `ContainerConfig` 
is set to the actual rootfs of the container. (Need to confirm this). 

This does raise a question though, even we bind mount the files to the 
corresponding files in the host file system, we still need to bind mount the 
same files to the container file system as well. Reason being, that after 
`pivot_root` the process will start treating the container as the root 
filesystem and if the network files are not bind mounted into the rootfs of the 
container, we will start seeing the same failure.

I am thinking that the fix should be to bind mount the files to the rootfs of 
the container and the rootfs of the host file system. These mount points will 
get destroyed anyway when the mnt namespace is destroyed (container dies).

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins}}
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.8

[jira] [Commented] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-17 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245085#comment-15245085
 ] 

Avinash Sridharan commented on MESOS-5225:
--

Thanks Qian !!
This does seem like a bug. 

Odd part is we do set the rootfs on which we are going to bind mount the 
network files by checking the `ContainerConfig` 
https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566

However, my suspicion is that for command tasks the rootfs in `ContainerConfig` 
is set to the actual rootfs of the container. (Need to figure this out). 

This does raise a question though, even we bind mount the files to the 
corresponding files in the host file system, we still need to bind mount the 
same files to the container file system as well. Reason being, that after 
`pivot_root` the process will start treating the container as the root 
filesystem and if the network files are not bind mounted into the rootfs of the 
container, we will start seeing the same failure.

I am thinking that the fix should be to bind mount the files to the rootfs of 
the container and the rootfs of the host file system. These mount points will 
get destroyed anyway when the mnt namespace is destroyed (container dies).

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins}}
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
> '192.168.1.2/24' from CNI network 'net1' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
> container 2b29d6d6-b314-477

[jira] [Comment Edited] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-17 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245085#comment-15245085
 ] 

Avinash Sridharan edited comment on MESOS-5225 at 4/18/16 3:13 AM:
---

Thanks Qian !!
This does seem like a bug. 

Odd part is we do set the rootfs on which we are going to bind mount the 
network files by checking the `ContainerConfig` 
https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566

However, my suspicion is that for command tasks the rootfs in `ContainerConfig` 
is set to the actual rootfs of the container. (Need to confirm this). 

This does raise a question though, even we bind mount the files to the 
corresponding files in the host file system, we still need to bind mount the 
same files to the container file system as well. Reason being, that after 
`pivot_root` the process will start treating the container as the root 
filesystem and if the network files are not bind mounted into the rootfs of the 
container, we will start seeing the same failure.

I am thinking that the fix should be to bind mount the files to the rootfs of 
the container and the rootfs of the host file system. These mount points will 
get destroyed anyway when the mnt namespace is destroyed (container dies).


was (Author: avin...@mesosphere.io):
Thanks Qian !!
This does seem like a bug. 

Odd part is we do set the rootfs on which we are going to bind mount the 
network files by checking the `ContainerConfig` 
https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566

However, my suspicion is that for command tasks the rootfs in `ContainerConfig` 
is set to the actual rootfs of the container. (Need to figure this out). 

This does raise a question though, even we bind mount the files to the 
corresponding files in the host file system, we still need to bind mount the 
same files to the container file system as well. Reason being, that after 
`pivot_root` the process will start treating the container as the root 
filesystem and if the network files are not bind mounted into the rootfs of the 
container, we will start seeing the same failure.

I am thinking that the fix should be to bind mount the files to the rootfs of 
the container and the rootfs of the host file system. These mount points will 
get destroyed anyway when the mnt namespace is destroyed (container dies).

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins}}
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.8

[jira] [Commented] (MESOS-5123) Docker task may fail if path to agent work_dir is relative.

2016-04-17 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245073#comment-15245073
 ] 

Klaus Ma commented on MESOS-5123:
-

cc [~jieyu]/[~alexr] :).

> Docker task may fail if path to agent work_dir is relative. 
> 
>
> Key: MESOS-5123
> URL: https://issues.apache.org/jira/browse/MESOS-5123
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker
>Affects Versions: 0.28.0, 0.29.0
>Reporter: Alexander Rukletsov
>Assignee: Klaus Ma
>  Labels: docker, documentation, mesosphere
> Fix For: 0.29.0
>
>
> When a local folder for agent’s {{\-\-work_dir}} is specified (e.g., 
> {{\-\-work_dir=w/s}}) docker complains that there are forbidden symbols in a 
> *local* volume name. Specifying an absolute path (e.g., 
> {{\-\-work_dir=/tmp}}) solves the problem.
> Docker error observed:
> {noformat}
> docker: Error response from daemon: create 
> w/s/slaves/33b8fe47-e9e0-468a-83a6-98c1e3537e59-S1/frameworks/33b8fe47-e9e0-468a-83a6-98c1e3537e59-0001/executors/docker-test/runs/3cc5cb04-d0a9-490e-94d5-d446b66c97cc:
>  volume name invalid: 
> "w/s/slaves/33b8fe47-e9e0-468a-83a6-98c1e3537e59-S1/frameworks/33b8fe47-e9e0-468a-83a6-98c1e3537e59-0001/executors/docker-test/runs/3cc5cb04-d0a9-490e-94d5-d446b66c97cc"
>  includes invalid characters for a local volume name, only 
> "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed.
> {noformat}
> First off, it is not obvious that Mesos always creates a volume for the 
> sandbox. We may want to document it.
> Second, it's hard to understand that local {{work_dir}} can trigger forbidden 
> symbols error in docker. Does it make sense to check it during agent launch 
> if docker containerizer is enabled? Or reject docker tasks during task 
> validation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-17 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma reassigned MESOS-5224:
---

Assignee: Klaus Ma

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
> mesos::internal::operator<<()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
> mesos::internal::slave::Slave::statusUpdate()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
> mesos::internal::slave::Slave::Http::executor()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236daa8 
> _ZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc

[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245070#comment-15245070
 ] 

haosdent commented on MESOS-5224:
-

If overflow in {{UUID:fromBytes()}}, I think the status should looks like
{code}
UUID::fromBytes()
statusUpdat()
...
{code}

{{.framework_id()}} and {{.status()}} are the parameters user passed in here.

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
> mesos::internal::operator<<()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
> mesos::internal::slave::Slave::statusUpdate()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
> mesos::internal::slave::Slave::Http::executor()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236da

[jira] [Comment Edited] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245070#comment-15245070
 ] 

haosdent edited comment on MESOS-5224 at 4/18/16 2:46 AM:
--

If overflow in {{UUID:fromBytes()}}, I think the stack should looks like
{code}
UUID::fromBytes()
statusUpdate()
...
{code}

{{.framework_id()}} and {{.status()}} are the parameters user passed in here.


was (Author: haosd...@gmail.com):
If overflow in {{UUID:fromBytes()}}, I think the status should looks like
{code}
UUID::fromBytes()
statusUpdat()
...
{code}

{{.framework_id()}} and {{.status()}} are the parameters user passed in here.

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
> mesos::internal::operator<<()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
> mesos::internal::slave::Slave::statusUpdate()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
> mesos::internal::slave::Slave::Http::executor()
> Apr 17 17:5

[jira] [Updated] (MESOS-5056) Replace Master/Slave Terminology Phase I - Update strings in the shell scripts outputs

2016-04-17 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5056:
--
Shepherd: Vinod Kone

> Replace Master/Slave Terminology Phase I - Update strings in the shell 
> scripts outputs
> --
>
> Key: MESOS-5056
> URL: https://issues.apache.org/jira/browse/MESOS-5056
> Project: Mesos
>  Issue Type: Task
>Reporter: zhou xing
>Assignee: zhou xing
>
> This is a sub ticket of MESOS-3780. In this ticket, we will rename slave to 
> agent in the shell script outputs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5057) Replace Master/Slave Terminology Phase I - Update strings in error messages and other strings

2016-04-17 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5057:
--
Shepherd: Vinod Kone
  Sprint: Mesosphere Sprint 33
Story Points: 3

> Replace Master/Slave Terminology Phase I - Update strings in error messages 
> and other strings
> -
>
> Key: MESOS-5057
> URL: https://issues.apache.org/jira/browse/MESOS-5057
> Project: Mesos
>  Issue Type: Task
>Reporter: zhou xing
>Assignee: zhou xing
> Fix For: 0.29.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> This is a sub ticket of MESOS-3780. In this ticket, we will update all the 
> slave to agent in the error messages and other strings in the code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5057) Replace Master/Slave Terminology Phase I - Update strings in error messages and other strings

2016-04-17 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245066#comment-15245066
 ] 

Vinod Kone edited comment on MESOS-5057 at 4/18/16 2:35 AM:


Transition the issue to "Reviewable" when you post a review please. I will do 
it for this one.


was (Author: vinodkone):
Transition the review to "Reviewable" when you post a review please. I will do 
it for this one.

> Replace Master/Slave Terminology Phase I - Update strings in error messages 
> and other strings
> -
>
> Key: MESOS-5057
> URL: https://issues.apache.org/jira/browse/MESOS-5057
> Project: Mesos
>  Issue Type: Task
>Reporter: zhou xing
>Assignee: zhou xing
> Fix For: 0.29.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> This is a sub ticket of MESOS-3780. In this ticket, we will update all the 
> slave to agent in the error messages and other strings in the code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5226) The image-less task launched by mesos-execute can not join CNI network

2016-04-17 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245052#comment-15245052
 ] 

Qian Zhang commented on MESOS-5226:
---

The root cause of this bug is, in {{CommandScheduler::getContainerInfo()}}, we 
will not return a {{ContainerInfo}} as long as there is no image specified even 
there is CNI network specified, instead we will just return {{None()}} in this 
case. And in {{NetworkCniIsolatorProcess::prepare()}}, we will just ignore the 
container which has no {{ContainerInfo}}, so any CNI related logic will not be 
applied to the executor which will be in agent host network namespace.

> The image-less task launched by mesos-execute can not join CNI network
> --
>
> Key: MESOS-5226
> URL: https://issues.apache.org/jira/browse/MESOS-5226
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> With {{mesos-execute}}, if we launches a task which wants to join a CNI 
> network but has no image specified, like:
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --networks=net1 --command="ifconfig" --shell=true
> {code}
> The corresponding command executor actually will not join the specified CNI 
> network, instead it is still in agent host network namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5226) The image-less task launched by mesos-execute can not join CNI network

2016-04-17 Thread Qian Zhang (JIRA)
Qian Zhang created MESOS-5226:
-

 Summary: The image-less task launched by mesos-execute can not 
join CNI network
 Key: MESOS-5226
 URL: https://issues.apache.org/jira/browse/MESOS-5226
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Reporter: Qian Zhang
Assignee: Qian Zhang


With {{mesos-execute}}, if we launches a task which wants to join a CNI network 
but has no image specified, like:
{code}
sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
--networks=net1 --command="ifconfig" --shell=true
{code}

The corresponding command executor actually will not join the specified CNI 
network, instead it is still in agent host network namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-17 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245050#comment-15245050
 ] 

Klaus Ma commented on MESOS-5224:
-

[~jdef], woud you share your example? I'd like to reproduce it firstly :).

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
> mesos::internal::operator<<()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
> mesos::internal::slave::Slave::statusUpdate()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
> mesos::internal::slave::Slave::Http::executor()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236daa8 
> _ZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEE

[jira] [Comment Edited] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-17 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245044#comment-15245044
 ] 

Qian Zhang edited comment on MESOS-5225 at 4/18/16 1:47 AM:


The root cause of this bug is, before the command executor (mesos-executor) is 
started by mesos-containerizer, we bind mount {{/etc/hosts}}, 
{{/etc/hostname}}, {{/etc/resolv.conf}} in the container's rootfs (see 
{{NetworkCniIsolatorSetup::execute()}} for details), but for command executor, 
we will NOT do the {{chroot}} before launching it (see 
{{LinuxFilesystemIsolatorProcess::prepare()}}, we will only set rootfs in 
{{ContainerLaunchInfo}} if it is not a command task), instead the command 
executor will do the {{chroot}} itself when launching the task 
(https://github.com/apache/mesos/blob/0.28.0/src/launcher/executor.cpp#L369).

So when the command executor is launched, it is still using agent host FS, that 
means the bind mounts that we do will not take effect for it. Obviously in 
agent host FS, the {{/etc/hosts}} does not have the pair of container's 
hostname and IP, so the hostname lookup in libprocess will fail.


was (Author: qianzhang):
The root cause of this bug is, before the command executor (mesos-executor) is 
started by mesos-containerizer, we bind mount {{/etc/hosts}}, 
{{/etc/hostname}}, {{/etc/resolv.conf}} in the container's rootfs (see 
{{NetworkCniIsolatorSetup::execute()}} for details), but for command executor, 
we will NOT do the {{chroot}} before launching it (see 
{{LinuxFilesystemIsolatorProcess::prepare()}}, we will only set rootfs in 
{{ContainerLaunchInfo}} for if it is not a command task), instead the command 
executor will do the {{chroot}} itself when launching the task 
(https://github.com/apache/mesos/blob/0.28.0/src/launcher/executor.cpp#L369).

So when the command executor is launched, it is still using agent host FS, that 
means the bind mounts that we do will not take effect for it. Obviously in 
agent host FS, the {{/etc/hosts}} does not have the pair of container's 
hostname and IP, so the hostname lookup in libprocess will fail.

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins}}
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs

[jira] [Commented] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-17 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245044#comment-15245044
 ] 

Qian Zhang commented on MESOS-5225:
---

The root cause of this bug is, before the command executor (mesos-executor) is 
started by mesos-containerizer, we bind mount {{/etc/hosts}}, 
{{/etc/hostname}}, {{/etc/resolv.conf}} in the container's rootfs (see 
{{NetworkCniIsolatorSetup::execute()}} for details), but for command executor, 
we will NOT do the {{chroot}} before launching it (see 
{{LinuxFilesystemIsolatorProcess::prepare()}}, we will only set rootfs in 
{{ContainerLaunchInfo}} for if it is not a command task), instead the command 
executor will do the {{chroot}} itself when launching the task 
(https://github.com/apache/mesos/blob/0.28.0/src/launcher/executor.cpp#L369).

So when the command executor is launched, it is still using agent host FS, that 
means the bind mounts that we do will not take effect for it. Obviously in 
agent host FS, the {{/etc/hosts}} does not have the pair of container's 
hostname and IP, so the hostname lookup in libprocess will fail.

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins}}
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
> '192.168.1.2/24' from CNI network 'net1' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
> container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
> I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
> I0418 08:25:37.66

[jira] [Created] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-17 Thread Qian Zhang (JIRA)
Qian Zhang created MESOS-5225:
-

 Summary: Command executor can not start when joining a CNI network
 Key: MESOS-5225
 URL: https://issues.apache.org/jira/browse/MESOS-5225
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Reporter: Qian Zhang
Assignee: Qian Zhang


Reproduce steps:
1. Start master
{code}
sudo ./bin/mesos-master.sh --work_dir=/tmp
{code}
 
2. Start agent
{code}
sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 --containerizers=mesos 
--image_providers=docker 
--isolation=filesystem/linux,docker/runtime,network/cni 
--network_cni_config_dir=/opt/cni/net_configs 
--network_cni_plugins_dir=/opt/cni/plugins}}
{code}
 
3. Launch a command task with mesos-execute, and it will join a CNI network 
{{net1}}.
{code}
sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
--docker_image=library/busybox --networks=net1 --command="sleep 10" --shell=true
I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
Received status update TASK_FAILED for task 'test'
  message: 'Executor terminated'
  source: SOURCE_AGENT
  reason: REASON_EXECUTOR_TERMINATED
{code}

So the task failed with the reason "executor terminated". Here is the agent log:
{code}
I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
3c4796f0-eee7-4939-a036-7c6387c370eb-
I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
'/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources cpus(*):0.1; 
mem(*):32 in work directory '/t
mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
'2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
'3c4796f0-eee7-4939-a036-7c6387c370eb-00
00'
I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
'/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process with 
flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
'/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' for 
container 2b29d6d6-b314-4
77f-b734-7771d07d41e3
I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
'192.168.1.2/24' from CNI network 'net1' for container 
2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
'2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container 
'2b29d6d6-b314-477f-b734-7771d07d41e3'
I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
6.061056ms
I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
2.46016ms
I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace 
handle 
'/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' for 
container 2b29d6d6-b31
4-477f-b734-7771d07d41e3
I0418 08:25:37.874194 24914 cni.cpp:1132] Removed the container directory 
'/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3'
I0418 08:25:37.877306 24912 linux.cpp:814] Ignoring unmounting sandbox/work 
directory for container 2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:37.879295 24912 provisioner.cpp:338] Destroying container rootfs at 
'/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d

[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-17 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244803#comment-15244803
 ] 

Vinod Kone commented on MESOS-5224:
---

Interesting. Looks like the buffer overflow happened inside 
Slave::statusUpdate() when logging the update message?

{code}
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
mesos::internal::operator<<()
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
mesos::internal::slave::Slave::statusUpdate()
{code}

The code for the output stream operator for status update looks like so
{code}
ostream& operator<<(ostream& stream, const StatusUpdate& update)
{
  stream << update.status().state();

  if (update.has_uuid()) {
stream << " (UUID: " << stringify(UUID::fromBytes(update.uuid())) << ")";
  }

  stream << " for task " << update.status().task_id();

  if (update.status().has_healthy()) {
stream << " in health state "
   << (update.status().healthy() ? "healthy" : "unhealthy");
  }

  return stream << " of framework " << update.framework_id();
}
{code}

The one thing that could cause an issue is `UUID::fromBytes()`. How is the UUID 
being set by the HTTP executor?

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @  

[jira] [Created] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-17 Thread James DeFelice (JIRA)
James DeFelice created MESOS-5224:
-

 Summary: buffer overflow error in slave upon processing status 
update from executor v1 http API
 Key: MESOS-5224
 URL: https://issues.apache.org/jira/browse/MESOS-5224
 Project: Mesos
  Issue Type: Bug
  Components: slave
Affects Versions: 0.28.0
 Environment: {code}
$ dpkg -l|grep -e mesos
ii  mesos   0.28.0-2.0.16.ubuntu1404 amd64  
  Cluster resource manager with efficient resource isolation
$ uname -a
Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
x86_64 x86_64 x86_64 GNU/Linux
{code}
Reporter: James DeFelice


implementing support for executor HTTP v1 API in mesos-go:next and my executor 
can't send status updates because the slave dies upon receiving them. protobufs 
generated from 0.28.1

from syslog:
{code}
Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 with 
User-Agent='Go-http-client/1.1'
Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
/usr/sbin/mesos-slave terminated
Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
Apr 17 17:53:53 node-1 mesos-slave[4462]: 
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
...
Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix time) 
try "date -d @1460915633" if you are using GNU date ***
Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by PID 
4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
mesos::internal::operator<<()
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
mesos::internal::slave::Slave::statusUpdate()
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
mesos::internal::slave::Slave::Http::executor()
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236daa8 
_ZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc532375a71 
process::ProcessManager::resume()
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc532375d77 
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530e85bf0 (unknown)
Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309a8182 start_thread

[jira] [Comment Edited] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244718#comment-15244718
 ] 

haosdent edited comment on MESOS-1653 at 4/17/16 4:41 PM:
--

[~tnachen] After saw the log [~xujyan] posted. The second {{statusUpdate}} is 
nearly 5 seconds delay after {{14:46:23}}.

{code}
I0909 14:46:23.633633   944 hierarchical_allocator_process.hpp:659] Performed 
allocation for 1 slaves in 61631ns
I0909 14:46:27.799932   947 hierarchical_allocator_process.hpp:659] Performed 
allocation for 1 slaves in 95512ns
I0909 14:46:27.800237   947 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0909 14:46:27.800612   947 slave.cpp:2329] Received ping from 
slave-observer(2)@127.0.1.1:47396
tests/health_check_tests.cpp:557: Failure
Failed to wait 10secs for statusHealth
tests/health_check_tests.cpp:539: Failure
Actual function call count doesn't match EXPECT_CALL(sched, 
statusUpdate(&driver, _))...
 Expected: to be called at least twice
   Actual: called once - unsatisfied and active
I0909 14:46:27.815444   928 master.cpp:650] Master terminating
I0909 14:46:27.815640   928 master.hpp:851] Removing task 1 with resources 
cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 
20140909-144617-16842879-47396-928-0 (lucid)
W0909 14:46:27.815795   928 master.cpp:4419] Removing task 1 of framework 
20140909-144617-16842879-47396-928- and slave 
20140909-144617-16842879-47396-928-0 in non-terminal state TASK_RUNNING
I0909 14:46:27.823565   943 slave.cpp:2361] master@127.0.1.1:47396 exited
W0909 14:46:27.823611   943 slave.cpp:2364] Master disconnected! Waiting for a 
new master to be elected
I0909 14:46:27.828475   943 slave.cpp:2093] Handling status update TASK_RUNNING 
(UUID: 5f53830d-cd08-4c57-be42-33be367d3f01) for task 1 in health state 
unhealthy of framework 20140909-144617-16842879-47396-928- from 
executor(1)@127.0.1.1:52801
{code}

I think we need add
{code}
@@ -1053,6 +1053,9 @@ TEST_F(HealthCheckTest, DISABLED_GracePeriod)

   driver.launchTasks(offers.get()[0].id(), tasks);

+  AWAIT_READY(statusRunning);
+  EXPECT_EQ(TASK_RUNNING, statusRunning.get().state());
+
   Clock::pause();
{code}
before advance clock. Do you think it is OK to add this and reenable the test 
case?


was (Author: haosd...@gmail.com):
[~tnachen] After saw the log [~xujyan] posted. The second {statusUpdate} is 
nearly 5 seconds delay after {14:46:23}.

{code}
I0909 14:46:23.633633   944 hierarchical_allocator_process.hpp:659] Performed 
allocation for 1 slaves in 61631ns
I0909 14:46:27.799932   947 hierarchical_allocator_process.hpp:659] Performed 
allocation for 1 slaves in 95512ns
I0909 14:46:27.800237   947 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0909 14:46:27.800612   947 slave.cpp:2329] Received ping from 
slave-observer(2)@127.0.1.1:47396
tests/health_check_tests.cpp:557: Failure
Failed to wait 10secs for statusHealth
tests/health_check_tests.cpp:539: Failure
Actual function call count doesn't match EXPECT_CALL(sched, 
statusUpdate(&driver, _))...
 Expected: to be called at least twice
   Actual: called once - unsatisfied and active
I0909 14:46:27.815444   928 master.cpp:650] Master terminating
I0909 14:46:27.815640   928 master.hpp:851] Removing task 1 with resources 
cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 
20140909-144617-16842879-47396-928-0 (lucid)
W0909 14:46:27.815795   928 master.cpp:4419] Removing task 1 of framework 
20140909-144617-16842879-47396-928- and slave 
20140909-144617-16842879-47396-928-0 in non-terminal state TASK_RUNNING
I0909 14:46:27.823565   943 slave.cpp:2361] master@127.0.1.1:47396 exited
W0909 14:46:27.823611   943 slave.cpp:2364] Master disconnected! Waiting for a 
new master to be elected
I0909 14:46:27.828475   943 slave.cpp:2093] Handling status update TASK_RUNNING 
(UUID: 5f53830d-cd08-4c57-be42-33be367d3f01) for task 1 in health state 
unhealthy of framework 20140909-144617-16842879-47396-928- from 
executor(1)@127.0.1.1:52801
{code}

I think we need add a {AWAIT_READY(statusRunning);} before advance clock. Do 
you think it is OK to add this and reenable the test case?

> HealthCheckTest.GracePeriod is flaky.
> -
>
> Key: MESOS-1653
> URL: https://issues.apache.org/jira/browse/MESOS-1653
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Timothy Chen
>  Labels: flaky, health-check, mesosphere
>
> {noformat}
> [--] 3 tests from HealthCheckTest
> [ RUN  ] HealthCheckTest.GracePeriod
> Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
> I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
> I0729 17:10:10

[jira] [Commented] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244718#comment-15244718
 ] 

haosdent commented on MESOS-1653:
-

[~tnachen] After saw the log [~xujyan] posted. The second {statusUpdate} is 
nearly 5 seconds delay after {14:46:23}.

{code}
I0909 14:46:23.633633   944 hierarchical_allocator_process.hpp:659] Performed 
allocation for 1 slaves in 61631ns
I0909 14:46:27.799932   947 hierarchical_allocator_process.hpp:659] Performed 
allocation for 1 slaves in 95512ns
I0909 14:46:27.800237   947 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0909 14:46:27.800612   947 slave.cpp:2329] Received ping from 
slave-observer(2)@127.0.1.1:47396
tests/health_check_tests.cpp:557: Failure
Failed to wait 10secs for statusHealth
tests/health_check_tests.cpp:539: Failure
Actual function call count doesn't match EXPECT_CALL(sched, 
statusUpdate(&driver, _))...
 Expected: to be called at least twice
   Actual: called once - unsatisfied and active
I0909 14:46:27.815444   928 master.cpp:650] Master terminating
I0909 14:46:27.815640   928 master.hpp:851] Removing task 1 with resources 
cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 
20140909-144617-16842879-47396-928-0 (lucid)
W0909 14:46:27.815795   928 master.cpp:4419] Removing task 1 of framework 
20140909-144617-16842879-47396-928- and slave 
20140909-144617-16842879-47396-928-0 in non-terminal state TASK_RUNNING
I0909 14:46:27.823565   943 slave.cpp:2361] master@127.0.1.1:47396 exited
W0909 14:46:27.823611   943 slave.cpp:2364] Master disconnected! Waiting for a 
new master to be elected
I0909 14:46:27.828475   943 slave.cpp:2093] Handling status update TASK_RUNNING 
(UUID: 5f53830d-cd08-4c57-be42-33be367d3f01) for task 1 in health state 
unhealthy of framework 20140909-144617-16842879-47396-928- from 
executor(1)@127.0.1.1:52801
{code}

I think we need add a {AWAIT_READY(statusRunning);} before advance clock. Do 
you think it is OK to add this and reenable the test case?

> HealthCheckTest.GracePeriod is flaky.
> -
>
> Key: MESOS-1653
> URL: https://issues.apache.org/jira/browse/MESOS-1653
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Timothy Chen
>  Labels: flaky, health-check, mesosphere
>
> {noformat}
> [--] 3 tests from HealthCheckTest
> [ RUN  ] HealthCheckTest.GracePeriod
> Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
> I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
> I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
> I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
> I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
> 2317ns
> I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 1367ns
> I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
> I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
> I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
> STARTING
> I0729 17:10:10.508482  1213 master.cpp:289] Master 
> 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
> I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
> authenticated frameworks to register
> I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
> authenticated slaves to register
> I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
> I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
> I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:54701
> I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
> offers for all slaves
> I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
> master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
> I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
> I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
> I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
> I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 12.946461ms
> I0729 17:10

[jira] [Assigned] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.

2016-04-17 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent reassigned MESOS-1802:
---

Assignee: haosdent

> HealthCheckTest.HealthStatusChange is flaky on jenkins.
> ---
>
> Key: MESOS-1802
> URL: https://issues.apache.org/jira/browse/MESOS-1802
> Project: Mesos
>  Issue Type: Bug
>  Components: test, tests
>Affects Versions: 0.26.0
>Reporter: Benjamin Mahler
>Assignee: haosdent
>  Labels: flaky, health-check, mesosphere
> Attachments: health_check_flaky_test_log.txt
>
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull
> {noformat}
> [ RUN  ] HealthCheckTest.HealthStatusChange
> Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2'
> I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms
> I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns
> I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns
> I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 642ns
> I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 343ns
> I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery
> I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status
> I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to 
> STARTING
> I0916 22:56:14.036603 21046 master.cpp:286] Master 
> 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 
> 67.195.81.186:47865
> I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials'
> I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 480322ns
> I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to 
> STARTING
> I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled
> I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status
> I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@67.195.81.186:47865
> I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is 
> master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026
> I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master!
> I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar
> I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar
> I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING
> I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 330251ns
> I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to 
> VOTING
> I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos 
> group
> I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated
> I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer
> I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 92623ns
> I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1
> I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 144215ns
> I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0
> I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request 
> for position 0
> I091

[jira] [Commented] (MESOS-2331) MasterSlaveReconciliationTest.ReconcileRace is flaky

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244665#comment-15244665
 ] 

haosdent commented on MESOS-2331:
-

I think we could add a {{settle()}} before {{DROP_PROTOBUFS}}

{code}
diff --git a/src/tests/master_slave_reconciliation_tests.cpp 
b/src/tests/master_slave_reconciliation_tests.cpp
index 71fb78a..833c3c0 100644
--- a/src/tests/master_slave_reconciliation_tests.cpp
+++ b/src/tests/master_slave_reconciliation_tests.cpp
@@ -295,6 +295,11 @@ TEST_F(MasterSlaveReconciliationTest, ReconcileRace)

   driver.start();

+  // Make sure all `SlaveRegisteredMessage` have been handled by agent.
+  Clock::pause();
+  Clock::settle();
+  Clock::resume();
+
   // Trigger a re-registration of the slave and capture the message
   // so that we can spoof a race with a launch task message.
   DROP_PROTOBUFS(ReregisterSlaveMessage(), slave.get()->pid, 
master.get()->pid);
{code}

However, because I could not reproduce it in my env and only saw it in 
reviewbot. Not sure whether this approach works or not.

> MasterSlaveReconciliationTest.ReconcileRace is flaky
> 
>
> Key: MESOS-2331
> URL: https://issues.apache.org/jira/browse/MESOS-2331
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
>Reporter: Yan Xu
>Assignee: Qian Zhang
>  Labels: flaky
>
> {noformat:title=}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileRace
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV'
> I0206 19:09:44.196542 32362 leveldb.cpp:175] Opened db in 38.230192ms
> I0206 19:09:44.206826 32362 leveldb.cpp:182] Compacted db in 9.988493ms
> I0206 19:09:44.207164 32362 leveldb.cpp:197] Created db iterator in 29979ns
> I0206 19:09:44.207641 32362 leveldb.cpp:203] Seeked to beginning of db in 
> 4478ns
> I0206 19:09:44.207929 32362 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 737ns
> I0206 19:09:44.208222 32362 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0206 19:09:44.209132 32384 recover.cpp:448] Starting replica recovery
> I0206 19:09:44.209524 32384 recover.cpp:474] Replica is in EMPTY status
> I0206 19:09:44.211094 32384 replica.cpp:640] Replica in EMPTY status received 
> a broadcasted recover request
> I0206 19:09:44.211385 32384 recover.cpp:194] Received a recover response from 
> a replica in EMPTY status
> I0206 19:09:44.211902 32384 recover.cpp:565] Updating replica status to 
> STARTING
> I0206 19:09:44.236177 32381 master.cpp:344] Master 
> 20150206-190944-16842879-36452-32362 (lucid) started on 127.0.1.1:36452
> I0206 19:09:44.236291 32381 master.cpp:390] Master only allowing 
> authenticated frameworks to register
> I0206 19:09:44.236305 32381 master.cpp:395] Master only allowing 
> authenticated slaves to register
> I0206 19:09:44.236327 32381 credentials.hpp:35] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV/credentials'
> I0206 19:09:44.236601 32381 master.cpp:439] Authorization enabled
> I0206 19:09:44.238539 32381 hierarchical_allocator_process.hpp:284] 
> Initialized hierarchical allocator process
> I0206 19:09:44.238662 32381 whitelist_watcher.cpp:64] No whitelist given
> I0206 19:09:44.239364 32381 master.cpp:1350] The newly elected leader is 
> master@127.0.1.1:36452 with id 20150206-190944-16842879-36452-32362
> I0206 19:09:44.239392 32381 master.cpp:1363] Elected as the leading master!
> I0206 19:09:44.239413 32381 master.cpp:1181] Recovering from registrar
> I0206 19:09:44.239645 32381 registrar.cpp:312] Recovering registrar
> I0206 19:09:44.241142 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 29.029117ms
> I0206 19:09:44.241189 32384 replica.cpp:322] Persisted replica status to 
> STARTING
> I0206 19:09:44.241478 32384 recover.cpp:474] Replica is in STARTING status
> I0206 19:09:44.243075 32384 replica.cpp:640] Replica in STARTING status 
> received a broadcasted recover request
> I0206 19:09:44.243398 32384 recover.cpp:194] Received a recover response from 
> a replica in STARTING status
> I0206 19:09:44.243964 32384 recover.cpp:565] Updating replica status to VOTING
> I0206 19:09:44.255692 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 11.502759ms
> I0206 19:09:44.255765 32384 replica.cpp:322] Persisted replica status to 
> VOTING
> I0206 19:09:44.256009 32384 recover.cpp:579] Successfully joined the Paxos 
> group
> I0206 19:09:44.256253 32384 recover.cpp:463] Recover process terminated
> I0206 19:09:44.257669 32384 log.cpp:659] Attempting to start the writer
> I0206 19:09:44.259944 32377 replica.cpp:476] Replica received implicit 
> promise request with proposal 1
> I0206 19:09:44.268805 32377

[jira] [Commented] (MESOS-2331) MasterSlaveReconciliationTest.ReconcileRace is flaky

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244619#comment-15244619
 ] 

haosdent commented on MESOS-2331:
-

Compare to normal log, the cause of flaky is 
{code}
I0417 08:09:37.556551 31925 master.cpp:4580] Registered agent 
07f7917f-63d1-40d4-b983-4f0eb5c18f3d-S0 at slave(141)@172.17.0.1:35480 
(95302125b116) with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
I0417 08:09:37.557147 31925 master.cpp:4482] Agent 
07f7917f-63d1-40d4-b983-4f0eb5c18f3d-S0 at slave(141)@172.17.0.1:35480 
(95302125b116) already registered, resending acknowledgement
{code}

Messo master resend {{SlaveRegisteredMessage}}, and then cause Mesos agent 
registered successfully. Then not need reregistered again. 

> MasterSlaveReconciliationTest.ReconcileRace is flaky
> 
>
> Key: MESOS-2331
> URL: https://issues.apache.org/jira/browse/MESOS-2331
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
>Reporter: Yan Xu
>Assignee: Qian Zhang
>  Labels: flaky
>
> {noformat:title=}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileRace
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV'
> I0206 19:09:44.196542 32362 leveldb.cpp:175] Opened db in 38.230192ms
> I0206 19:09:44.206826 32362 leveldb.cpp:182] Compacted db in 9.988493ms
> I0206 19:09:44.207164 32362 leveldb.cpp:197] Created db iterator in 29979ns
> I0206 19:09:44.207641 32362 leveldb.cpp:203] Seeked to beginning of db in 
> 4478ns
> I0206 19:09:44.207929 32362 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 737ns
> I0206 19:09:44.208222 32362 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0206 19:09:44.209132 32384 recover.cpp:448] Starting replica recovery
> I0206 19:09:44.209524 32384 recover.cpp:474] Replica is in EMPTY status
> I0206 19:09:44.211094 32384 replica.cpp:640] Replica in EMPTY status received 
> a broadcasted recover request
> I0206 19:09:44.211385 32384 recover.cpp:194] Received a recover response from 
> a replica in EMPTY status
> I0206 19:09:44.211902 32384 recover.cpp:565] Updating replica status to 
> STARTING
> I0206 19:09:44.236177 32381 master.cpp:344] Master 
> 20150206-190944-16842879-36452-32362 (lucid) started on 127.0.1.1:36452
> I0206 19:09:44.236291 32381 master.cpp:390] Master only allowing 
> authenticated frameworks to register
> I0206 19:09:44.236305 32381 master.cpp:395] Master only allowing 
> authenticated slaves to register
> I0206 19:09:44.236327 32381 credentials.hpp:35] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV/credentials'
> I0206 19:09:44.236601 32381 master.cpp:439] Authorization enabled
> I0206 19:09:44.238539 32381 hierarchical_allocator_process.hpp:284] 
> Initialized hierarchical allocator process
> I0206 19:09:44.238662 32381 whitelist_watcher.cpp:64] No whitelist given
> I0206 19:09:44.239364 32381 master.cpp:1350] The newly elected leader is 
> master@127.0.1.1:36452 with id 20150206-190944-16842879-36452-32362
> I0206 19:09:44.239392 32381 master.cpp:1363] Elected as the leading master!
> I0206 19:09:44.239413 32381 master.cpp:1181] Recovering from registrar
> I0206 19:09:44.239645 32381 registrar.cpp:312] Recovering registrar
> I0206 19:09:44.241142 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 29.029117ms
> I0206 19:09:44.241189 32384 replica.cpp:322] Persisted replica status to 
> STARTING
> I0206 19:09:44.241478 32384 recover.cpp:474] Replica is in STARTING status
> I0206 19:09:44.243075 32384 replica.cpp:640] Replica in STARTING status 
> received a broadcasted recover request
> I0206 19:09:44.243398 32384 recover.cpp:194] Received a recover response from 
> a replica in STARTING status
> I0206 19:09:44.243964 32384 recover.cpp:565] Updating replica status to VOTING
> I0206 19:09:44.255692 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 11.502759ms
> I0206 19:09:44.255765 32384 replica.cpp:322] Persisted replica status to 
> VOTING
> I0206 19:09:44.256009 32384 recover.cpp:579] Successfully joined the Paxos 
> group
> I0206 19:09:44.256253 32384 recover.cpp:463] Recover process terminated
> I0206 19:09:44.257669 32384 log.cpp:659] Attempting to start the writer
> I0206 19:09:44.259944 32377 replica.cpp:476] Replica received implicit 
> promise request with proposal 1
> I0206 19:09:44.268805 32377 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 8.45858ms
> I0206 19:09:44.269067 32377 replica.cpp:344] Persisted promised to 1
> I0206 19:09:44.277974 32383 coordinator.cpp:229] Coordinator attemping to 
> fill missing position
> I0206 19:09:44.279767 32383 replica.cpp:377] Repli

[jira] [Commented] (MESOS-2331) MasterSlaveReconciliationTest.ReconcileRace is flaky

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244596#comment-15244596
 ] 

haosdent commented on MESOS-2331:
-

Raw this again
{code}
[ RUN  ] MasterSlaveReconciliationTest.ReconcileRace
I0417 08:09:37.011265 31901 cluster.cpp:149] Creating default 'local' authorizer
I0417 08:09:37.086580 31901 leveldb.cpp:174] Opened db in 74.882317ms
I0417 08:09:37.103621 31901 leveldb.cpp:181] Compacted db in 16.92606ms
I0417 08:09:37.103744 31901 leveldb.cpp:196] Created db iterator in 32846ns
I0417 08:09:37.103762 31901 leveldb.cpp:202] Seeked to beginning of db in 3615ns
I0417 08:09:37.103775 31901 leveldb.cpp:271] Iterated through 0 keys in the db 
in 250ns
I0417 08:09:37.103832 31901 replica.cpp:779] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0417 08:09:37.104671 31931 recover.cpp:447] Starting replica recovery
I0417 08:09:37.105304 31931 recover.cpp:473] Replica is in EMPTY status
I0417 08:09:37.106678 31934 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (6118)@172.17.0.1:35480
I0417 08:09:37.107188 31929 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0417 08:09:37.108885 31934 recover.cpp:564] Updating replica status to STARTING
I0417 08:09:37.111217 31922 master.cpp:382] Master 
07f7917f-63d1-40d4-b983-4f0eb5c18f3d (95302125b116) started on 172.17.0.1:35480
I0417 08:09:37.111249 31922 master.cpp:384] Flags at startup: --acls="" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate="true" --authenticate_http="true" 
--authenticate_http_frameworks="true" --authenticate_slaves="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/Wdw9Iq/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_slave_ping_timeouts="5" --quiet="false" 
--recovery_slave_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
--registry_strict="true" --root_submissions="true" 
--slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
--user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" 
--work_dir="/tmp/Wdw9Iq/master" --zk_session_timeout="10secs"
I0417 08:09:37.111726 31922 master.cpp:433] Master only allowing authenticated 
frameworks to register
I0417 08:09:37.111738 31922 master.cpp:439] Master only allowing authenticated 
agents to register
I0417 08:09:37.111747 31922 master.cpp:445] Master only allowing authenticated 
HTTP frameworks to register
I0417 08:09:37.111755 31922 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/Wdw9Iq/credentials'
I0417 08:09:37.112149 31922 master.cpp:489] Using default 'crammd5' 
authenticator
I0417 08:09:37.112300 31922 master.cpp:560] Using default 'basic' HTTP 
authenticator
I0417 08:09:37.112460 31922 master.cpp:640] Using default 'basic' HTTP 
framework authenticator
I0417 08:09:37.112573 31922 master.cpp:687] Authorization enabled
I0417 08:09:37.112798 31931 hierarchical.cpp:142] Initialized hierarchical 
allocator process
I0417 08:09:37.112861 31931 whitelist_watcher.cpp:77] No whitelist given
I0417 08:09:37.122642 31921 master.cpp:1932] The newly elected leader is 
master@172.17.0.1:35480 with id 07f7917f-63d1-40d4-b983-4f0eb5c18f3d
I0417 08:09:37.122709 31921 master.cpp:1945] Elected as the leading master!
I0417 08:09:37.122732 31921 master.cpp:1632] Recovering from registrar
I0417 08:09:37.123011 31921 registrar.cpp:331] Recovering registrar
I0417 08:09:37.137696 31929 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 28.65638ms
I0417 08:09:37.137791 31929 replica.cpp:320] Persisted replica status to 
STARTING
I0417 08:09:37.138139 31921 recover.cpp:473] Replica is in STARTING status
I0417 08:09:37.139683 31929 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from (6121)@172.17.0.1:35480
I0417 08:09:37.139957 31935 recover.cpp:193] Received a recover response from a 
replica in STARTING status
I0417 08:09:37.140836 31928 recover.cpp:564] Updating replica status to VOTING
I0417 08:09:37.161991 31928 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 20.949493ms
I0417 08:09:37.162083 31928 replica.cpp:320] Persisted replica status to VOTING
I0417 08:09:37.162320 31935 recover.cpp:578] Successfully joined the Paxos group
I0417 08:09:37.162582 31935 recover.cpp:462] Recover process terminated
I0417 08:09:37.163247 31923 log.cpp:659] Attempting to start the writer
I0417 08:09:37.165011 31923 replica.cpp:493] Replica received implicit promise 
req

[jira] [Updated] (MESOS-3567) Support TCP checks in Mesos health check program

2016-04-17 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-3567:

Labels: Mesosphere health-check  (was: Mesosphere)

> Support TCP checks in Mesos health check program
> 
>
> Key: MESOS-3567
> URL: https://issues.apache.org/jira/browse/MESOS-3567
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Matthias Veit
>Assignee: haosdent
>  Labels: Mesosphere, health-check
>
> In Marathon we have the ability to specify Health Checks for:
> - Command (Mesos supports this)
> - HTTP (see progress in MESOS-2533)
> - TCP missing
> See here for reference: 
> https://mesosphere.github.io/marathon/docs/health-checks.html
> Since we made good experiences with those 3 options in Marathon, I see a lot 
> of value, if Mesos would also support them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2533) Support HTTP checks in Mesos health check program

2016-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244589#comment-15244589
 ] 

haosdent commented on MESOS-2533:
-

[~alexr] I update https://reviews.apache.org/r/36816/ , would you please review 
at your convenience? Thank you in advance.

> Support HTTP checks in Mesos health check program
> -
>
> Key: MESOS-2533
> URL: https://issues.apache.org/jira/browse/MESOS-2533
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>Assignee: haosdent
>  Labels: health-check, mesosphere
>
> Currently, only commands are supported but our health check protobuf enables 
> users to encode HTTP checks as well. We should wire up this in the health 
> check program or remove the http field from the protobuf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)