[jira] [Commented] (MESOS-3367) Mesos fetcher does not extract archives for URI with parameters
[ https://issues.apache.org/jira/browse/MESOS-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245208#comment-15245208 ] haosdent commented on MESOS-3367: - Got it. I think MESOS-4735 is a better approach, let me close this. Feel free to reopen this if you think it still necessary. > Mesos fetcher does not extract archives for URI with parameters > --- > > Key: MESOS-3367 > URL: https://issues.apache.org/jira/browse/MESOS-3367 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.22.1, 0.23.0 > Environment: DCOS 1.1 >Reporter: Renat Zubairov >Assignee: haosdent >Priority: Minor > Labels: mesosphere > > I'm deploying using marathon applications with sources served from S3. I'm > using a signed URL to give only temporary access to the S3 resources, so URL > of the resource have some query parameters. > So URI is 'https://foo.com/file.tgz?hasi' and fetcher stores it in the file > with the name 'file.tgz?hasi', then it thinks that extension 'hasi' is not > tgz hence extraction is skipped, despite the fact that MIME Type of the HTTP > resource is 'application/x-tar'. > Workaround - add additional parameter like '&workaround=.tgz' -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API
[ https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245206#comment-15245206 ] James DeFelice commented on MESOS-5224: --- Here's a JSON of an update that's "rejected" by the slave. I don't know if this is THE update that's crashing the slave. But it seems likely since the connection is dropped and I see an EOF on the executor. All of the updates are generated the exact same way (via https://github.com/mesos/mesos-go/blob/executor_proto/cmd/example-executor/main.go#L208). {code} { "executor_id":{"value":"default"}, "framework_id":{"value":"ad9e5972-8b5e-4042-b97f-ecc36f2c046f-0011"}, "type":"UPDATE", "update":{ "status":{ "task_id":{"value":"1"}, "state":"TASK_RUNNING", "source":"SOURCE_EXECUTOR", "executor_id":{"value":"default"}, "uuid":"ZTZlZTRlNmMtNzE0Ni00NTAwLWJkZWYtNDc0Yzk2MWNmNGU4" // base64-decoded: e6ee4e6c-7146-4500-bdef-474c961cf4e8 } } } {code} > buffer overflow error in slave upon processing status update from executor v1 > http API > -- > > Key: MESOS-5224 > URL: https://issues.apache.org/jira/browse/MESOS-5224 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.28.0 > Environment: {code} > $ dpkg -l|grep -e mesos > ii mesos 0.28.0-2.0.16.ubuntu1404 > amd64Cluster resource manager with efficient resource isolation > $ uname -a > Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 > x86_64 x86_64 x86_64 GNU/Linux > {code} >Reporter: James DeFelice >Assignee: Klaus Ma > Labels: mesosphere > > implementing support for executor HTTP v1 API in mesos-go:next and my > executor can't send status updates because the slave dies upon receiving > them. protobufs generated from 0.28.1 > from syslog: > {code} > Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467 4489 > http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 > with User-Agent='Go-http-client/1.1' > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: > /usr/sbin/mesos-slave terminated > Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: = > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d] > ... > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix > time) try "date -d @1460915633" if you are using GNU date *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by > PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
[jira] [Commented] (MESOS-4705) Slave failed to sample container with perf event
[ https://issues.apache.org/jira/browse/MESOS-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245167#comment-15245167 ] Fan Du commented on MESOS-4705: --- [~haosd...@gmail.com] [~bmahler] I have elaborated more about the comments, please review again: https://reviews.apache.org/r/44379/ Thanks a lot! > Slave failed to sample container with perf event > > > Key: MESOS-4705 > URL: https://issues.apache.org/jira/browse/MESOS-4705 > Project: Mesos > Issue Type: Bug > Components: cgroups, isolation >Affects Versions: 0.27.1 >Reporter: Fan Du >Assignee: Fan Du > > When sampling container with perf event on Centos7 with kernel > 3.10.0-123.el7.x86_64, slave complained with below error spew: > {code} > E0218 16:32:00.591181 8376 perf_event.cpp:408] Failed to get perf sample: > Failed to parse perf sample: Failed to parse perf sample line > '25871993253,,cycles,mesos/5f23ffca-87ed-4ff6-84f2-6ec3d4098ab8,10059827422,100.00': > Unexpected number of fields > {code} > it's caused by the current perf format [assumption | > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/linux/perf.cpp;h=1c113a2b3f57877e132bbd65e01fb2f045132128;hb=HEAD#l430] > with kernel version below 3.12 > On 3.10.0-123.el7.x86_64 kernel, the format is with 6 tokens as below: > value,unit,event,cgroup,running,ratio > A local modification fixed this error on my test bed, please review this > ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245149#comment-15245149 ] Erik Weathers commented on MESOS-4735: -- [~mrbrowning] & [~vinodkone]: awesome, thanks so much for fixing this! > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Reporter: Erik Weathers >Assignee: Michael Browning >Priority: Minor > Fix For: 0.29.0 > > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3367) Mesos fetcher does not extract archives for URI with parameters
[ https://issues.apache.org/jira/browse/MESOS-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245147#comment-15245147 ] Erik Weathers commented on MESOS-3367: -- [~haosd...@gmail.com]: yup, that looks good to me! It should suffice from my perspective. Not sure about [~bernd-mesos] & [~zubairov] though. > Mesos fetcher does not extract archives for URI with parameters > --- > > Key: MESOS-3367 > URL: https://issues.apache.org/jira/browse/MESOS-3367 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.22.1, 0.23.0 > Environment: DCOS 1.1 >Reporter: Renat Zubairov >Assignee: haosdent >Priority: Minor > Labels: mesosphere > > I'm deploying using marathon applications with sources served from S3. I'm > using a signed URL to give only temporary access to the S3 resources, so URL > of the resource have some query parameters. > So URI is 'https://foo.com/file.tgz?hasi' and fetcher stores it in the file > with the name 'file.tgz?hasi', then it thinks that extension 'hasi' is not > tgz hence extraction is skipped, despite the fact that MIME Type of the HTTP > resource is 'application/x-tar'. > Workaround - add additional parameter like '&workaround=.tgz' -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5225) Command executor can not start when joining a CNI network
[ https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan reassigned MESOS-5225: Assignee: Avinash Sridharan (was: Qian Zhang) > Command executor can not start when joining a CNI network > - > > Key: MESOS-5225 > URL: https://issues.apache.org/jira/browse/MESOS-5225 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Avinash Sridharan > > Reproduce steps: > 1. Start master > {code} > sudo ./bin/mesos-master.sh --work_dir=/tmp > {code} > > 2. Start agent > {code} > sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 > --containerizers=mesos --image_providers=docker > --isolation=filesystem/linux,docker/runtime,network/cni > --network_cni_config_dir=/opt/cni/net_configs > --network_cni_plugins_dir=/opt/cni/plugins}} > {code} > > 3. Launch a command task with mesos-execute, and it will join a CNI network > {{net1}}. > {code} > sudo src/mesos-execute --master=192.168.122.171:5050 --name=test > --docker_image=library/busybox --networks=net1 --command="sleep 10" > --shell=true > I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0 > Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-' > Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0' > Received status update TASK_FAILED for task 'test' > message: 'Executor terminated' > source: SOURCE_AGENT > reason: REASON_EXECUTOR_TERMINATED > {code} > So the task failed with the reason "executor terminated". Here is the agent > log: > {code} > I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework > 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown > '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t > est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root' > I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources > cpus(*):0.1; mem(*):32 in work directory '/t > mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' > I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container > '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework > '3c4796f0-eee7-4939-a036-7c6387c370eb-00 > 00' > I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor > 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs > '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3 > 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3 > I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process > with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS > I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to > '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' > for container 2b29d6d6-b314-4 > 77f-b734-7771d07d41e3 > I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address > '192.168.1.2/24' from CNI network 'net1' for container > 2b29d6d6-b314-477f-b734-7771d07d41e3 > I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for > container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf' > I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container > '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited > I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container > '2b29d6d6-b314-477f-b734-7771d07d41e3' > I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 > I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after > 6.061056ms > I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 > I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after > 2.46016ms > I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace > handle > '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' > for container 2b29d6d6-b31 > 4-477f-b734-7771d07d41e3 > I0418 08:2
[jira] [Comment Edited] (MESOS-5225) Command executor can not start when joining a CNI network
[ https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245085#comment-15245085 ] Avinash Sridharan edited comment on MESOS-5225 at 4/18/16 4:44 AM: --- Thanks Qian !! This does seem like a bug. Odd part is we do set the rootfs on which we are going to bind mount the network files by checking the `ContainerConfig` https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566 However, my suspicion is that for command tasks the rootfs in `ContainerConfig` is set to the actual rootfs of the container. (Need to confirm this). This does raise a question though, even if we bind mount the files to the corresponding files in the host file system, we still need to bind mount the same files to the container file system as well. Reason being, that after `pivot_root` the process will start treating the container as the root filesystem and if the network files are not bind mounted into the rootfs of the container, we will start seeing the same failure. I am thinking that the fix should be to bind mount the files to the rootfs of the container and the rootfs of the host file system. These mount points will get destroyed anyway when the mnt namespace is destroyed (container dies). was (Author: avin...@mesosphere.io): Thanks Qian !! This does seem like a bug. Odd part is we do set the rootfs on which we are going to bind mount the network files by checking the `ContainerConfig` https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566 However, my suspicion is that for command tasks the rootfs in `ContainerConfig` is set to the actual rootfs of the container. (Need to confirm this). This does raise a question though, even we bind mount the files to the corresponding files in the host file system, we still need to bind mount the same files to the container file system as well. Reason being, that after `pivot_root` the process will start treating the container as the root filesystem and if the network files are not bind mounted into the rootfs of the container, we will start seeing the same failure. I am thinking that the fix should be to bind mount the files to the rootfs of the container and the rootfs of the host file system. These mount points will get destroyed anyway when the mnt namespace is destroyed (container dies). > Command executor can not start when joining a CNI network > - > > Key: MESOS-5225 > URL: https://issues.apache.org/jira/browse/MESOS-5225 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > Reproduce steps: > 1. Start master > {code} > sudo ./bin/mesos-master.sh --work_dir=/tmp > {code} > > 2. Start agent > {code} > sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 > --containerizers=mesos --image_providers=docker > --isolation=filesystem/linux,docker/runtime,network/cni > --network_cni_config_dir=/opt/cni/net_configs > --network_cni_plugins_dir=/opt/cni/plugins}} > {code} > > 3. Launch a command task with mesos-execute, and it will join a CNI network > {{net1}}. > {code} > sudo src/mesos-execute --master=192.168.122.171:5050 --name=test > --docker_image=library/busybox --networks=net1 --command="sleep 10" > --shell=true > I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0 > Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-' > Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0' > Received status update TASK_FAILED for task 'test' > message: 'Executor terminated' > source: SOURCE_AGENT > reason: REASON_EXECUTOR_TERMINATED > {code} > So the task failed with the reason "executor terminated". Here is the agent > log: > {code} > I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework > 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown > '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t > est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root' > I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources > cpus(*):0.1; mem(*):32 in work directory '/t > mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' > I0418 08:25:35.8
[jira] [Commented] (MESOS-5225) Command executor can not start when joining a CNI network
[ https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245085#comment-15245085 ] Avinash Sridharan commented on MESOS-5225: -- Thanks Qian !! This does seem like a bug. Odd part is we do set the rootfs on which we are going to bind mount the network files by checking the `ContainerConfig` https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566 However, my suspicion is that for command tasks the rootfs in `ContainerConfig` is set to the actual rootfs of the container. (Need to figure this out). This does raise a question though, even we bind mount the files to the corresponding files in the host file system, we still need to bind mount the same files to the container file system as well. Reason being, that after `pivot_root` the process will start treating the container as the root filesystem and if the network files are not bind mounted into the rootfs of the container, we will start seeing the same failure. I am thinking that the fix should be to bind mount the files to the rootfs of the container and the rootfs of the host file system. These mount points will get destroyed anyway when the mnt namespace is destroyed (container dies). > Command executor can not start when joining a CNI network > - > > Key: MESOS-5225 > URL: https://issues.apache.org/jira/browse/MESOS-5225 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > Reproduce steps: > 1. Start master > {code} > sudo ./bin/mesos-master.sh --work_dir=/tmp > {code} > > 2. Start agent > {code} > sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 > --containerizers=mesos --image_providers=docker > --isolation=filesystem/linux,docker/runtime,network/cni > --network_cni_config_dir=/opt/cni/net_configs > --network_cni_plugins_dir=/opt/cni/plugins}} > {code} > > 3. Launch a command task with mesos-execute, and it will join a CNI network > {{net1}}. > {code} > sudo src/mesos-execute --master=192.168.122.171:5050 --name=test > --docker_image=library/busybox --networks=net1 --command="sleep 10" > --shell=true > I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0 > Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-' > Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0' > Received status update TASK_FAILED for task 'test' > message: 'Executor terminated' > source: SOURCE_AGENT > reason: REASON_EXECUTOR_TERMINATED > {code} > So the task failed with the reason "executor terminated". Here is the agent > log: > {code} > I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework > 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown > '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t > est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root' > I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources > cpus(*):0.1; mem(*):32 in work directory '/t > mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' > I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container > '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework > '3c4796f0-eee7-4939-a036-7c6387c370eb-00 > 00' > I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor > 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs > '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3 > 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3 > I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process > with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS > I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to > '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' > for container 2b29d6d6-b314-4 > 77f-b734-7771d07d41e3 > I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address > '192.168.1.2/24' from CNI network 'net1' for container > 2b29d6d6-b314-477f-b734-7771d07d41e3 > I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for > container 2b29d6d6-b314-477
[jira] [Comment Edited] (MESOS-5225) Command executor can not start when joining a CNI network
[ https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245085#comment-15245085 ] Avinash Sridharan edited comment on MESOS-5225 at 4/18/16 3:13 AM: --- Thanks Qian !! This does seem like a bug. Odd part is we do set the rootfs on which we are going to bind mount the network files by checking the `ContainerConfig` https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566 However, my suspicion is that for command tasks the rootfs in `ContainerConfig` is set to the actual rootfs of the container. (Need to confirm this). This does raise a question though, even we bind mount the files to the corresponding files in the host file system, we still need to bind mount the same files to the container file system as well. Reason being, that after `pivot_root` the process will start treating the container as the root filesystem and if the network files are not bind mounted into the rootfs of the container, we will start seeing the same failure. I am thinking that the fix should be to bind mount the files to the rootfs of the container and the rootfs of the host file system. These mount points will get destroyed anyway when the mnt namespace is destroyed (container dies). was (Author: avin...@mesosphere.io): Thanks Qian !! This does seem like a bug. Odd part is we do set the rootfs on which we are going to bind mount the network files by checking the `ContainerConfig` https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566 However, my suspicion is that for command tasks the rootfs in `ContainerConfig` is set to the actual rootfs of the container. (Need to figure this out). This does raise a question though, even we bind mount the files to the corresponding files in the host file system, we still need to bind mount the same files to the container file system as well. Reason being, that after `pivot_root` the process will start treating the container as the root filesystem and if the network files are not bind mounted into the rootfs of the container, we will start seeing the same failure. I am thinking that the fix should be to bind mount the files to the rootfs of the container and the rootfs of the host file system. These mount points will get destroyed anyway when the mnt namespace is destroyed (container dies). > Command executor can not start when joining a CNI network > - > > Key: MESOS-5225 > URL: https://issues.apache.org/jira/browse/MESOS-5225 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > Reproduce steps: > 1. Start master > {code} > sudo ./bin/mesos-master.sh --work_dir=/tmp > {code} > > 2. Start agent > {code} > sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 > --containerizers=mesos --image_providers=docker > --isolation=filesystem/linux,docker/runtime,network/cni > --network_cni_config_dir=/opt/cni/net_configs > --network_cni_plugins_dir=/opt/cni/plugins}} > {code} > > 3. Launch a command task with mesos-execute, and it will join a CNI network > {{net1}}. > {code} > sudo src/mesos-execute --master=192.168.122.171:5050 --name=test > --docker_image=library/busybox --networks=net1 --command="sleep 10" > --shell=true > I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0 > Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-' > Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0' > Received status update TASK_FAILED for task 'test' > message: 'Executor terminated' > source: SOURCE_AGENT > reason: REASON_EXECUTOR_TERMINATED > {code} > So the task failed with the reason "executor terminated". Here is the agent > log: > {code} > I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework > 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown > '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t > est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root' > I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources > cpus(*):0.1; mem(*):32 in work directory '/t > mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' > I0418 08:25:35.8
[jira] [Commented] (MESOS-5123) Docker task may fail if path to agent work_dir is relative.
[ https://issues.apache.org/jira/browse/MESOS-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245073#comment-15245073 ] Klaus Ma commented on MESOS-5123: - cc [~jieyu]/[~alexr] :). > Docker task may fail if path to agent work_dir is relative. > > > Key: MESOS-5123 > URL: https://issues.apache.org/jira/browse/MESOS-5123 > Project: Mesos > Issue Type: Improvement > Components: docker >Affects Versions: 0.28.0, 0.29.0 >Reporter: Alexander Rukletsov >Assignee: Klaus Ma > Labels: docker, documentation, mesosphere > Fix For: 0.29.0 > > > When a local folder for agent’s {{\-\-work_dir}} is specified (e.g., > {{\-\-work_dir=w/s}}) docker complains that there are forbidden symbols in a > *local* volume name. Specifying an absolute path (e.g., > {{\-\-work_dir=/tmp}}) solves the problem. > Docker error observed: > {noformat} > docker: Error response from daemon: create > w/s/slaves/33b8fe47-e9e0-468a-83a6-98c1e3537e59-S1/frameworks/33b8fe47-e9e0-468a-83a6-98c1e3537e59-0001/executors/docker-test/runs/3cc5cb04-d0a9-490e-94d5-d446b66c97cc: > volume name invalid: > "w/s/slaves/33b8fe47-e9e0-468a-83a6-98c1e3537e59-S1/frameworks/33b8fe47-e9e0-468a-83a6-98c1e3537e59-0001/executors/docker-test/runs/3cc5cb04-d0a9-490e-94d5-d446b66c97cc" > includes invalid characters for a local volume name, only > "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. > {noformat} > First off, it is not obvious that Mesos always creates a volume for the > sandbox. We may want to document it. > Second, it's hard to understand that local {{work_dir}} can trigger forbidden > symbols error in docker. Does it make sense to check it during agent launch > if docker containerizer is enabled? Or reject docker tasks during task > validation? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API
[ https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma reassigned MESOS-5224: --- Assignee: Klaus Ma > buffer overflow error in slave upon processing status update from executor v1 > http API > -- > > Key: MESOS-5224 > URL: https://issues.apache.org/jira/browse/MESOS-5224 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.28.0 > Environment: {code} > $ dpkg -l|grep -e mesos > ii mesos 0.28.0-2.0.16.ubuntu1404 > amd64Cluster resource manager with efficient resource isolation > $ uname -a > Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 > x86_64 x86_64 x86_64 GNU/Linux > {code} >Reporter: James DeFelice >Assignee: Klaus Ma > Labels: mesosphere > > implementing support for executor HTTP v1 API in mesos-go:next and my > executor can't send status updates because the slave dies upon receiving > them. protobufs generated from 0.28.1 > from syslog: > {code} > Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467 4489 > http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 > with User-Agent='Go-http-client/1.1' > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: > /usr/sbin/mesos-slave terminated > Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: = > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d] > ... > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix > time) try "date -d @1460915633" if you are using GNU date *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by > PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a > mesos::internal::operator<<() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 > mesos::internal::slave::Slave::statusUpdate() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 > mesos::internal::slave::Slave::Http::executor() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 > _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_ > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236daa8 > _ZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_ > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc
[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API
[ https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245070#comment-15245070 ] haosdent commented on MESOS-5224: - If overflow in {{UUID:fromBytes()}}, I think the status should looks like {code} UUID::fromBytes() statusUpdat() ... {code} {{.framework_id()}} and {{.status()}} are the parameters user passed in here. > buffer overflow error in slave upon processing status update from executor v1 > http API > -- > > Key: MESOS-5224 > URL: https://issues.apache.org/jira/browse/MESOS-5224 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.28.0 > Environment: {code} > $ dpkg -l|grep -e mesos > ii mesos 0.28.0-2.0.16.ubuntu1404 > amd64Cluster resource manager with efficient resource isolation > $ uname -a > Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 > x86_64 x86_64 x86_64 GNU/Linux > {code} >Reporter: James DeFelice > Labels: mesosphere > > implementing support for executor HTTP v1 API in mesos-go:next and my > executor can't send status updates because the slave dies upon receiving > them. protobufs generated from 0.28.1 > from syslog: > {code} > Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467 4489 > http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 > with User-Agent='Go-http-client/1.1' > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: > /usr/sbin/mesos-slave terminated > Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: = > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d] > ... > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix > time) try "date -d @1460915633" if you are using GNU date *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by > PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a > mesos::internal::operator<<() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 > mesos::internal::slave::Slave::statusUpdate() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 > mesos::internal::slave::Slave::Http::executor() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 > _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_ > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236da
[jira] [Comment Edited] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API
[ https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245070#comment-15245070 ] haosdent edited comment on MESOS-5224 at 4/18/16 2:46 AM: -- If overflow in {{UUID:fromBytes()}}, I think the stack should looks like {code} UUID::fromBytes() statusUpdate() ... {code} {{.framework_id()}} and {{.status()}} are the parameters user passed in here. was (Author: haosd...@gmail.com): If overflow in {{UUID:fromBytes()}}, I think the status should looks like {code} UUID::fromBytes() statusUpdat() ... {code} {{.framework_id()}} and {{.status()}} are the parameters user passed in here. > buffer overflow error in slave upon processing status update from executor v1 > http API > -- > > Key: MESOS-5224 > URL: https://issues.apache.org/jira/browse/MESOS-5224 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.28.0 > Environment: {code} > $ dpkg -l|grep -e mesos > ii mesos 0.28.0-2.0.16.ubuntu1404 > amd64Cluster resource manager with efficient resource isolation > $ uname -a > Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 > x86_64 x86_64 x86_64 GNU/Linux > {code} >Reporter: James DeFelice > Labels: mesosphere > > implementing support for executor HTTP v1 API in mesos-go:next and my > executor can't send status updates because the slave dies upon receiving > them. protobufs generated from 0.28.1 > from syslog: > {code} > Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467 4489 > http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 > with User-Agent='Go-http-client/1.1' > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: > /usr/sbin/mesos-slave terminated > Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: = > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d] > ... > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix > time) try "date -d @1460915633" if you are using GNU date *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by > PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a > mesos::internal::operator<<() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 > mesos::internal::slave::Slave::statusUpdate() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 > mesos::internal::slave::Slave::Http::executor() > Apr 17 17:5
[jira] [Updated] (MESOS-5056) Replace Master/Slave Terminology Phase I - Update strings in the shell scripts outputs
[ https://issues.apache.org/jira/browse/MESOS-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-5056: -- Shepherd: Vinod Kone > Replace Master/Slave Terminology Phase I - Update strings in the shell > scripts outputs > -- > > Key: MESOS-5056 > URL: https://issues.apache.org/jira/browse/MESOS-5056 > Project: Mesos > Issue Type: Task >Reporter: zhou xing >Assignee: zhou xing > > This is a sub ticket of MESOS-3780. In this ticket, we will rename slave to > agent in the shell script outputs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5057) Replace Master/Slave Terminology Phase I - Update strings in error messages and other strings
[ https://issues.apache.org/jira/browse/MESOS-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-5057: -- Shepherd: Vinod Kone Sprint: Mesosphere Sprint 33 Story Points: 3 > Replace Master/Slave Terminology Phase I - Update strings in error messages > and other strings > - > > Key: MESOS-5057 > URL: https://issues.apache.org/jira/browse/MESOS-5057 > Project: Mesos > Issue Type: Task >Reporter: zhou xing >Assignee: zhou xing > Fix For: 0.29.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > This is a sub ticket of MESOS-3780. In this ticket, we will update all the > slave to agent in the error messages and other strings in the code -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5057) Replace Master/Slave Terminology Phase I - Update strings in error messages and other strings
[ https://issues.apache.org/jira/browse/MESOS-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245066#comment-15245066 ] Vinod Kone edited comment on MESOS-5057 at 4/18/16 2:35 AM: Transition the issue to "Reviewable" when you post a review please. I will do it for this one. was (Author: vinodkone): Transition the review to "Reviewable" when you post a review please. I will do it for this one. > Replace Master/Slave Terminology Phase I - Update strings in error messages > and other strings > - > > Key: MESOS-5057 > URL: https://issues.apache.org/jira/browse/MESOS-5057 > Project: Mesos > Issue Type: Task >Reporter: zhou xing >Assignee: zhou xing > Fix For: 0.29.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > This is a sub ticket of MESOS-3780. In this ticket, we will update all the > slave to agent in the error messages and other strings in the code -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5226) The image-less task launched by mesos-execute can not join CNI network
[ https://issues.apache.org/jira/browse/MESOS-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245052#comment-15245052 ] Qian Zhang commented on MESOS-5226: --- The root cause of this bug is, in {{CommandScheduler::getContainerInfo()}}, we will not return a {{ContainerInfo}} as long as there is no image specified even there is CNI network specified, instead we will just return {{None()}} in this case. And in {{NetworkCniIsolatorProcess::prepare()}}, we will just ignore the container which has no {{ContainerInfo}}, so any CNI related logic will not be applied to the executor which will be in agent host network namespace. > The image-less task launched by mesos-execute can not join CNI network > -- > > Key: MESOS-5226 > URL: https://issues.apache.org/jira/browse/MESOS-5226 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > With {{mesos-execute}}, if we launches a task which wants to join a CNI > network but has no image specified, like: > {code} > sudo src/mesos-execute --master=192.168.122.171:5050 --name=test > --networks=net1 --command="ifconfig" --shell=true > {code} > The corresponding command executor actually will not join the specified CNI > network, instead it is still in agent host network namespace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5226) The image-less task launched by mesos-execute can not join CNI network
Qian Zhang created MESOS-5226: - Summary: The image-less task launched by mesos-execute can not join CNI network Key: MESOS-5226 URL: https://issues.apache.org/jira/browse/MESOS-5226 Project: Mesos Issue Type: Bug Components: isolation Reporter: Qian Zhang Assignee: Qian Zhang With {{mesos-execute}}, if we launches a task which wants to join a CNI network but has no image specified, like: {code} sudo src/mesos-execute --master=192.168.122.171:5050 --name=test --networks=net1 --command="ifconfig" --shell=true {code} The corresponding command executor actually will not join the specified CNI network, instead it is still in agent host network namespace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API
[ https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245050#comment-15245050 ] Klaus Ma commented on MESOS-5224: - [~jdef], woud you share your example? I'd like to reproduce it firstly :). > buffer overflow error in slave upon processing status update from executor v1 > http API > -- > > Key: MESOS-5224 > URL: https://issues.apache.org/jira/browse/MESOS-5224 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.28.0 > Environment: {code} > $ dpkg -l|grep -e mesos > ii mesos 0.28.0-2.0.16.ubuntu1404 > amd64Cluster resource manager with efficient resource isolation > $ uname -a > Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 > x86_64 x86_64 x86_64 GNU/Linux > {code} >Reporter: James DeFelice > Labels: mesosphere > > implementing support for executor HTTP v1 API in mesos-go:next and my > executor can't send status updates because the slave dies upon receiving > them. protobufs generated from 0.28.1 > from syslog: > {code} > Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467 4489 > http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 > with User-Agent='Go-http-client/1.1' > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: > /usr/sbin/mesos-slave terminated > Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: = > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d] > ... > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix > time) try "date -d @1460915633" if you are using GNU date *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by > PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a > mesos::internal::operator<<() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 > mesos::internal::slave::Slave::statusUpdate() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 > mesos::internal::slave::Slave::Http::executor() > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 > _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_ > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236daa8 > _ZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEE
[jira] [Comment Edited] (MESOS-5225) Command executor can not start when joining a CNI network
[ https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245044#comment-15245044 ] Qian Zhang edited comment on MESOS-5225 at 4/18/16 1:47 AM: The root cause of this bug is, before the command executor (mesos-executor) is started by mesos-containerizer, we bind mount {{/etc/hosts}}, {{/etc/hostname}}, {{/etc/resolv.conf}} in the container's rootfs (see {{NetworkCniIsolatorSetup::execute()}} for details), but for command executor, we will NOT do the {{chroot}} before launching it (see {{LinuxFilesystemIsolatorProcess::prepare()}}, we will only set rootfs in {{ContainerLaunchInfo}} if it is not a command task), instead the command executor will do the {{chroot}} itself when launching the task (https://github.com/apache/mesos/blob/0.28.0/src/launcher/executor.cpp#L369). So when the command executor is launched, it is still using agent host FS, that means the bind mounts that we do will not take effect for it. Obviously in agent host FS, the {{/etc/hosts}} does not have the pair of container's hostname and IP, so the hostname lookup in libprocess will fail. was (Author: qianzhang): The root cause of this bug is, before the command executor (mesos-executor) is started by mesos-containerizer, we bind mount {{/etc/hosts}}, {{/etc/hostname}}, {{/etc/resolv.conf}} in the container's rootfs (see {{NetworkCniIsolatorSetup::execute()}} for details), but for command executor, we will NOT do the {{chroot}} before launching it (see {{LinuxFilesystemIsolatorProcess::prepare()}}, we will only set rootfs in {{ContainerLaunchInfo}} for if it is not a command task), instead the command executor will do the {{chroot}} itself when launching the task (https://github.com/apache/mesos/blob/0.28.0/src/launcher/executor.cpp#L369). So when the command executor is launched, it is still using agent host FS, that means the bind mounts that we do will not take effect for it. Obviously in agent host FS, the {{/etc/hosts}} does not have the pair of container's hostname and IP, so the hostname lookup in libprocess will fail. > Command executor can not start when joining a CNI network > - > > Key: MESOS-5225 > URL: https://issues.apache.org/jira/browse/MESOS-5225 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > Reproduce steps: > 1. Start master > {code} > sudo ./bin/mesos-master.sh --work_dir=/tmp > {code} > > 2. Start agent > {code} > sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 > --containerizers=mesos --image_providers=docker > --isolation=filesystem/linux,docker/runtime,network/cni > --network_cni_config_dir=/opt/cni/net_configs > --network_cni_plugins_dir=/opt/cni/plugins}} > {code} > > 3. Launch a command task with mesos-execute, and it will join a CNI network > {{net1}}. > {code} > sudo src/mesos-execute --master=192.168.122.171:5050 --name=test > --docker_image=library/busybox --networks=net1 --command="sleep 10" > --shell=true > I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0 > Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-' > Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0' > Received status update TASK_FAILED for task 'test' > message: 'Executor terminated' > source: SOURCE_AGENT > reason: REASON_EXECUTOR_TERMINATED > {code} > So the task failed with the reason "executor terminated". Here is the agent > log: > {code} > I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework > 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown > '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t > est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root' > I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources > cpus(*):0.1; mem(*):32 in work directory '/t > mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' > I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container > '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework > '3c4796f0-eee7-4939-a036-7c6387c370eb-00 > 00' > I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor > 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs
[jira] [Commented] (MESOS-5225) Command executor can not start when joining a CNI network
[ https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245044#comment-15245044 ] Qian Zhang commented on MESOS-5225: --- The root cause of this bug is, before the command executor (mesos-executor) is started by mesos-containerizer, we bind mount {{/etc/hosts}}, {{/etc/hostname}}, {{/etc/resolv.conf}} in the container's rootfs (see {{NetworkCniIsolatorSetup::execute()}} for details), but for command executor, we will NOT do the {{chroot}} before launching it (see {{LinuxFilesystemIsolatorProcess::prepare()}}, we will only set rootfs in {{ContainerLaunchInfo}} for if it is not a command task), instead the command executor will do the {{chroot}} itself when launching the task (https://github.com/apache/mesos/blob/0.28.0/src/launcher/executor.cpp#L369). So when the command executor is launched, it is still using agent host FS, that means the bind mounts that we do will not take effect for it. Obviously in agent host FS, the {{/etc/hosts}} does not have the pair of container's hostname and IP, so the hostname lookup in libprocess will fail. > Command executor can not start when joining a CNI network > - > > Key: MESOS-5225 > URL: https://issues.apache.org/jira/browse/MESOS-5225 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > Reproduce steps: > 1. Start master > {code} > sudo ./bin/mesos-master.sh --work_dir=/tmp > {code} > > 2. Start agent > {code} > sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 > --containerizers=mesos --image_providers=docker > --isolation=filesystem/linux,docker/runtime,network/cni > --network_cni_config_dir=/opt/cni/net_configs > --network_cni_plugins_dir=/opt/cni/plugins}} > {code} > > 3. Launch a command task with mesos-execute, and it will join a CNI network > {{net1}}. > {code} > sudo src/mesos-execute --master=192.168.122.171:5050 --name=test > --docker_image=library/busybox --networks=net1 --command="sleep 10" > --shell=true > I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0 > Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-' > Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0' > Received status update TASK_FAILED for task 'test' > message: 'Executor terminated' > source: SOURCE_AGENT > reason: REASON_EXECUTOR_TERMINATED > {code} > So the task failed with the reason "executor terminated". Here is the agent > log: > {code} > I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework > 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown > '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t > est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root' > I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of > framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources > cpus(*):0.1; mem(*):32 in work directory '/t > mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' > I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container > '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework > '3c4796f0-eee7-4939-a036-7c6387c370eb-00 > 00' > I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor > 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb- > I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs > '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3 > 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3 > I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process > with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS > I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to > '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' > for container 2b29d6d6-b314-4 > 77f-b734-7771d07d41e3 > I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address > '192.168.1.2/24' from CNI network 'net1' for container > 2b29d6d6-b314-477f-b734-7771d07d41e3 > I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for > container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf' > I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container > '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited > I0418 08:25:37.66
[jira] [Created] (MESOS-5225) Command executor can not start when joining a CNI network
Qian Zhang created MESOS-5225: - Summary: Command executor can not start when joining a CNI network Key: MESOS-5225 URL: https://issues.apache.org/jira/browse/MESOS-5225 Project: Mesos Issue Type: Bug Components: isolation Reporter: Qian Zhang Assignee: Qian Zhang Reproduce steps: 1. Start master {code} sudo ./bin/mesos-master.sh --work_dir=/tmp {code} 2. Start agent {code} sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,network/cni --network_cni_config_dir=/opt/cni/net_configs --network_cni_plugins_dir=/opt/cni/plugins}} {code} 3. Launch a command task with mesos-execute, and it will join a CNI network {{net1}}. {code} sudo src/mesos-execute --master=192.168.122.171:5050 --name=test --docker_image=library/busybox --networks=net1 --command="sleep 10" --shell=true I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0 Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-' Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0' Received status update TASK_FAILED for task 'test' message: 'Executor terminated' source: SOURCE_AGENT reason: REASON_EXECUTOR_TERMINATED {code} So the task failed with the reason "executor terminated". Here is the agent log: {code} I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for framework 3c4796f0-eee7-4939-a036-7c6387c370eb- I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 3c4796f0-eee7-4939-a036-7c6387c370eb- I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root' I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources cpus(*):0.1; mem(*):32 in work directory '/t mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework '3c4796f0-eee7-4939-a036-7c6387c370eb-00 00' I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb- I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3 I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' for container 2b29d6d6-b314-4 77f-b734-7771d07d41e3 I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address '192.168.1.2/24' from CNI network 'net1' for container 2b29d6d6-b314-477f-b734-7771d07d41e3 I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf' I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container '2b29d6d6-b314-477f-b734-7771d07d41e3' I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 6.061056ms I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 2.46016ms I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' for container 2b29d6d6-b31 4-477f-b734-7771d07d41e3 I0418 08:25:37.874194 24914 cni.cpp:1132] Removed the container directory '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3' I0418 08:25:37.877306 24912 linux.cpp:814] Ignoring unmounting sandbox/work directory for container 2b29d6d6-b314-477f-b734-7771d07d41e3 I0418 08:25:37.879295 24912 provisioner.cpp:338] Destroying container rootfs at '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d
[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API
[ https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244803#comment-15244803 ] Vinod Kone commented on MESOS-5224: --- Interesting. Looks like the buffer overflow happened inside Slave::statusUpdate() when logging the update message? {code} Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a mesos::internal::operator<<() Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 mesos::internal::slave::Slave::statusUpdate() {code} The code for the output stream operator for status update looks like so {code} ostream& operator<<(ostream& stream, const StatusUpdate& update) { stream << update.status().state(); if (update.has_uuid()) { stream << " (UUID: " << stringify(UUID::fromBytes(update.uuid())) << ")"; } stream << " for task " << update.status().task_id(); if (update.status().has_healthy()) { stream << " in health state " << (update.status().healthy() ? "healthy" : "unhealthy"); } return stream << " of framework " << update.framework_id(); } {code} The one thing that could cause an issue is `UUID::fromBytes()`. How is the UUID being set by the HTTP executor? > buffer overflow error in slave upon processing status update from executor v1 > http API > -- > > Key: MESOS-5224 > URL: https://issues.apache.org/jira/browse/MESOS-5224 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.28.0 > Environment: {code} > $ dpkg -l|grep -e mesos > ii mesos 0.28.0-2.0.16.ubuntu1404 > amd64Cluster resource manager with efficient resource isolation > $ uname -a > Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 > x86_64 x86_64 x86_64 GNU/Linux > {code} >Reporter: James DeFelice > Labels: mesosphere > > implementing support for executor HTTP v1 API in mesos-go:next and my > executor can't send status updates because the slave dies upon receiving > them. protobufs generated from 0.28.1 > from syslog: > {code} > Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467 4489 > http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 > with User-Agent='Go-http-client/1.1' > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: > /usr/sbin/mesos-slave terminated > Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: = > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182] > Apr 17 17:53:53 node-1 mesos-slave[4462]: > /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d] > ... > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix > time) try "date -d @1460915633" if you are using GNU date *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by > PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: *** > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown) > Apr 17 17:53:53 node-1 mesos-slave[4462]: @
[jira] [Created] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API
James DeFelice created MESOS-5224: - Summary: buffer overflow error in slave upon processing status update from executor v1 http API Key: MESOS-5224 URL: https://issues.apache.org/jira/browse/MESOS-5224 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.28.0 Environment: {code} $ dpkg -l|grep -e mesos ii mesos 0.28.0-2.0.16.ubuntu1404 amd64 Cluster resource manager with efficient resource isolation $ uname -a Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux {code} Reporter: James DeFelice implementing support for executor HTTP v1 API in mesos-go:next and my executor can't send status updates because the slave dies upon receiving them. protobufs generated from 0.28.1 from syslog: {code} Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467 4489 http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 with User-Agent='Go-http-client/1.1' Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: /usr/sbin/mesos-slave terminated Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: = Apr 17 17:53:53 node-1 mesos-slave[4462]: /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f] Apr 17 17:53:53 node-1 mesos-slave[4462]: /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c] Apr 17 17:53:53 node-1 mesos-slave[4462]: /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60] Apr 17 17:53:53 node-1 mesos-slave[4462]: /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a] Apr 17 17:53:53 node-1 mesos-slave[4462]: /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837] Apr 17 17:53:53 node-1 mesos-slave[4462]: /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2] Apr 17 17:53:53 node-1 mesos-slave[4462]: /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3] Apr 17 17:53:53 node-1 mesos-slave[4462]: /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8] Apr 17 17:53:53 node-1 mesos-slave[4462]: /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71] Apr 17 17:53:53 node-1 mesos-slave[4462]: /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77] Apr 17 17:53:53 node-1 mesos-slave[4462]: /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0] Apr 17 17:53:53 node-1 mesos-slave[4462]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182] Apr 17 17:53:53 node-1 mesos-slave[4462]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d] ... Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix time) try "date -d @1460915633" if you are using GNU date *** Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown) Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: *** Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown) Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown) Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown) Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown) Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown) Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown) Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a mesos::internal::operator<<() Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 mesos::internal::slave::Slave::statusUpdate() Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 mesos::internal::slave::Slave::Http::executor() Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_ Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236daa8 _ZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_ Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc532375a71 process::ProcessManager::resume() Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc532375d77 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530e85bf0 (unknown) Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309a8182 start_thread
[jira] [Comment Edited] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244718#comment-15244718 ] haosdent edited comment on MESOS-1653 at 4/17/16 4:41 PM: -- [~tnachen] After saw the log [~xujyan] posted. The second {{statusUpdate}} is nearly 5 seconds delay after {{14:46:23}}. {code} I0909 14:46:23.633633 944 hierarchical_allocator_process.hpp:659] Performed allocation for 1 slaves in 61631ns I0909 14:46:27.799932 947 hierarchical_allocator_process.hpp:659] Performed allocation for 1 slaves in 95512ns I0909 14:46:27.800237 947 master.cpp:120] No whitelist given. Advertising offers for all slaves I0909 14:46:27.800612 947 slave.cpp:2329] Received ping from slave-observer(2)@127.0.1.1:47396 tests/health_check_tests.cpp:557: Failure Failed to wait 10secs for statusHealth tests/health_check_tests.cpp:539: Failure Actual function call count doesn't match EXPECT_CALL(sched, statusUpdate(&driver, _))... Expected: to be called at least twice Actual: called once - unsatisfied and active I0909 14:46:27.815444 928 master.cpp:650] Master terminating I0909 14:46:27.815640 928 master.hpp:851] Removing task 1 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140909-144617-16842879-47396-928-0 (lucid) W0909 14:46:27.815795 928 master.cpp:4419] Removing task 1 of framework 20140909-144617-16842879-47396-928- and slave 20140909-144617-16842879-47396-928-0 in non-terminal state TASK_RUNNING I0909 14:46:27.823565 943 slave.cpp:2361] master@127.0.1.1:47396 exited W0909 14:46:27.823611 943 slave.cpp:2364] Master disconnected! Waiting for a new master to be elected I0909 14:46:27.828475 943 slave.cpp:2093] Handling status update TASK_RUNNING (UUID: 5f53830d-cd08-4c57-be42-33be367d3f01) for task 1 in health state unhealthy of framework 20140909-144617-16842879-47396-928- from executor(1)@127.0.1.1:52801 {code} I think we need add {code} @@ -1053,6 +1053,9 @@ TEST_F(HealthCheckTest, DISABLED_GracePeriod) driver.launchTasks(offers.get()[0].id(), tasks); + AWAIT_READY(statusRunning); + EXPECT_EQ(TASK_RUNNING, statusRunning.get().state()); + Clock::pause(); {code} before advance clock. Do you think it is OK to add this and reenable the test case? was (Author: haosd...@gmail.com): [~tnachen] After saw the log [~xujyan] posted. The second {statusUpdate} is nearly 5 seconds delay after {14:46:23}. {code} I0909 14:46:23.633633 944 hierarchical_allocator_process.hpp:659] Performed allocation for 1 slaves in 61631ns I0909 14:46:27.799932 947 hierarchical_allocator_process.hpp:659] Performed allocation for 1 slaves in 95512ns I0909 14:46:27.800237 947 master.cpp:120] No whitelist given. Advertising offers for all slaves I0909 14:46:27.800612 947 slave.cpp:2329] Received ping from slave-observer(2)@127.0.1.1:47396 tests/health_check_tests.cpp:557: Failure Failed to wait 10secs for statusHealth tests/health_check_tests.cpp:539: Failure Actual function call count doesn't match EXPECT_CALL(sched, statusUpdate(&driver, _))... Expected: to be called at least twice Actual: called once - unsatisfied and active I0909 14:46:27.815444 928 master.cpp:650] Master terminating I0909 14:46:27.815640 928 master.hpp:851] Removing task 1 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140909-144617-16842879-47396-928-0 (lucid) W0909 14:46:27.815795 928 master.cpp:4419] Removing task 1 of framework 20140909-144617-16842879-47396-928- and slave 20140909-144617-16842879-47396-928-0 in non-terminal state TASK_RUNNING I0909 14:46:27.823565 943 slave.cpp:2361] master@127.0.1.1:47396 exited W0909 14:46:27.823611 943 slave.cpp:2364] Master disconnected! Waiting for a new master to be elected I0909 14:46:27.828475 943 slave.cpp:2093] Handling status update TASK_RUNNING (UUID: 5f53830d-cd08-4c57-be42-33be367d3f01) for task 1 in health state unhealthy of framework 20140909-144617-16842879-47396-928- from executor(1)@127.0.1.1:52801 {code} I think we need add a {AWAIT_READY(statusRunning);} before advance clock. Do you think it is OK to add this and reenable the test case? > HealthCheckTest.GracePeriod is flaky. > - > > Key: MESOS-1653 > URL: https://issues.apache.org/jira/browse/MESOS-1653 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: Timothy Chen > Labels: flaky, health-check, mesosphere > > {noformat} > [--] 3 tests from HealthCheckTest > [ RUN ] HealthCheckTest.GracePeriod > Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' > I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms > I0729 17:10:10
[jira] [Commented] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244718#comment-15244718 ] haosdent commented on MESOS-1653: - [~tnachen] After saw the log [~xujyan] posted. The second {statusUpdate} is nearly 5 seconds delay after {14:46:23}. {code} I0909 14:46:23.633633 944 hierarchical_allocator_process.hpp:659] Performed allocation for 1 slaves in 61631ns I0909 14:46:27.799932 947 hierarchical_allocator_process.hpp:659] Performed allocation for 1 slaves in 95512ns I0909 14:46:27.800237 947 master.cpp:120] No whitelist given. Advertising offers for all slaves I0909 14:46:27.800612 947 slave.cpp:2329] Received ping from slave-observer(2)@127.0.1.1:47396 tests/health_check_tests.cpp:557: Failure Failed to wait 10secs for statusHealth tests/health_check_tests.cpp:539: Failure Actual function call count doesn't match EXPECT_CALL(sched, statusUpdate(&driver, _))... Expected: to be called at least twice Actual: called once - unsatisfied and active I0909 14:46:27.815444 928 master.cpp:650] Master terminating I0909 14:46:27.815640 928 master.hpp:851] Removing task 1 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140909-144617-16842879-47396-928-0 (lucid) W0909 14:46:27.815795 928 master.cpp:4419] Removing task 1 of framework 20140909-144617-16842879-47396-928- and slave 20140909-144617-16842879-47396-928-0 in non-terminal state TASK_RUNNING I0909 14:46:27.823565 943 slave.cpp:2361] master@127.0.1.1:47396 exited W0909 14:46:27.823611 943 slave.cpp:2364] Master disconnected! Waiting for a new master to be elected I0909 14:46:27.828475 943 slave.cpp:2093] Handling status update TASK_RUNNING (UUID: 5f53830d-cd08-4c57-be42-33be367d3f01) for task 1 in health state unhealthy of framework 20140909-144617-16842879-47396-928- from executor(1)@127.0.1.1:52801 {code} I think we need add a {AWAIT_READY(statusRunning);} before advance clock. Do you think it is OK to add this and reenable the test case? > HealthCheckTest.GracePeriod is flaky. > - > > Key: MESOS-1653 > URL: https://issues.apache.org/jira/browse/MESOS-1653 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: Timothy Chen > Labels: flaky, health-check, mesosphere > > {noformat} > [--] 3 tests from HealthCheckTest > [ RUN ] HealthCheckTest.GracePeriod > Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' > I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms > I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms > I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns > I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in > 2317ns > I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the > db in 1367ns > I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery > I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status > I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to > STARTING > I0729 17:10:10.508482 1213 master.cpp:289] Master > 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 > I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing > authenticated frameworks to register > I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing > authenticated slaves to register > I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for > authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' > I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled > I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:54701 > I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising > offers for all slaves > I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is > master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 > I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! > I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar > I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar > I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 12.946461ms > I0729 17:10
[jira] [Assigned] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.
[ https://issues.apache.org/jira/browse/MESOS-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent reassigned MESOS-1802: --- Assignee: haosdent > HealthCheckTest.HealthStatusChange is flaky on jenkins. > --- > > Key: MESOS-1802 > URL: https://issues.apache.org/jira/browse/MESOS-1802 > Project: Mesos > Issue Type: Bug > Components: test, tests >Affects Versions: 0.26.0 >Reporter: Benjamin Mahler >Assignee: haosdent > Labels: flaky, health-check, mesosphere > Attachments: health_check_flaky_test_log.txt > > > https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull > {noformat} > [ RUN ] HealthCheckTest.HealthStatusChange > Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2' > I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms > I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns > I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns > I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in > 642ns > I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the > db in 343ns > I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery > I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status > I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to > STARTING > I0916 22:56:14.036603 21046 master.cpp:286] Master > 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on > 67.195.81.186:47865 > I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing > authenticated frameworks to register > I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing > authenticated slaves to register > I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials' > I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 480322ns > I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to > STARTING > I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled > I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status > I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising > offers for all slaves > I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] > Initializing hierarchical allocator process with master : > master@67.195.81.186:47865 > I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is > master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026 > I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master! > I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar > I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar > I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING > I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 330251ns > I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to > VOTING > I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos > group > I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated > I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer > I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 92623ns > I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1 > I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 144215ns > I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0 > I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request > for position 0 > I091
[jira] [Commented] (MESOS-2331) MasterSlaveReconciliationTest.ReconcileRace is flaky
[ https://issues.apache.org/jira/browse/MESOS-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244665#comment-15244665 ] haosdent commented on MESOS-2331: - I think we could add a {{settle()}} before {{DROP_PROTOBUFS}} {code} diff --git a/src/tests/master_slave_reconciliation_tests.cpp b/src/tests/master_slave_reconciliation_tests.cpp index 71fb78a..833c3c0 100644 --- a/src/tests/master_slave_reconciliation_tests.cpp +++ b/src/tests/master_slave_reconciliation_tests.cpp @@ -295,6 +295,11 @@ TEST_F(MasterSlaveReconciliationTest, ReconcileRace) driver.start(); + // Make sure all `SlaveRegisteredMessage` have been handled by agent. + Clock::pause(); + Clock::settle(); + Clock::resume(); + // Trigger a re-registration of the slave and capture the message // so that we can spoof a race with a launch task message. DROP_PROTOBUFS(ReregisterSlaveMessage(), slave.get()->pid, master.get()->pid); {code} However, because I could not reproduce it in my env and only saw it in reviewbot. Not sure whether this approach works or not. > MasterSlaveReconciliationTest.ReconcileRace is flaky > > > Key: MESOS-2331 > URL: https://issues.apache.org/jira/browse/MESOS-2331 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.22.0 >Reporter: Yan Xu >Assignee: Qian Zhang > Labels: flaky > > {noformat:title=} > [ RUN ] MasterSlaveReconciliationTest.ReconcileRace > Using temporary directory > '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV' > I0206 19:09:44.196542 32362 leveldb.cpp:175] Opened db in 38.230192ms > I0206 19:09:44.206826 32362 leveldb.cpp:182] Compacted db in 9.988493ms > I0206 19:09:44.207164 32362 leveldb.cpp:197] Created db iterator in 29979ns > I0206 19:09:44.207641 32362 leveldb.cpp:203] Seeked to beginning of db in > 4478ns > I0206 19:09:44.207929 32362 leveldb.cpp:272] Iterated through 0 keys in the > db in 737ns > I0206 19:09:44.208222 32362 replica.cpp:743] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0206 19:09:44.209132 32384 recover.cpp:448] Starting replica recovery > I0206 19:09:44.209524 32384 recover.cpp:474] Replica is in EMPTY status > I0206 19:09:44.211094 32384 replica.cpp:640] Replica in EMPTY status received > a broadcasted recover request > I0206 19:09:44.211385 32384 recover.cpp:194] Received a recover response from > a replica in EMPTY status > I0206 19:09:44.211902 32384 recover.cpp:565] Updating replica status to > STARTING > I0206 19:09:44.236177 32381 master.cpp:344] Master > 20150206-190944-16842879-36452-32362 (lucid) started on 127.0.1.1:36452 > I0206 19:09:44.236291 32381 master.cpp:390] Master only allowing > authenticated frameworks to register > I0206 19:09:44.236305 32381 master.cpp:395] Master only allowing > authenticated slaves to register > I0206 19:09:44.236327 32381 credentials.hpp:35] Loading credentials for > authentication from > '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV/credentials' > I0206 19:09:44.236601 32381 master.cpp:439] Authorization enabled > I0206 19:09:44.238539 32381 hierarchical_allocator_process.hpp:284] > Initialized hierarchical allocator process > I0206 19:09:44.238662 32381 whitelist_watcher.cpp:64] No whitelist given > I0206 19:09:44.239364 32381 master.cpp:1350] The newly elected leader is > master@127.0.1.1:36452 with id 20150206-190944-16842879-36452-32362 > I0206 19:09:44.239392 32381 master.cpp:1363] Elected as the leading master! > I0206 19:09:44.239413 32381 master.cpp:1181] Recovering from registrar > I0206 19:09:44.239645 32381 registrar.cpp:312] Recovering registrar > I0206 19:09:44.241142 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to > leveldb took 29.029117ms > I0206 19:09:44.241189 32384 replica.cpp:322] Persisted replica status to > STARTING > I0206 19:09:44.241478 32384 recover.cpp:474] Replica is in STARTING status > I0206 19:09:44.243075 32384 replica.cpp:640] Replica in STARTING status > received a broadcasted recover request > I0206 19:09:44.243398 32384 recover.cpp:194] Received a recover response from > a replica in STARTING status > I0206 19:09:44.243964 32384 recover.cpp:565] Updating replica status to VOTING > I0206 19:09:44.255692 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to > leveldb took 11.502759ms > I0206 19:09:44.255765 32384 replica.cpp:322] Persisted replica status to > VOTING > I0206 19:09:44.256009 32384 recover.cpp:579] Successfully joined the Paxos > group > I0206 19:09:44.256253 32384 recover.cpp:463] Recover process terminated > I0206 19:09:44.257669 32384 log.cpp:659] Attempting to start the writer > I0206 19:09:44.259944 32377 replica.cpp:476] Replica received implicit > promise request with proposal 1 > I0206 19:09:44.268805 32377
[jira] [Commented] (MESOS-2331) MasterSlaveReconciliationTest.ReconcileRace is flaky
[ https://issues.apache.org/jira/browse/MESOS-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244619#comment-15244619 ] haosdent commented on MESOS-2331: - Compare to normal log, the cause of flaky is {code} I0417 08:09:37.556551 31925 master.cpp:4580] Registered agent 07f7917f-63d1-40d4-b983-4f0eb5c18f3d-S0 at slave(141)@172.17.0.1:35480 (95302125b116) with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] I0417 08:09:37.557147 31925 master.cpp:4482] Agent 07f7917f-63d1-40d4-b983-4f0eb5c18f3d-S0 at slave(141)@172.17.0.1:35480 (95302125b116) already registered, resending acknowledgement {code} Messo master resend {{SlaveRegisteredMessage}}, and then cause Mesos agent registered successfully. Then not need reregistered again. > MasterSlaveReconciliationTest.ReconcileRace is flaky > > > Key: MESOS-2331 > URL: https://issues.apache.org/jira/browse/MESOS-2331 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.22.0 >Reporter: Yan Xu >Assignee: Qian Zhang > Labels: flaky > > {noformat:title=} > [ RUN ] MasterSlaveReconciliationTest.ReconcileRace > Using temporary directory > '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV' > I0206 19:09:44.196542 32362 leveldb.cpp:175] Opened db in 38.230192ms > I0206 19:09:44.206826 32362 leveldb.cpp:182] Compacted db in 9.988493ms > I0206 19:09:44.207164 32362 leveldb.cpp:197] Created db iterator in 29979ns > I0206 19:09:44.207641 32362 leveldb.cpp:203] Seeked to beginning of db in > 4478ns > I0206 19:09:44.207929 32362 leveldb.cpp:272] Iterated through 0 keys in the > db in 737ns > I0206 19:09:44.208222 32362 replica.cpp:743] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0206 19:09:44.209132 32384 recover.cpp:448] Starting replica recovery > I0206 19:09:44.209524 32384 recover.cpp:474] Replica is in EMPTY status > I0206 19:09:44.211094 32384 replica.cpp:640] Replica in EMPTY status received > a broadcasted recover request > I0206 19:09:44.211385 32384 recover.cpp:194] Received a recover response from > a replica in EMPTY status > I0206 19:09:44.211902 32384 recover.cpp:565] Updating replica status to > STARTING > I0206 19:09:44.236177 32381 master.cpp:344] Master > 20150206-190944-16842879-36452-32362 (lucid) started on 127.0.1.1:36452 > I0206 19:09:44.236291 32381 master.cpp:390] Master only allowing > authenticated frameworks to register > I0206 19:09:44.236305 32381 master.cpp:395] Master only allowing > authenticated slaves to register > I0206 19:09:44.236327 32381 credentials.hpp:35] Loading credentials for > authentication from > '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV/credentials' > I0206 19:09:44.236601 32381 master.cpp:439] Authorization enabled > I0206 19:09:44.238539 32381 hierarchical_allocator_process.hpp:284] > Initialized hierarchical allocator process > I0206 19:09:44.238662 32381 whitelist_watcher.cpp:64] No whitelist given > I0206 19:09:44.239364 32381 master.cpp:1350] The newly elected leader is > master@127.0.1.1:36452 with id 20150206-190944-16842879-36452-32362 > I0206 19:09:44.239392 32381 master.cpp:1363] Elected as the leading master! > I0206 19:09:44.239413 32381 master.cpp:1181] Recovering from registrar > I0206 19:09:44.239645 32381 registrar.cpp:312] Recovering registrar > I0206 19:09:44.241142 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to > leveldb took 29.029117ms > I0206 19:09:44.241189 32384 replica.cpp:322] Persisted replica status to > STARTING > I0206 19:09:44.241478 32384 recover.cpp:474] Replica is in STARTING status > I0206 19:09:44.243075 32384 replica.cpp:640] Replica in STARTING status > received a broadcasted recover request > I0206 19:09:44.243398 32384 recover.cpp:194] Received a recover response from > a replica in STARTING status > I0206 19:09:44.243964 32384 recover.cpp:565] Updating replica status to VOTING > I0206 19:09:44.255692 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to > leveldb took 11.502759ms > I0206 19:09:44.255765 32384 replica.cpp:322] Persisted replica status to > VOTING > I0206 19:09:44.256009 32384 recover.cpp:579] Successfully joined the Paxos > group > I0206 19:09:44.256253 32384 recover.cpp:463] Recover process terminated > I0206 19:09:44.257669 32384 log.cpp:659] Attempting to start the writer > I0206 19:09:44.259944 32377 replica.cpp:476] Replica received implicit > promise request with proposal 1 > I0206 19:09:44.268805 32377 leveldb.cpp:305] Persisting metadata (8 bytes) to > leveldb took 8.45858ms > I0206 19:09:44.269067 32377 replica.cpp:344] Persisted promised to 1 > I0206 19:09:44.277974 32383 coordinator.cpp:229] Coordinator attemping to > fill missing position > I0206 19:09:44.279767 32383 replica.cpp:377] Repli
[jira] [Commented] (MESOS-2331) MasterSlaveReconciliationTest.ReconcileRace is flaky
[ https://issues.apache.org/jira/browse/MESOS-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244596#comment-15244596 ] haosdent commented on MESOS-2331: - Raw this again {code} [ RUN ] MasterSlaveReconciliationTest.ReconcileRace I0417 08:09:37.011265 31901 cluster.cpp:149] Creating default 'local' authorizer I0417 08:09:37.086580 31901 leveldb.cpp:174] Opened db in 74.882317ms I0417 08:09:37.103621 31901 leveldb.cpp:181] Compacted db in 16.92606ms I0417 08:09:37.103744 31901 leveldb.cpp:196] Created db iterator in 32846ns I0417 08:09:37.103762 31901 leveldb.cpp:202] Seeked to beginning of db in 3615ns I0417 08:09:37.103775 31901 leveldb.cpp:271] Iterated through 0 keys in the db in 250ns I0417 08:09:37.103832 31901 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0417 08:09:37.104671 31931 recover.cpp:447] Starting replica recovery I0417 08:09:37.105304 31931 recover.cpp:473] Replica is in EMPTY status I0417 08:09:37.106678 31934 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (6118)@172.17.0.1:35480 I0417 08:09:37.107188 31929 recover.cpp:193] Received a recover response from a replica in EMPTY status I0417 08:09:37.108885 31934 recover.cpp:564] Updating replica status to STARTING I0417 08:09:37.111217 31922 master.cpp:382] Master 07f7917f-63d1-40d4-b983-4f0eb5c18f3d (95302125b116) started on 172.17.0.1:35480 I0417 08:09:37.111249 31922 master.cpp:384] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" --authenticate_http_frameworks="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/Wdw9Iq/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" --work_dir="/tmp/Wdw9Iq/master" --zk_session_timeout="10secs" I0417 08:09:37.111726 31922 master.cpp:433] Master only allowing authenticated frameworks to register I0417 08:09:37.111738 31922 master.cpp:439] Master only allowing authenticated agents to register I0417 08:09:37.111747 31922 master.cpp:445] Master only allowing authenticated HTTP frameworks to register I0417 08:09:37.111755 31922 credentials.hpp:37] Loading credentials for authentication from '/tmp/Wdw9Iq/credentials' I0417 08:09:37.112149 31922 master.cpp:489] Using default 'crammd5' authenticator I0417 08:09:37.112300 31922 master.cpp:560] Using default 'basic' HTTP authenticator I0417 08:09:37.112460 31922 master.cpp:640] Using default 'basic' HTTP framework authenticator I0417 08:09:37.112573 31922 master.cpp:687] Authorization enabled I0417 08:09:37.112798 31931 hierarchical.cpp:142] Initialized hierarchical allocator process I0417 08:09:37.112861 31931 whitelist_watcher.cpp:77] No whitelist given I0417 08:09:37.122642 31921 master.cpp:1932] The newly elected leader is master@172.17.0.1:35480 with id 07f7917f-63d1-40d4-b983-4f0eb5c18f3d I0417 08:09:37.122709 31921 master.cpp:1945] Elected as the leading master! I0417 08:09:37.122732 31921 master.cpp:1632] Recovering from registrar I0417 08:09:37.123011 31921 registrar.cpp:331] Recovering registrar I0417 08:09:37.137696 31929 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 28.65638ms I0417 08:09:37.137791 31929 replica.cpp:320] Persisted replica status to STARTING I0417 08:09:37.138139 31921 recover.cpp:473] Replica is in STARTING status I0417 08:09:37.139683 31929 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from (6121)@172.17.0.1:35480 I0417 08:09:37.139957 31935 recover.cpp:193] Received a recover response from a replica in STARTING status I0417 08:09:37.140836 31928 recover.cpp:564] Updating replica status to VOTING I0417 08:09:37.161991 31928 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 20.949493ms I0417 08:09:37.162083 31928 replica.cpp:320] Persisted replica status to VOTING I0417 08:09:37.162320 31935 recover.cpp:578] Successfully joined the Paxos group I0417 08:09:37.162582 31935 recover.cpp:462] Recover process terminated I0417 08:09:37.163247 31923 log.cpp:659] Attempting to start the writer I0417 08:09:37.165011 31923 replica.cpp:493] Replica received implicit promise req
[jira] [Updated] (MESOS-3567) Support TCP checks in Mesos health check program
[ https://issues.apache.org/jira/browse/MESOS-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-3567: Labels: Mesosphere health-check (was: Mesosphere) > Support TCP checks in Mesos health check program > > > Key: MESOS-3567 > URL: https://issues.apache.org/jira/browse/MESOS-3567 > Project: Mesos > Issue Type: Improvement >Reporter: Matthias Veit >Assignee: haosdent > Labels: Mesosphere, health-check > > In Marathon we have the ability to specify Health Checks for: > - Command (Mesos supports this) > - HTTP (see progress in MESOS-2533) > - TCP missing > See here for reference: > https://mesosphere.github.io/marathon/docs/health-checks.html > Since we made good experiences with those 3 options in Marathon, I see a lot > of value, if Mesos would also support them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2533) Support HTTP checks in Mesos health check program
[ https://issues.apache.org/jira/browse/MESOS-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244589#comment-15244589 ] haosdent commented on MESOS-2533: - [~alexr] I update https://reviews.apache.org/r/36816/ , would you please review at your convenience? Thank you in advance. > Support HTTP checks in Mesos health check program > - > > Key: MESOS-2533 > URL: https://issues.apache.org/jira/browse/MESOS-2533 > Project: Mesos > Issue Type: Improvement >Reporter: Niklas Quarfot Nielsen >Assignee: haosdent > Labels: health-check, mesosphere > > Currently, only commands are supported but our health check protobuf enables > users to encode HTTP checks as well. We should wire up this in the health > check program or remove the http field from the protobuf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)