[jira] [Created] (MESOS-7986) ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test execution
Benjamin Bannier created MESOS-7986: --- Summary: ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test execution Key: MESOS-7986 URL: https://issues.apache.org/jira/browse/MESOS-7986 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Reporter: Benjamin Bannier When running cmake-built Mesos test in parallel, I reliably encounter a failing {{ExecutorHttpApiTest.ValidJsonButInvalidProtobuf}}, {noformat} $ ../support/mesos-gtest-runner.py ./src/mesos-tests -j10 [ RUN ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf ../src/tests/executor_http_api_tests.cpp:197: Failure Value of: (response).get().status Actual: "401 Unauthorized" Expected: BadRequest().status Which is: "400 Bad Request" [ FAILED ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf (17 ms) {noformat}. The machine has 16 physical cores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7986) ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test execution
[ https://issues.apache.org/jira/browse/MESOS-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7986: Attachment: test.log > ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test > execution > > > Key: MESOS-7986 > URL: https://issues.apache.org/jira/browse/MESOS-7986 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 >Reporter: Benjamin Bannier > Labels: mesosphere > Attachments: test.log > > > When running cmake-built Mesos test in parallel, I reliably encounter a > failing {{ExecutorHttpApiTest.ValidJsonButInvalidProtobuf}}, > {noformat} > $ ../support/mesos-gtest-runner.py ./src/mesos-tests -j10 > [ RUN ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf > ../src/tests/executor_http_api_tests.cpp:197: Failure > Value of: (response).get().status > Actual: "401 Unauthorized" > Expected: BadRequest().status > Which is: "400 Bad Request" > [ FAILED ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf (17 ms) > {noformat}. > The machine has 16 physical cores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-4812) Mesos fails to escape command health checks
[ https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-4812: Assignee: Andrei Budnik (was: haosdent) Reworked Haosdent's patch: https://reviews.apache.org/r/62381/ > Mesos fails to escape command health checks > --- > > Key: MESOS-4812 > URL: https://issues.apache.org/jira/browse/MESOS-4812 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Lukas Loesche >Assignee: Andrei Budnik > Labels: health-check, mesosphere, tech-debt > Attachments: health_task.gif > > > As described in https://github.com/mesosphere/marathon/issues/ > I would like to run a command health check > {noformat} > /bin/bash -c " {noformat} > The health check fails because Mesos, while running the command inside double > quotes of a sh -c "" doesn't escape the double quotes in the command. > If I escape the double quotes myself the command health check succeeds. But > this would mean that the user needs intimate knowledge of how Mesos executes > his commands which can't be right. > I was told this is not a Marathon but a Mesos issue so am opening this JIRA. > I don't know if this only affects the command health check. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7987) Initialize Google Mock rather than Google Test.
James Peach created MESOS-7987: -- Summary: Initialize Google Mock rather than Google Test. Key: MESOS-7987 URL: https://issues.apache.org/jira/browse/MESOS-7987 Project: Mesos Issue Type: Improvement Components: test Reporter: James Peach Assignee: James Peach We should initialize Google Mock rather than Google Test. The Google Mock initializer calls Google Test, so it is functionally a superset. If we do this then the {{\-\-gmock_verbose}} option works. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7988) Mesos attempts to open handle for the system idle process
Andrew Schwartzmeyer created MESOS-7988: --- Summary: Mesos attempts to open handle for the system idle process Key: MESOS-7988 URL: https://issues.apache.org/jira/browse/MESOS-7988 Project: Mesos Issue Type: Bug Components: stout Environment: Windows 10 Reporter: Andrew Schwartzmeyer Assignee: Andrew Schwartzmeyer While running {{mesos-tests}} under Application Verifier, I found that we were inadvertently attempting to get a handle for the System Idle Process. This is not permitted by the OS, and so the {{OpenProcess}} system call was failing. I further found that we were incorrectly checking the failure condition of {{OpenProcess}}. We were attempting to open this handle when opening handles for all PIDs returned by {{os::pids}}, and the Windows API {{EnumProcess}} includes PID 0 (System Idle Process). As this PID is not useful, we can safely remove it from the {{os::pids}} API. Attempting to do _anything_ with PID 0 will likely result in failure, as it is a special process on Windows, and so we can help to prevent these errors by filtering out PID 0. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16170890#comment-16170890 ] James Peach commented on MESOS-7963: /cc [~jieyu] This covers the executor container limitation we discussed on the Slack channel. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }], > "command": { > "value": "sleep 600" > } > }, { > "name": "7247643c-5e4d-4b01-9839-e38db49f7f
[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] adaibee commented on MESOS-7966: {code:shell} // # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave We had a loop doing maintenance job in three mesos-slaves: 1.mesos-maintenance-schedule 2.machine-down 3.machine-up 4.mesos-maintenance-schedule-cancel But in the loop, we found one of mesos-master crashed and other mesos-master was elected. We found something in mesos.slave.FATAL: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() And in mesos.slave.INFO: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: *** 2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail() 2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog() 2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush() 2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal() 2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer() 2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_ 2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ 2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()() 2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit() 2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit() 2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve() 2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume() 2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv 2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE 2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv 2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown) 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread 2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT This case could be reproduced by calling `for i in {1..8}; do python call.py; done` (call.py gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ). Looks like there is something wrong when call /maintenance/schedule concurrently. We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like: 1. call schedule for 3 machine: a,b,c. 2. as machine a was done, maintenance window updates to: b,c 3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d This change sometimes happen in little interval. Then we find the fatal log just in Bayou's mail. What's the right way to update maintanence window? Thanks to any reply. > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >
[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] adaibee edited comment on MESOS-7966 at 9/19/17 3:17 AM: - {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave We had a loop doing maintenance job in three mesos-slaves: 1.mesos-maintenance-schedule 2.machine-down 3.machine-up 4.mesos-maintenance-schedule-cancel But in the loop, we found one of mesos-master crashed and other mesos-master was elected. We found something in mesos.slave.FATAL: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() And in mesos.slave.INFO: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: *** 2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail() 2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog() 2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush() 2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal() 2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer() 2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_ 2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ 2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()() 2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit() 2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit() 2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve() 2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume() 2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv 2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE 2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv 2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown) 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread 2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT This case could be reproduced by calling `for i in {1..8}; do python call.py; done` (call.py gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ). Looks like there is something wrong when call /maintenance/schedule concurrently. We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like: 1. call schedule for 3 machine: a,b,c. 2. as machine a was done, maintenance window updates to: b,c 3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d This change sometimes happen in little interval. Then we find the fatal log just in Bayou's mail. What's the right way to update maintanence window? Thanks to any reply. was (Author: adaibee): {code:none} // # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave We had a loop doing maintenance job in three mesos-slaves: 1.mesos-maintenance-schedule 2.machine-down 3.ma
[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] adaibee edited comment on MESOS-7966 at 9/19/17 3:17 AM: - {code:none} // # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave We had a loop doing maintenance job in three mesos-slaves: 1.mesos-maintenance-schedule 2.machine-down 3.machine-up 4.mesos-maintenance-schedule-cancel But in the loop, we found one of mesos-master crashed and other mesos-master was elected. We found something in mesos.slave.FATAL: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() And in mesos.slave.INFO: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: *** 2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail() 2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog() 2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush() 2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal() 2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer() 2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_ 2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ 2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()() 2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit() 2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit() 2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve() 2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume() 2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv 2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE 2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv 2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown) 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread 2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT This case could be reproduced by calling `for i in {1..8}; do python call.py; done` (call.py gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ). Looks like there is something wrong when call /maintenance/schedule concurrently. We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like: 1. call schedule for 3 machine: a,b,c. 2. as machine a was done, maintenance window updates to: b,c 3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d This change sometimes happen in little interval. Then we find the fatal log just in Bayou's mail. What's the right way to update maintanence window? Thanks to any reply. was (Author: adaibee): {code:shell} // # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave We had a loop doing maintenance job in three mesos-slaves: 1.mesos-maintenance-schedule 2.machine-down
[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] adaibee edited comment on MESOS-7966 at 9/19/17 3:18 AM: - h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave We had a loop doing maintenance job in three mesos-slaves: 1.mesos-maintenance-schedule 2.machine-down 3.machine-up 4.mesos-maintenance-schedule-cancel But in the loop, we found one of mesos-master crashed and other mesos-master was elected. We found something in mesos.slave.FATAL: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() And in mesos.slave.INFO: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: *** 2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail() 2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog() 2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush() 2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal() 2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer() 2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_ 2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ 2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()() 2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit() 2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit() 2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve() 2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume() 2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv 2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE 2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv 2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown) 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread 2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT This case could be reproduced by calling `for i in {1..8}; do python call.py; done` (call.py gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ). Looks like there is something wrong when call /maintenance/schedule concurrently. We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like: 1. call schedule for 3 machine: a,b,c. 2. as machine a was done, maintenance window updates to: b,c 3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d This change sometimes happen in little interval. Then we find the fatal log just in Bayou's mail. What's the right way to update maintanence window? Thanks to any reply. was (Author: adaibee): {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave We had a loop doing maintenance job in three mesos-slaves: 1.mesos-maintenance-schedule 2.ma
[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] adaibee edited comment on MESOS-7966 at 9/19/17 3:19 AM: - h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} h4. Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave *We had a loop doing maintenance job in three mesos-slaves:* 1.mesos-maintenance-schedule 2.machine-down 3.machine-up 4.mesos-maintenance-schedule-cancel But in the loop, we found one of mesos-master crashed and other mesos-master was elected. We found something in mesos.slave.FATAL: {code:none} 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() And in mesos.slave.INFO: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: *** 2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail() 2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog() 2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush() 2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal() 2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer() 2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_ 2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ 2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()() 2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit() 2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit() 2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve() 2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume() 2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv 2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE 2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv 2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown) 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread 2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT {code} This case could be reproduced by calling `for i in {1..8}; do python call.py; done` (call.py gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ). Looks like there is something wrong when call /maintenance/schedule concurrently. We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like: 1. call schedule for 3 machine: a,b,c. 2. as machine a was done, maintenance window updates to: b,c 3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d This change sometimes happen in little interval. Then we find the fatal log just in Bayou's mail. What's the right way to update maintanence window? Thanks to any reply. was (Author: adaibee): h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave We had a loop doing maintenance job in three mesos-
[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] adaibee edited comment on MESOS-7966 at 9/19/17 3:20 AM: - h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} h4. Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave *We had a loop doing maintenance job in three mesos-slaves:* 1.mesos-maintenance-schedule 2.machine-down 3.machine-up 4.mesos-maintenance-schedule-cancel But in the loop, we found one of mesos-master crashed and other mesos-master was elected. We found something in mesos.slave.FATAL: {code:none} 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() And in mesos.slave.INFO: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: *** 2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail() 2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog() 2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush() 2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal() 2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer() 2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_ 2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ 2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()() 2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit() 2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit() 2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve() 2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume() 2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv 2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE 2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv 2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown) 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread 2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT {code} This case could be reproduced by calling `for i in {1..8}; do python call.py; done` (call.py gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ). Looks like there is something wrong when call /maintenance/schedule concurrently. We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like: 1. call schedule for 3 machine: a,b,c. 2. as machine a was done, maintenance window updates to: b,c 3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d This change sometimes happen in little interval.What's the best practice to update maintanence window? was (Author: adaibee): h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} h4. Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave *We had a loop doing maintenance job in three mesos-slaves:* 1.mesos-maintenance-schedule 2.machine-down 3.machine-up
[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] adaibee edited comment on MESOS-7966 at 9/19/17 3:25 AM: - h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} h4. Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave *We had a loop doing maintenance job in three mesos-slaves:* 1.mesos-maintenance-schedule 2.machine-down 3.machine-up 4.mesos-maintenance-schedule-cancel But in the loop, we found one of mesos-master crashed and other mesos-master was elected. We found something in mesos.slave.FATAL: {code:none} 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() And in mesos.slave.INFO: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: *** 2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail() 2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog() 2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush() 2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal() 2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer() 2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_ 2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ 2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()() 2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit() 2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit() 2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve() 2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume() 2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv 2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE 2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv 2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown) 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread 2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT {code} This case could be reproduced by calling {code:none} for i in {1..8}; do python call.py; done {code}(_call.py _gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ). Looks like there is something wrong when call /maintenance/schedule concurrently. We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like: 1. call schedule for 3 machine: a,b,c. 2. as machine a was done, maintenance window updates to: b,c 3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d This change sometimes happen in little interval.What's the best practice to update maintanence window? was (Author: adaibee): h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} h4. Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave *We had a loop doing maintenance job in three mesos-slaves:* 1.mesos-maintenance-schedule 2.machine-
[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] adaibee edited comment on MESOS-7966 at 9/19/17 3:26 AM: - h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} h4. Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave *We had a loop doing maintenance job in three mesos-slaves:* 1.mesos-maintenance-schedule 2.machine-down 3.machine-up 4.mesos-maintenance-schedule-cancel But in the loop, we found one of mesos-master crashed and other mesos-master was elected. We found something in mesos.slave.FATAL: {code:none} 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() And in mesos.slave.INFO: 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome() 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: *** 2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail() 2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog() 2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush() 2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal() 2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer() 2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_ 2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ 2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()() 2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit() 2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit() 2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve() 2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume() 2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv 2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE 2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv 2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown) 2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread 2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT {code} This case could be reproduced by calling {code:none} for i in {1..8}; do python call.py; done {code}(call.py gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ). Looks like there is something wrong when call /maintenance/schedule concurrently. We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like: 1. call schedule for 3 machine: a,b,c. 2. as machine a was done, maintenance window updates to: b,c 3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d This change sometimes happen in little interval.What's the best practice to update maintanence window? was (Author: adaibee): h4. Mesos version: {code:none} # rpm -qa mesos mesos-1.2.0-1.2.0.x86_64 {code} h4. Cluster info: 3 mesos-master(mesos_quorum=1) 3 mesos-slave *We had a loop doing maintenance job in three mesos-slaves:* 1.mesos-maintenance-schedule 2.machine-do
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171182#comment-16171182 ] Qian Zhang commented on MESOS-7963: --- [~jpe...@apache.org] In your second example, a task group with two tasks was launched, when disk/du isolator raised a limitation for the root container, Mesos containerizer will try to destroy the root container, but before that, it will try to destroy the two nested containers first. So when the first nested container was destroyed, the default executor will know it (since the default executor was still alive at that moment) and it will send a {{TASK_FAILED}} for the first task and the source is {{SOURCE_EXECUTOR}}. For the second task, I think before the default executor got a chance to send status update for it, the default executor itself was destroyed by Mesos containerizer, that's why we see a {{TASK_FAILED}} for the second task and its source is {{SOURCE_AGENT}}. In your first example, the task group has only one task, so I think it follow the same flow of the first task in your second example, i.e., the default executor sent a {{TASK_FAILED}} for the task, and then the default executor itself was destroyed (or maybe self terminated). Currently both cgroups isolator (memory subsystem) and disk/du isolator raise the limitation for root container rather than nested container. I think we may need to change them to raise the limitation for nested container, and enhance the implementation of waitNestedContainer() to make it propagate the reason and message of the container termination to the default executor, and then the default executor can send the status update with the reason and message for the nested container to the scheduler. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8eb