date:20170918

[jira] [Created] (MESOS-7986) ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test execution

2017-09-18 Thread Benjamin Bannier (JIRA)

Benjamin Bannier created MESOS-7986:
---

 Summary: ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in 
parallel test execution
 Key: MESOS-7986
 URL: https://issues.apache.org/jira/browse/MESOS-7986
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.5.0
Reporter: Benjamin Bannier


When running cmake-built Mesos test in parallel, I reliably encounter a failing 
{{ExecutorHttpApiTest.ValidJsonButInvalidProtobuf}},
{noformat}
$ ../support/mesos-gtest-runner.py ./src/mesos-tests -j10
[ RUN  ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf
../src/tests/executor_http_api_tests.cpp:197: Failure
Value of: (response).get().status
  Actual: "401 Unauthorized"
Expected: BadRequest().status
Which is: "400 Bad Request"
[  FAILED  ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf (17 ms)
{noformat}.
The machine has 16 physical cores.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-7986) ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test execution

2017-09-18 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7986:

Attachment: test.log

> ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test 
> execution
> 
>
> Key: MESOS-7986
> URL: https://issues.apache.org/jira/browse/MESOS-7986
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>  Labels: mesosphere
> Attachments: test.log
>
>
> When running cmake-built Mesos test in parallel, I reliably encounter a 
> failing {{ExecutorHttpApiTest.ValidJsonButInvalidProtobuf}},
> {noformat}
> $ ../support/mesos-gtest-runner.py ./src/mesos-tests -j10
> [ RUN  ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf
> ../src/tests/executor_http_api_tests.cpp:197: Failure
> Value of: (response).get().status
>   Actual: "401 Unauthorized"
> Expected: BadRequest().status
> Which is: "400 Bad Request"
> [  FAILED  ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf (17 ms)
> {noformat}.
> The machine has 16 physical cores.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (MESOS-4812) Mesos fails to escape command health checks

2017-09-18 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-4812:


Assignee: Andrei Budnik  (was: haosdent)

Reworked Haosdent's patch:
https://reviews.apache.org/r/62381/

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: Andrei Budnik
>  Labels: health-check, mesosphere, tech-debt
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-7987) Initialize Google Mock rather than Google Test.

2017-09-18 Thread James Peach (JIRA)

James Peach created MESOS-7987:
--

 Summary: Initialize Google Mock rather than Google Test.
 Key: MESOS-7987
 URL: https://issues.apache.org/jira/browse/MESOS-7987
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: James Peach
Assignee: James Peach


We should initialize Google Mock rather than Google Test. The Google Mock 
initializer calls Google Test, so it is functionally a superset. If we do this 
then the {{\-\-gmock_verbose}} option works.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-7988) Mesos attempts to open handle for the system idle process

2017-09-18 Thread Andrew Schwartzmeyer (JIRA)

Andrew Schwartzmeyer created MESOS-7988:
---

 Summary: Mesos attempts to open handle for the system idle process
 Key: MESOS-7988
 URL: https://issues.apache.org/jira/browse/MESOS-7988
 Project: Mesos
  Issue Type: Bug
  Components: stout
 Environment: Windows 10
Reporter: Andrew Schwartzmeyer
Assignee: Andrew Schwartzmeyer


While running {{mesos-tests}} under Application Verifier, I found that we were 
inadvertently attempting to get a handle for the System Idle Process. This is 
not permitted by the OS, and so the {{OpenProcess}} system call was failing. I 
further found that we were incorrectly checking the failure condition of 
{{OpenProcess}}. We were attempting to open this handle when opening  handles 
for all PIDs returned by {{os::pids}}, and the Windows API {{EnumProcess}} 
includes PID 0 (System Idle Process). As this PID is not useful, we can safely 
remove it from the {{os::pids}} API. Attempting to do _anything_ with PID 0 
will likely result in failure, as it is a special process on Windows, and so we 
can help to prevent these errors by filtering out PID 0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-18 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16170890#comment-16170890
 ] 

James Peach commented on MESOS-7963:


/cc [~jieyu] This covers the executor container limitation we discussed on the 
Slack channel.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }],
> "command": {
> "value": "sleep 600"
> }
> }, {
> "name": "7247643c-5e4d-4b01-9839-e38db49f7f

[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

2017-09-18 Thread adaibee (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058
 ] 

adaibee commented on MESOS-7966:


{code:shell}
// # rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-slaves:
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master 
was elected.
We found something in mesos.slave.FATAL:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack 
trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 
google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 
google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 
google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a 
google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 
std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 
process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a 
process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e 
process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 
process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c 
_ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process 
exited, code=killed, status=6/ABRT

This case could be reproduced by calling `for i in {1..8}; do python call.py; 
done` (call.py gist: 
https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).

Looks like there is something wrong when call /maintenance/schedule 
concurrently.

We met this case because we use wrote a service base on ansible that manage the 
mesos cluster. When we create a task to update slave configs with a certain 
number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update 
to: b,c,d

This change sometimes happen in little interval. Then we find the fatal log 
just in Bayou's mail.

What's the right way to update maintanence window? Thanks to any reply.

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

2017-09-18 Thread adaibee (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058
 ] 

adaibee edited comment on MESOS-7966 at 9/19/17 3:17 AM:
-

{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-slaves:
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master 
was elected.
We found something in mesos.slave.FATAL:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack 
trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 
google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 
google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 
google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a 
google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 
std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 
process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a 
process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e 
process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 
process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c 
_ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process 
exited, code=killed, status=6/ABRT

This case could be reproduced by calling `for i in {1..8}; do python call.py; 
done` (call.py gist: 
https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).

Looks like there is something wrong when call /maintenance/schedule 
concurrently.

We met this case because we use wrote a service base on ansible that manage the 
mesos cluster. When we create a task to update slave configs with a certain 
number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update 
to: b,c,d

This change sometimes happen in little interval. Then we find the fatal log 
just in Bayou's mail.

What's the right way to update maintanence window? Thanks to any reply.


was (Author: adaibee):
{code:none}
// # rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-slaves:
1.mesos-maintenance-schedule 
2.machine-down 
3.ma

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

2017-09-18 Thread adaibee (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058
 ] 

adaibee edited comment on MESOS-7966 at 9/19/17 3:17 AM:
-

{code:none}
// # rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-slaves:
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master 
was elected.
We found something in mesos.slave.FATAL:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack 
trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 
google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 
google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 
google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a 
google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 
std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 
process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a 
process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e 
process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 
process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c 
_ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process 
exited, code=killed, status=6/ABRT

This case could be reproduced by calling `for i in {1..8}; do python call.py; 
done` (call.py gist: 
https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).

Looks like there is something wrong when call /maintenance/schedule 
concurrently.

We met this case because we use wrote a service base on ansible that manage the 
mesos cluster. When we create a task to update slave configs with a certain 
number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update 
to: b,c,d

This change sometimes happen in little interval. Then we find the fatal log 
just in Bayou's mail.

What's the right way to update maintanence window? Thanks to any reply.


was (Author: adaibee):
{code:shell}
// # rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-slaves:
1.mesos-maintenance-schedule 
2.machine-down

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

2017-09-18 Thread adaibee (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058
 ] 

adaibee edited comment on MESOS-7966 at 9/19/17 3:18 AM:
-

h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-slaves:
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master 
was elected.
We found something in mesos.slave.FATAL:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack 
trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 
google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 
google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 
google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a 
google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 
std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 
process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a 
process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e 
process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 
process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c 
_ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process 
exited, code=killed, status=6/ABRT

This case could be reproduced by calling `for i in {1..8}; do python call.py; 
done` (call.py gist: 
https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).

Looks like there is something wrong when call /maintenance/schedule 
concurrently.

We met this case because we use wrote a service base on ansible that manage the 
mesos cluster. When we create a task to update slave configs with a certain 
number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update 
to: b,c,d

This change sometimes happen in little interval. Then we find the fatal log 
just in Bayou's mail.

What's the right way to update maintanence window? Thanks to any reply.


was (Author: adaibee):
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-slaves:
1.mesos-maintenance-schedule 
2.ma

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

2017-09-18 Thread adaibee (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058
 ] 

adaibee edited comment on MESOS-7966 at 9/19/17 3:19 AM:
-

h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
h4. Cluster info:
3 mesos-master(mesos_quorum=1)
3 mesos-slave
*We had a loop doing maintenance job in three mesos-slaves:*
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master 
was elected.
We found something in mesos.slave.FATAL:
{code:none}
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack 
trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 
google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 
google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 
google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a 
google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 
std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 
process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a 
process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e 
process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 
process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c 
_ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process 
exited, code=killed, status=6/ABRT
{code}

This case could be reproduced by calling `for i in {1..8}; do python call.py; 
done` (call.py gist: 
https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).

Looks like there is something wrong when call /maintenance/schedule 
concurrently.

We met this case because we use wrote a service base on ansible that manage the 
mesos cluster. When we create a task to update slave configs with a certain 
number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update 
to: b,c,d

This change sometimes happen in little interval. Then we find the fatal log 
just in Bayou's mail.

What's the right way to update maintanence window? Thanks to any reply.


was (Author: adaibee):
h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

2017-09-18 Thread adaibee (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058
 ] 

adaibee edited comment on MESOS-7966 at 9/19/17 3:20 AM:
-

h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
h4. Cluster info:
3 mesos-master(mesos_quorum=1)
3 mesos-slave
*We had a loop doing maintenance job in three mesos-slaves:*
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master 
was elected.
We found something in mesos.slave.FATAL:
{code:none}
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack 
trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 
google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 
google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 
google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a 
google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 
std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 
process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a 
process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e 
process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 
process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c 
_ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process 
exited, code=killed, status=6/ABRT
{code}

This case could be reproduced by calling 
`for i in {1..8}; do python call.py; done` (call.py gist: 
https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).
Looks like there is something wrong when call /maintenance/schedule 
concurrently.
We met this case because we use wrote a service base on ansible that manage the 
mesos cluster. When we create a task to update slave configs with a certain 
number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update 
to: b,c,d
This change sometimes happen in little interval.What's the best practice to 
update maintanence window? 


was (Author: adaibee):
h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
h4. Cluster info:
3 mesos-master(mesos_quorum=1)
3 mesos-slave
*We had a loop doing maintenance job in three mesos-slaves:*
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

2017-09-18 Thread adaibee (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058
 ] 

adaibee edited comment on MESOS-7966 at 9/19/17 3:25 AM:
-

h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
h4. Cluster info:
3 mesos-master(mesos_quorum=1)
3 mesos-slave
*We had a loop doing maintenance job in three mesos-slaves:*
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master 
was elected.
We found something in mesos.slave.FATAL:
{code:none}
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack 
trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 
google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 
google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 
google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a 
google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 
std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 
process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a 
process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e 
process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 
process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c 
_ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process 
exited, code=killed, status=6/ABRT
{code}

This case could be reproduced by calling
 {code:none}
for i in {1..8}; do python call.py; done
{code}(_call.py _gist: 
https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).
Looks like there is something wrong when call /maintenance/schedule 
concurrently.
We met this case because we use wrote a service base on ansible that manage the 
mesos cluster. When we create a task to update slave configs with a certain 
number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update 
to: b,c,d
This change sometimes happen in little interval.What's the best practice to 
update maintanence window? 


was (Author: adaibee):
h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
h4. Cluster info:
3 mesos-master(mesos_quorum=1)
3 mesos-slave
*We had a loop doing maintenance job in three mesos-slaves:*
1.mesos-maintenance-schedule 
2.machine-

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

2017-09-18 Thread adaibee (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058
 ] 

adaibee edited comment on MESOS-7966 at 9/19/17 3:26 AM:
-

h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
h4. Cluster info:
3 mesos-master(mesos_quorum=1)
3 mesos-slave
*We had a loop doing maintenance job in three mesos-slaves:*
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master 
was elected.
We found something in mesos.slave.FATAL:
{code:none}
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 
hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack 
trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 
google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 
google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 
google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a 
google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 
std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 
process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a 
process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e 
process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 
process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c 
_ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process 
exited, code=killed, status=6/ABRT
{code}

This case could be reproduced by calling
 {code:none}
for i in {1..8}; do python call.py; done
{code}(call.py gist: 
https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).
Looks like there is something wrong when call /maintenance/schedule 
concurrently.
We met this case because we use wrote a service base on ansible that manage the 
mesos cluster. When we create a task to update slave configs with a certain 
number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update 
to: b,c,d
This change sometimes happen in little interval.What's the best practice to 
update maintanence window? 


was (Author: adaibee):
h4. Mesos version:
{code:none}
# rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
h4. Cluster info:
3 mesos-master(mesos_quorum=1)
3 mesos-slave
*We had a loop doing maintenance job in three mesos-slaves:*
1.mesos-maintenance-schedule 
2.machine-do

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-18 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171182#comment-16171182
 ] 

Qian Zhang commented on MESOS-7963:
---

[~jpe...@apache.org] In your second example, a task group with two tasks was 
launched, when disk/du isolator raised a limitation for the root container, 
Mesos containerizer will try to destroy the root container, but before that, it 
will try to destroy the two nested containers first. So when the first nested 
container was destroyed, the default executor will know it (since the default 
executor was still alive at that moment) and it will send a {{TASK_FAILED}} for 
the first task and the source is {{SOURCE_EXECUTOR}}. For the second task, I 
think before the default executor got a chance to send status update for it, 
the default executor itself was destroyed by Mesos containerizer, that's why we 
see a {{TASK_FAILED}} for the second task and its source is {{SOURCE_AGENT}}.

In your first example, the task group has only one task, so I think it follow 
the same flow of the first task in your second example, i.e., the default 
executor sent a {{TASK_FAILED}} for the task, and then the default executor 
itself was destroyed (or maybe self terminated).

Currently both cgroups isolator (memory subsystem) and disk/du isolator raise 
the limitation for root container rather than nested container. I think we may 
need to change them to raise the limitation for nested container, and enhance 
the implementation of waitNestedContainer() to make it propagate the reason and 
message of the container termination to the default executor, and then the 
default executor can send the status update with the reason and message for the 
nested container to the scheduler.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8eb

[jira] [Created] (MESOS-7986) ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test execution

[jira] [Updated] (MESOS-7986) ExecutorHttpApiTest.ValidJsonButInvalidProtobuf fails in parallel test execution

[jira] [Assigned] (MESOS-4812) Mesos fails to escape command health checks

[jira] [Created] (MESOS-7987) Initialize Google Mock rather than Google Test.

[jira] [Created] (MESOS-7988) Mesos attempts to open handle for the system idle process

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

[jira] [Comment Edited] (MESOS-7966) check for maintenance on agent causes fatal error

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

15 matches

Site Navigation

Mail list logo

Footer information