Re: Adding a new agent terminates existing executors?

Dan Leary Wed, 15 Nov 2017 08:24:41 -0800

Yes, as I said at the outset, the agents are on the same host, with
different ip's and hostname's and work_dir's.
If having separate work_dirs is not sufficient to keep containers separated
by agent, what additionally is required?



On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone <vinodk...@apache.org> wrote:

> How is agent2 able to see agent1's containers? Are they running on the
> same box!? Are they somehow sharing the filesystem? If yes, that's not
> supported.
>
> On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary <d...@touchplan.io> wrote:
>
>> Sure, master log and agent logs are attached.
>>
>> Synopsis:  In the master log, tasks t000001 and t000002 are running...
>>
>> > I1114 17:08:15.972033  5443 master.cpp:6841] Status update TASK_RUNNING
>> (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t000001 of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1)
>> > I1114 17:08:19.142276  5448 master.cpp:6841] Status update TASK_RUNNING
>> (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t000002 of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1)
>>
>> Operator starts up agent2 around 17:08:50ish.  Executor1 and its tasks
>> are terminated....
>>
>> > I1114 17:08:54.835841  5447 master.cpp:6964] Executor 'executor1' of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1): terminated with signal Killed
>> > I1114 17:08:54.835959  5447 master.cpp:9051] Removing executor
>> 'executor1' with resources [] of framework 
>> 10aa0208-4a85-466c-af89-7e73617516f5-0001
>> on agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@
>> 127.1.1.1:5051 (agent1)
>> > I1114 17:08:54.837419  5436 master.cpp:6841] Status update TASK_FAILED
>> (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t000001 of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1)
>> > I1114 17:08:54.837497  5436 master.cpp:6903] Forwarding status update
>> TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task
>> t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
>> > I1114 17:08:54.837896  5436 master.cpp:8928] Updating the state of task
>> t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest
>> state: TASK_FAILED, status update state: TASK_FAILED)
>> > I1114 17:08:54.839159  5436 master.cpp:6841] Status update TASK_FAILED
>> (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t000002 of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1)
>> > I1114 17:08:54.839221  5436 master.cpp:6903] Forwarding status update
>> TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task
>> t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
>> > I1114 17:08:54.839493  5436 master.cpp:8928] Updating the state of task
>> t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest
>> state: TASK_FAILED, status update state: TASK_FAILED)
>>
>> But agent2 doesn't register until later...
>>
>> > I1114 17:08:55.588762  5442 master.cpp:5714] Received register agent
>> message from slave(1)@127.1.1.2:5052 (agent2)
>>
>> Meanwhile in the agent1 log, the termination of executor1 appears to be
>> the result of the destruction of its container...
>>
>> > I1114 17:08:54.810638  5468 containerizer.cpp:2612] Container
>> cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited
>> > I1114 17:08:54.810732  5468 containerizer.cpp:2166] Destroying
>> container cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
>> > I1114 17:08:54.810761  5468 containerizer.cpp:2712] Transitioning the
>> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to
>> DESTROYING
>>
>> Apparently because agent2 decided to "recover" the very same container...
>>
>> > I1114 17:08:54.775907  6041 linux_launcher.cpp:373]
>> cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container
>> > I1114 17:08:54.779634  6037 containerizer.cpp:966] Cleaning up orphan
>> container cbcf6992-3094-4d0f-8482-4d68f68eae84
>> > I1114 17:08:54.779705  6037 containerizer.cpp:2166] Destroying
>> container cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
>> > I1114 17:08:54.779737  6037 containerizer.cpp:2712] Transitioning the
>> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to
>> DESTROYING
>> > I1114 17:08:54.780740  6041 linux_launcher.cpp:505] Asked to destroy
>> container cbcf6992-3094-4d0f-8482-4d68f68eae84
>>
>> Seems like an issue with the containerizer?
>>
>>
>> On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone <vinodk...@apache.org> wrote:
>>
>>> That seems weird then. A new agent coming up on a new ip and host,
>>> shouldn't affect other agents running on different hosts. Can you share
>>> master logs that surface the issue?
>>>
>>> On Tue, Nov 14, 2017 at 12:51 PM, Dan Leary <d...@touchplan.io> wrote:
>>>
>>>> Just one mesos-master (no zookeeper) with --ip=127.0.0.1
>>>> --hostname=localhost.
>>>> In /etc/hosts are
>>>>   127.1.1.1    agent1
>>>>   127.1.1.2    agent2
>>>> etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1 etc.
>>>>
>>>>
>>>> On Tue, Nov 14, 2017 at 3:41 PM, Vinod Kone <vinodk...@apache.org>
>>>> wrote:
>>>>
>>>>> ```Experiments thus far are with a cluster all on a single host,
>>>>> master on 127.0.0.1, agents have their own ip's and hostnames and 
>>>>> ports.```
>>>>>
>>>>> What does this mean? How are all your masters and agents on the same
>>>>> host but still get different ips and hostnames?
>>>>>
>>>>>
>>>>> On Tue, Nov 14, 2017 at 12:22 PM, Dan Leary <d...@touchplan.io> wrote:
>>>>>
>>>>>> So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP
>>>>>> API, custom executor, checkpointing disabled.
>>>>>> When the framework is running happily and a new agent is added to the
>>>>>> cluster all the existing executors immediately get terminated.
>>>>>> The scheduler is told of the lost executors and tasks and then
>>>>>> receives offers about agents old and new and carries on normally.
>>>>>>
>>>>>> I would expect however that the existing executors should keep
>>>>>> running and the scheduler should just receive offers about the new agent.
>>>>>> It's as if agent recovery is being performed when the new agent is
>>>>>> launched even though no old agent has exited.
>>>>>> Experiments thus far are with a cluster all on a single host, master
>>>>>> on 127.0.0.1, agents have their own ip's and hostnames and ports.
>>>>>>
>>>>>> Am I missing a configuration parameter?   Or is this correct behavior?
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Adding a new agent terminates existing executors?

Reply via email to