Yes, as I said at the outset, the agents are on the same host, with different ip's and hostname's and work_dir's. If having separate work_dirs is not sufficient to keep containers separated by agent, what additionally is required?
On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone <vinodk...@apache.org> wrote: > How is agent2 able to see agent1's containers? Are they running on the > same box!? Are they somehow sharing the filesystem? If yes, that's not > supported. > > On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary <d...@touchplan.io> wrote: > >> Sure, master log and agent logs are attached. >> >> Synopsis: In the master log, tasks t000001 and t000002 are running... >> >> > I1114 17:08:15.972033 5443 master.cpp:6841] Status update TASK_RUNNING >> (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t000001 of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1) >> > I1114 17:08:19.142276 5448 master.cpp:6841] Status update TASK_RUNNING >> (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t000002 of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1) >> >> Operator starts up agent2 around 17:08:50ish. Executor1 and its tasks >> are terminated.... >> >> > I1114 17:08:54.835841 5447 master.cpp:6964] Executor 'executor1' of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1): terminated with signal Killed >> > I1114 17:08:54.835959 5447 master.cpp:9051] Removing executor >> 'executor1' with resources [] of framework >> 10aa0208-4a85-466c-af89-7e73617516f5-0001 >> on agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@ >> 127.1.1.1:5051 (agent1) >> > I1114 17:08:54.837419 5436 master.cpp:6841] Status update TASK_FAILED >> (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t000001 of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1) >> > I1114 17:08:54.837497 5436 master.cpp:6903] Forwarding status update >> TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task >> t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 >> > I1114 17:08:54.837896 5436 master.cpp:8928] Updating the state of task >> t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest >> state: TASK_FAILED, status update state: TASK_FAILED) >> > I1114 17:08:54.839159 5436 master.cpp:6841] Status update TASK_FAILED >> (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t000002 of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1) >> > I1114 17:08:54.839221 5436 master.cpp:6903] Forwarding status update >> TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task >> t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 >> > I1114 17:08:54.839493 5436 master.cpp:8928] Updating the state of task >> t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest >> state: TASK_FAILED, status update state: TASK_FAILED) >> >> But agent2 doesn't register until later... >> >> > I1114 17:08:55.588762 5442 master.cpp:5714] Received register agent >> message from slave(1)@127.1.1.2:5052 (agent2) >> >> Meanwhile in the agent1 log, the termination of executor1 appears to be >> the result of the destruction of its container... >> >> > I1114 17:08:54.810638 5468 containerizer.cpp:2612] Container >> cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited >> > I1114 17:08:54.810732 5468 containerizer.cpp:2166] Destroying >> container cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state >> > I1114 17:08:54.810761 5468 containerizer.cpp:2712] Transitioning the >> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to >> DESTROYING >> >> Apparently because agent2 decided to "recover" the very same container... >> >> > I1114 17:08:54.775907 6041 linux_launcher.cpp:373] >> cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container >> > I1114 17:08:54.779634 6037 containerizer.cpp:966] Cleaning up orphan >> container cbcf6992-3094-4d0f-8482-4d68f68eae84 >> > I1114 17:08:54.779705 6037 containerizer.cpp:2166] Destroying >> container cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state >> > I1114 17:08:54.779737 6037 containerizer.cpp:2712] Transitioning the >> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to >> DESTROYING >> > I1114 17:08:54.780740 6041 linux_launcher.cpp:505] Asked to destroy >> container cbcf6992-3094-4d0f-8482-4d68f68eae84 >> >> Seems like an issue with the containerizer? >> >> >> On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone <vinodk...@apache.org> wrote: >> >>> That seems weird then. A new agent coming up on a new ip and host, >>> shouldn't affect other agents running on different hosts. Can you share >>> master logs that surface the issue? >>> >>> On Tue, Nov 14, 2017 at 12:51 PM, Dan Leary <d...@touchplan.io> wrote: >>> >>>> Just one mesos-master (no zookeeper) with --ip=127.0.0.1 >>>> --hostname=localhost. >>>> In /etc/hosts are >>>> 127.1.1.1 agent1 >>>> 127.1.1.2 agent2 >>>> etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1 etc. >>>> >>>> >>>> On Tue, Nov 14, 2017 at 3:41 PM, Vinod Kone <vinodk...@apache.org> >>>> wrote: >>>> >>>>> ```Experiments thus far are with a cluster all on a single host, >>>>> master on 127.0.0.1, agents have their own ip's and hostnames and >>>>> ports.``` >>>>> >>>>> What does this mean? How are all your masters and agents on the same >>>>> host but still get different ips and hostnames? >>>>> >>>>> >>>>> On Tue, Nov 14, 2017 at 12:22 PM, Dan Leary <d...@touchplan.io> wrote: >>>>> >>>>>> So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP >>>>>> API, custom executor, checkpointing disabled. >>>>>> When the framework is running happily and a new agent is added to the >>>>>> cluster all the existing executors immediately get terminated. >>>>>> The scheduler is told of the lost executors and tasks and then >>>>>> receives offers about agents old and new and carries on normally. >>>>>> >>>>>> I would expect however that the existing executors should keep >>>>>> running and the scheduler should just receive offers about the new agent. >>>>>> It's as if agent recovery is being performed when the new agent is >>>>>> launched even though no old agent has exited. >>>>>> Experiments thus far are with a cluster all on a single host, master >>>>>> on 127.0.0.1, agents have their own ip's and hostnames and ports. >>>>>> >>>>>> Am I missing a configuration parameter? Or is this correct behavior? >>>>>> >>>>>> -Dan >>>>>> >>>>>> >>>>> >>>> >>> >> >