> sometimes Mesos agent is launched but master doesn’t show them. It sounds like the Master Master could not connect to your Agents. May you mind paste your Mesos Master log? Any information show Mesos agents are disconnected in it?
On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <kmenshi...@gmail.com> wrote: > I have my own framework. Sometimes I get TASK_LOST status with message > slave lost during health check. > > Also I found sometimes Mesos agent is launched but master doesn’t show > them. From agent I see that it found master and connected. After agent > restart it start working. > > -Kiril > > > On Dec 16, 2016, at 21:58, Zameer Manji <zma...@apache.org> wrote: > > Hey, > > Could you detail on what you mean by "delays and health check problems"? > Are you using your own framework or an existing one? How are you launching > the tasks? > > Could you share logs from Mesos that show timeouts to ZK? > > For reference, I operate a large Mesos cluster and I have never > encountered problems when running 1k tasks concurrently so I think sharing > data would help everyone debug this problem. > > On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <kmenshi...@gmail.com> > wrote: > >> Hi, >> >> Does any body try to run Mesos on AWS instances? Can you give me >> recommendations. >> >> I am developing elastic (scale aws instances on demand) Mesos cluster. >> Currently I have 3 master instances. I run about 1000 tasks simultaneously. >> I see delays and health check problems. >> >> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU). >> >> At the moment I increase time out in ZooKeeper cluster. What can I do to >> decrease timeouts? >> >> Also how can I increase performance? The main bottleneck is what I have >> the big amount of tasks(run simultaneously) for an hour after I shutdown >> them or restart (depends how good them perform). >> >> -Kiril >> >> -- >> Zameer Manji >> > > -- Best Regards, Haosdent Huang