Hi, I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances.
So, the only moment I get a TASK_LOST is when I lose a spot instance due to being outbid. I guess you may also lose instances due to an AWS autoscaler scale-in procedure, for example, if it decides the cluster is inderutilised then it can kill any instane in your cluster, not necessarilly the least used one. That's the reason we decided to develop our customised autoscaler that detects and kills specific instances based on our own rules. So, are you using spot fleets or spot innstances? Have you setup your scale-in procedures correctly? Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge instance and run xlarge instances instead. Same price and if you lose one you just lose 1/10th of your jobs. Luck! ---------------------------------------- From: "haosdent" <haosd...@gmail.com> Sent: Saturday, December 17, 2016 6:12 PM To: "user" <user@mesos.apache.org> Subject: Re: Mesos on AWS > sometimes Mesos agent is launched but master doesn't show them. It sounds like the Master Master could not connect to your Agents. May you mind paste your Mesos Master log? Any information show Mesos agents are disconnected in it? On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <kmenshi...@gmail.com> wrote: I have my own framework. Sometimes I get TASK_LOST status with message slave lost during health check. Also I found sometimes Mesos agent is launched but master doesn't show them. From agent I see that it found master and connected. After agent restart it start working. -Kiril On Dec 16, 2016, at 21:58, Zameer Manji <zma...@apache.org> wrote: Hey, Could you detail on what you mean by "delays and health check problems"? Are you using your own framework or an existing one? How are you launching the tasks? Could you share logs from Mesos that show timeouts to ZK? For reference, I operate a large Mesos cluster and I have never encountered problems when running 1k tasks concurrently so I think sharing data would help everyone debug this problem. On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <kmenshi...@gmail.com> wrote: ?Hi, Does any body try to run Mesos on AWS instances? Can you give me recommendations. I am developing elastic (scale aws instances on demand) Mesos cluster. Currently I have 3 master instances. I run about 1000 tasks simultaneously. I see delays and health check problems. ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU). At the moment I increase time out in ZooKeeper cluster. What can I do to decrease timeouts? Also how can I increase performance? The main bottleneck is what I have the big amount of tasks(run simultaneously) for an hour after I shutdown them or restart (depends how good them perform). -Kiril? -- Zameer Manji -- Best Regards, Haosdent Huang