Hey, Sorry for delayed response. I reinstalled my AWS infrastructure. Now I install everything on RedHat linux. Before I use Amazon Linux.
I tested with single master (m4.large). Everything works perfect. I am not sure if it was Amazon Linux or my old configurations. Thanks, -Kirils On 18 December 2016 at 14:03, Guillermo Rodriguez <gu...@spritekin.com> wrote: > Hi, > I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at > any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances. > > So, the only moment I get a TASK_LOST is when I lose a spot instance due > to being outbid. > > I guess you may also lose instances due to an AWS autoscaler scale-in > procedure, for example, if it decides the cluster is inderutilised then it > can kill any instane in your cluster, not necessarilly the least used one. > That's the reason we decided to develop our customised autoscaler that > detects and kills specific instances based on our own rules. > > So, are you using spot fleets or spot innstances? Have you setup your > scale-in procedures correctly? > > Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge > means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge > instance and run xlarge instances instead. Same price and if you lose one > you just lose 1/10th of your jobs. > > Luck! > > > > > > ------------------------------ > *From*: "haosdent" <haosd...@gmail.com> > *Sent*: Saturday, December 17, 2016 6:12 PM > *To*: "user" <user@mesos.apache.org> > *Subject*: Re: Mesos on AWS > > > sometimes Mesos agent is launched but master doesn’t show them. > It sounds like the Master Master could not connect to your Agents. May you > mind paste your Mesos Master log? Any information show Mesos agents are > disconnected in it? > > On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <kmenshi...@gmail.com> > wrote: >> >> I have my own framework. Sometimes I get TASK_LOST status with message >> slave lost during health check. >> >> Also I found sometimes Mesos agent is launched but master doesn’t show >> them. From agent I see that it found master and connected. After agent >> restart it start working. >> >> -Kiril >> >> >> >> On Dec 16, 2016, at 21:58, Zameer Manji <zma...@apache.org> wrote: >> >> Hey, >> >> Could you detail on what you mean by "delays and health check problems"? >> Are you using your own framework or an existing one? How are you launching >> the tasks? >> >> Could you share logs from Mesos that show timeouts to ZK? >> >> For reference, I operate a large Mesos cluster and I have never >> encountered problems when running 1k tasks concurrently so I think sharing >> data would help everyone debug this problem. >> >> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <kmenshi...@gmail.com> >> wrote: >>> >>> ?Hi, >>> >>> Does any body try to run Mesos on AWS instances? Can you give me >>> recommendations. >>> >>> I am developing elastic (scale aws instances on demand) Mesos cluster. >>> Currently I have 3 master instances. I run about 1000 tasks simultaneously. >>> I see delays and health check problems. >>> >>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU). >>> >>> At the moment I increase time out in ZooKeeper cluster. What can I do to >>> decrease timeouts? >>> >>> Also how can I increase performance? The main bottleneck is what I have >>> the big amount of tasks(run simultaneously) for an hour after I shutdown >>> them or restart (depends how good them perform). >>> >>> -Kiril? >>> >>> -- >>> Zameer Manji >>> >> > > -- > Best Regards, > Haosdent Huang > -- Thanks, -Kiril Phone +37126409291 Riga, Latvia Skype perimetr122