Kiril— from what you described it does not sound like the problem is the Linux distribution. It may be your AWS configuration. However, if a combination of health checks and heavy loaded agent leads to the agent termination — I would like to investigate this issue. Please come back—with logs!—if you see the issue again.
On Tue, Dec 20, 2016 at 3:46 PM, Kiril Menshikov <kmenshi...@gmail.com> wrote: > Hey, > > Sorry for delayed response. I reinstalled my AWS infrastructure. Now I > install everything on RedHat linux. Before I use Amazon Linux. > > I tested with single master (m4.large). Everything works perfect. I am not > sure if it was Amazon Linux or my old configurations. > > Thanks, > -Kirils > > On 18 December 2016 at 14:03, Guillermo Rodriguez <gu...@spritekin.com> > wrote: > >> Hi, >> I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at >> any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances. >> >> So, the only moment I get a TASK_LOST is when I lose a spot instance due >> to being outbid. >> >> I guess you may also lose instances due to an AWS autoscaler scale-in >> procedure, for example, if it decides the cluster is inderutilised then it >> can kill any instane in your cluster, not necessarilly the least used one. >> That's the reason we decided to develop our customised autoscaler that >> detects and kills specific instances based on our own rules. >> >> So, are you using spot fleets or spot innstances? Have you setup your >> scale-in procedures correctly? >> >> Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge >> means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge >> instance and run xlarge instances instead. Same price and if you lose one >> you just lose 1/10th of your jobs. >> >> Luck! >> >> >> >> >> >> ------------------------------ >> *From*: "haosdent" <haosd...@gmail.com> >> *Sent*: Saturday, December 17, 2016 6:12 PM >> *To*: "user" <user@mesos.apache.org> >> *Subject*: Re: Mesos on AWS >> >> > sometimes Mesos agent is launched but master doesn’t show them. >> It sounds like the Master Master could not connect to your Agents. May >> you mind paste your Mesos Master log? Any information show Mesos agents are >> disconnected in it? >> >> On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <kmenshi...@gmail.com> >> wrote: >>> >>> I have my own framework. Sometimes I get TASK_LOST status with message >>> slave lost during health check. >>> >>> Also I found sometimes Mesos agent is launched but master doesn’t show >>> them. From agent I see that it found master and connected. After agent >>> restart it start working. >>> >>> -Kiril >>> >>> >>> >>> On Dec 16, 2016, at 21:58, Zameer Manji <zma...@apache.org> wrote: >>> >>> Hey, >>> >>> Could you detail on what you mean by "delays and health check problems"? >>> Are you using your own framework or an existing one? How are you launching >>> the tasks? >>> >>> Could you share logs from Mesos that show timeouts to ZK? >>> >>> For reference, I operate a large Mesos cluster and I have never >>> encountered problems when running 1k tasks concurrently so I think sharing >>> data would help everyone debug this problem. >>> >>> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <kmenshi...@gmail.com> >>> wrote: >>>> >>>> ?Hi, >>>> >>>> Does any body try to run Mesos on AWS instances? Can you give me >>>> recommendations. >>>> >>>> I am developing elastic (scale aws instances on demand) Mesos cluster. >>>> Currently I have 3 master instances. I run about 1000 tasks simultaneously. >>>> I see delays and health check problems. >>>> >>>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU). >>>> >>>> At the moment I increase time out in ZooKeeper cluster. What can I do >>>> to decrease timeouts? >>>> >>>> Also how can I increase performance? The main bottleneck is what I have >>>> the big amount of tasks(run simultaneously) for an hour after I shutdown >>>> them or restart (depends how good them perform). >>>> >>>> -Kiril? >>>> >>>> -- >>>> Zameer Manji >>>> >>> >> >> -- >> Best Regards, >> Haosdent Huang >> > > > > -- > Thanks, > -Kiril > Phone +37126409291 <+371%2026%20409%20291> > Riga, Latvia > Skype perimetr122 >