Hi,
 I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at any 
time. Between 200 and 1500 jobs anytime. Slaves run as spot instances.

 So, the only moment I get a TASK_LOST is when I lose a spot instance due to 
being outbid.

 I guess you may also lose instances due to an AWS autoscaler scale-in 
procedure, for example, if it decides the cluster is inderutilised then it can 
kill any instane in your cluster, not necessarilly the least used one. That's 
the reason we decided to develop our customised autoscaler that detects and 
kills specific instances based on our own rules.

 So, are you using spot fleets or spot innstances? Have you setup your scale-in 
procedures correctly?

 Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge means 
0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge instance and 
run xlarge instances instead. Same price and if you lose one you just lose 
1/10th of your jobs.

 Luck!






----------------------------------------
 From: "haosdent" <haosd...@gmail.com>
Sent: Saturday, December 17, 2016 6:12 PM
To: "user" <user@mesos.apache.org>
Subject: Re: Mesos on AWS
  >  sometimes Mesos agent is launched but master doesn't show them.
It sounds like the Master Master could not connect to your Agents. May you mind 
paste your Mesos Master log? Any information show Mesos agents are disconnected 
in it?

   On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <kmenshi...@gmail.com> 
wrote:    I have my own framework. Sometimes I get TASK_LOST status with 
message slave lost during health check.

 Also I found sometimes Mesos agent is launched but master doesn't show them. 
From agent I see that it found master and connected. After agent restart it 
start working.

 -Kiril

     On Dec 16, 2016, at 21:58, Zameer Manji <zma...@apache.org> wrote:
    Hey,
 Could you detail on what you mean by "delays and health check problems"? Are 
you using your own framework or an existing one? How are you launching the 
tasks?

 Could you share logs from Mesos that show timeouts to ZK?

 For reference, I operate a large Mesos cluster and I have never encountered 
problems when running 1k tasks concurrently so I think sharing data would help 
everyone debug this problem.

   On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <kmenshi...@gmail.com> 
wrote:    ?Hi,

 Does any body try to run Mesos on AWS instances? Can you give me 
recommendations.

 I am developing elastic (scale aws instances on demand) Mesos cluster. 
Currently I have 3 master instances. I run about 1000 tasks simultaneously. I 
see delays and health check problems.

 ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).

 At the moment I increase time out in ZooKeeper cluster. What can I do to 
decrease timeouts?

 Also how can I increase performance? The main bottleneck is what I have the 
big amount of tasks(run simultaneously) for an hour after I shutdown them or 
restart (depends how good them perform).

 -Kiril?
--     Zameer Manji


--  Best Regards, Haosdent Huang


Reply via email to