Re: Mesos on AWS

Alex Rukletsov Wed, 21 Dec 2016 12:59:02 -0800

Kiril—

from what you described it does not sound like the problem is the Linux
distribution. It may be your AWS configuration. However, if a combination
of health checks and heavy loaded agent leads to the agent termination — I
would like to investigate this issue. Please come back—with logs!—if you
see the issue again.


On Tue, Dec 20, 2016 at 3:46 PM, Kiril Menshikov <kmenshi...@gmail.com>
wrote:

> Hey,
>
> Sorry for delayed response. I reinstalled my AWS infrastructure. Now I
> install everything on RedHat linux. Before I use Amazon Linux.
>
> I tested with single master (m4.large). Everything works perfect. I am not
> sure if it was Amazon Linux or my old configurations.
>
> Thanks,
> -Kirils
>
> On 18 December 2016 at 14:03, Guillermo Rodriguez <gu...@spritekin.com>
> wrote:
>
>> Hi,
>> I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at
>> any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances.
>>
>> So, the only moment I get a TASK_LOST is when I lose a spot instance due
>> to being outbid.
>>
>> I guess you may also lose instances due to an AWS autoscaler scale-in
>> procedure, for example, if it decides the cluster is inderutilised then it
>> can kill any instane in your cluster, not necessarilly the least used one.
>> That's the reason we decided to develop our customised autoscaler that
>> detects and kills specific instances based on our own rules.
>>
>> So, are you using spot fleets or spot innstances? Have you setup your
>> scale-in procedures correctly?
>>
>> Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge
>> means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge
>> instance and run xlarge instances instead. Same price and if you lose one
>> you just lose 1/10th of your jobs.
>>
>> Luck!
>>
>>
>>
>>
>>
>> ------------------------------
>> *From*: "haosdent" <haosd...@gmail.com>
>> *Sent*: Saturday, December 17, 2016 6:12 PM
>> *To*: "user" <user@mesos.apache.org>
>> *Subject*: Re: Mesos on AWS
>>
>> >  sometimes Mesos agent is launched but master doesn’t show them.
>> It sounds like the Master Master could not connect to your Agents. May
>> you mind paste your Mesos Master log? Any information show Mesos agents are
>> disconnected in it?
>>
>> On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <kmenshi...@gmail.com>
>> wrote:
>>>
>>> I have my own framework. Sometimes I get TASK_LOST status with message
>>> slave lost during health check.
>>>
>>> Also I found sometimes Mesos agent is launched but master doesn’t show
>>> them. From agent I see that it found master and connected. After agent
>>> restart it start working.
>>>
>>> -Kiril
>>>
>>>
>>>
>>> On Dec 16, 2016, at 21:58, Zameer Manji <zma...@apache.org> wrote:
>>>
>>> Hey,
>>>
>>> Could you detail on what you mean by "delays and health check problems"?
>>> Are you using your own framework or an existing one? How are you launching
>>> the tasks?
>>>
>>> Could you share logs from Mesos that show timeouts to ZK?
>>>
>>> For reference, I operate a large Mesos cluster and I have never
>>> encountered problems when running 1k tasks concurrently so I think sharing
>>> data would help everyone debug this problem.
>>>
>>> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <kmenshi...@gmail.com>
>>> wrote:
>>>>
>>>> ?Hi,
>>>>
>>>> Does any body try to run Mesos on AWS instances? Can you give me
>>>> recommendations.
>>>>
>>>> I am developing elastic (scale aws instances on demand) Mesos cluster.
>>>> Currently I have 3 master instances. I run about 1000 tasks simultaneously.
>>>> I see delays and health check problems.
>>>>
>>>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>>>>
>>>> At the moment I increase time out in ZooKeeper cluster. What can I do
>>>> to decrease timeouts?
>>>>
>>>> Also how can I increase performance? The main bottleneck is what I have
>>>> the big amount of tasks(run simultaneously) for an hour after I shutdown
>>>> them or restart (depends how good them perform).
>>>>
>>>> -Kiril?
>>>>
>>>> --
>>>> Zameer Manji
>>>>
>>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Thanks,
> -Kiril
> Phone +37126409291 <+371%2026%20409%20291>
> Riga, Latvia
> Skype perimetr122
>

Re: Mesos on AWS

Reply via email to