Marathon stability and use-case

Antonin Kral Mon, 19 Jan 2015 04:08:24 -0800

Hi all,

first of all, than you for all the hard work on Mesos and related stuff.
We are running fairly small mesos/marathon cluster (3 masters + 9
slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ .
This means that we are sometime facing a network issues, frequently
caused by some DDoS attack running against other servers in datacenters.


We are then facing huge problems with our Marathon installation. Typical
behavior would be that Marathon will abandon the tasks. So it will
report the lower number of tasks is running (frequently 0) then
requested with scaling. So it will try to scale up, which will fail as
workers are occupied with previous jobs, which are correctly reported in
Mesos.

We have not been able to pinpoint anything helpful in the log files of
Marathon. We have tried running in 1 master as well as 3 masters modes.
3 node mode seemed actually a bit worse.

The only working solution so far is to stop everything. Wipe ZK and kill
all jobs on Mesos and then start all components again.

So I would like to ask couple questions:

  - what is the actual use-case for Marathon?

    Is it expected to have larger number of apps/jobs (right now we have
    something like 50 apps) or rather to have like 5 of them, which are
    Mesos frameworks?

  - Is there a way how to tell Marathon to take ownership of currently
    running jobs?

    Honestly, not really sure how this could work as I possibly don't
    have any state information about them.

  - What should be the command line to get some helpful information for
    you guyz to debug the problem next time?

    As you can see, the problem is that problems are quite random. We
    didn't have any problem during December, but already had like 3
    total breakdowns last week.

Thanks a lot,

    Antonin

Marathon stability and use-case

Reply via email to