Hi all, first of all, than you for all the hard work on Mesos and related stuff. We are running fairly small mesos/marathon cluster (3 masters + 9 slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ . This means that we are sometime facing a network issues, frequently caused by some DDoS attack running against other servers in datacenters.
We are then facing huge problems with our Marathon installation. Typical behavior would be that Marathon will abandon the tasks. So it will report the lower number of tasks is running (frequently 0) then requested with scaling. So it will try to scale up, which will fail as workers are occupied with previous jobs, which are correctly reported in Mesos. We have not been able to pinpoint anything helpful in the log files of Marathon. We have tried running in 1 master as well as 3 masters modes. 3 node mode seemed actually a bit worse. The only working solution so far is to stop everything. Wipe ZK and kill all jobs on Mesos and then start all components again. So I would like to ask couple questions: - what is the actual use-case for Marathon? Is it expected to have larger number of apps/jobs (right now we have something like 50 apps) or rather to have like 5 of them, which are Mesos frameworks? - Is there a way how to tell Marathon to take ownership of currently running jobs? Honestly, not really sure how this could work as I possibly don't have any state information about them. - What should be the command line to get some helpful information for you guyz to debug the problem next time? As you can see, the problem is that problems are quite random. We didn't have any problem during December, but already had like 3 total breakdowns last week. Thanks a lot, Antonin