Re: resource fragmentation/mesos-master offers too low

2015-10-02 Thread Eren Güven
Hey Vinod,

Mesos: 0.23.0-1.0.ubuntu1404
Marathon: 0.10.1-1.0.416.ubuntu1404
Chronos: 2.4.0-0.1.20150828104228.ubuntu1404

Single master, 16 slaves, each with 6cpu 6mem, many (perhaps not all) had
more resources than mentioned in the logs (mesos offers with 6m mem etc..)
or what was required for the marathon tasks (.25cpu 128mem).

# cat /etc/mesos-master/offer_timeout
30secs

Marathon is running 200-400 tasks at any given time, chronos <20.

There aren't too much interesting in the logs, prior to this many lines of
Marathon app definition and then the 'Not all basic resources satisfied'
spam starts, something like


Sep 29 09:35:35 lab-mesos-master1 marathon[60609]: [INFO] [09/29/2015
09:35:35.938] [marathon-akka.actor.default-dispatcher-1396]
[akka://marathon/user/MarathonScheduler/$a/DeploymentManager/6d87463d-38
d6-4f20-86ba-9caefce12ad7/$a] Successfully started 0 instances of 
Sep 29 09:35:36 lab-mesos-master1 mesos-master[60442]: I0929
09:35:36.444304 60462 master.cpp:4290] Sending 1 offers to framework
20150928-140031-2581335306-5050-60442- (marathon) at scheduler-3cced
735-b1d8-4739-adae-b479a0411593@127.0.1.1:50185
Sep 29 09:35:36 lab-mesos-master1 mesos-master[60442]: I0929
09:35:36.445269 60462 master.cpp:4290] Sending 3 offers to framework
20150928-104536-2581335306-5050-25731- (chronos-2.4.0) at scheduler-
16cd75ef-d383-4c4c-8239-89a0c1fac536@127.0.1.1:57205
Sep 29 09:35:36 lab-mesos-master1 chronos[60791]: [2015-09-29 09:35:36,447]
INFO Received resource offers
(org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Sep 29 09:35:36 lab-mesos-master1 chronos[60791]: [2015-09-29 09:35:36,447]
INFO No tasks scheduled or next task has been disabled.
Sep 29 09:35:36 lab-mesos-master1 chronos[60791]:
 (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Sep 29 09:35:36 lab-mesos-master1 chronos[60791]: [2015-09-29 09:35:36,447]
INFO Declining unused offers.
(org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Sep 29 09:35:36 lab-mesos-master1 chronos[60791]: [2015-09-29 09:35:36,448]
INFO Declined unused offers with filter refuseSeconds=5.0 (use
--decline_offer_duration to reconfigure) (org.apache.mesos.chro
nos.scheduler.mesos.MesosJobFramework:97)
Sep 29 09:35:36 lab-mesos-master1 mesos-master[60442]: I0929
09:35:36.450083 60465 master.cpp:2884] Processing DECLINE call for offers:
[ 20150928-140031-2581335306-5050-60442-O436519 ] for framework 20
150928-104536-2581335306-5050-25731- (chronos-2.4.0) at
scheduler-16cd75ef-d383-4c4c-8239-89a0c1fac536@127.0.1.1:57205
Sep 29 09:35:36 lab-mesos-master1 mesos-master[60442]: I0929
09:35:36.451763 60465 master.cpp:2884] Processing DECLINE call for offers:
[ 20150928-140031-2581335306-5050-60442-O436520 ] for framework 20
150928-104536-2581335306-5050-25731- (chronos-2.4.0) at
scheduler-16cd75ef-d383-4c4c-8239-89a0c1fac536@127.0.1.1:57205
Sep 29 09:35:36 lab-mesos-master1 mesos-master[60442]: I0929
09:35:36.453340 60465 master.cpp:2884] Processing DECLINE call for offers:
[ 20150928-140031-2581335306-5050-60442-O436521 ] for framework 20
150928-104536-2581335306-5050-25731- (chronos-2.4.0) at
scheduler-16cd75ef-d383-4c4c-8239-89a0c1fac536@127.0.1.1:57205
Sep 29 09:35:36 lab-mesos-master1 mesos-master[60442]: I0929
09:35:36.457660 60467 hierarchical.hpp:761] Recovered cpus(*):0.05;
mem(*):12; disk(*):41211; ports(*):[31000-31005, 31007-31014, 31016-31063
, 31065-31088, 31090-31146, 31148-31176, 31178-31193, 31195-31226,
31228-31239, 31241-31321, 31323-31367, 31369-31524, 31526-31544,
31546-31562, 31564-31564, 31566-31598, 31600-31667, 31669-31709, 31711
-31711, 31713-31732, 31734-31750, 31752-31850, 31852-31864, 31866-31910,
31912-31992, 31994-32000] (total: cpus(*):6; mem(*):2922; disk(*):41211;
ports(*):[31000-32000], allocated: cpus(*):5.95; mem(*):
2910; ports(*):[31006-31006, 31015-31015, 31064-31064, 31089-31089,
31147-31147, 31177-31177, 31194-31194, 31227-31227, 31240-31240,
31322-31322, 31368-31368, 31525-31525, 31545-31545, 31563-31563, 3156
5-31565, 31599-31599, 31668-31668, 31710-31710, 31712-31712, 31733-31733,
31751-31751, 31851-31851, 31865-31865, 31911-31911, 31993-31993]) on slave
20150827-122131-2581335306-5050-34525-S274 from frame
work 20150928-104536-2581335306-5050-25731-
Sep 29 09:35:36 lab-mesos-master1 mesos-master[60442]: I0929
09:35:36.459028 60467 hierarchical.hpp:761] Recovered cpus(*):1.45;
mem(*):70; disk(*):41211; ports(*):[31000-31236, 31238-31242, 31244-31256
, 31258-31262, 31264-31267, 31269-31307, 31309-31313, 31315-31450,
31452-31486, 31488-31618, 31620-31622, 31624-31634, 31636-31653,
31655-31751, 31753-31826, 31828-31837, 31839-31896, 31898-31962, 31964
-31967, 31969-31983, 31985-32000] (total: cpus(*):6; mem(*):2922;
disk(*):41211; ports(*):[31000-32000], allocated: cpus(*):4.55;
mem(*):2852; ports(*):[31237-31237, 31243-31243, 31257-31257, 31263-3126
3, 31268-31268, 31308-31308, 31314-31314, 31451-31451, 31487-31487,
31619-31619, 31623-31623, 31635-31635, 316

Re: resource fragmentation/mesos-master offers too low

2015-09-29 Thread Vinod Kone
Can you provide more details about your setup and master logs when this
started to happen? Typically, when a framework declines an offer it sets a
timeout filter which gives mesos a chance to aggregate free resources on a
slave into one big offer.

On Tue, Sep 29, 2015 at 6:31 AM, Eren Güven  wrote:

> Hello,
>
> I had an incident where mesos kept offering marathon insufficient
> resources which resulted in lots of forever WAITING tasks in marathon.
> Below are some example logs I saw. It kept repeating the same insufficient
> offers even though the cluster had a lot of free resources like ~80cpus and
> 45G mem. A master restart solved the issue but I was wondering what can
> result in this situation or if anyone had any pointers on where to dig
> further.
>
> Thanks!
>
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (160.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 34.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (160.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 62.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (160.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (160.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
> mem NOT SATISFIED (128.0 > 6.0)
> cpu SATISFIED (0.25 <= 0.6504), disk SATIS

resource fragmentation/mesos-master offers too low

2015-09-29 Thread Eren Güven
Hello,

I had an incident where mesos kept offering marathon insufficient resources
which resulted in lots of forever WAITING tasks in marathon. Below are some
example logs I saw. It kept repeating the same insufficient offers even
though the cluster had a lot of free resources like ~80cpus and 45G mem. A
master restart solved the issue but I was wondering what can result in this
situation or if anyone had any pointers on where to dig further.

Thanks!

cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (160.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.3514), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 34.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (160.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 1.1504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 62.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (160.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.7998), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (160.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 0.6504), disk SATISFIED (0.0 <= 0.0),
mem NOT SATISFIED (128.0 > 6.0)
cpu SATISFIED (0.25 <= 1.25), disk SATISFIED (0.0 <= 0.0), mem NOT
SATISFIED (128.0 > 42.0)
cpu SATISFIED (0.25 <= 1.25), disk SATISFIED (0.0 <= 0.0), mem NOT
SATISFIED (160.0 > 42.0)
cpu SATISFIED (0.25 <= 1.25), disk SATISFIED (0.0 <= 0.0), mem NOT
SATISFIED (128.0 > 42.0)
cpu SATISFIED (0.25 <= 1.25), disk SATISFIED