I was running a load test on a mesos-cluster, and observed that when mesos is running lots of frameworks, offer starvation occurs for certain frameworks, i.e. only a subset of frameworks registered with mesos gets offers. Let me describe the scenario below:
First phase: At the beginning, there’s only one framework registered with mesos, which is ‘Marathon’. The load generator, uses Marathon’s API to launch let’s say 50 Jenkins masters, with mesos-plugin installed. Once all 50 masters are launched, the mesos-cluster now have 51 frameworks registered in total, because the mesos-plugin registers itself with mesos-master as a framework. Second phase: Now, the load generator goes and triggers couple of build jobs on each Jenkins Master. Each framework’s Schedular will now have let’s say 2 items in it’s build queue. Once framework get’s a resource offer from Master, it’s schedular can perform the build tasks, if the offer matches the resource constraints as specified by mesos-plugin. What I observed was, at the start of second phase, some frameworks (jenkins masters) got offers and got their tasks scheduled to run. But, rest of the frameworks, didn’t get resource offers from mesos-master, and the build jobs scheduled on those, got starved. Tailing jenkins logs on these masters never showed: 'Received offers’. Also, according to mesos master logs, mesos was sending offers to only a handful of frameworks. The logs below show the message from a minute, but I saw the similar behavior at other times, I have added a line break after each group of frameworks getting offers: I0310 17:56:44.703126 1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0364 I0310 17:56:45.722951 1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0371 I0310 17:56:46.744184 1159 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0377 I0310 17:56:47.768546 1158 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0380 I0310 17:56:48.794517 1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0396 I0310 17:56:49.813484 1157 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0364 I0310 17:56:50.833155 1159 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0371 I0310 17:56:51.859712 1158 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0377 I0310 17:56:52.879678 1153 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0380 I0310 17:56:53.904261 1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0396 I0310 17:56:54.929472 1155 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0364 I0310 17:56:55.947387 1153 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0371 I0310 17:56:56.975060 1157 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0377 I0310 17:56:57.996995 1159 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0380 I0310 17:56:59.022555 1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0396 Couple of questions: 1. Does running multiple frameworks (say more than 10), have an impact on resource allocation strategy ? 2. If a registered framework keeps declining mesos offers for a while, does mesos take that into account while sending offers ? Links: 1. https://github.com/mesosphere/marathon 2. https://github.com/jenkinsci/mesos-plugin -- Mohit