On the master node, I see this printed over and over in the mesos-master.WARNING log file: W0615 06:06:51.211262 8672 hierarchical_allocator_process.hpp:589] Using the default value of 'refuse_seconds' to create the refused resources filter because the input value is negative
Here's what I see in the master INFO file: I0616 12:10:55.040024 8674 http.cpp:478] HTTP request for '/master/state.json' I0616 12:10:55.425833 8669 master.cpp:3843] Sending 1 offers to framework 20150511-140547-189138442-5051-8667-0831 (Savings) at scheduler-5a5e99d4-5e16-4a48-94d5-86f751615a04@10.6.71.203:47979 I0616 12:10:55.438303 8669 master.cpp:3843] Sending 1 offers to framework 20150304-134212-222692874-5051-2300-0054 (chronos-2.3.2_mesos-0.20.1-SNAPSHOT) at scheduler-c8f2acc2-d16e-44d5-b54f-7f88d3ab39a2@10.6.70.11:57549 I0616 12:10:55.441295 8669 master.cpp:3843] Sending 1 offers to framework 20150511-140547-189138442-5051-8667-0838 (Savings) at scheduler-8b4389df-109e-49f5-8064-dd263fbec9fe@10.6.71.202:53346 I0616 12:10:55.442204 8669 master.cpp:2344] Processing reply for offers: [ 20150511-140547-189138442-5051-8667-O9282037 ] on slave 20150511-140547-189138442-5051-8667-S4 at slave(1)@10.6.71.203:5151 (secasdb01-2) for framework 20150511-140547-189138442-5051-8667-0831 (Savings) at scheduler-5a5e99d4-5e16-4a48-94d5-86f751615a04@10.6.71.203:47979 I0616 12:10:55.443111 8669 master.cpp:2344] Processing reply for offers: [ 20150511-140547-189138442-5051-8667-O9282038 ] on slave 20150304-134111-205915658-5051-1595-S0 at slave(1)@10.6.71.206:5151 (secasdb01-3) for framework 20150304-134212-222692874-5051-2300-0054 (chronos-2.3.2_mesos-0.20.1-SNAPSHOT) at scheduler-c8f2acc2-d16e-44d5-b54f-7f88d3ab39a2@10.6.70.11:57549 I0616 12:10:55.444875 8671 hierarchical_allocator_process.hpp:563] Recovered mem(*):5305; disk(*):4744; ports(*):[25001-30000] (total allocatable: mem(*):5305; disk(*):4744; ports(*):[25001-30000]) on slave 20150511-140547-189138442-5051-8667-S4 from framework 20150511-140547-189138442-5051-8667-0831 I0616 12:10:55.445121 8669 master.cpp:2344] Processing reply for offers: [ 20150511-140547-189138442-5051-8667-O9282039 ] on slave 20150511-140547-189138442-5051-8667-S2 at slave(1)@10.6.71.202:5151 (secasdb01-1) for framework 20150511-140547-189138442-5051-8667-0838 (Savings) at scheduler-8b4389df-109e-49f5-8064-dd263fbec9fe@10.6.71.202:53346 I0616 12:10:55.445971 8670 hierarchical_allocator_process.hpp:563] Recovered mem(*):6329; disk(*):5000; ports(*):[25001-30000] (total allocatable: mem(*):6329; disk(*):5000; ports(*):[25001-30000]) on slave 20150304-134111-205915658-5051-1595-S0 from framework 20150304-134212-222692874-5051-2300-0054 I0616 12:10:55.446185 8674 hierarchical_allocator_process.hpp:563] Recovered mem(*):4672; disk(*):4488; ports(*):[25001-25667, 25669-30000] (total allocatable: mem(*):4672; disk(*):4488; ports(*):[25001-25667, 25669-30000]) on slave 20150511-140547-189138442-5051-8667-S2 from framework 20150511-140547-189138442-5051-8667-0838 There's two savings jobs and one weather job and they're all hung right now (all started from chronos). Here's what the frameworks tab looks like in mesos: IDHostUserNameActive TasksCPUsMemMax ShareRegisteredRe-Registered …5051-8667-0840 <http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0840> secasdb01-1mesosWeather000 B0%4 hours ago-…5051-8667-0838 <http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0838> secasdb01-1mesosSavings000 B0%4 hours ago-…5051-8667-0831 <http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0831> secasdb01-2mesosSavings000 B0%7 hours ago-…5051-8667-0804 <http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0804> secasdb01-1mesosAlertConsumer131.0 GB50%20 hours ago-…5051-2300-0090 <http://intmesosmaster01:5051/#/frameworks/20150304-134212-222692874-5051-2300-0090> intMesosMaster02 mesosmarathon10.5128 MB8.333%a month agoa month ago…5051-2300-0054 <http://intmesosmaster01:5051/#/frameworks/20150304-134212-222692874-5051-2300-0054> intMesosMaster01rootchronos-2.3.2_mesos-0.20.1-SNAPSHOT32.53.0 GB41.667%a month agoa month ago It seems that the chronos framework has reserved all the remaining cpu in the cluster but not given it to the jobs that need it (savings and weather). AlertConsumer is a marathon job that's always running and is working fine. On 16 June 2015 at 04:32, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Did you look inside all logs? Mesos logs and executor logs? > > Thanks > Best Regards > > On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden <gog...@gmail.com> wrote: > >> My Mesos cluster has 1.5 CPU and 17GB free. If I set: >> >> conf.set("spark.mesos.coarse", "true"); >> conf.set("spark.cores.max", "1"); >> >> in the SparkConf object, the job will run in the mesos cluster fine. >> >> But if I comment out those settings above so that it defaults to fine >> grained, the task never finishes. It just shows as 0 for everything in the >> mesos frameworks (# of tasks, cpu, memory are all 0). There's nothing in >> the log files anywhere as to what's going on. >> >> Thanks >> >> >> >> >