On the master node, I see this printed over and over in the
mesos-master.WARNING log file:
W0615 06:06:51.211262 8672 hierarchical_allocator_process.hpp:589] Using
the default value of 'refuse_seconds' to create the refused resources
filter because the input value is negative
Here's what I see in the master INFO file:
I0616 12:10:55.040024 8674 http.cpp:478] HTTP request for
'/master/state.json'
I0616 12:10:55.425833 8669 master.cpp:3843] Sending 1 offers to framework
20150511-140547-189138442-5051-8667-0831 (Savings) at
scheduler-5a5e99d4-5e16-4a48-94d5-86f751615a04@10.6.71.203:47979
I0616 12:10:55.438303 8669 master.cpp:3843] Sending 1 offers to framework
20150304-134212-222692874-5051-2300-0054
(chronos-2.3.2_mesos-0.20.1-SNAPSHOT) at
scheduler-c8f2acc2-d16e-44d5-b54f-7f88d3ab39a2@10.6.70.11:57549
I0616 12:10:55.441295 8669 master.cpp:3843] Sending 1 offers to framework
20150511-140547-189138442-5051-8667-0838 (Savings) at
scheduler-8b4389df-109e-49f5-8064-dd263fbec9fe@10.6.71.202:53346
I0616 12:10:55.442204 8669 master.cpp:2344] Processing reply for offers: [
20150511-140547-189138442-5051-8667-O9282037 ] on slave
20150511-140547-189138442-5051-8667-S4 at slave(1)@10.6.71.203:5151
(secasdb01-2) for framework 20150511-140547-189138442-5051-8667-0831
(Savings) at
scheduler-5a5e99d4-5e16-4a48-94d5-86f751615a04@10.6.71.203:47979
I0616 12:10:55.443111 8669 master.cpp:2344] Processing reply for offers: [
20150511-140547-189138442-5051-8667-O9282038 ] on slave
20150304-134111-205915658-5051-1595-S0 at slave(1)@10.6.71.206:5151
(secasdb01-3) for framework 20150304-134212-222692874-5051-2300-0054
(chronos-2.3.2_mesos-0.20.1-SNAPSHOT) at
scheduler-c8f2acc2-d16e-44d5-b54f-7f88d3ab39a2@10.6.70.11:57549
I0616 12:10:55.444875 8671 hierarchical_allocator_process.hpp:563]
Recovered mem(*):5305; disk(*):4744; ports(*):[25001-3] (total
allocatable: mem(*):5305; disk(*):4744; ports(*):[25001-3]) on slave
20150511-140547-189138442-5051-8667-S4 from framework
20150511-140547-189138442-5051-8667-0831
I0616 12:10:55.445121 8669 master.cpp:2344] Processing reply for offers: [
20150511-140547-189138442-5051-8667-O9282039 ] on slave
20150511-140547-189138442-5051-8667-S2 at slave(1)@10.6.71.202:5151
(secasdb01-1) for framework 20150511-140547-189138442-5051-8667-0838
(Savings) at
scheduler-8b4389df-109e-49f5-8064-dd263fbec9fe@10.6.71.202:53346
I0616 12:10:55.445971 8670 hierarchical_allocator_process.hpp:563]
Recovered mem(*):6329; disk(*):5000; ports(*):[25001-3] (total
allocatable: mem(*):6329; disk(*):5000; ports(*):[25001-3]) on slave
20150304-134111-205915658-5051-1595-S0 from framework
20150304-134212-222692874-5051-2300-0054
I0616 12:10:55.446185 8674 hierarchical_allocator_process.hpp:563]
Recovered mem(*):4672; disk(*):4488; ports(*):[25001-25667, 25669-3]
(total allocatable: mem(*):4672; disk(*):4488; ports(*):[25001-25667,
25669-3]) on slave 20150511-140547-189138442-5051-8667-S2 from
framework 20150511-140547-189138442-5051-8667-0838
There's two savings jobs and one weather job and they're all hung right now
(all started from chronos).
Here's what the frameworks tab looks like in mesos:
IDHostUserNameActive TasksCPUsMemMax ShareRegisteredRe-Registered
…5051-8667-0840
http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0840
secasdb01-1mesosWeather000 B0%4 hours ago-…5051-8667-0838
http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0838
secasdb01-1mesosSavings000 B0%4 hours ago-…5051-8667-0831
http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0831
secasdb01-2mesosSavings000 B0%7 hours ago-…5051-8667-0804
http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0804
secasdb01-1mesosAlertConsumer131.0 GB50%20 hours ago-…5051-2300-0090
http://intmesosmaster01:5051/#/frameworks/20150304-134212-222692874-5051-2300-0090
intMesosMaster02
mesosmarathon10.5128 MB8.333%a month agoa month ago…5051-2300-0054
http://intmesosmaster01:5051/#/frameworks/20150304-134212-222692874-5051-2300-0054
intMesosMaster01rootchronos-2.3.2_mesos-0.20.1-SNAPSHOT32.53.0 GB41.667%a
month agoa month ago
It seems that the chronos framework has reserved all the remaining cpu in
the cluster but not given it to the jobs that need it (savings and
weather).
AlertConsumer is a marathon job that's always running and is working fine.
On 16 June 2015 at 04:32, Akhil Das ak...@sigmoidanalytics.com wrote:
Did you look inside all logs? Mesos logs and executor logs?
Thanks
Best Regards
On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden gog...@gmail.com wrote:
My Mesos cluster has 1.5 CPU and 17GB free. If I set:
conf.set(spark.mesos.coarse, true);
conf.set(spark.cores.max, 1);
in the SparkConf object, the job will run in the mesos cluster fine.
But if I comment out those settings above so that it defaults to fine
grained, the task never finishes. It just shows as 0 for everything in the
mesos frameworks (# of