Steven Schlansker created MESOS-7085:
----------------------------------------

             Summary: Consider reducing processing of DECLINE calls log from 
info to debug
                 Key: MESOS-7085
                 URL: https://issues.apache.org/jira/browse/MESOS-7085
             Project: Mesos
          Issue Type: Improvement
          Components: master
    Affects Versions: 1.0.1
            Reporter: Steven Schlansker


The Mesos master gets resource decline messages as a normal matter of course.

It repeatedly logs the offers declined from schedulers.  This is critical 
diagnostics information, but unless your scheduler is broken or buggy, usually 
uninteresting.

In our production environment this ended up being a significant fraction of all 
logging.  One of our operators got paged:


> Checking to see what I can delete.
> 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> outputting this to syslog ) :

> I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844

➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
➢ grep -c DECLINE 
mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
5846770

It seems that this line looks scary ("DECLINE") is a scary word to an operator, 
is a huge percentage of log output, and is part of normal operation.
Should it be reduced to DEBUG?  Or could Mesos print it out in a time based 
manner?  ("654 offers declined in last 1 minute")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to