[ 
https://issues.apache.org/jira/browse/MESOS-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-7085:
-------------------------------------
    Description: 
The Mesos master gets resource decline messages as a normal matter of course.

It repeatedly logs the offers declined from schedulers.  This is critical 
diagnostics information, but unless your scheduler is broken or buggy, usually 
uninteresting.

In our production environment this ended up being a significant fraction of all 
logging.  One of our operators got paged:


> Checking to see what I can delete.
> 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> outputting this to syslog ) :

> I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844

➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
➢ grep -c DECLINE 
mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
5846770

It seems that this line looks scary ("DECLINE" is a scary word to an operator), 
is a huge percentage of log output, and is part of normal operation.
Should it be reduced to DEBUG?  Or could Mesos print it out in a time based 
manner?  ("654 offers declined in last 1 minute")

  was:
The Mesos master gets resource decline messages as a normal matter of course.

It repeatedly logs the offers declined from schedulers.  This is critical 
diagnostics information, but unless your scheduler is broken or buggy, usually 
uninteresting.

In our production environment this ended up being a significant fraction of all 
logging.  One of our operators got paged:


> Checking to see what I can delete.
> 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> outputting this to syslog ) :

> I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844

➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
➢ grep -c DECLINE 
mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
5846770

It seems that this line looks scary ("DECLINE") is a scary word to an operator, 
is a huge percentage of log output, and is part of normal operation.
Should it be reduced to DEBUG?  Or could Mesos print it out in a time based 
manner?  ("654 offers declined in last 1 minute")


> Consider reducing processing of DECLINE calls log from info to debug
> --------------------------------------------------------------------
>
>                 Key: MESOS-7085
>                 URL: https://issues.apache.org/jira/browse/MESOS-7085
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.0.1
>            Reporter: Steven Schlansker
>
> The Mesos master gets resource decline messages as a normal matter of course.
> It repeatedly logs the offers declined from schedulers.  This is critical 
> diagnostics information, but unless your scheduler is broken or buggy, 
> usually uninteresting.
> In our production environment this ended up being a significant fraction of 
> all logging.  One of our operators got paged:
> > Checking to see what I can delete.
> > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> > outputting this to syslog ) :
> > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> > Singularity (Singularity) at 
> > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> > I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for 
> > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework 
> > Singularity (Singularity) at 
> > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> > I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for 
> > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework 
> > Singularity (Singularity) at 
> > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> ➢  wc -l 
> mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> 6796024 
> mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> ➢ grep -c DECLINE 
> mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> 5846770
> It seems that this line looks scary ("DECLINE" is a scary word to an 
> operator), is a huge percentage of log output, and is part of normal 
> operation.
> Should it be reduced to DEBUG?  Or could Mesos print it out in a time based 
> manner?  ("654 offers declined in last 1 minute")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to