[ https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cody Maloney updated MESOS-4233: -------------------------------- Attachment: giant_port_range_logging > Logging is too verbose for sysadmins / syslog > --------------------------------------------- > > Key: MESOS-4233 > URL: https://issues.apache.org/jira/browse/MESOS-4233 > Project: Mesos > Issue Type: Epic > Reporter: Cody Maloney > Labels: mesosphere > Attachments: giant_port_range_logging > > > Currently mesos logs a lot. When launching a thousand tasks in the space of > 10 seconds it will print tens of thousands of log lines, overwhelming syslog > (there is a max rate at which a process can send stuff over a unix socket) > and not giving useful information to a sysadmin who cares about just the > high-level activity and when something goes wrong. > Note mesos also blocks writing to its log locations, so when writing a lot of > log messages, it can fill up the write buffer in the kernel, and be suspended > until the syslog agent catches up reading from the socket (GLOG does a > blocking fwrite to stderr). GLOG also has a big mutex around logging so only > one thing logs at a time. > While for "internal debugging" it is useful to see things like "message went > from internal compoent x to internal component y", from a sysadmin > perspective I only care about the high level actions taken (launched task for > framework x), sent offer to framework y, got task failed from host z. Note > those are what I'd expect at the "INFO" level. At the "WARNING" level I'd > expect very little to be logged / almost nothing in normal operation. Just > things like "WARN: Repliacted log write took longer than expected". WARN > would also get things like backtraces on crashes and abnormal exits / abort. > When trying to launch 3k+ tasks inside a second, mesos logging currently > overwhelms syslog with 100k+ messages, many of which are thousands of bytes. > Sysadmins expect to be able to use syslog to monitor basic events in their > system. This is too much. > We can keep logging the messages to files, but the logging to stderr needs to > be reduced significantly (stderr gets picked up and forwarded to syslog / > central aggregation). > What I would like is if I can set the stderr logging level to be different / > independent from the file logging level (Syslog giving the "sysadmin" > aggregated overview, files useful for debugging in depth what happened in a > cluster). A lot of what mesos currently logs at info is really debugging info > / should show up as debug log level. > Some samples of mesos logging a lot more than a sysadmin would want / expect > are attached, and some are below: > Every task gets printed multiple times for a basic launch: > {noformat} > There are also things like every task gets printed multiple times when > launched (Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal > mesos-master[1311]: I1215 22:58:29.382644 1315 master.cpp:3248] Launching > task envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework > 5178f46d-71d6-422f-922c-5bbe82dff9cc-0000 (marathon) > Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: > I1215 22:58:29.382925 1315 master.hpp:176] Adding task > envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(*):0.0001; > mem(*):16; ports(*):[14047-14047] > {noformat} > Every task status update prints many log lines, successful ones are part of > normal operation and maybe should be logged at info / debug levels, but not > to a sysadmin (Just show when things fail, and maybe aggregate counters to > tell of the volume of working) > No log messagse should be really big / more than 1k characters (Would prevent > the giant port list attached, make that easily discoverable / bug filable / > fixable) -- This message was sent by Atlassian JIRA (v6.3.4#6332)