Re: [jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper

Ivan Kelly Fri, 16 Jul 2010 08:25:39 -0700

Zookeeper's traces (i.e., logs in TRACE level) provide someinformation that can be helpful to understand what happened. Forinstance, they contain information about the clients that areconnected, the operations issued, etc. However, in real deploymentswith many clients (say, hundreds), traces are typically turned offto avoid the high overhead that they cause. Furthermore, the data inthe traces is probably not enough for our purposes because it doesnot include, e.g., the replies to operations or the data values.

As far as I've seen, this overhead comes in two forms, CPU and disk.CPU overhead is mostly due to formatting. Disk obviously becausetracing will fill your disk fairly quickly. Perhaps something could bedone to combat both of these. To fix the formatting problem we coulduse a binary log format. I've seen this done in C++ but not in java.The basic idea is that if you have TRACE("operation %x happened to %s%p", obj1, obj2, obj3); a preprocessor replaces this withTRACE(0x1234, obj1, obj2, obj3) where 0x1234 is an identifier for thetrace. Then when the trace occurs a binary blob [0x1234, value ofobj1, value of obj2, value of obj3] is logged. Then when the logs arepulled of the machine you run a post processor to do all theformatting and you get your full trace.

Regarding the disk overhead, traces are usually only interesting inthe run up to a failure. We could have a ring buffer in memory that isconstantly traced to, old traces being overwritten when the ringbuffer reaches it's limit. These traces should only be dumped to thefilesystem when an error or fatal level event occurs, thereby givingyou a trace of what was happening before you fell over.




-Ivan

Re: [jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper

Reply via email to