Re: Logging braindump
I did some rough benchmarking (N entries over M threads) on the current code vs wip-log2 (my naive implementation), which - allocates an Entry struct on the heap, with 80 chars preallocated - you write into a streambuf into that, overflowing into a std::string on the heap - takes a mutex and adds it to a linked list with an async thread to write it out. For 1M lines, on a 16-core system, -O2, threads old codenew (total) new (queue) no queue 1 4.7 ~7.51.4 1.3 10 8.0 ~8 1.541.0 16 7.8 ~8 1.621.1 100 5.1 ~8 1.751.34 A few notes: - The new (queue) is just the time to queue 1M events, and the total includes a call to log.flush(). Normally that would happen in another thread, of course, so it's only interesting as a rough measure of total work. - Commenting out the mutex on submit_entry and one thread shaves off .1 seconds or so... about the same as the no 'queue case'. - Preallocating entries instead of calling new only saves about .1 seconds. - Using PreallocatedStreambuf instead of a ostringstream saves about .4 seconds in the single thread case (if we fit). - Overflowing the buffer takes about .1 seconds longer.. still better than the ostringstream. It depends a bit on _how_ much we overflow, though.. it does powers of two over the original prealloc size, so we may have to realloc multiple times. This is probably what ostringstream is doing anyway. My thoughts: - I don't think we care too much about the concurrency issue (yet). - I think we only care about queuing time, since that is what slows down running code. Having an async thread write to disk will incur some background load, but in reality we won't be writing 1M records in 10 seconds. - That is especially true since we will probably collect lots of log data that we never write. - Most of the work so far adds infrastructure to control dual log levels and to clean up the logging interface, making it easy to implement different functionality. I think it's worth merging as-is. - We'll want to tune the amount of space preallocated in each Entry. Missing pieces: - Need to throttle when writeout gets behind to cap total memory usage. Pushing to wip-log branch for review. sage On Thu, 22 Mar 2012, Sage Weil wrote: On Thu, 22 Mar 2012, Mark Kampe wrote: On 03/22/12 09:38, Colin McCabe wrote: On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen tommi.virta...@dreamhost.com wrote: [mmap'ed buffer discussion] I always thought mmap'ed circular buffers were an elegant approach for getting data that survived a process crash, but not paying the overhead of write(2) and read(2). The main problem is that you need special tools to read the circular buffer files off of the disk. As Sage commented, that is probably undesirable for many users. (a) I actually favor not simply mmaping the circular buffer, but having a program that pulls the data out of memory and writes it to disk (ala Varnish). In addition to doing huge writes (greatly reducing the write overhead), it can filter what it processes, so that we have extensive logging for the last few seconds, and more manageable logs on disk extending farther back in time (modulo log rotation). The code in wip-log does this (dumps high level messages unless/until there is a reason to dump it all). Well, more importantly, it sets up the infrastructure to control those logging levels (the actually logging code is very short and simple). I don't see the value in putting this in a separate process, though (as opposed to a separate thread); that just seems like it would complicate ops... (b) The most interesting logs are probably the ones in coredumps (that didn't make it out to disk) for which we want a crawler/extractor anyway. It probably isn't very hard to make the program that extracts logs from memory also be able to pick the pockets of dead bodies (put a big self identifying header on the front of each buffer). Note also that having the ability to extract the logs from a coredump pretty much eliminates any motivations to flush log entries out to disk promptly/expensively. If the process exits clealy, we'll get the logs. If the process produces a coredump, we'll still get the logs. This mostly seems like an argument to keep the in-memory layout simple, to keep the post-mortem code simple. But at worst it's a complexity/performance tradeoff... and, if we're lucky, there's no difficult choice to make. Honestly, I don't think the overhead is where we think it is. But need to do some real testing to see. (c) I have always loved text logs
Re: Logging braindump
This is well-timed.. I just started playing with this code on the flight over here. There are a couple pieces to what I've done so far: - clearly segmented logging code into a new set of classes in src/log, without dependencies on everything else - set up config infrastructure to control 'log' and 'gather' levels for different subsystems, where gather controls what messages are generated, and log controls which of those reach disk (in non-crash situation) - made a relatively naive implementation that works (messages on heap, separate thread to write out, dump recent history on crash). - a lot of the weird interaction/bootstrapping stuff with contexts and config now go away entirely if we just rely on cerr for errors during config parsing, etc. (like that simple spinlock Greg was complaining about). MY RECOMMENDATIONS [biased, as always ;-] - bundle the MessagePack library I'm worried about requiring tools to read logs. And I think most of the current overhead is the sync writes (which are easily addressed). But string serialization is also a significant chunk, I think, so this may be worth it... need to do some real tests. - in thread that calls log: serialize as MessagePack onto stack, allocate needed bytes from ringbuffer, copy event to ringbuffer - write to disk is now very simple, could even be done in a different process (mmap header+ringbuffer) My naive implementation puts entries on heap and links them into a list: can't easily extract from a core, but otherwise equivalent to a ring of pointers. My limited tests showed std::string assignment had a huge overhead, though, so I suspect writing a stringbuf implementation that writes into cheaply preallocated buffers (maybe allocated off a large preallocated ring buffer), optimized for common case (40-80 byte message). But I need to read more of the Disruptor stuff to see what exactly they're doing and why. In any case, the new structure makes it easier to test implementations given things are better contained. - let disk files be named after timestamp they were started at, start new ones based on time size (no .0 - .1 - .2 renaming needed) - make it really simple to process+delete chunks of log, feeding them into Brisk or Graylog, then deleting from the node (perhaps after a delay, so last 24h is locally browseable) (and don't remove things that haven't been processed) Currently that's done with rename, sighup, wait for new log to appear, then process+delete. Are there problems there that make it worth taking a non-standard approach to the log file naming? Anyway, pushing what I have now to wip-log. The exposed interface is just the macro in common/dout.h now. Alternative approaches will probably change log/Entry.h and probably the allocation steps, but otherwise be pretty simple to swap in/out. The main piece that is sitll missing is an interface for logging structured content. I suspect the way to do that is by putting stuff in the ostream stream (ala std::setiosflags(foo) and their ilk) so that you can easily stick key/value pairs in th code (... log::kv(foo, bar) ... or something)? Hopefully that can reduce to something efficient, at least, so we don't have to replace the current dout logging sites with something completely different. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Logging braindump
On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen tommi.virta...@dreamhost.com wrote: [mmap'ed buffer discussion] I always thought mmap'ed circular buffers were an elegant approach for getting data that survived a process crash, but not paying the overhead of write(2) and read(2). The main problem is that you need special tools to read the circular buffer files off of the disk. As Sage commented, that is probably undesirable for many users. An mmap'ed buffer, even a lockless one, is a simple beast. Do you really need a whole library just for that? Maybe I'm just old-fashioned. DISK FORMAT - GELF: compressed JSON with specific fields: https://github.com/Graylog2/graylog2-docs/wiki/GELF - Google Protocol Buffers: considered clumsy these days (code generation from IDL etc); only Google has significant investment in the format Hadoop also has a significant investment in Google Protocol Buffers. It's almost the only thing we send over the wire these days. I don't think this affects your decision about logging at all, though. - Thrift: considered clumsy these days (code generation from IDL etc); only Facebook has significant investment in the format - BSON: sort of close to binary encoding of JSON + extra data types, not a huge improvement in speed/space.. http://bsonspec.org/ - Avro: Apache-sponspored data format, nicely self-describing, apparently slow? http://avro.apache.org/ - MessagePack: binary encoding for JSON, claims to beat others in speed.. http://msgpack.org/ I'm not sure why you consider code generation from a schema file clumsy. To me, the whole point of structured logging is that 1. it is more efficient in terms of space/speed, partly because it doesn't need to be self-describing, and 2. it is a stable ABI that binds the system together The sticking point for most programmers is point #2. They don't want to commit to a log format ahead of time because it slows down development. (Pretty much the same reason Linus doesn't want a stable in-kernel ABI.) In my opinion, that's the reason why structured logging has not made much headway in the world at large. Journald is an interesting experiment-- we'll see how many programmers carefully design structured journald log messages, and how many just output a single field called text or something similar. :) If you just want better efficiency, there's a lot of ways to get that which don't involve changing the Ceph logging APIs. Probably the easiest one is just cutting down on hugely verbose (as in more than 80 character) log messages. There's no shortage of those. For an unrelated project, I have been considering a system where all logs are generated and go to a circular buffer, but are not sent to permanent storage until/unless there is a crash, or the particular log message type is enabled. It seems like something like this is a good compromise between too much logging and not enough. Colin And all of these can be compressed with e.g. Snappy as they flow to disk. http://code.google.com/p/snappy/ Downside of just all but JSON: we'd need to bundle the library -- distro support just isn't there yet. Should the disk format be binary? That makes it less friendly to the admin. I'm not sure which way to go. JSON is simpler and friendlier, e.g. MessagePack has identical data model but is faster and takes less space. Some options: a. make configurable so simple installations don't need to suffer binary logs b. just pick one and stick with it QUERYING / ANALYSIS - use a format from above that is mapreduce-friedly, or can be 1:1 imported into another storage system - software like Graylog may be of use, but I fear we'll overwhelm it with events: http://graylog2.org/ - Cassandra's Brisk is a really easy way to run SQL-like Hive queries over structured data, and has a design that'll ingest any amount of data, Just Add Hardware(tm): http://www.datastax.com/docs/0.8/brisk/index - the standards process is churning out things like CEE, but I'm not holding my breath: http://cee.mitre.org/ MY RECOMMENDATIONS [biased, as always ;-] - bundle the MessagePack library - in thread that calls log: serialize as MessagePack onto stack, allocate needed bytes from ringbuffer, copy event to ringbuffer - write to disk is now very simple, could even be done in a different process (mmap header+ringbuffer) - let disk files be named after timestamp they were started at, start new ones based on time size (no .0 - .1 - .2 renaming needed) - make it really simple to process+delete chunks of log, feeding them into Brisk or Graylog, then deleting from the node (perhaps after a delay, so last 24h is locally browseable) (and don't remove things that haven't been processed) Hope that made sense. Let's talk more, especially if it didn't ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Logging braindump
On 03/22/12 09:38, Colin McCabe wrote: On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen tommi.virta...@dreamhost.com wrote: [mmap'ed buffer discussion] I always thought mmap'ed circular buffers were an elegant approach for getting data that survived a process crash, but not paying the overhead of write(2) and read(2). The main problem is that you need special tools to read the circular buffer files off of the disk. As Sage commented, that is probably undesirable for many users. (a) I actually favor not simply mmaping the circular buffer, but having a program that pulls the data out of memory and writes it to disk (ala Varnish). In addition to doing huge writes (greatly reducing the write overhead), it can filter what it processes, so that we have extensive logging for the last few seconds, and more manageable logs on disk extending farther back in time (modulo log rotation). (b) The most interesting logs are probably the ones in coredumps (that didn't make it out to disk) for which we want a crawler/extractor anyway. It probably isn't very hard to make the program that extracts logs from memory also be able to pick the pockets of dead bodies (put a big self identifying header on the front of each buffer). Note also that having the ability to extract the logs from a coredump pretty much eliminates any motivations to flush log entries out to disk promptly/expensively. If the process exits clealy, we'll get the logs. If the process produces a coredump, we'll still get the logs. (c) I have always loved text logs that I can directly view. Their immediate and effortless accessibility encourages their use, which encourages work in optimizing their content (lots of the stuff you need, and little else). But binary logs are less than half the size (cheaper to take and keep twice as much info), and a program that formats them can take arguments about which records/fields you want and how you want them formatted ... and getting the output the way you want it (whether for browsing or subsequent reprocessing) is a huge win. You get used to running the log processing command quickly, but the benefits (d) If somebody really wants text logs for archival, it is completely trivial to run the output of the log-extractor through the formatter before writing it to disk ... so the in memory format need not be tied to the on-disk format. The rotation code won't care. An mmap'ed buffer, even a lockless one, is a simple beast. Do you really need a whole library just for that? Maybe I'm just old-fashioned. IMHO, surprisingly few things involving large numbers of performance critical threads turn out to be simple :-) For example: If we are logging a lot, buffer management has the potential to become a bottle-neck ... so we need to be able to allocate a record of the required size from the circular buffer with atomic instructions (at least in non-wrap situations). But if records are allocated and then filled, we have to consider how to handle the case where the filling is delayed, and the reader catches up with an incomplete log record (e.g. skip it, wait how long, ???). And while we hope this will never happen, we have to deal with what happens when the writer catches up with the reader, or worse, an incomplete log block ... where we might have to determine whether or not the owner is deceased (making it safe to break his record lock) ... or should we simply take down the service at that point (on the assumption that something has gone very wrong). If we are going to use multiple buffers, we may have to do a transaction dance (last guy in has to close this buffer to new writes, start a new one, and somebody has to wait for pending additions to complete, queue this one for delivery or perhaps even flush it to disk if we don't have some other thread/process doing this). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Logging braindump
On Thu, Mar 22, 2012 at 11:17, Mark Kampe mark.ka...@dreamhost.com wrote: (a) I actually favor not simply mmaping the circular buffer, but having a program that pulls the data out of memory and writes it to disk (ala Varnish). In addition to doing huge writes (greatly reducing the write overhead), it can filter what it processes, so that we have extensive logging for the last few seconds, and more manageable logs on disk extending farther back in time (modulo log rotation). varnishlog communicates via a mmap'ed ringbuffer. These are not in conflict. (b) The most interesting logs are probably the ones in coredumps (that didn't make it out to disk) for which we want a crawler/extractor anyway. It probably isn't very hard to make the program that extracts logs from memory also be able to pick the pockets of dead bodies (put a big self identifying header on the front of each buffer). A single memory area for this makes it easier.. that's why I like the ringbuffer. If we are logging a lot, buffer management has the potential to become a bottle-neck ... so we need to be able to allocate a record of the required size from the circular buffer with atomic instructions (at least in non-wrap situations). But if records are allocated and then filled, we have to consider how to handle the case where the filling is delayed, and the reader catches up with an incomplete log record (e.g. skip it, wait how long, ???). And while we hope this will never happen, we have to deal with what happens when the writer catches up with the reader, or worse, an incomplete log block ... where we might have to determine whether or not the owner is deceased (making it safe to break his record lock) ... or should we simply take down the service at that point (on the assumption that something has gone very wrong). The Disruptor design handles all these, is simple in the sense of that's what I would have built, and looks good. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Logging braindump
On Thu, 22 Mar 2012, Tommi Virtanen wrote: If we are logging a lot, buffer management has the potential to become a bottle-neck ... so we need to be able to allocate a record of the required size from the circular buffer with atomic instructions (at least in non-wrap situations). But if records are allocated and then filled, we have to consider how to handle the case where the filling is delayed, and the reader catches up with an incomplete log record (e.g. skip it, wait how long, ???). And while we hope this will never happen, we have to deal with what happens when the writer catches up with the reader, or worse, an incomplete log block ... where we might have to determine whether or not the owner is deceased (making it safe to break his record lock) ... or should we simply take down the service at that point (on the assumption that something has gone very wrong). The Disruptor design handles all these, is simple in the sense of that's what I would have built, and looks good. My only problem with the disrupter stuff was that, as i was reading it, it was very much like yeah, given the limitations of Java, that's what you would do, but we're in a slightly different boat. i.e., they use a ringbuffer of pointers to preallocated objects. My guess is that the best bet would be preallocated Entry objects (either in a flat buffer or on the heap) with a preallocated per-entry buffer (say, 80 chars) that will spill over into something slow/simple when necessary. And something disrupter-like to claim slots in the ringbuffer. But in any case, I think the key is first measuring how much time we spend - rendering the current entries - queueing each entry under varying levels of concurrency. With the current code, for instance, I think most time is spent convering crap into strings and waiting for a blocking flush. We aren't logging millions of items, only hundreds. -- Ignoring the nitty gitty of the log queueing, though... does the basic framework make sense? That is, - a set of predefined subsystems, each with their own log levels - a level to control which entries are gathered/rendered, with a fast contitional check (in, say, dout macro) - a level to control which entries are logged d - a dump on crash (or other event) of everything we have ? sage
Re: Logging braindump
On Thu, Mar 22, 2012 at 11:17 AM, Mark Kampe mark.ka...@dreamhost.com wrote: On 03/22/12 09:38, Colin McCabe wrote: An mmap'ed buffer, even a lockless one, is a simple beast. Do you really need a whole library just for that? Maybe I'm just old-fashioned. IMHO, surprisingly few things involving large numbers of performance critical threads turn out to be simple :-) For example: If we are logging a lot, buffer management has the potential to become a bottle-neck ... so we need to be able to allocate a record of the required size from the circular buffer with atomic instructions (at least in non-wrap situations). But if records are allocated and then filled, we have to consider how to handle the case where the filling is delayed, and the reader catches up with an incomplete log record (e.g. skip it, wait how long, ???). The easiest solution is to store pointers in the circular buffer rather than directly writing to it. Another solution would be epoch numbers that increment each time the log buffer wraps around. C. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Logging braindump
So, we've been talking about an in-memory buffer that would contain debug-level messages, and a separate thread/process [1] that would write a subset of these to disk. Thus, on crash, we'd have a lot of detail available (the ring buffer entries not yet overwritten), without paying the IO/serialization/storage cost of dumping it all out to disk. [1] e.g. varnish uses an mmaped file for a ringbuffer, consumes it from a difference process; that way, the log consumer cannot corrupt the server process memory. Of course, this means log messages cannot contain pointers to indirect data. And that means the buffer is not made of constant size entries, like Disruptor tends to assume.. though I think you could view Disruptor sequence numbers as byte offsets, if you wanted. RING BUFFER For the ring buffer, we've been looking at Disruptor[2]-style consumer tries to catch up with a sequence number from producer design. As we have multiple producers (threads creating log entries), the cost of creating a log entry would be a single cmpxchg op, and then whatever work is needed to lay out the event in the ringbuffer. http://martinfowler.com/articles/lmax.html http://code.google.com/p/disruptor/ The in-memory data format could just use whatever data format is most convenient. The ringbuffer could be an array tiny structs with the base fields like thread id embedded there, and pointers to separately allocated data for items that aren't always present. But this means we need to be very careful about memory management; we want the data pointed to to stay alive and unmodified until the producer loops around the ringbuffer. Alternatively, interpret Disruptor sequence numbers as byte offsets, serialize message first, allocate that much space from ring buffer (still just one cmpxchg). This pushes more of the work to the producer of log messages, but avoids having an intermediate data format that needs to be converted to another format, and simplifies memory management tremendously. DISK FORMAT The process writing the events to disk should be more focused on usefulness and longevity of the data. If the ring buffer is just arrays with pointers, here we should take the performance hit to convert to one of the known formats. I feel strongly in favor of structured logging, as parsing bazillion log entries is slow, and maintaining the parsing rules is actually harder than structured logging in the first place. The status quo is hoping to improve syslog, but there's so much Enterprise in this stuff that I'm not holding my breath.. http://lwn.net/Articles/484731/ . Work that has come out includes the structured syslog format below, CEE querying further down. Some candidates: - Scribe and Flume are pre-existing log collectors that emphasize a DAG of log flow, lots of Java everywhere.. I'm not thrilled. https://github.com/facebook/scribe https://cwiki.apache.org/FLUME/ - journald: I'm just going to pretend it doesn't exist, at least for 2 years: http://blog.gerhards.net/2011/11/journald-and-rsyslog.html - syslog's structured logging extension: http://tools.ietf.org/html/rfc5424#section-6.5 essentially, [key=value key2=val2] MESSAGE 1651 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut=3 eventSource= Application eventID=1011] BOMAn application event log entry... - JSON: http://json.org/ {key: value, key2: val2, message: MESSAGE} the convention is lines of json, separated by newlines, each line is a full event biggest downside is numbers are always floats (need to stringify large numbers), binary data must be encoded somehow into a utf-8 string (base64 is most common). - GELF: compressed JSON with specific fields: https://github.com/Graylog2/graylog2-docs/wiki/GELF - Google Protocol Buffers: considered clumsy these days (code generation from IDL etc); only Google has significant investment in the format - Thrift: considered clumsy these days (code generation from IDL etc); only Facebook has significant investment in the format - BSON: sort of close to binary encoding of JSON + extra data types, not a huge improvement in speed/space.. http://bsonspec.org/ - Avro: Apache-sponspored data format, nicely self-describing, apparently slow? http://avro.apache.org/ - MessagePack: binary encoding for JSON, claims to beat others in speed.. http://msgpack.org/ And all of these can be compressed with e.g. Snappy as they flow to disk. http://code.google.com/p/snappy/ Downside of just all but JSON: we'd need to bundle the library -- distro support just isn't there yet. Should the disk format be binary? That makes it less friendly to the admin. I'm not sure which way to go. JSON is simpler and friendlier, e.g. MessagePack has identical data model but is faster and takes less space. Some options: a. make configurable so simple installations don't need to suffer binary logs b. just pick one and stick with it QUERYING / ANALYSIS - use a format