Re: Logging braindump

2012-03-27 Thread Sage Weil
I did some rough benchmarking (N entries over M threads) on the current 
code vs wip-log2 (my naive implementation), which

 - allocates an Entry struct on the heap, with 80 chars preallocated
 - you write into a streambuf into that, overflowing into a std::string on 
   the heap
 - takes a mutex and adds it to a linked list

with an async thread to write it out.

For 1M lines, on a 16-core system, -O2, 

threads old codenew (total) new (queue) no queue
1   4.7 ~7.51.4 1.3
10  8.0 ~8  1.541.0
16  7.8 ~8  1.621.1
100 5.1 ~8  1.751.34

A few notes:
 - The new (queue) is just the time to queue 1M events, and the total 
   includes a call to log.flush().  Normally that would happen in another 
   thread, of course, so it's only interesting as a rough measure of 
   total work. 
 - Commenting out the mutex on submit_entry and one thread shaves off .1 
   seconds or so... about the same as the no 'queue case'.
 - Preallocating entries instead of calling new only saves about .1 
   seconds.
 - Using PreallocatedStreambuf instead of a ostringstream saves about .4 
   seconds in the single thread case (if we fit).
 - Overflowing the buffer takes about .1 seconds longer.. still better
   than the ostringstream.  It depends a bit on _how_ much we overflow,
   though.. it does powers of two over the original prealloc size, so we 
   may have to realloc multiple times.  This is probably what 
   ostringstream is doing anyway.

My thoughts:
 - I don't think we care too much about the concurrency issue (yet).
 - I think we only care about queuing time, since that is what slows down
   running code.  Having an async thread write to disk will incur some 
   background load, but in reality we won't be writing 1M records in 10 
   seconds.
 - That is especially true since we will probably collect lots of log data 
   that we never write.
 - Most of the work so far adds infrastructure to control dual log levels 
   and to clean up the logging interface, making it easy to implement
   different functionality.  I think it's worth merging as-is.
 - We'll want to tune the amount of space preallocated in each Entry.  

Missing pieces:
 - Need to throttle when writeout gets behind to cap total memory usage.

Pushing to wip-log branch for review.

sage



On Thu, 22 Mar 2012, Sage Weil wrote:

 On Thu, 22 Mar 2012, Mark Kampe wrote:
  On 03/22/12 09:38, Colin McCabe wrote:
   On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen
   tommi.virta...@dreamhost.com  wrote:
[mmap'ed buffer discussion]
   
   I always thought mmap'ed circular buffers were an elegant approach for
   getting data that survived a process crash, but not paying the
   overhead of write(2) and read(2).  The main problem is that you need
   special tools to read the circular buffer files off of the disk.  As
   Sage commented, that is probably undesirable for many users.
  
  (a) I actually favor not simply mmaping the circular buffer,
  but having a program that pulls the data out of memory
  and writes it to disk (ala Varnish).  In addition to doing
  huge writes (greatly reducing the write overhead), it can
  filter what it processes, so that we have extensive logging
  for the last few seconds, and more manageable logs on disk
  extending farther back in time (modulo log rotation).
 
 The code in wip-log does this (dumps high level messages unless/until 
 there is a reason to dump it all).  Well, more importantly, it sets up the 
 infrastructure to control those logging levels (the actually logging code 
 is very short and simple).  I don't see the value in putting this in a 
 separate process, though (as opposed to a separate thread); that just 
 seems like it would complicate ops...
 
  (b) The most interesting logs are probably the ones in coredumps
  (that didn't make it out to disk) for which we want a
  crawler/extractor anyway.  It probably isn't very hard to
  make the program that extracts logs from memory also be
  able to pick the pockets of dead bodies (put a big self
  identifying header on the front of each buffer).
  
  Note also that having the ability to extract the logs from
  a coredump pretty much eliminates any motivations to flush
  log entries out to disk promptly/expensively.  If the process
  exits clealy, we'll get the logs.  If the process produces
  a coredump, we'll still get the logs.
 
 This mostly seems like an argument to keep the in-memory layout simple, to 
 keep the post-mortem code simple.  But at worst it's a 
 complexity/performance tradeoff... and, if we're lucky, there's no 
 difficult choice to make.  Honestly, I don't think the overhead is where 
 we think it is.  But need to do some real testing to see.
 
  (c) I have always loved text logs 

Re: Logging braindump

2012-03-22 Thread Sage Weil
This is well-timed.. I just started playing with this code on the flight 
over here.  

There are a couple pieces to what I've done so far:

 - clearly segmented logging code into a new set of classes in src/log, 
   without dependencies on everything else
 - set up config infrastructure to control 'log' and 'gather' levels for 
   different subsystems, where gather controls what messages are 
   generated, and log controls which of those reach disk (in non-crash 
   situation)
 - made a relatively naive implementation that works (messages on heap, 
   separate thread to write out, dump recent history on crash).
 - a lot of the weird interaction/bootstrapping stuff with contexts and 
   config now go away entirely if we just rely on cerr for errors during
   config parsing, etc. (like that simple spinlock Greg was complaining 
   about).

 MY RECOMMENDATIONS [biased, as always ;-]
 
 - bundle the MessagePack library

I'm worried about requiring tools to read logs.  And I think most of the 
current overhead is the sync writes (which are easily addressed).  But 
string serialization is also a significant chunk, I think, so this may be 
worth it... need to do some real tests.

 - in thread that calls log: serialize as MessagePack onto stack,
 allocate needed bytes from ringbuffer, copy event to ringbuffer
 - write to disk is now very simple, could even be done in a different
 process (mmap header+ringbuffer)

My naive implementation puts entries on heap and links them into a list: 
can't easily extract from a core, but otherwise equivalent to a ring of 
pointers.  My limited tests showed std::string assignment had a huge 
overhead, though, so I suspect writing a stringbuf implementation that 
writes into cheaply preallocated buffers (maybe allocated off a large 
preallocated ring buffer), optimized for common case (40-80 byte message).  
But I need to read more of the Disruptor stuff to see what exactly they're 
doing and why.

In any case, the new structure makes it easier to test implementations 
given things are better contained.

 - let disk files be named after timestamp they were started at, start
 new ones based on time  size (no .0 - .1 - .2 renaming needed)
 - make it really simple to process+delete chunks of log, feeding them
 into Brisk or Graylog, then deleting from the node (perhaps after a
 delay, so last 24h is locally browseable)
   (and don't remove things that haven't been processed)

Currently that's done with rename, sighup, wait for new log to 
appear, then process+delete.  Are there problems there that make it worth 
taking a non-standard approach to the log file naming?


Anyway, pushing what I have now to wip-log.  The exposed interface is just 
the macro in common/dout.h now.  Alternative approaches will probably 
change log/Entry.h and probably the allocation steps, but otherwise be 
pretty simple to swap in/out. 

The main piece that is sitll missing is an interface for logging 
structured content.  I suspect the way to do that is by putting stuff in 
the ostream stream (ala std::setiosflags(foo) and their ilk) so that you 
can easily stick key/value pairs in th code (...  log::kv(foo, bar) 
 ... or something)?  Hopefully that can reduce to something efficient, 
at least, so we don't have to replace the current dout logging sites with 
something completely different.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Logging braindump

2012-03-22 Thread Colin McCabe
On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen
tommi.virta...@dreamhost.com wrote:
 [mmap'ed buffer discussion]

I always thought mmap'ed circular buffers were an elegant approach for
getting data that survived a process crash, but not paying the
overhead of write(2) and read(2).  The main problem is that you need
special tools to read the circular buffer files off of the disk.  As
Sage commented, that is probably undesirable for many users.

An mmap'ed buffer, even a lockless one, is a simple beast.  Do you
really need a whole library just for that?  Maybe I'm just
old-fashioned.

 DISK FORMAT
 - GELF: compressed JSON with specific fields:
 https://github.com/Graylog2/graylog2-docs/wiki/GELF
 - Google Protocol Buffers: considered clumsy these days (code
 generation from IDL etc); only Google has significant investment in
 the format

Hadoop also has a significant investment in Google Protocol Buffers.
It's almost the only thing we send over the wire these days.  I don't
think this affects your decision about logging at all, though.

 - Thrift: considered clumsy these days (code generation from IDL etc);
 only Facebook has significant investment in the format
 - BSON: sort of close to binary encoding of JSON + extra data types,
 not a huge improvement in speed/space.. http://bsonspec.org/
 - Avro: Apache-sponspored data format, nicely self-describing,
 apparently slow? http://avro.apache.org/
 - MessagePack: binary encoding for JSON, claims to beat others in
 speed.. http://msgpack.org/

I'm not sure why you consider code generation from a schema file clumsy.
To me, the whole point of structured logging is that
1. it is more efficient in terms of space/speed, partly because it
doesn't need to be self-describing, and
2. it is a stable ABI that binds the system together

The sticking point for most programmers is point #2.  They don't want
to commit to a log format ahead of time because it slows down
development.  (Pretty  much the same reason Linus doesn't want a
stable in-kernel ABI.)  In my opinion, that's the reason why
structured logging has not made much headway in the world at large.
Journald is an interesting experiment-- we'll see how many programmers
carefully design structured journald log messages, and how many just
output a single field called text or something similar. :)

If you just want better efficiency, there's a lot of ways to get
that which don't involve changing the Ceph logging APIs.  Probably the
easiest one is just cutting down on hugely verbose (as in more than 80
character) log messages.  There's no shortage of those.

For an unrelated project, I have been considering a system where all
logs are generated and go to a circular buffer, but are not sent to
permanent storage until/unless there is a crash, or the particular log
message type is enabled.  It seems like something like this is a good
compromise between too much logging and not enough.

Colin



 And all of these can be compressed with e.g. Snappy as they flow to
 disk.  http://code.google.com/p/snappy/

 Downside of just all but JSON: we'd need to bundle the library --
 distro support just isn't there yet.

 Should the disk format be binary? That makes it less friendly to the
 admin. I'm not sure which way to go. JSON is simpler and friendlier,
 e.g. MessagePack has identical data model but is faster and takes less
 space. Some options:
  a. make configurable so simple installations don't need to suffer binary logs
  b. just pick one and stick with it



 QUERYING / ANALYSIS

 - use a format from above that is mapreduce-friedly, or can be 1:1
 imported into another storage system
 - software like Graylog may be of use, but I fear we'll overwhelm it
 with events: http://graylog2.org/
 - Cassandra's Brisk is a really easy way to run SQL-like Hive queries
 over structured data, and has a design that'll ingest any amount of
 data, Just Add Hardware(tm):
 http://www.datastax.com/docs/0.8/brisk/index
 - the standards process is churning out things like CEE, but I'm not
 holding my breath: http://cee.mitre.org/



 MY RECOMMENDATIONS [biased, as always ;-]

 - bundle the MessagePack library
 - in thread that calls log: serialize as MessagePack onto stack,
 allocate needed bytes from ringbuffer, copy event to ringbuffer
 - write to disk is now very simple, could even be done in a different
 process (mmap header+ringbuffer)
 - let disk files be named after timestamp they were started at, start
 new ones based on time  size (no .0 - .1 - .2 renaming needed)
 - make it really simple to process+delete chunks of log, feeding them
 into Brisk or Graylog, then deleting from the node (perhaps after a
 delay, so last 24h is locally browseable)
  (and don't remove things that haven't been processed)



 Hope that made sense. Let's talk more, especially if it didn't ;)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  

Re: Logging braindump

2012-03-22 Thread Mark Kampe

On 03/22/12 09:38, Colin McCabe wrote:

On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen
tommi.virta...@dreamhost.com  wrote:

[mmap'ed buffer discussion]


I always thought mmap'ed circular buffers were an elegant approach for
getting data that survived a process crash, but not paying the
overhead of write(2) and read(2).  The main problem is that you need
special tools to read the circular buffer files off of the disk.  As
Sage commented, that is probably undesirable for many users.


(a) I actually favor not simply mmaping the circular buffer,
but having a program that pulls the data out of memory
and writes it to disk (ala Varnish).  In addition to doing
huge writes (greatly reducing the write overhead), it can
filter what it processes, so that we have extensive logging
for the last few seconds, and more manageable logs on disk
extending farther back in time (modulo log rotation).

(b) The most interesting logs are probably the ones in coredumps
(that didn't make it out to disk) for which we want a
crawler/extractor anyway.  It probably isn't very hard to
make the program that extracts logs from memory also be
able to pick the pockets of dead bodies (put a big self
identifying header on the front of each buffer).

Note also that having the ability to extract the logs from
a coredump pretty much eliminates any motivations to flush
log entries out to disk promptly/expensively.  If the process
exits clealy, we'll get the logs.  If the process produces
a coredump, we'll still get the logs.

(c) I have always loved text logs that I can directly view.
Their immediate and effortless accessibility encourages
their use, which encourages work in optimizing their content
(lots of the stuff you need, and little else).

But binary logs are less than half the size (cheaper to
take and keep twice as much info), and a program that
formats them can take arguments about which records/fields
you want and how you want them formatted ... and getting
the output the way you want it (whether for browsing or
subsequent reprocessing) is a huge win.  You get used to
running the log processing command quickly, but the benefits

(d) If somebody really wants text logs for archival, it is completely
trivial to run the output of the log-extractor through the
formatter before writing it to disk ... so the in memory
format need not be tied to the on-disk format.  The rotation
code won't care.


An mmap'ed buffer, even a lockless one, is a simple beast.  Do you
really need a whole library just for that?  Maybe I'm just
old-fashioned.


IMHO, surprisingly few things involving large numbers of performance
critical threads turn out to be simple :-)  For example:

If we are logging a lot, buffer management has the potential
to become a bottle-neck ... so we need to be able to allocate
a record of the required size from the circular buffer
with atomic instructions (at least in non-wrap situations).

But if records are allocated and then filled, we have to
consider how to handle the case where the filling is
delayed, and the reader catches up with an incomplete
log record (e.g. skip it, wait how long, ???).

And while we hope this will never happen, we have to deal
with what happens when the writer catches up with the
reader, or worse, an incomplete log block ... where we might
have to determine whether or not the owner is deceased (making
it safe to break his record lock) ... or should we simply take
down the service at that point (on the assumption that something
has gone very wrong).

If we are going to use multiple buffers, we may have to
do a transaction dance (last guy in has to close this
buffer to new writes, start a new one, and somebody has
to wait for pending additions to complete, queue this
one for delivery or perhaps even flush it to disk if we
don't have some other thread/process doing this).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Logging braindump

2012-03-22 Thread Tommi Virtanen
On Thu, Mar 22, 2012 at 11:17, Mark Kampe mark.ka...@dreamhost.com wrote:
 (a) I actually favor not simply mmaping the circular buffer,
    but having a program that pulls the data out of memory
    and writes it to disk (ala Varnish).  In addition to doing
    huge writes (greatly reducing the write overhead), it can
    filter what it processes, so that we have extensive logging
    for the last few seconds, and more manageable logs on disk
    extending farther back in time (modulo log rotation).

varnishlog communicates via a mmap'ed ringbuffer. These are not in conflict.

 (b) The most interesting logs are probably the ones in coredumps
    (that didn't make it out to disk) for which we want a
    crawler/extractor anyway.  It probably isn't very hard to
    make the program that extracts logs from memory also be
    able to pick the pockets of dead bodies (put a big self
    identifying header on the front of each buffer).

A single memory area for this makes it easier.. that's why I like the
ringbuffer.

        If we are logging a lot, buffer management has the potential
        to become a bottle-neck ... so we need to be able to allocate
        a record of the required size from the circular buffer
        with atomic instructions (at least in non-wrap situations).

        But if records are allocated and then filled, we have to
        consider how to handle the case where the filling is
        delayed, and the reader catches up with an incomplete
        log record (e.g. skip it, wait how long, ???).

        And while we hope this will never happen, we have to deal
        with what happens when the writer catches up with the
        reader, or worse, an incomplete log block ... where we might
        have to determine whether or not the owner is deceased (making
        it safe to break his record lock) ... or should we simply take
        down the service at that point (on the assumption that something
        has gone very wrong).

The Disruptor design handles all these, is simple in the sense of
that's what I would have built, and looks good.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Logging braindump

2012-03-22 Thread Sage Weil
On Thu, 22 Mar 2012, Tommi Virtanen wrote:
         If we are logging a lot, buffer management has the potential
         to become a bottle-neck ... so we need to be able to allocate
         a record of the required size from the circular buffer
         with atomic instructions (at least in non-wrap situations).
 
         But if records are allocated and then filled, we have to
         consider how to handle the case where the filling is
         delayed, and the reader catches up with an incomplete
         log record (e.g. skip it, wait how long, ???).
 
         And while we hope this will never happen, we have to deal
         with what happens when the writer catches up with the
         reader, or worse, an incomplete log block ... where we might
         have to determine whether or not the owner is deceased (making
         it safe to break his record lock) ... or should we simply take
         down the service at that point (on the assumption that something
         has gone very wrong).
 
 The Disruptor design handles all these, is simple in the sense of
 that's what I would have built, and looks good.

My only problem with the disrupter stuff was that, as i was reading it, it 
was very much like yeah, given the limitations of Java, that's what you 
would do, but we're in a slightly different boat.  i.e., they use a 
ringbuffer of pointers to preallocated objects.

My guess is that the best bet would be preallocated Entry objects (either 
in a flat buffer or on the heap) with a preallocated per-entry buffer 
(say, 80 chars) that will spill over into something slow/simple when 
necessary.  And something disrupter-like to claim slots in the ringbuffer.

But in any case, I think the key is first measuring how much time we spend

 - rendering the current entries
 - queueing each entry

under varying levels of concurrency.  With the current code, for instance, 
I think most time is spent convering crap into strings and waiting for a 
blocking flush.  We aren't logging millions of items, only hundreds.

--

Ignoring the nitty gitty of the log queueing, though... does the basic 
framework make sense?  That is,

 - a set of predefined subsystems, each with their own log levels
 - a level to control which entries are gathered/rendered, with a fast 
   contitional check (in, say, dout macro)
 - a level to control which entries are logged d
 - a dump on crash (or other event) of everything we have

?

sage

Re: Logging braindump

2012-03-22 Thread Colin McCabe
On Thu, Mar 22, 2012 at 11:17 AM, Mark Kampe mark.ka...@dreamhost.com wrote:
 On 03/22/12 09:38, Colin McCabe wrote:
 An mmap'ed buffer, even a lockless one, is a simple beast.  Do you
 really need a whole library just for that?  Maybe I'm just
 old-fashioned.


 IMHO, surprisingly few things involving large numbers of performance
 critical threads turn out to be simple :-)  For example:

        If we are logging a lot, buffer management has the potential
        to become a bottle-neck ... so we need to be able to allocate
        a record of the required size from the circular buffer
        with atomic instructions (at least in non-wrap situations).

        But if records are allocated and then filled, we have to
        consider how to handle the case where the filling is
        delayed, and the reader catches up with an incomplete
        log record (e.g. skip it, wait how long, ???).

The easiest solution is to store pointers in the circular buffer
rather than directly writing to it.   Another solution would be epoch
numbers that increment each time the log buffer wraps around.

C.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Logging braindump

2012-03-19 Thread Tommi Virtanen
So, we've been talking about an in-memory buffer that would contain
debug-level messages, and a separate thread/process [1] that would
write a subset of these to disk. Thus, on crash, we'd have a lot of
detail available (the ring buffer entries not yet overwritten),
without paying the IO/serialization/storage cost of dumping it all out
to disk.

[1] e.g. varnish uses an mmaped file for a ringbuffer, consumes it
from a difference process; that way, the log consumer cannot corrupt
the server process memory. Of course, this means log messages cannot
contain pointers to indirect data. And that means the buffer is not
made of constant size entries, like Disruptor tends to assume.. though
I think you could view Disruptor sequence numbers as byte offsets, if
you wanted.



RING BUFFER

For the ring buffer, we've been looking at Disruptor[2]-style
consumer tries to catch up with a sequence number from producer
design. As we have multiple producers (threads creating log entries),
the cost of creating a log entry would be a single cmpxchg op, and
then whatever work is needed to lay out the event in the ringbuffer.

http://martinfowler.com/articles/lmax.html
http://code.google.com/p/disruptor/

The in-memory data format could just use whatever data format is most
convenient. The ringbuffer could be an array tiny structs with the
base fields like thread id embedded there, and pointers to separately
allocated data for items that aren't always present. But this means we
need to be very careful about memory management; we want the data
pointed to to stay alive and unmodified until the producer loops
around the ringbuffer.

Alternatively, interpret Disruptor sequence numbers as byte offsets,
serialize message first, allocate that much space from ring buffer
(still just one cmpxchg). This pushes more of the work to the producer
of log messages, but avoids having an intermediate data format that
needs to be converted to another format, and simplifies memory
management tremendously.


DISK FORMAT

The process writing the events to disk should be more focused on
usefulness and longevity of the data. If the ring buffer is just
arrays with pointers, here we should take the performance hit to
convert to one of the known formats. I feel strongly in favor of
structured logging, as parsing bazillion log entries is slow, and
maintaining the parsing rules is actually harder than structured
logging in the first place.

The status quo is hoping to improve syslog, but there's so much
Enterprise in this stuff that I'm not holding my breath..
http://lwn.net/Articles/484731/ . Work that has come out includes the
structured syslog format below, CEE querying further down.

Some candidates:

- Scribe and Flume are pre-existing log collectors that emphasize a
DAG of log flow, lots of Java everywhere.. I'm not thrilled.
  https://github.com/facebook/scribe  https://cwiki.apache.org/FLUME/

- journald: I'm just going to pretend it doesn't exist, at least for 2
years: http://blog.gerhards.net/2011/11/journald-and-rsyslog.html

- syslog's structured logging extension:
http://tools.ietf.org/html/rfc5424#section-6.5

  essentially, [key=value key2=val2] MESSAGE

 1651 2003-10-11T22:14:15.003Z mymachine.example.com
   evntslog - ID47 [exampleSDID@32473 iut=3 eventSource=
   Application eventID=1011] BOMAn application
   event log entry...

- JSON: http://json.org/

  {key: value, key2: val2, message: MESSAGE}

  the convention is lines of json, separated by newlines, each line is
a full event

  biggest downside is numbers are always floats (need to stringify
large numbers), binary data must be encoded somehow into a utf-8
string (base64 is most common).

- GELF: compressed JSON with specific fields:
https://github.com/Graylog2/graylog2-docs/wiki/GELF
- Google Protocol Buffers: considered clumsy these days (code
generation from IDL etc); only Google has significant investment in
the format
- Thrift: considered clumsy these days (code generation from IDL etc);
only Facebook has significant investment in the format
- BSON: sort of close to binary encoding of JSON + extra data types,
not a huge improvement in speed/space.. http://bsonspec.org/
- Avro: Apache-sponspored data format, nicely self-describing,
apparently slow? http://avro.apache.org/
- MessagePack: binary encoding for JSON, claims to beat others in
speed.. http://msgpack.org/

And all of these can be compressed with e.g. Snappy as they flow to
disk.  http://code.google.com/p/snappy/

Downside of just all but JSON: we'd need to bundle the library --
distro support just isn't there yet.

Should the disk format be binary? That makes it less friendly to the
admin. I'm not sure which way to go. JSON is simpler and friendlier,
e.g. MessagePack has identical data model but is faster and takes less
space. Some options:
  a. make configurable so simple installations don't need to suffer binary logs
  b. just pick one and stick with it



QUERYING / ANALYSIS

- use a format