On Wed, 31 Oct 2012, Rainer Gerhards wrote:
Hi all,
There is the dangling issue that rsyslog has grown out of its current queue subsystem. I
am currently considering a refactoring or a complete redesign. I initially wanted to
write a large blog post with all details and ideas, but have now opted to split this in a
couple of parts - both because I have problems to find time to do the "big one"
at once; and also it probably is smarter to get feedback asap.
So here is the initial part:
http://blog.gerhards.net/2012/10/rsyslog-disk-queues-refactor-or.html
This will get anyone interested in the queue subsystem a broad understanding of
how it works - and why. Please share any concerns you have about the current
system as well as wishes/suggestions on what should improve. Deeply technical
information is fine, actually appreciated.
I intend to let the discussion run and write the other parts of the blog series when "events
warrant it" ;) Due to other projects, I can probably not discuss 10 hours a day, but will try
to be as active as possible (which hopefully means "much"). The intent is to come up with
a solution that will be good for the next five years to come...
Thinking a bit more about the disk format.
We have two competing requirements
1. making it as fast as possible for rsyslog to read and write the data
2. making it human readable so that it can be salvaged by a person if
something goes horribly wrong.
for the former, binary data structures are desirable
for the latter, you want everything in text
For rsyslog, this is greatly simplified by the fact that everything we are
processing is text, and does not have any embedded newlines.
One approach to consider is to not store anything in the file that can be
re-calculated (i.e. store the rawmessage, a little extra metadata and then
run it through the parsing stack when you dequeue the message)
This costs a significant amount of CPU, and runs the risk that the parsing
may not end up being the same (processing a queue file after a restart)
In addition, with version 7 and the ability to set variables and fields in
structures, the data in a queue file tied to a specific action may have
been manipulated significantly since the message arrived.
So I don't think this is the way we want to go.
Another approach is to define everything as text fields (i.e.
name=value\n) and then parse it when you read it in.
This is also pretty expensive in CPU.
One trick that we can pull to greatly speed up the processing is to play
pointer games.
If you take a line of text, you can very cheaply walk through it and
record a pointer to the beginning of each word, replacing spaces will null
characters. This is FAR faster than copying the data to new memory
locations and then lets you treat the resulting strings as standard C
strings.
I would suggest a variation of this.
store everything as name=value<null> (doing a 'strings file' will return
name=value\n)
add a header to each message, something along the lines of:
RSYSLOG_HEADER Size=###### <base64 encoded data><null>
where size is the size of the base64 encoded stuff
The binary data encoded in the base64 blob would be along the lines of:
<offset to rawmessage><offset to timestamp><offset to received time>...
for the standard properties, followed by
<offset to name><offset to value> for all the dynamically generated data
where the offset is to the start of the value field in each
name=value<nul> 'line' for the standard properties.
This would allow rsyslog to _very_ quickly know where everything is, and
copy the standard properties into the queue record memory structure. for
dynamic properties it would be a smidge slower as you can't just do
strcpy of the values from a 'known' location to the location in the
message structure, you would have to look at the name by the number of
bytes (<offset to value> - <offset to name> -1 bytes worth of text), and
then setup the location for it before copying the value into place. But
this should still be faster than parsing arbitrary text (and if not, just
have the dynamically generated data fields be parsed when they are read,
so far they aren't that common, so this cost won't dominate)
For data recovery purposes (where a person needs to manually tweak the
file to recover from problems), do
strings queuefile |sed s/"RSYSLOG_HEADER Size=.*$"/"RSYSLOG_HEADER Size=0"/
to clear the header, and rsyslog can have a fallback mode (or external
repair tool) that does the slow parsing of everything (if it's an external
tool, it can create a new header line for each record)
David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.