On Wed, 31 Oct 2012, Rainer Gerhards wrote:

Hi all,

There is the dangling issue that rsyslog has grown out of its current queue subsystem. I 
am currently considering a refactoring or a complete redesign. I initially wanted to 
write a large blog post with all details and ideas, but have now opted to split this in a 
couple of parts - both because I have problems to find time to do the "big one" 
at once; and also it probably is smarter to get feedback asap.

So here is the initial part:

http://blog.gerhards.net/2012/10/rsyslog-disk-queues-refactor-or.html

This will get anyone interested in the queue subsystem a broad understanding of 
how it works - and why. Please share any concerns you have about the current 
system as well as wishes/suggestions on what should improve. Deeply technical 
information is fine, actually appreciated.

I intend to let the discussion run and write the other parts of the blog series when "events 
warrant it" ;) Due to other projects, I can probably not discuss 10 hours a day, but will try 
to be as active as possible (which hopefully means "much"). The intent is to come up with 
a solution that will be good for the next five years to come...

Thinking a bit more about the disk format.

We have two competing requirements

1. making it as fast as possible for rsyslog to read and write the data

2. making it human readable so that it can be salvaged by a person if something goes horribly wrong.

for the former, binary data structures are desirable

for the latter, you want everything in text

For rsyslog, this is greatly simplified by the fact that everything we are processing is text, and does not have any embedded newlines.


One approach to consider is to not store anything in the file that can be re-calculated (i.e. store the rawmessage, a little extra metadata and then run it through the parsing stack when you dequeue the message)

This costs a significant amount of CPU, and runs the risk that the parsing may not end up being the same (processing a queue file after a restart)

In addition, with version 7 and the ability to set variables and fields in structures, the data in a queue file tied to a specific action may have been manipulated significantly since the message arrived.

So I don't think this is the way we want to go.


Another approach is to define everything as text fields (i.e. name=value\n) and then parse it when you read it in.

This is also pretty expensive in CPU.


One trick that we can pull to greatly speed up the processing is to play pointer games.

If you take a line of text, you can very cheaply walk through it and record a pointer to the beginning of each word, replacing spaces will null characters. This is FAR faster than copying the data to new memory locations and then lets you treat the resulting strings as standard C strings.

I would suggest a variation of this.

store everything as name=value<null> (doing a 'strings file' will return name=value\n)

add a header to each message, something along the lines of:

RSYSLOG_HEADER Size=###### <base64 encoded data><null>

where size is the size of the base64 encoded stuff

The binary data encoded in the base64 blob would be along the lines of:
<offset to rawmessage><offset to timestamp><offset to received time>...
for the standard properties, followed by
<offset to name><offset to value> for all the dynamically generated data

where the offset is to the start of the value field in each name=value<nul> 'line' for the standard properties.

This would allow rsyslog to _very_ quickly know where everything is, and copy the standard properties into the queue record memory structure. for dynamic properties it would be a smidge slower as you can't just do strcpy of the values from a 'known' location to the location in the message structure, you would have to look at the name by the number of bytes (<offset to value> - <offset to name> -1 bytes worth of text), and then setup the location for it before copying the value into place. But this should still be faster than parsing arbitrary text (and if not, just have the dynamically generated data fields be parsed when they are read, so far they aren't that common, so this cost won't dominate)


For data recovery purposes (where a person needs to manually tweak the file to recover from problems), do

strings queuefile |sed s/"RSYSLOG_HEADER Size=.*$"/"RSYSLOG_HEADER Size=0"/

to clear the header, and rsyslog can have a fallback mode (or external repair tool) that does the slow parsing of everything (if it's an external tool, it can create a new header line for each record)

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to