[rsyslog] property nesting & templates & mongo -was: Regex logging to MongoDB

Rainer Gerhards Fri, 24 Aug 2012 01:22:41 -0700

Hi all,

sorry for the silence, I really had to finish a couple of things before I could 
turn to the discussion. And, btw, thanks for the good points. I use this reply 
as a kind of wrap-up reply saving us the work of going through multiple replies.

First off: this is a complicated problem, and one I already asked for advise on 
the mailing list a couple of weeks ago. So I am very happy we finally come to a 
point where solutions pop up :-)

> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of [email protected]
> Sent: Thursday, August 23, 2012 8:39 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] Regex logging to MongoDB
> 
> On Thu, 23 Aug 2012, Miloslav Trmac wrote:
> 
> > ----- Original Message -----
> >> On Wed, 22 Aug 2012, Miloslav Trmac wrote:
> >>
> >>> ----- Original Message -----
> >>>> Why have your own template engine instead of using the normal
> rsyslog
> >>>> template engine?

I think this is a core misunderstanding in this thread: Miloslav (Mirek) is 
actually working on EXTENDING THE CURRENT PROPERTY ENGINE. He is NOT doing a 
new module. David, not sure if you looked at his code. If not, you should do. 
The "interesting head" is here:

http://fedorapeople.org/cgit/mitr/public_git/rsyslog.git/?h=ommongodb

and this commit is kind of the outline of the current work:

http://fedorapeople.org/cgit/mitr/public_git/rsyslog.git/commit/?h=ommongodb&id=da801d28ef7427897fadc1d9de88c8999413e836

You can grasp the idea very quickly.

> >>>
> >>> Primarily I was considering the use case of modifying $!all-json -

$!all-json was and is a very, very quick and dirty hack. I created it because 
it was needed for a couple of reasons, but the necessary base pluming was not 
present - mostly because at this time things (in CEE) were too much a moving 
target. Note that it still is undecided if we have a flat vs. hierarchical 
field set (or was it decided while I was in vacation - at least I saw no 
notification).

On the other hand, for lumberjack we all seem to agree that hierarchical is the 
way to go. However, I did not finally decide how to internally represent the 
hierarchy, having the choice between multiple libraries. During my vacation I 
had some good time to think about this and will now most probably go with the 
cjson native representation, which seems to be sufficient and quite efficient. 
But having such a library finally decided IMHO is a core requirement to get 
serious about the formatter.

[snip]

> > It just seems to me that text-based templates don't fit the case
> where we actually want to work with a deeper structure very well - is

Jupp, that's exactly the point. We do not yet have a deep structure 
*implemented*. Libee works on CEE semantics, which at the time when libee was 
flat (and possibly still says so).

[snip]

> > I can see a fairly reasonable alternative - mostly give up on
> templates
> > for ommongodb (e.g. only support an equivalent of %$!all-json%, not
> > arbitrary templates), and create a separate message modification
> module
> > that could be used for arbitrary field editing.

A message modification module is the wrong choice. An enhanced template 
processor is the right way (think about the many capabilities like substrings, 
regex extraction, ... the template processor provides).

[snip]

> > In more detail, I think this would happen with text-based templates:
> > 1. Lines of text are received using any network protocol
> > 2. mmjsonparse extracts field values from @cee-marked messages:
> >   2a. a JSON parser converts text into a JSON parser data structure
> >   2b. ... which is converted into libee data structure
> 
> one thing you are missing is how rsyslog does this internally, it does
> it
> by making one copy of the string and then walking through the string,
> replacing spaces with nulls and keeping pointers to the start of each
> substring. As a result this is a very efficient process.

Be careful: Mirek is talking about mmjsonparse, NOT the normalizer. 

> 
> > 3. Rsyslog core processes a template with field values and other
> > properties pasted, to create a text line:
> >   3a: libee data structures are repeatedly searched for relevant
> fields
> >   3b: each of the fields and other properties is converted into
> partial strings
> 
> these fields are all text to start with, they don't have to be
> converted
> into partial strings.

That's not 100% true: if the JSON representation is "a=5.0", then this of 
course is text, but the json spec says it is a double because that syntax is 
used! So we do have types at this point (again: talking about JSON format 
coming in and being parsed). The current implementation deliberately discards 
this information and re-formats non-strings to strings.

> 
> >   3c: these strings are concatenated to create a single text line
> > (hoping that the user got JSON escaping exactly right)
> 
> 
> 
> 
> > 4. ommongodb receives the template-formed string, and acts on it:
> >   4a: a JSON parser converts the string into a JSON parser data
> structure
> >   4b: ... which is converted into a BSON data structure
> >   4c: ... which is converted into a BSON byte stream, and finally
> sent to the MongoDB server.
> 
> This seems like a fairly inefficent way of doing things, why not
> convert
> the JSON string directly to a BSON byte stream?

That's a good question. But think about it: that would mean that the rsyslog 
*core* would need to understand BSON. That in turn means that the core would 
have a dependency on the BSON (mongodb) library. And *that* is a really bad 
thing. Remember that a root cause to creating the plugin system was to prevent 
such dependencies. One could, of course, create some template plugin, but this 
plugin would be mongo-specific. This doesn't sound like the right thing to me.

> 
> > With the field-based templates, 3. and 4. is:
> > 3. Rsyslog core processes a template with field values and other
> properties pasted, to create a list of named fields
> >   3a: libee data structures are repeatedly searched for relevant
> fields
> >   3b': each of the fields and other properties is individually
> converted into text
> >   (3c missing)
> 
> they are text aready, there's no conversion needed in step 3b
> 
> > 4. ommongodb receives the template-formed list of fields, and acts on
> it:
> >   (4a missing)
> >   4b': The field list is converted into a BSON data structure
> >   4c: ... which is converted into a BSON byte stream, and finally
> sent to the MongoDB server.
> 
> you are missing that ommongodb gets the list of fields and then
> "repeatedly searches for the relavent fields" in the data structure.
> 
> There is some value in the idea of creating a new module interface that
> sends a data structure instead of a string, but this is a fairly
> significant change to the core of rsyslog. And even if it turns out to
> be
> a good idea

I think it is, but IMHO we should start with looking at the end-user interface 
first, decide how this may look and then decide on an implementation.

David is also right that this is a quite big change to rsyslog core (especially 
with deep structure being added). V6 was meant to provide the new config 
format. I really think hard about kicking off v7 for these changes, so that 
users who just want to use the new config system can stick to v6.

> to have a different interface for some modules, it's still
> a
> bad idea to have the manipulation of this list (adding fields, etc) be
> done in the output module. It should be done in rsyslog.
> 
> 
> > i.e. the result can be done with a little less effort.  That's not a
> decisive factor for me, though.

The performance oft he solution is very important to me, because we have many 
high-end users who really see a difference for things that are 5 to 10% slower 
than they need to be. This is what really annoys me with the current 
JSON-to-BSON translation that's needed for mongo. I think this was also the 
root cause that I did NOT implement it that way but stuck to a fixed schema as 
an interim solution.

> >
> >
> >>> Using the raw "sequence of fields without any formatting" format is
> not
> >>> great, I agree - but then pretending that the template can be an
> >>> arbitrary JSON format and we parse it intelligently is not great
> >>> either.
> >>> However that's definitely open to a change.
> >>
> >> I thought I saw you saying that you wanted to send JSON to the
> database.
> >> If that is the case, then let the sysadmin create the JSON and
> insert
> >> that.
> >
> > The MongoDB command format, unlike most SQL databases, does not send
> a
> > text command, but something pre-parsed.  It is not possible to just
> send
> > what the sysadmin created as-is.  (It is, of course, possible to just
> > parse it into JSON and send that with zero semantic modification).

I'd rephrase that slighly (and Mirek already said it): the goal is to keep the 
sysadmin in charge, but the current tooling needs to be enhanced to do this 
decently.

[snip]

Bottom line for me: in general, I like Mirek's approach and appreciate his work 
very much. He finally seems to make this discussion going :-)

As David said (in a later mail), we need to define a new interface (and Mirek 
has done so). We are looking at implementation right now. I would suggest to 
focus, for a moment, on the end user perspective: what do we want to specify? 
How can this be done inside templates?

We do not need to limit us to current template syntax, after all, we can extend 
it at will (and without harming existing knowledge). We should also assume that 
rsyslog already supports nested structures, like

.cee.subbranch.field
.audit.field
.userdefined.field

and so on...

We want to keep the ability to permit the rsyslog core to reformat properties 
and we also need the ability to include constant text/fields (a detail, but 
important to many users). But what do we want to do with that and how?

For sure we want to access single fields. I also assume we want to access full 
sub-branches (like "include all under .cee"). For the latter, I assume we want 
the capability to remove and add fields to the subbranch set. Do we need more? 
Again, how do we configure that? The config file format question actually seems 
to be more important to me than the implementation side (maybe because I have a 
very hard time trying to get a good idea ;)). IMHO once we have the config 
format, the implementation follows easily.

Rainer 

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards

[rsyslog] property nesting & templates & mongo -was: Regex logging to MongoDB

Reply via email to