Hi David,

Thanks a lot for your reply! I will add my comments inline.

2013/12/4 David Lang <da...@lang.hm>

> On Wed, 4 Dec 2013, Radu Gheorghe wrote:
>
>  Hi list :)
>>
>> I'm trying to understand if mmnormalize is a good fit for parsing a high
>> traffic of logs, given the fact that events are really heterogeneous
>> (think
>> log4j logs, apache logs, whatever logs are commonly produced).
>>
>> My only frame of reference is Logstash's grok
>> filter<http://logstash.net/docs/1.2.2/filters/grok>,
>>
>> which allows you to tag regular expressions in a dictionary, and then use
>> those tags to match fields from logs, and put them in a structured event.
>> Much like how you'd build a liblognorm rulebase.
>>
>> If I got it right, the advantage of mmnormalize seems to be performance,
>> because it goes around using regular expressions. Not sure how this
>> actually work, though. Practically, it sounds like this comes at the
>> expense of flexibility: if I need to add a new "pattern" in liblognorm
>> (say, a new date format) I'd have to patch the library itself, no?
>>
>
> a completly new type of data you would have to modify the library, but you
> seldom need to do that because when you are processing the logs, all you
> really care about is that this string of characters is the date, you aren't
> parsing the date so that you can do calculations on it.
>

So you're basically saying that if I just want to "copy-paste" a new date,
I can simply say "word" or "char-to" and it should work. If I need to parse
an SQL date and send it over, for example as an ISO date, I need a new type
and therefore liblognorm needs patching. Right?

If so, this means that I can either do with the field types that exists, or
patch liblognorm. That was my initial assumption, which leaves me a bit
undecided. On one hand, the current set of field types looks like it would
suit 99.9999999999999% of the logs out there. On the other hand, you don't
really know until you're trying. I've tried to use mmnormalize a few months
ago in my setup and I failed because it didn't have something to match the
string until the end of the line. Now it has, so I'm going to give it a
second shot. But God knows what will be coming up next. So it would be nice
to have an easy way to define new field types.

I'm guessing this is a design thing. You need to have those "specific"
types if you want to have the awesome performance. Right?


>
> As long as you can say 'this string of characters is what I care about,
> and I'm going to label it "date"' you are in good shape.
>
> mmnormalize is far better than regex engines for a couple of reasons.
>
> 1. full regex support requires supporting some very expensive types of
> expressions, even if you don't plan to use them. This costs.
>
> 2. regex engines almost always go down the list, does regex1 match, if not
> does regex2 match, if not does regex3 match, ....
>
> mmnormalize in comparison compiles your config into a parse tree, so it
> can walk down the log message a character at a time, looking that character
> up in the parse tree and when it comes to the end of the line it knows it
> has the correct match, so instead of being O(N) based on the number of
> rules it's (1) based on the (relatively) short length of the lines.


Thanks for the explanation. This makes a lot of sense. So it should really
be A LOT faster, which would make a lot of difference at scale.


>
>
>  Speaking of scope, can liblognorm be enhanced to support parsing multiline
>> messages? This seems to be possible in grok:
>> https://logstash.jira.com/browse/LOGSTASH-692
>>
>
> multiline logs cause all sorts of problems, in general you should avoid
> them or collapse the multiline logs into a single line when you get it into
> your logging system, too many things will break a multiline log into
> multiple logs. In some cases you can carefully configure everything to
> handle multiline logs, but it's very fragile and prevents you from using
> many tools and transport mechanisms.


Yeah, I know these tend to be a pain. But I have to deal with them.
Collapsing sounds like a hack to me because I need to be aware of what I'm
doing down the pipeline. For example, something else that works with the
log, like an UI, would need to know that the strange character is actually
a newline. I'll probably also have to escape it... The whole thing sounds
more complicated (and hackier) than dealing with the newline itself.
Especially since, right now at least, from rsyslog my events go to
Elasticsearch (probably something else in future, like HDFS) and then
Kibana and some other UI. All these have no problem handling multi-line
events, so if rsyslog works with them, too, I'll be good.


>
>
>  For me, it's important to understand whether I should put effort in
>> working
>> with mmnormalize and sponsor needed enhancements, or would sponsoring a
>> new
>> "mmgrok" module be a better idea for my use-case. Because it looks like
>> grok is available as a C library as well:
>> https://github.com/jordansissel/grok
>>
>
> It's not clear what enhancements you are thinking that you need (other
> than the multiline support, which as I say is problomatic)


To be honest, it's not clear to me either, because I didn't start working
with it yet. It should be clear in less than a month, though. Expect the
list to be spammed with mmnormalize questions :)

My question for now is basically "what's the scope of mmnormalize?". Is is
very hard to add a new type? If such additions should be rare and take lots
of time, maybe mmgrok makes more sense. Is it very hard or unacceptable to
add multi-line support? These are more about the design than about the
current functionality, and I need to understand if enhancing mmnormalize is
the way to go for a scenario like mine or I should go for something like
mmgrok.

Lots of people send logs from rsyslog to Elasticsearch or Solr via stuff
like Logstash or Flume because of grok. I'm thinking that if I'd have
grok-like capabilities in rsyslog, I'd be able to skip a step and have an
easier and faster setup. If mmnormalize can do that, it sounds like it
would be MUCH faster.

This is not to say that grok is the only reason one would use
Logstash/Flume. With Logstash, for example you have lots of stuff to modify
your events (like a geoip
<http://logstash.net/docs/1.2.2/filters/geoip>filter), and it's
trivial to add new ones (I've recently commited a Solr
output plugin <https://github.com/logstash/logstash/pull/675> and I'm a
noob at Ruby). I don't think rsyslog can (or should?) have all these
features. But if you can do the bulk of processing in rsyslog I can bet
there will be much more interest for it when it comes to large-scale log
processing, because of how fast it is. In my mind, that should draw more
testing, more contributions, more sponsoring and hopefully make everyone
happy.

Best regards,
Radu
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to