Re: [rsyslog] Can we have a minimum bulk size for omelasticsearch?

David Lang Thu, 20 Aug 2015 23:24:57 -0700


On Fri, 21 Aug 2015, Radu Gheorghe wrote:

Hello rsyslog users :)

We've seen a problem that is similar to the one reported here:
http://www.gossamer-threads.com/lists/rsyslog/users/17550 While that looks
like a bug, ours seems like a design issue.

Basically we see bulks of one document all over the place. Not 100% what's
the root cause, but I'm thinking: if you have many machines with rsyslog
installed that send logs to Elasticsearch, but most of them send little
logs, they would never get enough messages in the queue to push in large
batches. Unless you add a slowdown, in which case you restrict rsyslog's
ability to push data when it's under load.

if you have all your systems send to a central aggregation point, rather thaninto ES directly, that aggregation point is going to have the combined traffic,and is much more likely to have data available to send.

If you have 10K docs/s coming in 1 doc batches (say, from 10K machines),
there's a lot of unnecessary load on ES. Sure, if ES is overloaded things
will get better (as documents will add up in queues, resulting in bigger
batches) but even then I'd imagine things will look quite inefficient.
Plus, I'd like to avoid ES being overloaded in the first place.

The solution, in my mind, was to add two options:
- one that says "if you don't have at least N items in the bulk, wait a bit
until you have"
- one that overrides it saying "if M seconds passed since the last bulk,
send the bulk anyway"

this sort of logic tends to be rather fragile (and setting timers, checking howlong it's been, etc ends up realy hurting you when you are under load. It's alsothe sort of thing that is routinely misconfigured to really hurt you.

The approach that Rsyslog takes is to send something as soon as it's available,let things queue up while that's being processed, and then send what's queued up(with a max limit)

This has the advantage of simplicity and performance. There are no timers tosetup, not timestamps to check, and teh latency in message delivery is theminimum possible.

As a result, the sort of change you are looking for will almost certinly notgo into the core. I believe that the ES module has it's own buffer of messagesthat it's sending, so it could go there (IIRC the omelasticsearch module wascontributed)




Now, where does this help

when traffic is really slow, this won't help, everything will still besingletons

when traffic is heavy (just above the minimum batch size) this won't help,everything will be sent the same way with either set of logic.

There is a middle ground where fewer, but larger batches are being sent, andthings will flow more efficiently. How much of a difference does this make?



I don't think it will make much difference, but I can be convinced by numbers.

Let's invetigate what the best case situation is (I don't have the numbers forthis, so we'll have to do some research)

the best case is where without this setting, rsyslog would send singletonmessages, but with this setting, it would batch up exactly minbatch messages andsend them.


what sort of setting are you thinking of for your 'minimum size' batch?

on the sender side, each batch sent has a fairly small overhead, the messagebeing sent doesn't have a lot of overhead besides the messages being inserted.There is going to be some amount of additional RAM used to hold on to theselogs, but the system is idle, so it really shouldn't hurt. What I think is morelikely to hurt is that when things go wrong, more data will be lost.

On the receivers side, how much of a performance benefit is there? (this dependson the internals of ES)

batch mode was created in rsylog because my testing was showing that on low-endhardware I could insert ~1000 records into postgres as a batch in the same timethat it took to insert two records individually.

can we get someone who has an ES setup to run a test, force the batch size to 1and hammer it until you have the max rate, then set the batch size really large,and keep increasing the dequeue delay time until the total rate of inserts dropsback to the same and report how large the delay time needs to be for them toeven out. Also the load on the ES server under the 'many small' vs 'few large'(vmstat and iostat output, and possibly /proc/meminfo so that we can see disk,ram, and cpu utilization)

the recent request to add compression to the ES transaction will matter here aswell, a larger batch will compress better

For example, if you are talking 1 vs 5 messages/batch, will that really make adifference on the ES server? if so, how big a difference? if it's 500%improvement, but the 'bad' situation is only using 20% of a cpu on ES, do wecare? If ES does the insert into a datastructue in RAM, and then pushes it outto disk and updates it's indexes to make things visible only every severalseconds, then it may be that there is no noticable difference between the twomodes for quite a while. If we push the single-item rate until the server can'tkeep up, we will be hiting some limit.

There's also the question of what is the value of what's being saved? dependingon what resource on the ES server ends up being the bottleneck that is saved byusing larger batches, it may be that it's not something that would really makethe ES server noticably better if it wasn't being used. It's also possible thatwe will find that it's a really critical resource and would make a hugedifference.

Also, should the minimum queue size be based on the number of messages, or thesize of the data being sent?

you could also test this by having a program to insert into ES that reads fromstdin and set it up as omprog and cache everything up until the minimum batchsize, possibly with a signal that forces it to flush it's cache _now_ so you canexperiment with timing by changing the rate that you send signals to it from anexternal script and the sending code doesn't need to have any of the clocklogic in it.

we know the cost to rsylog for doing something like this, but we don't know thebenefits.

Now the big questions:
- is this possible? where would one apply such a change?
- would it have a significant impact on the performance of outputs that
work well with the current design? Like omfwd, where the receiving end
wouldn't care how many docs it receives I imagine
- if it does have a significant impact, can we restrict such a change to
omelasticsearch, or does it have to go under rsyslog's core (in the way it
handles queues)?
- do you see better solutions?

I think the answer is that it would hurt in the general case and be veryinvasive and not the right thing for many outputs, but it may be the right thingfor soem outputs, so let's test and see.


David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Can we have a minimum bulk size for omelasticsearch?

Reply via email to