Re: [graylog2] Graylog 1.0 UDP process buffer performance

Bernd Ahlers Wed, 25 Feb 2015 09:05:47 -0800

Johan, Henrik,

thanks for the details. I created an issue on GitHub and will investigate.


https://github.com/Graylog2/graylog2-server/issues/999

Regards,
Bernd

On 25 February 2015 at 17:48, Henrik Johansen <h...@myunix.dk> wrote:
> Bernd,
>
> Correct - that issue started after 0.92.x.
>
> We are still seeing evaluated CPU utilisation but we are attributing that
> to the fact that 0.92 was loosing messages in our setup.
>
>
>> On 25 Feb 2015, at 17:37, Bernd Ahlers <be...@graylog.com> wrote:
>>
>> Henrik,
>>
>> uh, okay. I suppose it worked for you in 0.92 as well?
>>
>> I will create an issue on GitHub for that.
>>
>> Bernd
>>
>> On 25 February 2015 at 17:14, Henrik Johansen <h...@myunix.dk> wrote:
>>> Bernd,
>>>
>>> We saw the exact same issue - here is a graph over the CPU idle
>>> percentage across a few of the cluster nodes during the upgrade :
>>>
>>> http://5.9.37.177/graylog_cluster_cpu_idle.png
>>>
>>> We went from ~20% CPU utilisation to ~100% CPU utilisation across
>>> ~200 cores and things only settled down after disabling force_rdns.
>>>
>>>
>>> On 25 Feb 2015, at 11:55, Bernd Ahlers <be...@graylog.com> wrote:
>>>
>>> Johan,
>>>
>>> the only thing that changed from 0.92 to 1.0 is that the DNS lookup is
>>> now done when the messages are read from the journal and not in the
>>> input path where the messages are received. Otherwise, nothing has
>>> changed in that regard.
>>>
>>> We do not do any manual caching of the DNS lookups, but the JVM caches
>>> them by default. Check
>>> http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
>>> for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.
>>>
>>> Regards,
>>> Bernd
>>>
>>> On 25 February 2015 at 08:56,  <sun...@sunner.com> wrote:
>>>
>>> This is strange, I went through all of the settings for my reply, and we are
>>> indeed using rdns, and it seems to be the culprit. The strangeness is that
>>> it works fine on the old servers even though they're on the same networks,
>>> and using the same DNS's and resolver settings.
>>> Did something regarding reverse DNS change between 0.92 and 1.0? I'm
>>> thinking perhaps the server is trying to do one lookup per message instead
>>> of caching reverse lookups, seeing as the latter would result in very little
>>> DNS traffic since most of the logs will be coming from a small number of
>>> hosts.
>>>
>>> Regards
>>> Johan
>>>
>>> On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote:
>>>
>>>
>>> Johan,
>>>
>>> this sounds very strange indeed. Can you provide us with some more
>>> details?
>>>
>>> - What kind of messages are you pouring into Graylog via UDP? (GELF,
>>> raw, syslog?)
>>> - Do you have any extractors or grok filters running for the messages
>>> coming in via UDP?
>>> - Any other differences between the TCP and UDP messages?
>>> - Can you show us your input configuration?
>>> - Are you using reverse DNS lookups?
>>>
>>> Thank you!
>>>
>>> Regards,
>>> Bernd
>>>
>>> On 24 February 2015 at 16:45,  <sun...@sunner.com> wrote:
>>>
>>> Well that could be a suspect if it wasn't for the fact that the old
>>> nodes
>>> running on old hardware handle it just fine, along with the fact that
>>> the
>>> traffic seems to reach the nodes just fine(i.e it actually fills the
>>> journal
>>> up just fine, and the input buffer never breaks a sweat). And it's
>>> really
>>> not that much traffic, even spread across four nodes those ~1000
>>> messages
>>> per second will cause this whereas the old nodes are just two and can
>>> handle
>>> it just fine.
>>>
>>> About disk tuning, I haven't done much of that, and I realize I forgot
>>> to
>>> mention that the Elasticsearch cluster is on separate physical hardware
>>> so
>>> there's a minuscule amount of disk I/O happening on the Graylog nodes.
>>>
>>> It's really very strange since it seems like UDP itself isn't to blame,
>>> after all the messages get into Graylog just fine and fills up the
>>> journal
>>> rapidly. The screenshot from I linked was from after I had stopped
>>> sending
>>> logs, i.e there was no longer any ingress traffic so the Graylog process
>>> had
>>> nothing to do except emptying it's journal so it should all be internal
>>> processing and egress traffic to Elasticsearch. And as can be seen in
>>> the
>>> screenshot it seems like it's doing it in small bursts.
>>>
>>> In the exact same scenario(i.e when I just streamed a large file into
>>> the
>>> system as fast as it could receive it) but with the logs having come
>>> over
>>> TCP, it'll still store up a sizable number of messages in the journal,
>>> but
>>> the processing of the journaled messages is both more even and vastly
>>> faster.
>>>
>>> So in short it doesn't appear to be the communication itself, but
>>> something
>>> happening "inside" the Graylog process, but that only happens when the
>>> messages have been delivered over UDP.
>>>
>>> Regards
>>> Johan
>>>
>>>
>>> On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen
>>> wrote:
>>>
>>>
>>> Could this simply be because TCP avoids (or tries to avoid) congestion
>>> while UDP does not?
>>>
>>> /HJ
>>>
>>> On 24 Feb 2015, at 13:50, sun...@sunner.com wrote:
>>>
>>> Hello,
>>>
>>> With the release of 1.0 we've started moving towards a new cluster of
>>> GL
>>> hosts. These are working very well, with one exception.
>>> For some reason any reasonably significant UDP traffic will choke the
>>> message processor, fill up and process buffers on all four hosts, and
>>> effectively choke up all other message processing as well.
>>> Normally we do around 2k messages per second, split roughly 50/50
>>> between
>>> TCP and UDP. Sending the entire TCP load to one host doesn't present a
>>> problem, it doesn't break a sweat.
>>>
>>> I've also experimented a little with sending a large text file using
>>> rsyslog's imfile module, sending it via TCP will bottleneck us at the
>>> ES
>>> side of things and cause the disk journal fill up fairly rapidly, but
>>> it's
>>> still working at at ~9k messages per second so that's fine. Sending it
>>> via
>>> UDP just causes GL to choke again, fill up the journal to a certain
>>> point
>>> and slowly slowly process the journal at little bursts of a few
>>> thousand
>>> messages followed by several seconds of apparent sleeping(i.e pretty
>>> much no
>>> CPU usage).
>>>
>>> During all of this the input buffer never fills up more than at most
>>> single digit percentages, using TCP the output buffer sometimes moves
>>> up to
>>> 20-30%, with UDP it never moves at all. It's all in the process buffer.
>>> Sending a large burst of messages and then stopping doesn't seem to
>>> affect
>>> this behavior either, even after the inbound messages stop it still
>>> takes a
>>> long time to process the messages that are already in the journal and
>>> process buffer.
>>> I'm using VisualVM to look at the CPU and memory usage, this is a
>>> screenshot of a UDP session:
>>> http://i59.tinypic.com/x23xfl.png
>>>
>>> I've tried mucking around with various knobs, processbuffer_processors,
>>> JVM settings, etc, with no results whatsoever, good or bad.
>>> There's nothing to suggest a problem in neither the graylog nor system
>>> logs.
>>>
>>> Pertinent specs and settings:
>>> ring_size = 16384 (CPU's have 20 MB L3)
>>> processbuffer_processors = 5
>>>
>>> Java 8u31
>>> Using G1GC with StringDeduplication, I've tried without the latter and
>>> just using CMC as well, no difference.
>>> 4 GB Xmx/Xms.
>>> Linux 3.16.0
>>> net.core.rmem_max = 8388608
>>>
>>> These are virtual machines, VMware, 8 GB / 8 vCPU's, Xeon E5-2690's.
>>>
>>> Software wise the old nodes are running the same setup more or less,
>>> except kernel 3.2.0, same JVM, G1GC, etc. Hardware wise, they're
>>> physical
>>> boxes, old Dell 2950's with dual quad core E5440's. That's Core2 era so
>>> quite a bit slower.
>>>
>>> Any ideas?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups
>>> "graylog2" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an
>>> email to graylog2+u...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups
>>> "graylog2" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an
>>> email to graylog2+u...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>
>>>
>>> --
>>> Developer
>>>
>>> Tel.: +49 (0)40 609 452 077
>>> Fax.: +49 (0)40 609 452 078
>>>
>>> TORCH GmbH - A Graylog company
>>> Steckelhörn 11
>>> 20457 Hamburg
>>> Germany
>>>
>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>>> Geschäftsführer: Lennart Koopmann (CEO)
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "graylog2" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to graylog2+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>
>>>
>>> --
>>> Developer
>>>
>>> Tel.: +49 (0)40 609 452 077
>>> Fax.: +49 (0)40 609 452 078
>>>
>>> TORCH GmbH - A Graylog company
>>> Steckelhörn 11
>>> 20457 Hamburg
>>> Germany
>>>
>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>>> Geschäftsführer: Lennart Koopmann (CEO)
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "graylog2" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to graylog2+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "graylog2" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to graylog2+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> Developer
>>
>> Tel.: +49 (0)40 609 452 077
>> Fax.: +49 (0)40 609 452 078
>>
>> TORCH GmbH - A Graylog company
>> Steckelhörn 11
>> 20457 Hamburg
>> Germany
>>
>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>> Geschäftsführer: Lennart Koopmann (CEO)
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "graylog2" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to graylog2+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "graylog2" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to graylog2+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



-- 
Developer

Tel.: +49 (0)40 609 452 077
Fax.: +49 (0)40 609 452 078

TORCH GmbH - A Graylog company
Steckelhörn 11
20457 Hamburg
Germany

Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
Geschäftsführer: Lennart Koopmann (CEO)

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] Graylog 1.0 UDP process buffer performance

Reply via email to