Well that could be a suspect if it wasn't for the fact that the old nodes 
running on old hardware handle it just fine, along with the fact that the 
traffic seems to reach the nodes just fine(i.e it actually fills the 
journal up just fine, and the input buffer never breaks a sweat). And it's 
really not that much traffic, even spread across four nodes those ~1000 
messages per second will cause this whereas the old nodes are just two and 
can handle it just fine.

About disk tuning, I haven't done much of that, and I realize I forgot to 
mention that the Elasticsearch cluster is on separate physical hardware so 
there's a minuscule amount of disk I/O happening on the Graylog nodes.

It's really very strange since it seems like UDP itself isn't to blame, 
after all the messages get into Graylog just fine and fills up the journal 
rapidly. The screenshot from I linked was from after I had stopped sending 
logs, i.e there was no longer any ingress traffic so the Graylog process 
had nothing to do except emptying it's journal so it should all be internal 
processing and egress traffic to Elasticsearch. And as can be seen in the 
screenshot it seems like it's doing it in small bursts.

In the exact same scenario(i.e when I just streamed a large file into the 
system as fast as it could receive it) but with the logs having come over 
TCP, it'll still store up a sizable number of messages in the journal, but 
the processing of the journaled messages is both more even and vastly 
faster.

So in short it doesn't appear to be the communication itself, but something 
happening "inside" the Graylog process, but that only happens when the 
messages have been delivered over UDP.

Regards
Johan


On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen wrote:
>
> Could this simply be because TCP avoids (or tries to avoid) congestion 
> while UDP does not?
>
> /HJ
>
> On 24 Feb 2015, at 13:50, sun...@sunner.com <javascript:> wrote:
>
> Hello,
>
> With the release of 1.0 we've started moving towards a new cluster of GL 
> hosts. These are working very well, with one exception.
> For some reason any reasonably significant UDP traffic will choke the 
> message processor, fill up and process buffers on all four hosts, and 
> effectively choke up all other message processing as well.
> Normally we do around 2k messages per second, split roughly 50/50 between 
> TCP and UDP. Sending the entire TCP load to one host doesn't present a 
> problem, it doesn't break a sweat.
>
> I've also experimented a little with sending a large text file using 
> rsyslog's imfile module, sending it via TCP will bottleneck us at the ES 
> side of things and cause the disk journal fill up fairly rapidly, but it's 
> still working at at ~9k messages per second so that's fine. Sending it via 
> UDP just causes GL to choke again, fill up the journal to a certain point 
> and slowly slowly process the journal at little bursts of a few thousand 
> messages followed by several seconds of apparent sleeping(i.e pretty much 
> no CPU usage).
>
> During all of this the input buffer never fills up more than at most 
> single digit percentages, using TCP the output buffer sometimes moves up to 
> 20-30%, with UDP it never moves at all. It's all in the process buffer. 
> Sending a large burst of messages and then stopping doesn't seem to affect 
> this behavior either, even after the inbound messages stop it still takes a 
> long time to process the messages that are already in the journal and 
> process buffer.
> I'm using VisualVM to look at the CPU and memory usage, this is a 
> screenshot of a UDP session:
> http://i59.tinypic.com/x23xfl.png
>
> I've tried mucking around with various knobs, processbuffer_processors, 
> JVM settings, etc, with no results whatsoever, good or bad.
> There's nothing to suggest a problem in neither the graylog nor system 
> logs.
>
> Pertinent specs and settings:
> ring_size = 16384 (CPU's have 20 MB L3)
> processbuffer_processors = 5
>
> Java 8u31
> Using G1GC with StringDeduplication, I've tried without the latter and 
> just using CMC as well, no difference.
> 4 GB Xmx/Xms.
> Linux 3.16.0
> net.core.rmem_max = 8388608
>
> These are virtual machines, VMware, 8 GB / 8 vCPU's, Xeon E5-2690's.
>
> Software wise the old nodes are running the same setup more or less, 
> except kernel 3.2.0, same JVM, G1GC, etc. Hardware wise, they're physical 
> boxes, old Dell 2950's with dual quad core E5440's. That's Core2 era so 
> quite a bit slower.
>
> Any ideas?
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "graylog2" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to graylog2+u...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to