Johan, Henrik, thanks for the details. I created an issue on GitHub and will investigate.
https://github.com/Graylog2/graylog2-server/issues/999 Regards, Bernd On 25 February 2015 at 17:48, Henrik Johansen <h...@myunix.dk> wrote: > Bernd, > > Correct - that issue started after 0.92.x. > > We are still seeing evaluated CPU utilisation but we are attributing that > to the fact that 0.92 was loosing messages in our setup. > > >> On 25 Feb 2015, at 17:37, Bernd Ahlers <be...@graylog.com> wrote: >> >> Henrik, >> >> uh, okay. I suppose it worked for you in 0.92 as well? >> >> I will create an issue on GitHub for that. >> >> Bernd >> >> On 25 February 2015 at 17:14, Henrik Johansen <h...@myunix.dk> wrote: >>> Bernd, >>> >>> We saw the exact same issue - here is a graph over the CPU idle >>> percentage across a few of the cluster nodes during the upgrade : >>> >>> http://5.9.37.177/graylog_cluster_cpu_idle.png >>> >>> We went from ~20% CPU utilisation to ~100% CPU utilisation across >>> ~200 cores and things only settled down after disabling force_rdns. >>> >>> >>> On 25 Feb 2015, at 11:55, Bernd Ahlers <be...@graylog.com> wrote: >>> >>> Johan, >>> >>> the only thing that changed from 0.92 to 1.0 is that the DNS lookup is >>> now done when the messages are read from the journal and not in the >>> input path where the messages are received. Otherwise, nothing has >>> changed in that regard. >>> >>> We do not do any manual caching of the DNS lookups, but the JVM caches >>> them by default. Check >>> http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html >>> for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. >>> >>> Regards, >>> Bernd >>> >>> On 25 February 2015 at 08:56, <sun...@sunner.com> wrote: >>> >>> This is strange, I went through all of the settings for my reply, and we are >>> indeed using rdns, and it seems to be the culprit. The strangeness is that >>> it works fine on the old servers even though they're on the same networks, >>> and using the same DNS's and resolver settings. >>> Did something regarding reverse DNS change between 0.92 and 1.0? I'm >>> thinking perhaps the server is trying to do one lookup per message instead >>> of caching reverse lookups, seeing as the latter would result in very little >>> DNS traffic since most of the logs will be coming from a small number of >>> hosts. >>> >>> Regards >>> Johan >>> >>> On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: >>> >>> >>> Johan, >>> >>> this sounds very strange indeed. Can you provide us with some more >>> details? >>> >>> - What kind of messages are you pouring into Graylog via UDP? (GELF, >>> raw, syslog?) >>> - Do you have any extractors or grok filters running for the messages >>> coming in via UDP? >>> - Any other differences between the TCP and UDP messages? >>> - Can you show us your input configuration? >>> - Are you using reverse DNS lookups? >>> >>> Thank you! >>> >>> Regards, >>> Bernd >>> >>> On 24 February 2015 at 16:45, <sun...@sunner.com> wrote: >>> >>> Well that could be a suspect if it wasn't for the fact that the old >>> nodes >>> running on old hardware handle it just fine, along with the fact that >>> the >>> traffic seems to reach the nodes just fine(i.e it actually fills the >>> journal >>> up just fine, and the input buffer never breaks a sweat). And it's >>> really >>> not that much traffic, even spread across four nodes those ~1000 >>> messages >>> per second will cause this whereas the old nodes are just two and can >>> handle >>> it just fine. >>> >>> About disk tuning, I haven't done much of that, and I realize I forgot >>> to >>> mention that the Elasticsearch cluster is on separate physical hardware >>> so >>> there's a minuscule amount of disk I/O happening on the Graylog nodes. >>> >>> It's really very strange since it seems like UDP itself isn't to blame, >>> after all the messages get into Graylog just fine and fills up the >>> journal >>> rapidly. The screenshot from I linked was from after I had stopped >>> sending >>> logs, i.e there was no longer any ingress traffic so the Graylog process >>> had >>> nothing to do except emptying it's journal so it should all be internal >>> processing and egress traffic to Elasticsearch. And as can be seen in >>> the >>> screenshot it seems like it's doing it in small bursts. >>> >>> In the exact same scenario(i.e when I just streamed a large file into >>> the >>> system as fast as it could receive it) but with the logs having come >>> over >>> TCP, it'll still store up a sizable number of messages in the journal, >>> but >>> the processing of the journaled messages is both more even and vastly >>> faster. >>> >>> So in short it doesn't appear to be the communication itself, but >>> something >>> happening "inside" the Graylog process, but that only happens when the >>> messages have been delivered over UDP. >>> >>> Regards >>> Johan >>> >>> >>> On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen >>> wrote: >>> >>> >>> Could this simply be because TCP avoids (or tries to avoid) congestion >>> while UDP does not? >>> >>> /HJ >>> >>> On 24 Feb 2015, at 13:50, sun...@sunner.com wrote: >>> >>> Hello, >>> >>> With the release of 1.0 we've started moving towards a new cluster of >>> GL >>> hosts. These are working very well, with one exception. >>> For some reason any reasonably significant UDP traffic will choke the >>> message processor, fill up and process buffers on all four hosts, and >>> effectively choke up all other message processing as well. >>> Normally we do around 2k messages per second, split roughly 50/50 >>> between >>> TCP and UDP. Sending the entire TCP load to one host doesn't present a >>> problem, it doesn't break a sweat. >>> >>> I've also experimented a little with sending a large text file using >>> rsyslog's imfile module, sending it via TCP will bottleneck us at the >>> ES >>> side of things and cause the disk journal fill up fairly rapidly, but >>> it's >>> still working at at ~9k messages per second so that's fine. Sending it >>> via >>> UDP just causes GL to choke again, fill up the journal to a certain >>> point >>> and slowly slowly process the journal at little bursts of a few >>> thousand >>> messages followed by several seconds of apparent sleeping(i.e pretty >>> much no >>> CPU usage). >>> >>> During all of this the input buffer never fills up more than at most >>> single digit percentages, using TCP the output buffer sometimes moves >>> up to >>> 20-30%, with UDP it never moves at all. It's all in the process buffer. >>> Sending a large burst of messages and then stopping doesn't seem to >>> affect >>> this behavior either, even after the inbound messages stop it still >>> takes a >>> long time to process the messages that are already in the journal and >>> process buffer. >>> I'm using VisualVM to look at the CPU and memory usage, this is a >>> screenshot of a UDP session: >>> http://i59.tinypic.com/x23xfl.png >>> >>> I've tried mucking around with various knobs, processbuffer_processors, >>> JVM settings, etc, with no results whatsoever, good or bad. >>> There's nothing to suggest a problem in neither the graylog nor system >>> logs. >>> >>> Pertinent specs and settings: >>> ring_size = 16384 (CPU's have 20 MB L3) >>> processbuffer_processors = 5 >>> >>> Java 8u31 >>> Using G1GC with StringDeduplication, I've tried without the latter and >>> just using CMC as well, no difference. >>> 4 GB Xmx/Xms. >>> Linux 3.16.0 >>> net.core.rmem_max = 8388608 >>> >>> These are virtual machines, VMware, 8 GB / 8 vCPU's, Xeon E5-2690's. >>> >>> Software wise the old nodes are running the same setup more or less, >>> except kernel 3.2.0, same JVM, G1GC, etc. Hardware wise, they're >>> physical >>> boxes, old Dell 2950's with dual quad core E5440's. That's Core2 era so >>> quite a bit slower. >>> >>> Any ideas? >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups >>> "graylog2" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an >>> email to graylog2+u...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups >>> "graylog2" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an >>> email to graylog2+u...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> >>> >>> -- >>> Developer >>> >>> Tel.: +49 (0)40 609 452 077 >>> Fax.: +49 (0)40 609 452 078 >>> >>> TORCH GmbH - A Graylog company >>> Steckelhörn 11 >>> 20457 Hamburg >>> Germany >>> >>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 >>> Geschäftsführer: Lennart Koopmann (CEO) >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "graylog2" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to graylog2+unsubscr...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> >>> >>> -- >>> Developer >>> >>> Tel.: +49 (0)40 609 452 077 >>> Fax.: +49 (0)40 609 452 078 >>> >>> TORCH GmbH - A Graylog company >>> Steckelhörn 11 >>> 20457 Hamburg >>> Germany >>> >>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 >>> Geschäftsführer: Lennart Koopmann (CEO) >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "graylog2" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to graylog2+unsubscr...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "graylog2" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to graylog2+unsubscr...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >> >> >> >> -- >> Developer >> >> Tel.: +49 (0)40 609 452 077 >> Fax.: +49 (0)40 609 452 078 >> >> TORCH GmbH - A Graylog company >> Steckelhörn 11 >> 20457 Hamburg >> Germany >> >> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 >> Geschäftsführer: Lennart Koopmann (CEO) >> >> -- >> You received this message because you are subscribed to the Google Groups >> "graylog2" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to graylog2+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to graylog2+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- Developer Tel.: +49 (0)40 609 452 077 Fax.: +49 (0)40 609 452 078 TORCH GmbH - A Graylog company Steckelhörn 11 20457 Hamburg Germany Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 Geschäftsführer: Lennart Koopmann (CEO) -- You received this message because you are subscribed to the Google Groups "graylog2" group. To unsubscribe from this group and stop receiving emails from it, send an email to graylog2+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.