Johan, Henrik,

I tried to track this problem down.The problem is that the JVM does
not cache reverse DNS lookups. The available JVM DNS cache settings
like "networkaddress.cache.ttl" only affect forward DNS lookups.

The code for doing the reverse lookups in Graylog did not change in a
long time, so this problem is not new in 1.0.

I my test setup enabling "force_rdns" for a syslog input reduced the
throughput from around 7000 msg/s to 300 msg/s. This was without a
local DNS cache. Once I installed a DNS cache on the Graylog server,
the throughput went up to around 3000 msg/s.

We will investigate if there is a sane way to cache the reverse
lookups ourselves. In the meantime I suggest to test with a DNS cache
installed on the Graylog server nodes to see if that helps or to
disable the "force_rdns" setting.

Regards,
Bernd

On 25 February 2015 at 18:00, Bernd Ahlers <be...@graylog.com> wrote:
> Johan, Henrik,
>
> thanks for the details. I created an issue on GitHub and will investigate.
>
> https://github.com/Graylog2/graylog2-server/issues/999
>
> Regards,
> Bernd
>
> On 25 February 2015 at 17:48, Henrik Johansen <h...@myunix.dk> wrote:
>> Bernd,
>>
>> Correct - that issue started after 0.92.x.
>>
>> We are still seeing evaluated CPU utilisation but we are attributing that
>> to the fact that 0.92 was loosing messages in our setup.
>>
>>
>>> On 25 Feb 2015, at 17:37, Bernd Ahlers <be...@graylog.com> wrote:
>>>
>>> Henrik,
>>>
>>> uh, okay. I suppose it worked for you in 0.92 as well?
>>>
>>> I will create an issue on GitHub for that.
>>>
>>> Bernd
>>>
>>> On 25 February 2015 at 17:14, Henrik Johansen <h...@myunix.dk> wrote:
>>>> Bernd,
>>>>
>>>> We saw the exact same issue - here is a graph over the CPU idle
>>>> percentage across a few of the cluster nodes during the upgrade :
>>>>
>>>> http://5.9.37.177/graylog_cluster_cpu_idle.png
>>>>
>>>> We went from ~20% CPU utilisation to ~100% CPU utilisation across
>>>> ~200 cores and things only settled down after disabling force_rdns.
>>>>
>>>>
>>>> On 25 Feb 2015, at 11:55, Bernd Ahlers <be...@graylog.com> wrote:
>>>>
>>>> Johan,
>>>>
>>>> the only thing that changed from 0.92 to 1.0 is that the DNS lookup is
>>>> now done when the messages are read from the journal and not in the
>>>> input path where the messages are received. Otherwise, nothing has
>>>> changed in that regard.
>>>>
>>>> We do not do any manual caching of the DNS lookups, but the JVM caches
>>>> them by default. Check
>>>> http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
>>>> for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>>> On 25 February 2015 at 08:56,  <sun...@sunner.com> wrote:
>>>>
>>>> This is strange, I went through all of the settings for my reply, and we 
>>>> are
>>>> indeed using rdns, and it seems to be the culprit. The strangeness is that
>>>> it works fine on the old servers even though they're on the same networks,
>>>> and using the same DNS's and resolver settings.
>>>> Did something regarding reverse DNS change between 0.92 and 1.0? I'm
>>>> thinking perhaps the server is trying to do one lookup per message instead
>>>> of caching reverse lookups, seeing as the latter would result in very 
>>>> little
>>>> DNS traffic since most of the logs will be coming from a small number of
>>>> hosts.
>>>>
>>>> Regards
>>>> Johan
>>>>
>>>> On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote:
>>>>
>>>>
>>>> Johan,
>>>>
>>>> this sounds very strange indeed. Can you provide us with some more
>>>> details?
>>>>
>>>> - What kind of messages are you pouring into Graylog via UDP? (GELF,
>>>> raw, syslog?)
>>>> - Do you have any extractors or grok filters running for the messages
>>>> coming in via UDP?
>>>> - Any other differences between the TCP and UDP messages?
>>>> - Can you show us your input configuration?
>>>> - Are you using reverse DNS lookups?
>>>>
>>>> Thank you!
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>>> On 24 February 2015 at 16:45,  <sun...@sunner.com> wrote:
>>>>
>>>> Well that could be a suspect if it wasn't for the fact that the old
>>>> nodes
>>>> running on old hardware handle it just fine, along with the fact that
>>>> the
>>>> traffic seems to reach the nodes just fine(i.e it actually fills the
>>>> journal
>>>> up just fine, and the input buffer never breaks a sweat). And it's
>>>> really
>>>> not that much traffic, even spread across four nodes those ~1000
>>>> messages
>>>> per second will cause this whereas the old nodes are just two and can
>>>> handle
>>>> it just fine.
>>>>
>>>> About disk tuning, I haven't done much of that, and I realize I forgot
>>>> to
>>>> mention that the Elasticsearch cluster is on separate physical hardware
>>>> so
>>>> there's a minuscule amount of disk I/O happening on the Graylog nodes.
>>>>
>>>> It's really very strange since it seems like UDP itself isn't to blame,
>>>> after all the messages get into Graylog just fine and fills up the
>>>> journal
>>>> rapidly. The screenshot from I linked was from after I had stopped
>>>> sending
>>>> logs, i.e there was no longer any ingress traffic so the Graylog process
>>>> had
>>>> nothing to do except emptying it's journal so it should all be internal
>>>> processing and egress traffic to Elasticsearch. And as can be seen in
>>>> the
>>>> screenshot it seems like it's doing it in small bursts.
>>>>
>>>> In the exact same scenario(i.e when I just streamed a large file into
>>>> the
>>>> system as fast as it could receive it) but with the logs having come
>>>> over
>>>> TCP, it'll still store up a sizable number of messages in the journal,
>>>> but
>>>> the processing of the journaled messages is both more even and vastly
>>>> faster.
>>>>
>>>> So in short it doesn't appear to be the communication itself, but
>>>> something
>>>> happening "inside" the Graylog process, but that only happens when the
>>>> messages have been delivered over UDP.
>>>>
>>>> Regards
>>>> Johan
>>>>
>>>>
>>>> On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen
>>>> wrote:
>>>>
>>>>
>>>> Could this simply be because TCP avoids (or tries to avoid) congestion
>>>> while UDP does not?
>>>>
>>>> /HJ
>>>>
>>>> On 24 Feb 2015, at 13:50, sun...@sunner.com wrote:
>>>>
>>>> Hello,
>>>>
>>>> With the release of 1.0 we've started moving towards a new cluster of
>>>> GL
>>>> hosts. These are working very well, with one exception.
>>>> For some reason any reasonably significant UDP traffic will choke the
>>>> message processor, fill up and process buffers on all four hosts, and
>>>> effectively choke up all other message processing as well.
>>>> Normally we do around 2k messages per second, split roughly 50/50
>>>> between
>>>> TCP and UDP. Sending the entire TCP load to one host doesn't present a
>>>> problem, it doesn't break a sweat.
>>>>
>>>> I've also experimented a little with sending a large text file using
>>>> rsyslog's imfile module, sending it via TCP will bottleneck us at the
>>>> ES
>>>> side of things and cause the disk journal fill up fairly rapidly, but
>>>> it's
>>>> still working at at ~9k messages per second so that's fine. Sending it
>>>> via
>>>> UDP just causes GL to choke again, fill up the journal to a certain
>>>> point
>>>> and slowly slowly process the journal at little bursts of a few
>>>> thousand
>>>> messages followed by several seconds of apparent sleeping(i.e pretty
>>>> much no
>>>> CPU usage).
>>>>
>>>> During all of this the input buffer never fills up more than at most
>>>> single digit percentages, using TCP the output buffer sometimes moves
>>>> up to
>>>> 20-30%, with UDP it never moves at all. It's all in the process buffer.
>>>> Sending a large burst of messages and then stopping doesn't seem to
>>>> affect
>>>> this behavior either, even after the inbound messages stop it still
>>>> takes a
>>>> long time to process the messages that are already in the journal and
>>>> process buffer.
>>>> I'm using VisualVM to look at the CPU and memory usage, this is a
>>>> screenshot of a UDP session:
>>>> http://i59.tinypic.com/x23xfl.png
>>>>
>>>> I've tried mucking around with various knobs, processbuffer_processors,
>>>> JVM settings, etc, with no results whatsoever, good or bad.
>>>> There's nothing to suggest a problem in neither the graylog nor system
>>>> logs.
>>>>
>>>> Pertinent specs and settings:
>>>> ring_size = 16384 (CPU's have 20 MB L3)
>>>> processbuffer_processors = 5
>>>>
>>>> Java 8u31
>>>> Using G1GC with StringDeduplication, I've tried without the latter and
>>>> just using CMC as well, no difference.
>>>> 4 GB Xmx/Xms.
>>>> Linux 3.16.0
>>>> net.core.rmem_max = 8388608
>>>>
>>>> These are virtual machines, VMware, 8 GB / 8 vCPU's, Xeon E5-2690's.
>>>>
>>>> Software wise the old nodes are running the same setup more or less,
>>>> except kernel 3.2.0, same JVM, G1GC, etc. Hardware wise, they're
>>>> physical
>>>> boxes, old Dell 2950's with dual quad core E5440's. That's Core2 era so
>>>> quite a bit slower.
>>>>
>>>> Any ideas?
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups
>>>> "graylog2" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an
>>>> email to graylog2+u...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups
>>>> "graylog2" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an
>>>> email to graylog2+u...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Developer
>>>>
>>>> Tel.: +49 (0)40 609 452 077
>>>> Fax.: +49 (0)40 609 452 078
>>>>
>>>> TORCH GmbH - A Graylog company
>>>> Steckelhörn 11
>>>> 20457 Hamburg
>>>> Germany
>>>>
>>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>>>> Geschäftsführer: Lennart Koopmann (CEO)
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups
>>>> "graylog2" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an
>>>> email to graylog2+unsubscr...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Developer
>>>>
>>>> Tel.: +49 (0)40 609 452 077
>>>> Fax.: +49 (0)40 609 452 078
>>>>
>>>> TORCH GmbH - A Graylog company
>>>> Steckelhörn 11
>>>> 20457 Hamburg
>>>> Germany
>>>>
>>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>>>> Geschäftsführer: Lennart Koopmann (CEO)
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups
>>>> "graylog2" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an
>>>> email to graylog2+unsubscr...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups
>>>> "graylog2" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an
>>>> email to graylog2+unsubscr...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>
>>> --
>>> Developer
>>>
>>> Tel.: +49 (0)40 609 452 077
>>> Fax.: +49 (0)40 609 452 078
>>>
>>> TORCH GmbH - A Graylog company
>>> Steckelhörn 11
>>> 20457 Hamburg
>>> Germany
>>>
>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>>> Geschäftsführer: Lennart Koopmann (CEO)
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups 
>>> "graylog2" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to graylog2+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "graylog2" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to graylog2+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Developer
>
> Tel.: +49 (0)40 609 452 077
> Fax.: +49 (0)40 609 452 078
>
> TORCH GmbH - A Graylog company
> Steckelhörn 11
> 20457 Hamburg
> Germany
>
> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
> Geschäftsführer: Lennart Koopmann (CEO)



-- 
Developer

Tel.: +49 (0)40 609 452 077
Fax.: +49 (0)40 609 452 078

TORCH GmbH - A Graylog company
Steckelhörn 11
20457 Hamburg
Germany

Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
Geschäftsführer: Lennart Koopmann (CEO)

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to