I installed unbound locally and used this, and it seems to have resolved
the issue. It's odd that the old server didn't show this behavior, but I'm
happy enough that it's resolved anyway. :)
Regards
Johan
On Friday, February 27, 2015 at 2:02:08 PM UTC+1, Bernd Ahlers wrote:
Johan, Henrik,
I tried to track this problem down.The problem is that the JVM does
not cache reverse DNS lookups. The available JVM DNS cache settings
like networkaddress.cache.ttl only affect forward DNS lookups.
The code for doing the reverse lookups in Graylog did not change in a
long time, so this problem is not new in 1.0.
I my test setup enabling force_rdns for a syslog input reduced the
throughput from around 7000 msg/s to 300 msg/s. This was without a
local DNS cache. Once I installed a DNS cache on the Graylog server,
the throughput went up to around 3000 msg/s.
We will investigate if there is a sane way to cache the reverse
lookups ourselves. In the meantime I suggest to test with a DNS cache
installed on the Graylog server nodes to see if that helps or to
disable the force_rdns setting.
Regards,
Bernd
On 25 February 2015 at 18:00, Bernd Ahlers be...@graylog.com
javascript: wrote:
Johan, Henrik,
thanks for the details. I created an issue on GitHub and will
investigate.
https://github.com/Graylog2/graylog2-server/issues/999
Regards,
Bernd
On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk
javascript: wrote:
Bernd,
Correct - that issue started after 0.92.x.
We are still seeing evaluated CPU utilisation but we are attributing
that
to the fact that 0.92 was loosing messages in our setup.
On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com
javascript: wrote:
Henrik,
uh, okay. I suppose it worked for you in 0.92 as well?
I will create an issue on GitHub for that.
Bernd
On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk
javascript: wrote:
Bernd,
We saw the exact same issue - here is a graph over the CPU idle
percentage across a few of the cluster nodes during the upgrade :
http://5.9.37.177/graylog_cluster_cpu_idle.png
We went from ~20% CPU utilisation to ~100% CPU utilisation across
~200 cores and things only settled down after disabling force_rdns.
On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com
javascript: wrote:
Johan,
the only thing that changed from 0.92 to 1.0 is that the DNS lookup
is
now done when the messages are read from the journal and not in the
input path where the messages are received. Otherwise, nothing has
changed in that regard.
We do not do any manual caching of the DNS lookups, but the JVM
caches
them by default. Check
http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.
Regards,
Bernd
On 25 February 2015 at 08:56, sun...@sunner.com javascript:
wrote:
This is strange, I went through all of the settings for my reply, and
we are
indeed using rdns, and it seems to be the culprit. The strangeness is
that
it works fine on the old servers even though they're on the same
networks,
and using the same DNS's and resolver settings.
Did something regarding reverse DNS change between 0.92 and 1.0? I'm
thinking perhaps the server is trying to do one lookup per message
instead
of caching reverse lookups, seeing as the latter would result in very
little
DNS traffic since most of the logs will be coming from a small number
of
hosts.
Regards
Johan
On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers
wrote:
Johan,
this sounds very strange indeed. Can you provide us with some more
details?
- What kind of messages are you pouring into Graylog via UDP? (GELF,
raw, syslog?)
- Do you have any extractors or grok filters running for the messages
coming in via UDP?
- Any other differences between the TCP and UDP messages?
- Can you show us your input configuration?
- Are you using reverse DNS lookups?
Thank you!
Regards,
Bernd
On 24 February 2015 at 16:45, sun...@sunner.com wrote:
Well that could be a suspect if it wasn't for the fact that the old
nodes
running on old hardware handle it just fine, along with the fact that
the
traffic seems to reach the nodes just fine(i.e it actually fills the
journal
up just fine, and the input buffer never breaks a sweat). And it's
really
not that much traffic, even spread across four nodes those ~1000
messages
per second will cause this whereas the old nodes are just two and can
handle
it just fine.
About disk tuning, I haven't done much of that, and I realize I
forgot
to
mention that the Elasticsearch cluster is on separate physical
hardware
so