Re: [graylog2] Graylog 1.0 UDP process buffer performance

2015-03-03 Thread Bernd Ahlers
Thanks for the feedback! :)

Bernd

On 2 March 2015 at 11:26,  sun...@sunner.com wrote:
 I installed unbound locally and used this, and it seems to have resolved the
 issue. It's odd that the old server didn't show this behavior, but I'm happy
 enough that it's resolved anyway. :)

 Regards
 Johan

 On Friday, February 27, 2015 at 2:02:08 PM UTC+1, Bernd Ahlers wrote:

 Johan, Henrik,

 I tried to track this problem down.The problem is that the JVM does
 not cache reverse DNS lookups. The available JVM DNS cache settings
 like networkaddress.cache.ttl only affect forward DNS lookups.

 The code for doing the reverse lookups in Graylog did not change in a
 long time, so this problem is not new in 1.0.

 I my test setup enabling force_rdns for a syslog input reduced the
 throughput from around 7000 msg/s to 300 msg/s. This was without a
 local DNS cache. Once I installed a DNS cache on the Graylog server,
 the throughput went up to around 3000 msg/s.

 We will investigate if there is a sane way to cache the reverse
 lookups ourselves. In the meantime I suggest to test with a DNS cache
 installed on the Graylog server nodes to see if that helps or to
 disable the force_rdns setting.

 Regards,
 Bernd

 On 25 February 2015 at 18:00, Bernd Ahlers be...@graylog.com wrote:
  Johan, Henrik,
 
  thanks for the details. I created an issue on GitHub and will
  investigate.
 
  https://github.com/Graylog2/graylog2-server/issues/999
 
  Regards,
  Bernd
 
  On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk wrote:
  Bernd,
 
  Correct - that issue started after 0.92.x.
 
  We are still seeing evaluated CPU utilisation but we are attributing
  that
  to the fact that 0.92 was loosing messages in our setup.
 
 
  On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com wrote:
 
  Henrik,
 
  uh, okay. I suppose it worked for you in 0.92 as well?
 
  I will create an issue on GitHub for that.
 
  Bernd
 
  On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk wrote:
  Bernd,
 
  We saw the exact same issue - here is a graph over the CPU idle
  percentage across a few of the cluster nodes during the upgrade :
 
  http://5.9.37.177/graylog_cluster_cpu_idle.png
 
  We went from ~20% CPU utilisation to ~100% CPU utilisation across
  ~200 cores and things only settled down after disabling force_rdns.
 
 
  On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote:
 
  Johan,
 
  the only thing that changed from 0.92 to 1.0 is that the DNS lookup
  is
  now done when the messages are read from the journal and not in the
  input path where the messages are received. Otherwise, nothing has
  changed in that regard.
 
  We do not do any manual caching of the DNS lookups, but the JVM
  caches
  them by default. Check
 
  http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
  for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.
 
  Regards,
  Bernd
 
  On 25 February 2015 at 08:56,  sun...@sunner.com wrote:
 
  This is strange, I went through all of the settings for my reply, and
  we are
  indeed using rdns, and it seems to be the culprit. The strangeness is
  that
  it works fine on the old servers even though they're on the same
  networks,
  and using the same DNS's and resolver settings.
  Did something regarding reverse DNS change between 0.92 and 1.0? I'm
  thinking perhaps the server is trying to do one lookup per message
  instead
  of caching reverse lookups, seeing as the latter would result in very
  little
  DNS traffic since most of the logs will be coming from a small number
  of
  hosts.
 
  Regards
  Johan
 
  On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers
  wrote:
 
 
  Johan,
 
  this sounds very strange indeed. Can you provide us with some more
  details?
 
  - What kind of messages are you pouring into Graylog via UDP? (GELF,
  raw, syslog?)
  - Do you have any extractors or grok filters running for the messages
  coming in via UDP?
  - Any other differences between the TCP and UDP messages?
  - Can you show us your input configuration?
  - Are you using reverse DNS lookups?
 
  Thank you!
 
  Regards,
  Bernd
 
  On 24 February 2015 at 16:45,  sun...@sunner.com wrote:
 
  Well that could be a suspect if it wasn't for the fact that the old
  nodes
  running on old hardware handle it just fine, along with the fact that
  the
  traffic seems to reach the nodes just fine(i.e it actually fills the
  journal
  up just fine, and the input buffer never breaks a sweat). And it's
  really
  not that much traffic, even spread across four nodes those ~1000
  messages
  per second will cause this whereas the old nodes are just two and can
  handle
  it just fine.
 
  About disk tuning, I haven't done much of that, and I realize I
  forgot
  to
  mention that the Elasticsearch cluster is on separate physical
  hardware
  so
  there's a minuscule amount of disk I/O happening on the Graylog
  nodes.
 
  It's really very strange since it seems like UDP 

Re: [graylog2] Graylog 1.0 UDP process buffer performance

2015-03-02 Thread sunner
I installed unbound locally and used this, and it seems to have resolved 
the issue. It's odd that the old server didn't show this behavior, but I'm 
happy enough that it's resolved anyway. :)

Regards
Johan

On Friday, February 27, 2015 at 2:02:08 PM UTC+1, Bernd Ahlers wrote:

 Johan, Henrik, 

 I tried to track this problem down.The problem is that the JVM does 
 not cache reverse DNS lookups. The available JVM DNS cache settings 
 like networkaddress.cache.ttl only affect forward DNS lookups. 

 The code for doing the reverse lookups in Graylog did not change in a 
 long time, so this problem is not new in 1.0. 

 I my test setup enabling force_rdns for a syslog input reduced the 
 throughput from around 7000 msg/s to 300 msg/s. This was without a 
 local DNS cache. Once I installed a DNS cache on the Graylog server, 
 the throughput went up to around 3000 msg/s. 

 We will investigate if there is a sane way to cache the reverse 
 lookups ourselves. In the meantime I suggest to test with a DNS cache 
 installed on the Graylog server nodes to see if that helps or to 
 disable the force_rdns setting. 

 Regards, 
 Bernd 

 On 25 February 2015 at 18:00, Bernd Ahlers be...@graylog.com 
 javascript: wrote: 
  Johan, Henrik, 
  
  thanks for the details. I created an issue on GitHub and will 
 investigate. 
  
  https://github.com/Graylog2/graylog2-server/issues/999 
  
  Regards, 
  Bernd 
  
  On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk 
 javascript: wrote: 
  Bernd, 
  
  Correct - that issue started after 0.92.x. 
  
  We are still seeing evaluated CPU utilisation but we are attributing 
 that 
  to the fact that 0.92 was loosing messages in our setup. 
  
  
  On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com 
 javascript: wrote: 
  
  Henrik, 
  
  uh, okay. I suppose it worked for you in 0.92 as well? 
  
  I will create an issue on GitHub for that. 
  
  Bernd 
  
  On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk 
 javascript: wrote: 
  Bernd, 
  
  We saw the exact same issue - here is a graph over the CPU idle 
  percentage across a few of the cluster nodes during the upgrade : 
  
  http://5.9.37.177/graylog_cluster_cpu_idle.png 
  
  We went from ~20% CPU utilisation to ~100% CPU utilisation across 
  ~200 cores and things only settled down after disabling force_rdns. 
  
  
  On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com 
 javascript: wrote: 
  
  Johan, 
  
  the only thing that changed from 0.92 to 1.0 is that the DNS lookup 
 is 
  now done when the messages are read from the journal and not in the 
  input path where the messages are received. Otherwise, nothing has 
  changed in that regard. 
  
  We do not do any manual caching of the DNS lookups, but the JVM 
 caches 
  them by default. Check 
  
 http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html 
  for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. 
  
  Regards, 
  Bernd 
  
  On 25 February 2015 at 08:56,  sun...@sunner.com javascript: 
 wrote: 
  
  This is strange, I went through all of the settings for my reply, and 
 we are 
  indeed using rdns, and it seems to be the culprit. The strangeness is 
 that 
  it works fine on the old servers even though they're on the same 
 networks, 
  and using the same DNS's and resolver settings. 
  Did something regarding reverse DNS change between 0.92 and 1.0? I'm 
  thinking perhaps the server is trying to do one lookup per message 
 instead 
  of caching reverse lookups, seeing as the latter would result in very 
 little 
  DNS traffic since most of the logs will be coming from a small number 
 of 
  hosts. 
  
  Regards 
  Johan 
  
  On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers 
 wrote: 
  
  
  Johan, 
  
  this sounds very strange indeed. Can you provide us with some more 
  details? 
  
  - What kind of messages are you pouring into Graylog via UDP? (GELF, 
  raw, syslog?) 
  - Do you have any extractors or grok filters running for the messages 
  coming in via UDP? 
  - Any other differences between the TCP and UDP messages? 
  - Can you show us your input configuration? 
  - Are you using reverse DNS lookups? 
  
  Thank you! 
  
  Regards, 
  Bernd 
  
  On 24 February 2015 at 16:45,  sun...@sunner.com wrote: 
  
  Well that could be a suspect if it wasn't for the fact that the old 
  nodes 
  running on old hardware handle it just fine, along with the fact that 
  the 
  traffic seems to reach the nodes just fine(i.e it actually fills the 
  journal 
  up just fine, and the input buffer never breaks a sweat). And it's 
  really 
  not that much traffic, even spread across four nodes those ~1000 
  messages 
  per second will cause this whereas the old nodes are just two and can 
  handle 
  it just fine. 
  
  About disk tuning, I haven't done much of that, and I realize I 
 forgot 
  to 
  mention that the Elasticsearch cluster is on separate physical 
 hardware 
  so 
  

Re: [graylog2] Graylog 1.0 UDP process buffer performance

2015-02-27 Thread Bernd Ahlers
Johan, Henrik,

I tried to track this problem down.The problem is that the JVM does
not cache reverse DNS lookups. The available JVM DNS cache settings
like networkaddress.cache.ttl only affect forward DNS lookups.

The code for doing the reverse lookups in Graylog did not change in a
long time, so this problem is not new in 1.0.

I my test setup enabling force_rdns for a syslog input reduced the
throughput from around 7000 msg/s to 300 msg/s. This was without a
local DNS cache. Once I installed a DNS cache on the Graylog server,
the throughput went up to around 3000 msg/s.

We will investigate if there is a sane way to cache the reverse
lookups ourselves. In the meantime I suggest to test with a DNS cache
installed on the Graylog server nodes to see if that helps or to
disable the force_rdns setting.

Regards,
Bernd

On 25 February 2015 at 18:00, Bernd Ahlers be...@graylog.com wrote:
 Johan, Henrik,

 thanks for the details. I created an issue on GitHub and will investigate.

 https://github.com/Graylog2/graylog2-server/issues/999

 Regards,
 Bernd

 On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk wrote:
 Bernd,

 Correct - that issue started after 0.92.x.

 We are still seeing evaluated CPU utilisation but we are attributing that
 to the fact that 0.92 was loosing messages in our setup.


 On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com wrote:

 Henrik,

 uh, okay. I suppose it worked for you in 0.92 as well?

 I will create an issue on GitHub for that.

 Bernd

 On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk wrote:
 Bernd,

 We saw the exact same issue - here is a graph over the CPU idle
 percentage across a few of the cluster nodes during the upgrade :

 http://5.9.37.177/graylog_cluster_cpu_idle.png

 We went from ~20% CPU utilisation to ~100% CPU utilisation across
 ~200 cores and things only settled down after disabling force_rdns.


 On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote:

 Johan,

 the only thing that changed from 0.92 to 1.0 is that the DNS lookup is
 now done when the messages are read from the journal and not in the
 input path where the messages are received. Otherwise, nothing has
 changed in that regard.

 We do not do any manual caching of the DNS lookups, but the JVM caches
 them by default. Check
 http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
 for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.

 Regards,
 Bernd

 On 25 February 2015 at 08:56,  sun...@sunner.com wrote:

 This is strange, I went through all of the settings for my reply, and we 
 are
 indeed using rdns, and it seems to be the culprit. The strangeness is that
 it works fine on the old servers even though they're on the same networks,
 and using the same DNS's and resolver settings.
 Did something regarding reverse DNS change between 0.92 and 1.0? I'm
 thinking perhaps the server is trying to do one lookup per message instead
 of caching reverse lookups, seeing as the latter would result in very 
 little
 DNS traffic since most of the logs will be coming from a small number of
 hosts.

 Regards
 Johan

 On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote:


 Johan,

 this sounds very strange indeed. Can you provide us with some more
 details?

 - What kind of messages are you pouring into Graylog via UDP? (GELF,
 raw, syslog?)
 - Do you have any extractors or grok filters running for the messages
 coming in via UDP?
 - Any other differences between the TCP and UDP messages?
 - Can you show us your input configuration?
 - Are you using reverse DNS lookups?

 Thank you!

 Regards,
 Bernd

 On 24 February 2015 at 16:45,  sun...@sunner.com wrote:

 Well that could be a suspect if it wasn't for the fact that the old
 nodes
 running on old hardware handle it just fine, along with the fact that
 the
 traffic seems to reach the nodes just fine(i.e it actually fills the
 journal
 up just fine, and the input buffer never breaks a sweat). And it's
 really
 not that much traffic, even spread across four nodes those ~1000
 messages
 per second will cause this whereas the old nodes are just two and can
 handle
 it just fine.

 About disk tuning, I haven't done much of that, and I realize I forgot
 to
 mention that the Elasticsearch cluster is on separate physical hardware
 so
 there's a minuscule amount of disk I/O happening on the Graylog nodes.

 It's really very strange since it seems like UDP itself isn't to blame,
 after all the messages get into Graylog just fine and fills up the
 journal
 rapidly. The screenshot from I linked was from after I had stopped
 sending
 logs, i.e there was no longer any ingress traffic so the Graylog process
 had
 nothing to do except emptying it's journal so it should all be internal
 processing and egress traffic to Elasticsearch. And as can be seen in
 the
 screenshot it seems like it's doing it in small bursts.

 In the exact same scenario(i.e when I just streamed a large file 

Re: [graylog2] Graylog 1.0 UDP process buffer performance

2015-02-25 Thread Henrik Johansen
Bernd,

We saw the exact same issue - here is a graph over the CPU idle 
percentage across a few of the cluster nodes during the upgrade :

http://5.9.37.177/graylog_cluster_cpu_idle.png 
http://5.9.37.177/graylog_cluster_cpu_idle.png

We went from ~20% CPU utilisation to ~100% CPU utilisation across
~200 cores and things only settled down after disabling force_rdns.


 On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote:
 
 Johan,
 
 the only thing that changed from 0.92 to 1.0 is that the DNS lookup is
 now done when the messages are read from the journal and not in the
 input path where the messages are received. Otherwise, nothing has
 changed in that regard.
 
 We do not do any manual caching of the DNS lookups, but the JVM caches
 them by default. Check
 http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
 for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.
 
 Regards,
 Bernd
 
 On 25 February 2015 at 08:56,  sun...@sunner.com wrote:
 This is strange, I went through all of the settings for my reply, and we are
 indeed using rdns, and it seems to be the culprit. The strangeness is that
 it works fine on the old servers even though they're on the same networks,
 and using the same DNS's and resolver settings.
 Did something regarding reverse DNS change between 0.92 and 1.0? I'm
 thinking perhaps the server is trying to do one lookup per message instead
 of caching reverse lookups, seeing as the latter would result in very little
 DNS traffic since most of the logs will be coming from a small number of
 hosts.
 
 Regards
 Johan
 
 On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote:
 
 Johan,
 
 this sounds very strange indeed. Can you provide us with some more
 details?
 
 - What kind of messages are you pouring into Graylog via UDP? (GELF,
 raw, syslog?)
 - Do you have any extractors or grok filters running for the messages
 coming in via UDP?
 - Any other differences between the TCP and UDP messages?
 - Can you show us your input configuration?
 - Are you using reverse DNS lookups?
 
 Thank you!
 
 Regards,
 Bernd
 
 On 24 February 2015 at 16:45,  sun...@sunner.com wrote:
 Well that could be a suspect if it wasn't for the fact that the old
 nodes
 running on old hardware handle it just fine, along with the fact that
 the
 traffic seems to reach the nodes just fine(i.e it actually fills the
 journal
 up just fine, and the input buffer never breaks a sweat). And it's
 really
 not that much traffic, even spread across four nodes those ~1000
 messages
 per second will cause this whereas the old nodes are just two and can
 handle
 it just fine.
 
 About disk tuning, I haven't done much of that, and I realize I forgot
 to
 mention that the Elasticsearch cluster is on separate physical hardware
 so
 there's a minuscule amount of disk I/O happening on the Graylog nodes.
 
 It's really very strange since it seems like UDP itself isn't to blame,
 after all the messages get into Graylog just fine and fills up the
 journal
 rapidly. The screenshot from I linked was from after I had stopped
 sending
 logs, i.e there was no longer any ingress traffic so the Graylog process
 had
 nothing to do except emptying it's journal so it should all be internal
 processing and egress traffic to Elasticsearch. And as can be seen in
 the
 screenshot it seems like it's doing it in small bursts.
 
 In the exact same scenario(i.e when I just streamed a large file into
 the
 system as fast as it could receive it) but with the logs having come
 over
 TCP, it'll still store up a sizable number of messages in the journal,
 but
 the processing of the journaled messages is both more even and vastly
 faster.
 
 So in short it doesn't appear to be the communication itself, but
 something
 happening inside the Graylog process, but that only happens when the
 messages have been delivered over UDP.
 
 Regards
 Johan
 
 
 On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen
 wrote:
 
 Could this simply be because TCP avoids (or tries to avoid) congestion
 while UDP does not?
 
 /HJ
 
 On 24 Feb 2015, at 13:50, sun...@sunner.com wrote:
 
 Hello,
 
 With the release of 1.0 we've started moving towards a new cluster of
 GL
 hosts. These are working very well, with one exception.
 For some reason any reasonably significant UDP traffic will choke the
 message processor, fill up and process buffers on all four hosts, and
 effectively choke up all other message processing as well.
 Normally we do around 2k messages per second, split roughly 50/50
 between
 TCP and UDP. Sending the entire TCP load to one host doesn't present a
 problem, it doesn't break a sweat.
 
 I've also experimented a little with sending a large text file using
 rsyslog's imfile module, sending it via TCP will bottleneck us at the
 ES
 side of things and cause the disk journal fill up fairly rapidly, but
 it's
 still working at at ~9k messages per second so that's fine. Sending it
 via
 UDP just 

Re: [graylog2] Graylog 1.0 UDP process buffer performance

2015-02-25 Thread Bernd Ahlers
Henrik,

uh, okay. I suppose it worked for you in 0.92 as well?

I will create an issue on GitHub for that.

Bernd

On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk wrote:
 Bernd,

 We saw the exact same issue - here is a graph over the CPU idle
 percentage across a few of the cluster nodes during the upgrade :

 http://5.9.37.177/graylog_cluster_cpu_idle.png

 We went from ~20% CPU utilisation to ~100% CPU utilisation across
 ~200 cores and things only settled down after disabling force_rdns.


 On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote:

 Johan,

 the only thing that changed from 0.92 to 1.0 is that the DNS lookup is
 now done when the messages are read from the journal and not in the
 input path where the messages are received. Otherwise, nothing has
 changed in that regard.

 We do not do any manual caching of the DNS lookups, but the JVM caches
 them by default. Check
 http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
 for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.

 Regards,
 Bernd

 On 25 February 2015 at 08:56,  sun...@sunner.com wrote:

 This is strange, I went through all of the settings for my reply, and we are
 indeed using rdns, and it seems to be the culprit. The strangeness is that
 it works fine on the old servers even though they're on the same networks,
 and using the same DNS's and resolver settings.
 Did something regarding reverse DNS change between 0.92 and 1.0? I'm
 thinking perhaps the server is trying to do one lookup per message instead
 of caching reverse lookups, seeing as the latter would result in very little
 DNS traffic since most of the logs will be coming from a small number of
 hosts.

 Regards
 Johan

 On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote:


 Johan,

 this sounds very strange indeed. Can you provide us with some more
 details?

 - What kind of messages are you pouring into Graylog via UDP? (GELF,
 raw, syslog?)
 - Do you have any extractors or grok filters running for the messages
 coming in via UDP?
 - Any other differences between the TCP and UDP messages?
 - Can you show us your input configuration?
 - Are you using reverse DNS lookups?

 Thank you!

 Regards,
 Bernd

 On 24 February 2015 at 16:45,  sun...@sunner.com wrote:

 Well that could be a suspect if it wasn't for the fact that the old
 nodes
 running on old hardware handle it just fine, along with the fact that
 the
 traffic seems to reach the nodes just fine(i.e it actually fills the
 journal
 up just fine, and the input buffer never breaks a sweat). And it's
 really
 not that much traffic, even spread across four nodes those ~1000
 messages
 per second will cause this whereas the old nodes are just two and can
 handle
 it just fine.

 About disk tuning, I haven't done much of that, and I realize I forgot
 to
 mention that the Elasticsearch cluster is on separate physical hardware
 so
 there's a minuscule amount of disk I/O happening on the Graylog nodes.

 It's really very strange since it seems like UDP itself isn't to blame,
 after all the messages get into Graylog just fine and fills up the
 journal
 rapidly. The screenshot from I linked was from after I had stopped
 sending
 logs, i.e there was no longer any ingress traffic so the Graylog process
 had
 nothing to do except emptying it's journal so it should all be internal
 processing and egress traffic to Elasticsearch. And as can be seen in
 the
 screenshot it seems like it's doing it in small bursts.

 In the exact same scenario(i.e when I just streamed a large file into
 the
 system as fast as it could receive it) but with the logs having come
 over
 TCP, it'll still store up a sizable number of messages in the journal,
 but
 the processing of the journaled messages is both more even and vastly
 faster.

 So in short it doesn't appear to be the communication itself, but
 something
 happening inside the Graylog process, but that only happens when the
 messages have been delivered over UDP.

 Regards
 Johan


 On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen
 wrote:


 Could this simply be because TCP avoids (or tries to avoid) congestion
 while UDP does not?

 /HJ

 On 24 Feb 2015, at 13:50, sun...@sunner.com wrote:

 Hello,

 With the release of 1.0 we've started moving towards a new cluster of
 GL
 hosts. These are working very well, with one exception.
 For some reason any reasonably significant UDP traffic will choke the
 message processor, fill up and process buffers on all four hosts, and
 effectively choke up all other message processing as well.
 Normally we do around 2k messages per second, split roughly 50/50
 between
 TCP and UDP. Sending the entire TCP load to one host doesn't present a
 problem, it doesn't break a sweat.

 I've also experimented a little with sending a large text file using
 rsyslog's imfile module, sending it via TCP will bottleneck us at the
 ES
 side of things and cause the disk journal fill 

Re: [graylog2] Graylog 1.0 UDP process buffer performance

2015-02-25 Thread Bernd Ahlers
Johan, Henrik,

thanks for the details. I created an issue on GitHub and will investigate.

https://github.com/Graylog2/graylog2-server/issues/999

Regards,
Bernd

On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk wrote:
 Bernd,

 Correct - that issue started after 0.92.x.

 We are still seeing evaluated CPU utilisation but we are attributing that
 to the fact that 0.92 was loosing messages in our setup.


 On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com wrote:

 Henrik,

 uh, okay. I suppose it worked for you in 0.92 as well?

 I will create an issue on GitHub for that.

 Bernd

 On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk wrote:
 Bernd,

 We saw the exact same issue - here is a graph over the CPU idle
 percentage across a few of the cluster nodes during the upgrade :

 http://5.9.37.177/graylog_cluster_cpu_idle.png

 We went from ~20% CPU utilisation to ~100% CPU utilisation across
 ~200 cores and things only settled down after disabling force_rdns.


 On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote:

 Johan,

 the only thing that changed from 0.92 to 1.0 is that the DNS lookup is
 now done when the messages are read from the journal and not in the
 input path where the messages are received. Otherwise, nothing has
 changed in that regard.

 We do not do any manual caching of the DNS lookups, but the JVM caches
 them by default. Check
 http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
 for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.

 Regards,
 Bernd

 On 25 February 2015 at 08:56,  sun...@sunner.com wrote:

 This is strange, I went through all of the settings for my reply, and we are
 indeed using rdns, and it seems to be the culprit. The strangeness is that
 it works fine on the old servers even though they're on the same networks,
 and using the same DNS's and resolver settings.
 Did something regarding reverse DNS change between 0.92 and 1.0? I'm
 thinking perhaps the server is trying to do one lookup per message instead
 of caching reverse lookups, seeing as the latter would result in very little
 DNS traffic since most of the logs will be coming from a small number of
 hosts.

 Regards
 Johan

 On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote:


 Johan,

 this sounds very strange indeed. Can you provide us with some more
 details?

 - What kind of messages are you pouring into Graylog via UDP? (GELF,
 raw, syslog?)
 - Do you have any extractors or grok filters running for the messages
 coming in via UDP?
 - Any other differences between the TCP and UDP messages?
 - Can you show us your input configuration?
 - Are you using reverse DNS lookups?

 Thank you!

 Regards,
 Bernd

 On 24 February 2015 at 16:45,  sun...@sunner.com wrote:

 Well that could be a suspect if it wasn't for the fact that the old
 nodes
 running on old hardware handle it just fine, along with the fact that
 the
 traffic seems to reach the nodes just fine(i.e it actually fills the
 journal
 up just fine, and the input buffer never breaks a sweat). And it's
 really
 not that much traffic, even spread across four nodes those ~1000
 messages
 per second will cause this whereas the old nodes are just two and can
 handle
 it just fine.

 About disk tuning, I haven't done much of that, and I realize I forgot
 to
 mention that the Elasticsearch cluster is on separate physical hardware
 so
 there's a minuscule amount of disk I/O happening on the Graylog nodes.

 It's really very strange since it seems like UDP itself isn't to blame,
 after all the messages get into Graylog just fine and fills up the
 journal
 rapidly. The screenshot from I linked was from after I had stopped
 sending
 logs, i.e there was no longer any ingress traffic so the Graylog process
 had
 nothing to do except emptying it's journal so it should all be internal
 processing and egress traffic to Elasticsearch. And as can be seen in
 the
 screenshot it seems like it's doing it in small bursts.

 In the exact same scenario(i.e when I just streamed a large file into
 the
 system as fast as it could receive it) but with the logs having come
 over
 TCP, it'll still store up a sizable number of messages in the journal,
 but
 the processing of the journaled messages is both more even and vastly
 faster.

 So in short it doesn't appear to be the communication itself, but
 something
 happening inside the Graylog process, but that only happens when the
 messages have been delivered over UDP.

 Regards
 Johan


 On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen
 wrote:


 Could this simply be because TCP avoids (or tries to avoid) congestion
 while UDP does not?

 /HJ

 On 24 Feb 2015, at 13:50, sun...@sunner.com wrote:

 Hello,

 With the release of 1.0 we've started moving towards a new cluster of
 GL
 hosts. These are working very well, with one exception.
 For some reason any reasonably significant UDP traffic will choke the
 message 

Re: [graylog2] Graylog 1.0 UDP process buffer performance

2015-02-25 Thread Bernd Ahlers
Johan,

the only thing that changed from 0.92 to 1.0 is that the DNS lookup is
now done when the messages are read from the journal and not in the
input path where the messages are received. Otherwise, nothing has
changed in that regard.

We do not do any manual caching of the DNS lookups, but the JVM caches
them by default. Check
http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.

Regards,
Bernd

On 25 February 2015 at 08:56,  sun...@sunner.com wrote:
 This is strange, I went through all of the settings for my reply, and we are
 indeed using rdns, and it seems to be the culprit. The strangeness is that
 it works fine on the old servers even though they're on the same networks,
 and using the same DNS's and resolver settings.
 Did something regarding reverse DNS change between 0.92 and 1.0? I'm
 thinking perhaps the server is trying to do one lookup per message instead
 of caching reverse lookups, seeing as the latter would result in very little
 DNS traffic since most of the logs will be coming from a small number of
 hosts.

 Regards
 Johan

 On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote:

 Johan,

 this sounds very strange indeed. Can you provide us with some more
 details?

 - What kind of messages are you pouring into Graylog via UDP? (GELF,
 raw, syslog?)
 - Do you have any extractors or grok filters running for the messages
 coming in via UDP?
 - Any other differences between the TCP and UDP messages?
 - Can you show us your input configuration?
 - Are you using reverse DNS lookups?

 Thank you!

 Regards,
 Bernd

 On 24 February 2015 at 16:45,  sun...@sunner.com wrote:
  Well that could be a suspect if it wasn't for the fact that the old
  nodes
  running on old hardware handle it just fine, along with the fact that
  the
  traffic seems to reach the nodes just fine(i.e it actually fills the
  journal
  up just fine, and the input buffer never breaks a sweat). And it's
  really
  not that much traffic, even spread across four nodes those ~1000
  messages
  per second will cause this whereas the old nodes are just two and can
  handle
  it just fine.
 
  About disk tuning, I haven't done much of that, and I realize I forgot
  to
  mention that the Elasticsearch cluster is on separate physical hardware
  so
  there's a minuscule amount of disk I/O happening on the Graylog nodes.
 
  It's really very strange since it seems like UDP itself isn't to blame,
  after all the messages get into Graylog just fine and fills up the
  journal
  rapidly. The screenshot from I linked was from after I had stopped
  sending
  logs, i.e there was no longer any ingress traffic so the Graylog process
  had
  nothing to do except emptying it's journal so it should all be internal
  processing and egress traffic to Elasticsearch. And as can be seen in
  the
  screenshot it seems like it's doing it in small bursts.
 
  In the exact same scenario(i.e when I just streamed a large file into
  the
  system as fast as it could receive it) but with the logs having come
  over
  TCP, it'll still store up a sizable number of messages in the journal,
  but
  the processing of the journaled messages is both more even and vastly
  faster.
 
  So in short it doesn't appear to be the communication itself, but
  something
  happening inside the Graylog process, but that only happens when the
  messages have been delivered over UDP.
 
  Regards
  Johan
 
 
  On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen
  wrote:
 
  Could this simply be because TCP avoids (or tries to avoid) congestion
  while UDP does not?
 
  /HJ
 
  On 24 Feb 2015, at 13:50, sun...@sunner.com wrote:
 
  Hello,
 
  With the release of 1.0 we've started moving towards a new cluster of
  GL
  hosts. These are working very well, with one exception.
  For some reason any reasonably significant UDP traffic will choke the
  message processor, fill up and process buffers on all four hosts, and
  effectively choke up all other message processing as well.
  Normally we do around 2k messages per second, split roughly 50/50
  between
  TCP and UDP. Sending the entire TCP load to one host doesn't present a
  problem, it doesn't break a sweat.
 
  I've also experimented a little with sending a large text file using
  rsyslog's imfile module, sending it via TCP will bottleneck us at the
  ES
  side of things and cause the disk journal fill up fairly rapidly, but
  it's
  still working at at ~9k messages per second so that's fine. Sending it
  via
  UDP just causes GL to choke again, fill up the journal to a certain
  point
  and slowly slowly process the journal at little bursts of a few
  thousand
  messages followed by several seconds of apparent sleeping(i.e pretty
  much no
  CPU usage).
 
  During all of this the input buffer never fills up more than at most
  single digit percentages, using TCP the output buffer sometimes moves
  up to