Re: [graylog2] Graylog 1.0 UDP process buffer performance
Thanks for the feedback! :) Bernd On 2 March 2015 at 11:26, sun...@sunner.com wrote: I installed unbound locally and used this, and it seems to have resolved the issue. It's odd that the old server didn't show this behavior, but I'm happy enough that it's resolved anyway. :) Regards Johan On Friday, February 27, 2015 at 2:02:08 PM UTC+1, Bernd Ahlers wrote: Johan, Henrik, I tried to track this problem down.The problem is that the JVM does not cache reverse DNS lookups. The available JVM DNS cache settings like networkaddress.cache.ttl only affect forward DNS lookups. The code for doing the reverse lookups in Graylog did not change in a long time, so this problem is not new in 1.0. I my test setup enabling force_rdns for a syslog input reduced the throughput from around 7000 msg/s to 300 msg/s. This was without a local DNS cache. Once I installed a DNS cache on the Graylog server, the throughput went up to around 3000 msg/s. We will investigate if there is a sane way to cache the reverse lookups ourselves. In the meantime I suggest to test with a DNS cache installed on the Graylog server nodes to see if that helps or to disable the force_rdns setting. Regards, Bernd On 25 February 2015 at 18:00, Bernd Ahlers be...@graylog.com wrote: Johan, Henrik, thanks for the details. I created an issue on GitHub and will investigate. https://github.com/Graylog2/graylog2-server/issues/999 Regards, Bernd On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk wrote: Bernd, Correct - that issue started after 0.92.x. We are still seeing evaluated CPU utilisation but we are attributing that to the fact that 0.92 was loosing messages in our setup. On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com wrote: Henrik, uh, okay. I suppose it worked for you in 0.92 as well? I will create an issue on GitHub for that. Bernd On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk wrote: Bernd, We saw the exact same issue - here is a graph over the CPU idle percentage across a few of the cluster nodes during the upgrade : http://5.9.37.177/graylog_cluster_cpu_idle.png We went from ~20% CPU utilisation to ~100% CPU utilisation across ~200 cores and things only settled down after disabling force_rdns. On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote: Johan, the only thing that changed from 0.92 to 1.0 is that the DNS lookup is now done when the messages are read from the journal and not in the input path where the messages are received. Otherwise, nothing has changed in that regard. We do not do any manual caching of the DNS lookups, but the JVM caches them by default. Check http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. Regards, Bernd On 25 February 2015 at 08:56, sun...@sunner.com wrote: This is strange, I went through all of the settings for my reply, and we are indeed using rdns, and it seems to be the culprit. The strangeness is that it works fine on the old servers even though they're on the same networks, and using the same DNS's and resolver settings. Did something regarding reverse DNS change between 0.92 and 1.0? I'm thinking perhaps the server is trying to do one lookup per message instead of caching reverse lookups, seeing as the latter would result in very little DNS traffic since most of the logs will be coming from a small number of hosts. Regards Johan On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: Johan, this sounds very strange indeed. Can you provide us with some more details? - What kind of messages are you pouring into Graylog via UDP? (GELF, raw, syslog?) - Do you have any extractors or grok filters running for the messages coming in via UDP? - Any other differences between the TCP and UDP messages? - Can you show us your input configuration? - Are you using reverse DNS lookups? Thank you! Regards, Bernd On 24 February 2015 at 16:45, sun...@sunner.com wrote: Well that could be a suspect if it wasn't for the fact that the old nodes running on old hardware handle it just fine, along with the fact that the traffic seems to reach the nodes just fine(i.e it actually fills the journal up just fine, and the input buffer never breaks a sweat). And it's really not that much traffic, even spread across four nodes those ~1000 messages per second will cause this whereas the old nodes are just two and can handle it just fine. About disk tuning, I haven't done much of that, and I realize I forgot to mention that the Elasticsearch cluster is on separate physical hardware so there's a minuscule amount of disk I/O happening on the Graylog nodes. It's really very strange since it seems like UDP
Re: [graylog2] Graylog 1.0 UDP process buffer performance
I installed unbound locally and used this, and it seems to have resolved the issue. It's odd that the old server didn't show this behavior, but I'm happy enough that it's resolved anyway. :) Regards Johan On Friday, February 27, 2015 at 2:02:08 PM UTC+1, Bernd Ahlers wrote: Johan, Henrik, I tried to track this problem down.The problem is that the JVM does not cache reverse DNS lookups. The available JVM DNS cache settings like networkaddress.cache.ttl only affect forward DNS lookups. The code for doing the reverse lookups in Graylog did not change in a long time, so this problem is not new in 1.0. I my test setup enabling force_rdns for a syslog input reduced the throughput from around 7000 msg/s to 300 msg/s. This was without a local DNS cache. Once I installed a DNS cache on the Graylog server, the throughput went up to around 3000 msg/s. We will investigate if there is a sane way to cache the reverse lookups ourselves. In the meantime I suggest to test with a DNS cache installed on the Graylog server nodes to see if that helps or to disable the force_rdns setting. Regards, Bernd On 25 February 2015 at 18:00, Bernd Ahlers be...@graylog.com javascript: wrote: Johan, Henrik, thanks for the details. I created an issue on GitHub and will investigate. https://github.com/Graylog2/graylog2-server/issues/999 Regards, Bernd On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk javascript: wrote: Bernd, Correct - that issue started after 0.92.x. We are still seeing evaluated CPU utilisation but we are attributing that to the fact that 0.92 was loosing messages in our setup. On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com javascript: wrote: Henrik, uh, okay. I suppose it worked for you in 0.92 as well? I will create an issue on GitHub for that. Bernd On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk javascript: wrote: Bernd, We saw the exact same issue - here is a graph over the CPU idle percentage across a few of the cluster nodes during the upgrade : http://5.9.37.177/graylog_cluster_cpu_idle.png We went from ~20% CPU utilisation to ~100% CPU utilisation across ~200 cores and things only settled down after disabling force_rdns. On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com javascript: wrote: Johan, the only thing that changed from 0.92 to 1.0 is that the DNS lookup is now done when the messages are read from the journal and not in the input path where the messages are received. Otherwise, nothing has changed in that regard. We do not do any manual caching of the DNS lookups, but the JVM caches them by default. Check http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. Regards, Bernd On 25 February 2015 at 08:56, sun...@sunner.com javascript: wrote: This is strange, I went through all of the settings for my reply, and we are indeed using rdns, and it seems to be the culprit. The strangeness is that it works fine on the old servers even though they're on the same networks, and using the same DNS's and resolver settings. Did something regarding reverse DNS change between 0.92 and 1.0? I'm thinking perhaps the server is trying to do one lookup per message instead of caching reverse lookups, seeing as the latter would result in very little DNS traffic since most of the logs will be coming from a small number of hosts. Regards Johan On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: Johan, this sounds very strange indeed. Can you provide us with some more details? - What kind of messages are you pouring into Graylog via UDP? (GELF, raw, syslog?) - Do you have any extractors or grok filters running for the messages coming in via UDP? - Any other differences between the TCP and UDP messages? - Can you show us your input configuration? - Are you using reverse DNS lookups? Thank you! Regards, Bernd On 24 February 2015 at 16:45, sun...@sunner.com wrote: Well that could be a suspect if it wasn't for the fact that the old nodes running on old hardware handle it just fine, along with the fact that the traffic seems to reach the nodes just fine(i.e it actually fills the journal up just fine, and the input buffer never breaks a sweat). And it's really not that much traffic, even spread across four nodes those ~1000 messages per second will cause this whereas the old nodes are just two and can handle it just fine. About disk tuning, I haven't done much of that, and I realize I forgot to mention that the Elasticsearch cluster is on separate physical hardware so
Re: [graylog2] Graylog 1.0 UDP process buffer performance
Johan, Henrik, I tried to track this problem down.The problem is that the JVM does not cache reverse DNS lookups. The available JVM DNS cache settings like networkaddress.cache.ttl only affect forward DNS lookups. The code for doing the reverse lookups in Graylog did not change in a long time, so this problem is not new in 1.0. I my test setup enabling force_rdns for a syslog input reduced the throughput from around 7000 msg/s to 300 msg/s. This was without a local DNS cache. Once I installed a DNS cache on the Graylog server, the throughput went up to around 3000 msg/s. We will investigate if there is a sane way to cache the reverse lookups ourselves. In the meantime I suggest to test with a DNS cache installed on the Graylog server nodes to see if that helps or to disable the force_rdns setting. Regards, Bernd On 25 February 2015 at 18:00, Bernd Ahlers be...@graylog.com wrote: Johan, Henrik, thanks for the details. I created an issue on GitHub and will investigate. https://github.com/Graylog2/graylog2-server/issues/999 Regards, Bernd On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk wrote: Bernd, Correct - that issue started after 0.92.x. We are still seeing evaluated CPU utilisation but we are attributing that to the fact that 0.92 was loosing messages in our setup. On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com wrote: Henrik, uh, okay. I suppose it worked for you in 0.92 as well? I will create an issue on GitHub for that. Bernd On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk wrote: Bernd, We saw the exact same issue - here is a graph over the CPU idle percentage across a few of the cluster nodes during the upgrade : http://5.9.37.177/graylog_cluster_cpu_idle.png We went from ~20% CPU utilisation to ~100% CPU utilisation across ~200 cores and things only settled down after disabling force_rdns. On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote: Johan, the only thing that changed from 0.92 to 1.0 is that the DNS lookup is now done when the messages are read from the journal and not in the input path where the messages are received. Otherwise, nothing has changed in that regard. We do not do any manual caching of the DNS lookups, but the JVM caches them by default. Check http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. Regards, Bernd On 25 February 2015 at 08:56, sun...@sunner.com wrote: This is strange, I went through all of the settings for my reply, and we are indeed using rdns, and it seems to be the culprit. The strangeness is that it works fine on the old servers even though they're on the same networks, and using the same DNS's and resolver settings. Did something regarding reverse DNS change between 0.92 and 1.0? I'm thinking perhaps the server is trying to do one lookup per message instead of caching reverse lookups, seeing as the latter would result in very little DNS traffic since most of the logs will be coming from a small number of hosts. Regards Johan On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: Johan, this sounds very strange indeed. Can you provide us with some more details? - What kind of messages are you pouring into Graylog via UDP? (GELF, raw, syslog?) - Do you have any extractors or grok filters running for the messages coming in via UDP? - Any other differences between the TCP and UDP messages? - Can you show us your input configuration? - Are you using reverse DNS lookups? Thank you! Regards, Bernd On 24 February 2015 at 16:45, sun...@sunner.com wrote: Well that could be a suspect if it wasn't for the fact that the old nodes running on old hardware handle it just fine, along with the fact that the traffic seems to reach the nodes just fine(i.e it actually fills the journal up just fine, and the input buffer never breaks a sweat). And it's really not that much traffic, even spread across four nodes those ~1000 messages per second will cause this whereas the old nodes are just two and can handle it just fine. About disk tuning, I haven't done much of that, and I realize I forgot to mention that the Elasticsearch cluster is on separate physical hardware so there's a minuscule amount of disk I/O happening on the Graylog nodes. It's really very strange since it seems like UDP itself isn't to blame, after all the messages get into Graylog just fine and fills up the journal rapidly. The screenshot from I linked was from after I had stopped sending logs, i.e there was no longer any ingress traffic so the Graylog process had nothing to do except emptying it's journal so it should all be internal processing and egress traffic to Elasticsearch. And as can be seen in the screenshot it seems like it's doing it in small bursts. In the exact same scenario(i.e when I just streamed a large file
Re: [graylog2] Graylog 1.0 UDP process buffer performance
Bernd, We saw the exact same issue - here is a graph over the CPU idle percentage across a few of the cluster nodes during the upgrade : http://5.9.37.177/graylog_cluster_cpu_idle.png http://5.9.37.177/graylog_cluster_cpu_idle.png We went from ~20% CPU utilisation to ~100% CPU utilisation across ~200 cores and things only settled down after disabling force_rdns. On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote: Johan, the only thing that changed from 0.92 to 1.0 is that the DNS lookup is now done when the messages are read from the journal and not in the input path where the messages are received. Otherwise, nothing has changed in that regard. We do not do any manual caching of the DNS lookups, but the JVM caches them by default. Check http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. Regards, Bernd On 25 February 2015 at 08:56, sun...@sunner.com wrote: This is strange, I went through all of the settings for my reply, and we are indeed using rdns, and it seems to be the culprit. The strangeness is that it works fine on the old servers even though they're on the same networks, and using the same DNS's and resolver settings. Did something regarding reverse DNS change between 0.92 and 1.0? I'm thinking perhaps the server is trying to do one lookup per message instead of caching reverse lookups, seeing as the latter would result in very little DNS traffic since most of the logs will be coming from a small number of hosts. Regards Johan On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: Johan, this sounds very strange indeed. Can you provide us with some more details? - What kind of messages are you pouring into Graylog via UDP? (GELF, raw, syslog?) - Do you have any extractors or grok filters running for the messages coming in via UDP? - Any other differences between the TCP and UDP messages? - Can you show us your input configuration? - Are you using reverse DNS lookups? Thank you! Regards, Bernd On 24 February 2015 at 16:45, sun...@sunner.com wrote: Well that could be a suspect if it wasn't for the fact that the old nodes running on old hardware handle it just fine, along with the fact that the traffic seems to reach the nodes just fine(i.e it actually fills the journal up just fine, and the input buffer never breaks a sweat). And it's really not that much traffic, even spread across four nodes those ~1000 messages per second will cause this whereas the old nodes are just two and can handle it just fine. About disk tuning, I haven't done much of that, and I realize I forgot to mention that the Elasticsearch cluster is on separate physical hardware so there's a minuscule amount of disk I/O happening on the Graylog nodes. It's really very strange since it seems like UDP itself isn't to blame, after all the messages get into Graylog just fine and fills up the journal rapidly. The screenshot from I linked was from after I had stopped sending logs, i.e there was no longer any ingress traffic so the Graylog process had nothing to do except emptying it's journal so it should all be internal processing and egress traffic to Elasticsearch. And as can be seen in the screenshot it seems like it's doing it in small bursts. In the exact same scenario(i.e when I just streamed a large file into the system as fast as it could receive it) but with the logs having come over TCP, it'll still store up a sizable number of messages in the journal, but the processing of the journaled messages is both more even and vastly faster. So in short it doesn't appear to be the communication itself, but something happening inside the Graylog process, but that only happens when the messages have been delivered over UDP. Regards Johan On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen wrote: Could this simply be because TCP avoids (or tries to avoid) congestion while UDP does not? /HJ On 24 Feb 2015, at 13:50, sun...@sunner.com wrote: Hello, With the release of 1.0 we've started moving towards a new cluster of GL hosts. These are working very well, with one exception. For some reason any reasonably significant UDP traffic will choke the message processor, fill up and process buffers on all four hosts, and effectively choke up all other message processing as well. Normally we do around 2k messages per second, split roughly 50/50 between TCP and UDP. Sending the entire TCP load to one host doesn't present a problem, it doesn't break a sweat. I've also experimented a little with sending a large text file using rsyslog's imfile module, sending it via TCP will bottleneck us at the ES side of things and cause the disk journal fill up fairly rapidly, but it's still working at at ~9k messages per second so that's fine. Sending it via UDP just
Re: [graylog2] Graylog 1.0 UDP process buffer performance
Henrik, uh, okay. I suppose it worked for you in 0.92 as well? I will create an issue on GitHub for that. Bernd On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk wrote: Bernd, We saw the exact same issue - here is a graph over the CPU idle percentage across a few of the cluster nodes during the upgrade : http://5.9.37.177/graylog_cluster_cpu_idle.png We went from ~20% CPU utilisation to ~100% CPU utilisation across ~200 cores and things only settled down after disabling force_rdns. On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote: Johan, the only thing that changed from 0.92 to 1.0 is that the DNS lookup is now done when the messages are read from the journal and not in the input path where the messages are received. Otherwise, nothing has changed in that regard. We do not do any manual caching of the DNS lookups, but the JVM caches them by default. Check http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. Regards, Bernd On 25 February 2015 at 08:56, sun...@sunner.com wrote: This is strange, I went through all of the settings for my reply, and we are indeed using rdns, and it seems to be the culprit. The strangeness is that it works fine on the old servers even though they're on the same networks, and using the same DNS's and resolver settings. Did something regarding reverse DNS change between 0.92 and 1.0? I'm thinking perhaps the server is trying to do one lookup per message instead of caching reverse lookups, seeing as the latter would result in very little DNS traffic since most of the logs will be coming from a small number of hosts. Regards Johan On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: Johan, this sounds very strange indeed. Can you provide us with some more details? - What kind of messages are you pouring into Graylog via UDP? (GELF, raw, syslog?) - Do you have any extractors or grok filters running for the messages coming in via UDP? - Any other differences between the TCP and UDP messages? - Can you show us your input configuration? - Are you using reverse DNS lookups? Thank you! Regards, Bernd On 24 February 2015 at 16:45, sun...@sunner.com wrote: Well that could be a suspect if it wasn't for the fact that the old nodes running on old hardware handle it just fine, along with the fact that the traffic seems to reach the nodes just fine(i.e it actually fills the journal up just fine, and the input buffer never breaks a sweat). And it's really not that much traffic, even spread across four nodes those ~1000 messages per second will cause this whereas the old nodes are just two and can handle it just fine. About disk tuning, I haven't done much of that, and I realize I forgot to mention that the Elasticsearch cluster is on separate physical hardware so there's a minuscule amount of disk I/O happening on the Graylog nodes. It's really very strange since it seems like UDP itself isn't to blame, after all the messages get into Graylog just fine and fills up the journal rapidly. The screenshot from I linked was from after I had stopped sending logs, i.e there was no longer any ingress traffic so the Graylog process had nothing to do except emptying it's journal so it should all be internal processing and egress traffic to Elasticsearch. And as can be seen in the screenshot it seems like it's doing it in small bursts. In the exact same scenario(i.e when I just streamed a large file into the system as fast as it could receive it) but with the logs having come over TCP, it'll still store up a sizable number of messages in the journal, but the processing of the journaled messages is both more even and vastly faster. So in short it doesn't appear to be the communication itself, but something happening inside the Graylog process, but that only happens when the messages have been delivered over UDP. Regards Johan On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen wrote: Could this simply be because TCP avoids (or tries to avoid) congestion while UDP does not? /HJ On 24 Feb 2015, at 13:50, sun...@sunner.com wrote: Hello, With the release of 1.0 we've started moving towards a new cluster of GL hosts. These are working very well, with one exception. For some reason any reasonably significant UDP traffic will choke the message processor, fill up and process buffers on all four hosts, and effectively choke up all other message processing as well. Normally we do around 2k messages per second, split roughly 50/50 between TCP and UDP. Sending the entire TCP load to one host doesn't present a problem, it doesn't break a sweat. I've also experimented a little with sending a large text file using rsyslog's imfile module, sending it via TCP will bottleneck us at the ES side of things and cause the disk journal fill
Re: [graylog2] Graylog 1.0 UDP process buffer performance
Johan, Henrik, thanks for the details. I created an issue on GitHub and will investigate. https://github.com/Graylog2/graylog2-server/issues/999 Regards, Bernd On 25 February 2015 at 17:48, Henrik Johansen h...@myunix.dk wrote: Bernd, Correct - that issue started after 0.92.x. We are still seeing evaluated CPU utilisation but we are attributing that to the fact that 0.92 was loosing messages in our setup. On 25 Feb 2015, at 17:37, Bernd Ahlers be...@graylog.com wrote: Henrik, uh, okay. I suppose it worked for you in 0.92 as well? I will create an issue on GitHub for that. Bernd On 25 February 2015 at 17:14, Henrik Johansen h...@myunix.dk wrote: Bernd, We saw the exact same issue - here is a graph over the CPU idle percentage across a few of the cluster nodes during the upgrade : http://5.9.37.177/graylog_cluster_cpu_idle.png We went from ~20% CPU utilisation to ~100% CPU utilisation across ~200 cores and things only settled down after disabling force_rdns. On 25 Feb 2015, at 11:55, Bernd Ahlers be...@graylog.com wrote: Johan, the only thing that changed from 0.92 to 1.0 is that the DNS lookup is now done when the messages are read from the journal and not in the input path where the messages are received. Otherwise, nothing has changed in that regard. We do not do any manual caching of the DNS lookups, but the JVM caches them by default. Check http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. Regards, Bernd On 25 February 2015 at 08:56, sun...@sunner.com wrote: This is strange, I went through all of the settings for my reply, and we are indeed using rdns, and it seems to be the culprit. The strangeness is that it works fine on the old servers even though they're on the same networks, and using the same DNS's and resolver settings. Did something regarding reverse DNS change between 0.92 and 1.0? I'm thinking perhaps the server is trying to do one lookup per message instead of caching reverse lookups, seeing as the latter would result in very little DNS traffic since most of the logs will be coming from a small number of hosts. Regards Johan On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: Johan, this sounds very strange indeed. Can you provide us with some more details? - What kind of messages are you pouring into Graylog via UDP? (GELF, raw, syslog?) - Do you have any extractors or grok filters running for the messages coming in via UDP? - Any other differences between the TCP and UDP messages? - Can you show us your input configuration? - Are you using reverse DNS lookups? Thank you! Regards, Bernd On 24 February 2015 at 16:45, sun...@sunner.com wrote: Well that could be a suspect if it wasn't for the fact that the old nodes running on old hardware handle it just fine, along with the fact that the traffic seems to reach the nodes just fine(i.e it actually fills the journal up just fine, and the input buffer never breaks a sweat). And it's really not that much traffic, even spread across four nodes those ~1000 messages per second will cause this whereas the old nodes are just two and can handle it just fine. About disk tuning, I haven't done much of that, and I realize I forgot to mention that the Elasticsearch cluster is on separate physical hardware so there's a minuscule amount of disk I/O happening on the Graylog nodes. It's really very strange since it seems like UDP itself isn't to blame, after all the messages get into Graylog just fine and fills up the journal rapidly. The screenshot from I linked was from after I had stopped sending logs, i.e there was no longer any ingress traffic so the Graylog process had nothing to do except emptying it's journal so it should all be internal processing and egress traffic to Elasticsearch. And as can be seen in the screenshot it seems like it's doing it in small bursts. In the exact same scenario(i.e when I just streamed a large file into the system as fast as it could receive it) but with the logs having come over TCP, it'll still store up a sizable number of messages in the journal, but the processing of the journaled messages is both more even and vastly faster. So in short it doesn't appear to be the communication itself, but something happening inside the Graylog process, but that only happens when the messages have been delivered over UDP. Regards Johan On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen wrote: Could this simply be because TCP avoids (or tries to avoid) congestion while UDP does not? /HJ On 24 Feb 2015, at 13:50, sun...@sunner.com wrote: Hello, With the release of 1.0 we've started moving towards a new cluster of GL hosts. These are working very well, with one exception. For some reason any reasonably significant UDP traffic will choke the message
Re: [graylog2] Graylog 1.0 UDP process buffer performance
Johan, the only thing that changed from 0.92 to 1.0 is that the DNS lookup is now done when the messages are read from the journal and not in the input path where the messages are received. Otherwise, nothing has changed in that regard. We do not do any manual caching of the DNS lookups, but the JVM caches them by default. Check http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. Regards, Bernd On 25 February 2015 at 08:56, sun...@sunner.com wrote: This is strange, I went through all of the settings for my reply, and we are indeed using rdns, and it seems to be the culprit. The strangeness is that it works fine on the old servers even though they're on the same networks, and using the same DNS's and resolver settings. Did something regarding reverse DNS change between 0.92 and 1.0? I'm thinking perhaps the server is trying to do one lookup per message instead of caching reverse lookups, seeing as the latter would result in very little DNS traffic since most of the logs will be coming from a small number of hosts. Regards Johan On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: Johan, this sounds very strange indeed. Can you provide us with some more details? - What kind of messages are you pouring into Graylog via UDP? (GELF, raw, syslog?) - Do you have any extractors or grok filters running for the messages coming in via UDP? - Any other differences between the TCP and UDP messages? - Can you show us your input configuration? - Are you using reverse DNS lookups? Thank you! Regards, Bernd On 24 February 2015 at 16:45, sun...@sunner.com wrote: Well that could be a suspect if it wasn't for the fact that the old nodes running on old hardware handle it just fine, along with the fact that the traffic seems to reach the nodes just fine(i.e it actually fills the journal up just fine, and the input buffer never breaks a sweat). And it's really not that much traffic, even spread across four nodes those ~1000 messages per second will cause this whereas the old nodes are just two and can handle it just fine. About disk tuning, I haven't done much of that, and I realize I forgot to mention that the Elasticsearch cluster is on separate physical hardware so there's a minuscule amount of disk I/O happening on the Graylog nodes. It's really very strange since it seems like UDP itself isn't to blame, after all the messages get into Graylog just fine and fills up the journal rapidly. The screenshot from I linked was from after I had stopped sending logs, i.e there was no longer any ingress traffic so the Graylog process had nothing to do except emptying it's journal so it should all be internal processing and egress traffic to Elasticsearch. And as can be seen in the screenshot it seems like it's doing it in small bursts. In the exact same scenario(i.e when I just streamed a large file into the system as fast as it could receive it) but with the logs having come over TCP, it'll still store up a sizable number of messages in the journal, but the processing of the journaled messages is both more even and vastly faster. So in short it doesn't appear to be the communication itself, but something happening inside the Graylog process, but that only happens when the messages have been delivered over UDP. Regards Johan On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen wrote: Could this simply be because TCP avoids (or tries to avoid) congestion while UDP does not? /HJ On 24 Feb 2015, at 13:50, sun...@sunner.com wrote: Hello, With the release of 1.0 we've started moving towards a new cluster of GL hosts. These are working very well, with one exception. For some reason any reasonably significant UDP traffic will choke the message processor, fill up and process buffers on all four hosts, and effectively choke up all other message processing as well. Normally we do around 2k messages per second, split roughly 50/50 between TCP and UDP. Sending the entire TCP load to one host doesn't present a problem, it doesn't break a sweat. I've also experimented a little with sending a large text file using rsyslog's imfile module, sending it via TCP will bottleneck us at the ES side of things and cause the disk journal fill up fairly rapidly, but it's still working at at ~9k messages per second so that's fine. Sending it via UDP just causes GL to choke again, fill up the journal to a certain point and slowly slowly process the journal at little bursts of a few thousand messages followed by several seconds of apparent sleeping(i.e pretty much no CPU usage). During all of this the input buffer never fills up more than at most single digit percentages, using TCP the output buffer sometimes moves up to