Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Pete Thu, 14 May 2015 02:14:13 -0700

Thanks very much Arie, I will check these tomorrow and report back.

One thing I can confirm is the heap size is configured correctly.


Cheers, Pete

> On 14 May 2015, at 05:35, Arie <satyava...@gmail.com> wrote:
> 
> Lets try some more options.
> 
> I see you are running your stuf virtual. Then you can consider the following 
> for centos6
> 
> In your startup kernel config you can add the following options 
> (/etc/grub.conf)
> 
>   nohz=off (for high cpu intensive systems)
>   elevator=noop (disc scheduling is done by the virtual layer, so disable 
> that)
>   cgroup_disable=memory (possibly not used, it fees up some memory and 
> allocation)
>   
> if you use the pvscsi device, add the following:
>   vmw_pvscsi.cmd_per_lun=254 
>  vmw_pvscsi.ring_pages=32
> 
>  Check disk buffers on the virtual layer too. vmware kb 2053145
>   see 
> http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2053145&sliceId=1&docTypeID=DT_KB_1_1&dialogID=621755330&stateId=1%200%20593866502
> 
>  Optimize your disk for performance (up to 30%!!! yes):
> 
>  for the filesystems were graylog and or elastic is located add the following 
> to /etc/fstab
> 
> example:
> /dev/mapper/vg_nagios-lv_root /  ext4 
> defaults,noatime,nobarrier,data=writeback 1 1
> and if you want to be more safe:
> /dev/mapper/vg_nagios-lv_root /  ext4 defaults,noatime,nobarrier 1 1    
> 
> is ES_HEAP_SIZE configured @ the correct place (I did that wrong at first)
> it is in /etc/systconfig/elasticsearch
> 
> 
> All these options together can improve system performance huge specially when 
> they are virtial.
> 
> ps did you correctly changed your file descriptors?
> 
> /etc/sysctl.conf
> 
> fs.file-max = 65536
> 
>  /etc/security/limits.conf
> 
> *          soft     nproc       65535
> *          hard     nproc       65535
> *          soft     nofile      65535
> *          hard     nofile      65535
>  
>  /etc/security/limits.d/90-nproc.conf
> 
> *          soft     nproc       65535
> *          hard     nproc       65535
> *          soft     nofile      65535
> *          hard     nofile      65535
> 
> check fs performance with iota -a to see how it is.
> 
> hth,,
> 
> Arie
> 
> 
> Op dinsdag 12 mei 2015 23:52:19 UTC+2 schreef Pete GS:
>> 
>> No further input on this?
>> 
>> The Graylog master node now seems to regularly drop out also with the "Did 
>> not find meta info of this node. Re-registering." message and it is under no 
>> load as our load balancer doesn't direct any input messages to it.
>> 
>> Cheers, Pete
>> 
>>> On Thursday, 7 May 2015 07:44:41 UTC+10, Pete GS wrote:
>>> I've come back to the office this morning and discovered we had an 
>>> ElasticSearch issue last night which has resulted in lots of unprocessed 
>>> messages in the journal.
>>> 
>>> All the Graylog nodes are busy processing these and it seems to be slowly 
>>> crunching through them.
>>> 
>>> Load average (using htop) varies across the four nodes but I'm seeing a 
>>> minimum of 13.59 11.80 and a maximum of 24.81 24.64.
>>> 
>>> Interestingly enough the process buffer is only full on one of the nodes, 
>>> the other three appear to be 10% full or less.
>>> 
>>> The output buffers are all empty.
>>> 
>>> The issue with ElasticSearch was running out of disk space which I've 
>>> resolved for the moment but my business case for new hardware should solve 
>>> that permanently.
>>> 
>>> What other info can I give you guys to help me look in the right direction?
>>> 
>>> Cheers, Pete
>>> 
>>>> On Wednesday, 6 May 2015 07:33:31 UTC+10, Pete GS wrote:
>>>> Thanks for the replies guys. I'm away from the office today but will check 
>>>> these things tomorrow.
>>>> 
>>>> Mathieu, I will check the load average but from memory the 5 minute 
>>>> average was around 12 or 18. I will confirm this tomorrow though.
>>>> 
>>>> As for the "co stop" metric, I haven't used esxtop on these hosts but I 
>>>> have looked at the CPU Ready metric and it seems to be ok (sub 5% 
>>>> sustained). One of the physical hosts has exactly the same number of CPU's 
>>>> allocated as the VM"s running on it, but the other two physical hosts have 
>>>> no over-subscription of CPU's at all.There is no memory over subscription 
>>>> on any hosts either.
>>>> 
>>>> For the moment I have simply increased the CPU's on the existing nodes as 
>>>> well as adding the two new ones. I am putting together a business case for 
>>>> new hardware for the ElasticSearch cluster and if this goes ahead I will 
>>>> move to a model of more Graylog nodes with less CPU's and memory for each 
>>>> node as I think that will scale better.
>>>> 
>>>> Arie, I will increase the output buffer processors tomorrow to see what 
>>>> happens, but I do know that the process buffer gets quite full at times 
>>>> while the output buffer is usually almost empty.
>>>> 
>>>>> On Wed, May 6, 2015 at 3:05 AM, Mathieu Grzybek <mathieu...@gmail.com> 
>>>>> wrote:
>>>>> Also check « co stop » metric on VMware. I am sure you have too many 
>>>>> vCPUs.
>>>>> 
>>>>>> Le 5 mai 2015 à 16:21, Arie <satya...@gmail.com> a écrit :
>>>>>> 
>>>>>> What happens when you raise "outputbuffer_processors = 5" to 
>>>>>> "outputbuffer_processors = 10" ?
>>>>>> 
>>>>>> Op dinsdag 5 mei 2015 02:23:37 UTC+2 schreef Pete GS:
>>>>>>> 
>>>>>>> Yesterday I did a yum update on all Graylog and MongoDB nodes and since 
>>>>>>> doing that and rebooting them all (there was a kernel update) it seems 
>>>>>>> that there are no longer issues connecting to the Mongo database.
>>>>>>> 
>>>>>>> However, I'm still seeing excessively high CPU usage on the Graylog 
>>>>>>> nodes where all vCPU's are regularly exceeding 95%.
>>>>>>> 
>>>>>>> What can contribute to this? I'm a little stumped at present.
>>>>>>> 
>>>>>>> I would say our average messages/second is around 5,000 to 6,000 with 
>>>>>>> peaks up to about 12,000.
>>>>>>> 
>>>>>>> Cheers, Pete
>>>>>>> 
>>>>>>>> On Friday, 1 May 2015 08:20:35 UTC+10, Pete GS wrote:
>>>>>>>> Does anyone have any thoughts on this?
>>>>>>>> 
>>>>>>>> Even if someone could identify some scenarios that would cause high 
>>>>>>>> CPU on Graylog servers and in what circumstances Graylog would have 
>>>>>>>> trouble contacting the MongoDB servers.
>>>>>>>> 
>>>>>>>> Cheers, Pete
>>>>>>>> 
>>>>>>>>> On Wednesday, 29 April 2015 10:34:28 UTC+10, Pete GS wrote:
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> We acquired a company a while ago and last week we added all of their 
>>>>>>>>> logs to our Graylog environment which all come in from their Syslog 
>>>>>>>>> server via UDP.
>>>>>>>>> 
>>>>>>>>> After this, I noticed that the Graylog servers were maxing CPU so to 
>>>>>>>>> alleviate this I increased CPU resources to the existing servers and 
>>>>>>>>> added two new servers.
>>>>>>>>> 
>>>>>>>>> I'm still seeing generally high CPU usage with peaks of 100% on all 
>>>>>>>>> four of the Graylog servers but I now have issues where they also 
>>>>>>>>> seem to have issues connecting to MongoDB.
>>>>>>>>> 
>>>>>>>>> I see lots of "[NodePingThread] Did not find meta info of this node. 
>>>>>>>>> Re-registering." streaming through the log files but it only seems to 
>>>>>>>>> happen when I have more than two Graylog servers running.
>>>>>>>>> 
>>>>>>>>> I have verified NTP is installed and configured and all servers 
>>>>>>>>> including the MongoDB and ElasticSearch servers are sync'ing with the 
>>>>>>>>> same NTP servers.
>>>>>>>>> 
>>>>>>>>> We're doing less than 10,000 messages per second so with the 
>>>>>>>>> resources I've allocated I would have expected no issues whatsoever.
>>>>>>>>> 
>>>>>>>>> I have seen this link: 
>>>>>>>>> https://groups.google.com/forum/?hl=en#!topic/graylog2/bW2glCdBIUI 
>>>>>>>>> but I don't believe it is our issue.
>>>>>>>>> 
>>>>>>>>> If it truly is being caused by doing lots of reverse DNS lookups, I 
>>>>>>>>> would expect tcpdump to show me that traffic to our DNS servers, but 
>>>>>>>>> I see almost no DNS lookups at all.
>>>>>>>>> 
>>>>>>>>> We have 6 inputs in total but only one receives the bulk of the 
>>>>>>>>> Syslog UDP messages. Most of the other inputs are GELF UDP inputs.
>>>>>>>>> 
>>>>>>>>> We also have 11 streams, however pausing these streams seems to have 
>>>>>>>>> little to no impact on the CPU usage.
>>>>>>>>> 
>>>>>>>>> All the Graylog servers are virtualised on top of vSphere 5.5 Update 
>>>>>>>>> 2 with plenty of physical hardware available to service the workload 
>>>>>>>>> (little to no contention).
>>>>>>>>> 
>>>>>>>>> The original two have 20 vCPU's and 32GB RAM, the additional two have 
>>>>>>>>> 16 vCPU's and 32GB RAM.
>>>>>>>>> 
>>>>>>>>> Java heap on all is set to 16GB.
>>>>>>>>> 
>>>>>>>>> This is all running on CentOS 6.
>>>>>>>>> 
>>>>>>>>> Any input would be greatly appreciated as I'm a bit stumped on how to 
>>>>>>>>> get this resolved at present.
>>>>>>>>> 
>>>>>>>>> Here is the config file I'm using (censored where appropriate):
>>>>>>>>> 
>>>>>>>>> is_master = false
>>>>>>>>> node_id_file = /etc/graylog2/server/node-id
>>>>>>>>> password_secret = <Censored>
>>>>>>>>> root_username = <Censored>
>>>>>>>>> root_password_sha2 = <Censored>
>>>>>>>>> plugin_dir = /usr/share/graylog2-server/plugin
>>>>>>>>> rest_listen_uri = http://172.22.20.66:12900/
>>>>>>>>> 
>>>>>>>>> elasticsearch_max_docs_per_index = 20000000
>>>>>>>>> elasticsearch_max_number_of_indices = 999
>>>>>>>>> retention_strategy = close
>>>>>>>>> elasticsearch_shards = 4
>>>>>>>>> elasticsearch_replicas = 1
>>>>>>>>> elasticsearch_index_prefix = graylog2
>>>>>>>>> allow_leading_wildcard_searches = true
>>>>>>>>> allow_highlighting = true
>>>>>>>>> elasticsearch_cluster_name = graylog2
>>>>>>>>> elasticsearch_node_name = bne3-0002las
>>>>>>>>> elasticsearch_node_master = false
>>>>>>>>> elasticsearch_node_data = false
>>>>>>>>> elasticsearch_discovery_zen_ping_multicast_enabled = false
>>>>>>>>> elasticsearch_discovery_zen_ping_unicast_hosts = 
>>>>>>>>> bne3-0001lai.server-web.com:9300,bne3-0002lai.server-web.com:9300,bne3-0003lai.server-web.com:9300,bne3-0004lai.server-web.com:9300,bne3-0005lai.server-web.com:9300,bne3-0006lai.server-web.com:9300,bne3-0007lai.server-web.com:9300,bne3-0008lai.server-web.com:9300,bne3-0009lai.server-web.com:9300
>>>>>>>>> elasticsearch_cluster_discovery_timeout = 5000
>>>>>>>>> elasticsearch_discovery_initial_state_timeout = 3s
>>>>>>>>> elasticsearch_analyzer = standard
>>>>>>>>> 
>>>>>>>>> output_batch_size = 5000
>>>>>>>>> output_flush_interval = 1
>>>>>>>>> processbuffer_processors = 20
>>>>>>>>> outputbuffer_processors = 5
>>>>>>>>> #outputbuffer_processor_keep_alive_time = 5000
>>>>>>>>> #outputbuffer_processor_threads_core_pool_size = 3
>>>>>>>>> #outputbuffer_processor_threads_max_pool_size = 30
>>>>>>>>> #udp_recvbuffer_sizes = 1048576
>>>>>>>>> processor_wait_strategy = blocking
>>>>>>>>> ring_size = 65536
>>>>>>>>> 
>>>>>>>>> inputbuffer_ring_size = 65536
>>>>>>>>> inputbuffer_processors = 2
>>>>>>>>> inputbuffer_wait_strategy = blocking
>>>>>>>>> 
>>>>>>>>> message_journal_enabled = true
>>>>>>>>> message_journal_dir = /var/lib/graylog-server/journal
>>>>>>>>> message_journal_max_age = 24h
>>>>>>>>> message_journal_max_size = 150gb
>>>>>>>>> message_journal_flush_age = 1m
>>>>>>>>> message_journal_flush_interval = 1000000
>>>>>>>>> message_journal_segment_age = 1h
>>>>>>>>> message_journal_segment_size = 1gb
>>>>>>>>> 
>>>>>>>>> dead_letters_enabled = false
>>>>>>>>> lb_recognition_period_seconds = 3
>>>>>>>>> 
>>>>>>>>> mongodb_useauth = true
>>>>>>>>> mongodb_user = <Censored>
>>>>>>>>> mongodb_password = <Censored>
>>>>>>>>> mongodb_replica_set = 
>>>>>>>>> bne3-0001ladb.server-web.com:27017,bne3-0002ladb.server-web.com:27017
>>>>>>>>> mongodb_database = graylog2
>>>>>>>>> mongodb_max_connections = 200
>>>>>>>>> mongodb_threads_allowed_to_block_multiplier = 5
>>>>>>>>> 
>>>>>>>>> #rules_file = /etc/graylog2.drl
>>>>>>>>> 
>>>>>>>>> # Email transport
>>>>>>>>> transport_email_enabled = true
>>>>>>>>> transport_email_hostname = <Censored>
>>>>>>>>> transport_email_port = 25
>>>>>>>>> transport_email_use_auth = false
>>>>>>>>> transport_email_use_tls = false
>>>>>>>>> transport_email_use_ssl = false
>>>>>>>>> transport_email_auth_username = y...@example.com
>>>>>>>>> transport_email_auth_password = secret
>>>>>>>>> transport_email_subject_prefix = [graylog2]
>>>>>>>>> transport_email_from_email = <Censored>
>>>>>>>>> transport_email_web_interface_url = <Censored>
>>>>>>>>> 
>>>>>>>>> message_cache_off_heap = false
>>>>>>>>> message_cache_spool_dir = /var/lib/graylog2-server/message-cache-spool
>>>>>>>>> #message_cache_commit_interval = 1000
>>>>>>>>> #input_cache_max_size = 0
>>>>>>>>> 
>>>>>>>>> #ldap_connection_timeout = 2000
>>>>>>>>> 
>>>>>>>>> versionchecks = false
>>>>>>>>> 
>>>>>>>>> #enable_metrics_collection = false
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "graylog2" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>>> an email to graylog2+u...@googlegroups.com.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>> 
>>>>> -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "graylog2" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> graylog2+u...@googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "graylog2" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/graylog2/h6Si-ckfts8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> graylog2+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] High CPU and did not find meta info issues since adding new Graylog servers and increased input messages/second

Reply via email to