Was running wazuh 2.8.1 agent on "most" systems, with the wazuh ossec docker container for a master server.
Upgraded to 2.8.3 to try to resolve this problem, with no luck. Out of about 160 machines, 4-5 of them will reliably wedge themselves after some amount of time with messages akin to: 2017 Feb 28 15:35:34 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ossec-syscheckd:12608] 2017 Feb 28 15:36:02 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ossec-syscheckd:12608] 2017 Feb 28 15:36:34 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [ossec-syscheckd:12608] If this continues long enough, the entire system grinds to a halt, and requires Big Red Button service. I finally managed to attach an strace to one today, but I may not have gotten it right. # strace -e trace=read,write -p 12608 which displayed an awful lot of noise (I'd just done a clean reinstall of ossec-hids-agent) of the format: read(7, "ST6=m\nCONFIG_CRYPTO_CAST6_AVX_X8"..., 1024) = 1024 read(7, "TO_DEV_QAT=m\nCONFIG_CRYPTO_DEV_Q"..., 1024) = 1024 read(7, "G_PERCPU_RWSEM=y\nCONFIG_ARCH_USE"..., 1024) = 1024 read(7, "NFIG_TEXTSEARCH_BM=m\nCONFIG_TEXT"..., 1024) = 479 read(7, "", 1024) = 0 before going into this loop: SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13344, si_status=0,si_utime=0, si_stime=0} SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13346, si_status=0,si_utime=0, si_stime=0} SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13355, si_status=0,si_utime=0, si_stime=0} SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13357, si_status=0,si_utime=0, si_stime=0} SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13359, si_status=0,si_utime=0, si_stime=0} SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13361, si_status=0,si_utime=0, si_stime=0} SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13363, si_status=0,si_utime=0, si_stime=0} ....... SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13660, si_status=0,si_utime=0, si_stime=0} and then the soft lockup messages started-- and no, I didn't think to attach an strace to pid 13660 until after I'd rebooted. It's a production server, and while it's not heavily used, it's used enough that we don't want it off during production hours. Some information about the server: kernelrelease => 3.10.0-514.6.1.el7.x86_64 lsbdistdescription => Red Hat Enterprise Linux Server release 7.3 (Maipo) It's a VM under VMware esx, 2 cores, 2 gig memory, ext4 / LVM. All of the affected systems appear to be Red Hat 7, all patched within the last 30 days. Any suggestions where to look next? Thanks in advance! --John -- --- You received this message because you are subscribed to the Google Groups "ossec-list" group. To unsubscribe from this group and stop receiving emails from it, send an email to ossec-list+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.