Offlist reply by accident. Repeating...
On 1/8/2018 9:52 AM, Dave Jones wrote:
On 01/08/2018 08:06 AM, Kevin A. McGrail wrote:
On 1/8/2018 8:01 AM, Dave Jones wrote:
I know the nightly rules promotion script hits the ruleqa site to
make sure there have been 3 days of successful masschecks so if we
did add a captcha, that would have to be excluded.
I have the web logs from the ruleqa site going to their own files
now with awstats setup to give a quick overview of what is going
on. I was planning on letting this ride for a bit and see what kind
of activity the ruleqa site normally gets. I don't think it gets hit
much now that the bots are taken care of.
Also, I installed NRPE on the box and am monitoring it more closely
via Icinga. I will have graphs of memory usage and get alerts when
the memory is being exhausted again. Hopefully if this happens
again, I can get into the box before OOM killer starts whacking
processes.
Thanks. Out of interest is the lack of swap what is completely
killing the box? I'm used to DDOSes but not why it's spiraling the
whole box.
The lack of swap is not the direct problem but it certainly is making
it hard to troubleshoot the actual problem. I really think it's odd
for infra to not setup swap space. I know they said it was bad for
their SAN and I suppose that it would be if it were on SSDs and VMs
were constantly into swap.
We normally monitor our VMs so that we get alerts when memory is
getting near exhaustion and swap is being used so it doesn't impact
our SAN. At least this gives us some time to get into the box, see
what is going on, and restart processes before the whole system is
unresponsive.
We build our VMs with swap space and it runs on our our Complellent
SAN with SSDs without a problem. Not sure why infra doesn't. You
just need to make sure to mount everything with the "discard" option
to play nice with most SAN's virtual block allocation and freeing.
My Icinga memory graphs are showing the used RAM hovering so far
around 6 GB. It dropped to around 3.0 GB last night when the hourly
ruleqa updates weren't happening in cron and the masscheck was
running. We seem to be wasting a lot of RAM in that VM right now
after getting Apache HTTPD under control by blocking the bad bots. We
will have more informative graphs after more time has passed.
Yeah, I was confused about the swap space as well but they give us more
ram easily.
Let me know in a week or so and we can ask them to lower the ram if you
want.
Regards,
KAM