Em 07/03/2017 20:26, Alex Rousskov escreveu:
These stuck disker responses probably explain why your disks do not
receive any traffic. It is potentially important that both disker
responses shown in your logs got stuck at approximately the same
absolute time ~13 days ago (around 2017-02-22, give or take a day;
subtract 1136930911 milliseconds from 15:53:05.255 in your Squid time
zone to know the "exact" time when those stuck requests were queued).
How can a disker response get stuck? Most likely, something unusual
happened ~13 days ago. This could be a Squid bug and/or a kid restart.
* Do all currently running Squid kid processes have about the same start
time? [1]
* Do you see ipcIo6.381049w7 or ipcIo6.153009r8 mentioned in any old
non-debugging messages/warnings?
I searched the log files from those days, nothing unusual, "grep"
returns nothing for ipcIo6.381049w7 or ipcIo6.153009r8.
On that day I couldn't verify if the kids were still with the same
uptime, I've reformatted those /cache2 /cache3 and /cache4 partitions
and started fresh with squid -z, but looking at the PS right now, I feel
I can answer that question:
root@proxy:~# ps auxw |grep squid-
proxy 10225 0.0 0.0 13964224 21708 ? S Mar10 0:10
(squid-coord-10) -s
proxy 10226 0.1 12.5 14737524 8268056 ? S Mar10 7:14
(squid-disk-9) -s
proxy 10227 0.0 11.6 14737524 7686564 ? S Mar10 3:08
(squid-disk-8) -s
proxy 10228 0.1 14.9 14737540 9863652 ? S Mar10 7:30
(squid-disk-7) -s
proxy 18348 3.5 10.3 17157560 6859904 ? S Mar13 48:44
(squid-6) -s
proxy 18604 2.8 9.0 16903948 5977728 ? S Mar13 37:28
(squid-4) -s
proxy 18637 1.7 10.8 16836872 7163392 ? R Mar13 23:03
(squid-1) -s
proxy 20831 15.3 10.3 17226652 6838372 ? S 08:50 39:51
(squid-2) -s
proxy 21189 5.3 2.8 16538064 1871788 ? S 12:29 2:12
(squid-5) -s
proxy 21214 3.8 1.5 16448972 1012720 ? S 12:43 1:03
(squid-3) -s
Diskers aren't dying but workers are, a lot.. with that "assertion
failed: client_side_reply.cc:1167: http->storeEntry()->objectLen() >=
headers_sz" thing.
Viewing DF and IOSTAT, it seems right now /cache3 isn't being accessed
anymore. (I think it is the disk-8 above, look at the CPU time usage..)
Another weird thing: lots of timeouts and overflows are happening on
non-active hours.. From 0h to 7h we have like 1-2% of the clients we
usually have from 8h to 17h.. (commercial time)
2017/03/14 00:26:50 kid3| WARNING: abandoning 23 /cache4/rock I/Os after
at least 7.00s timeout
2017/03/14 00:26:53 kid1| WARNING: abandoning 1 /cache4/rock I/Os after
at least 7.00s timeout
2017/03/14 02:14:48 kid5| ERROR: worker I/O push queue for /cache4/rock
overflow: ipcIo5.68259w9
2017/03/14 06:33:43 kid3| ERROR: worker I/O push queue for /cache4/rock
overflow: ipcIo3.55919w9
2017/03/14 06:57:53 kid3| ERROR: worker I/O push queue for /cache4/rock
overflow: ipcIo3.58130w9
This cache4 partition is where huge files would be stored:
maximum_object_size 4 GB
cache_dir rock /cache2 110000 min-size=0 max-size=65536
max-swap-rate=150 swap-timeout=360
cache_dir rock /cache3 110000 min-size=65537 max-size=262144
max-swap-rate=150 swap-timeout=380
cache_dir rock /cache4 110000 min-size=262145 max-swap-rate=150
swap-timeout=500
Still don't know how /cache3 stopped and /cache4 is still active, even
with all those warnings and errors.. :/
--
Atenciosamente / Best Regards,
Heiler Bemerguy
Network Manager - CINBESA
55 91 98151-4894/3184-1751
_______________________________________________
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev