Re: [squid-dev] Rock store stopped accessing discs

2017-03-17 Thread Alex Rousskov
On 03/17/2017 09:48 AM, Heiler Bemerguy wrote:
> Sadly the same thing occurred again. It seems the hole is deeper lol..

Most likely, it is the same hole. However, the more we panic and jump to
conclusions, the deeper that hole below us may look...

> I couldn't find any previous messages that could give a clue why this is 
> happening..

> 2017/03/15 19:01:29 kid2| WARNING: communication with /cache2/rock may be too 
> slow or disrupted for about 7.00s; rescued 1 out of 1 I/Os
> 2017/03/15 19:15:05 kid2| WARNING: communication with /cache2/rock may be too 
> slow or disrupted for about 7.00s; rescued 1 out of 1 I/Os
> 2017/03/15 19:18:02 kid2| WARNING: communication with /cache2/rock may be too 
> slow or disrupted for about 7.00s; rescued 1 out of 1 I/Os
> 2017/03/15 19:22:27 kid2| WARNING: communication with /cache2/rock may be too 
> slow or disrupted for about 7.00s; rescued 1 out of 1 I/Os
> 2017/03/15 19:26:19 kid2| WARNING: communication with /cache2/rock may be too 
> slow or disrupted for about 7.00s; rescued 1 out of 1 I/Os

In general, such semi-periodic problems without any other serious error
messages could be due to missing max-swap-rate which is designed to
prevent disk overload and associated process blockage by the OS. I doubt
that is what happening in your particular case because I would expect
more I/Os to be affected (but you can validate that theory by following
my original recommendation).

Can you reproduce this problem if you increase logging levels? If yes, I
suggest configuring each kid to log to its own cache-N.log file and then
use "-k debug" (or equivalent) to collect detailed logs containing a few
of these WARNINGs.


Cheers,

Alex.

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev


Re: [squid-dev] Rock store stopped accessing discs

2017-03-17 Thread Heiler Bemerguy



I have not checked the code, but I suspect that

* You see no timeout errors because you disabled timeouts.

* You see no I/O delay warnings because you disabled I/O delays.

Said that, I would not be surprised if disabling max-swap-rate reduces
the probability of an I/O message getting stuck in the queue because
without artificial message delays, the probability of a kid dying with a
pending message is naturally lower.

There are probably some bugs in the code even when no kids are dying,
but there is currently no information warranting an investigation AFAICT.


Sadly the same thing occurred again. It seems the hole is deeper lol.. 
max-swap-rate/swap-timeout are still disabled...
I couldn't find any previous messages that could give a clue why this is 
happening.. :/


2017/03/15 16:18:07 kid6| urlParse: Illegal hostname 'static..unocdn.com'
2017/03/15 16:18:07 kid4| urlParse: Illegal hostname 'static..unocdn.com'
2017/03/15 16:19:07 kid6| WARNING: Closing client connection due to 
lifetime timeout

2017/03/15 16:19:07 kid6| http://177.105.192.70:9000/oliberalcbn.mp3
2017/03/15 16:19:58 kid1| Weak ETags are not allowed in If-Range: 
"8083-1488088507000" ? "8083-1488088507000"
2017/03/15 16:20:02 kid1| WARNING: Closing client connection due to 
lifetime timeout

2017/03/15 16:20:02 kid1|   http://45.43.200.26:27388/;
2017/03/15 16:20:51 kid6| WARNING: Closing client connection due to 
lifetime timeout

2017/03/15 16:20:51 kid6|   http://45.43.200.26:27388/;
2017/03/15 16:22:43 kid1| ctx: enter level  0: 
'http://storage.mais.uol.com.br/11848044.flv'

2017/03/15 16:22:43 kid1| WARNING: Malformed Content-Length[1]=-
2017/03/15 16:23:03 kid1| ctx: exit level  0
2017/03/15 16:23:03 kid1| ctx: enter level  0: 
'http://storage.mais.uol.com.br/11848044.flv'

2017/03/15 16:23:03 kid1| WARNING: Malformed Content-Length[1]=-
2017/03/15 16:28:45 kid1| ctx: exit level  0
2017/03/15 16:28:45 kid1| SECURITY ALERT: Missing hostname in URL 
'http://'. see access.log for details.
2017/03/15 16:33:26 kid2| SECURITY ALERT: Missing hostname in URL 
'http://'. see access.log for details.
2017/03/15 16:33:27 kid3| SECURITY ALERT: Missing hostname in URL 
'http://'. see access.log for details.
2017/03/15 16:37:44 kid4| ipcacheParse No Address records in response to 
'a.sitemeter.com'
2017/03/15 16:37:44 kid4| ipcacheParse No Address records in response to 
'a.sitemeter.com'
2017/03/15 16:43:17 kid3| WARNING: Closing client connection due to 
lifetime timeout
2017/03/15 16:43:17 kid3| 
http://www.superclipping.com.br/Uploads/video/15.03.2017/f17f8022f44148f4bc53d75db766b154.mp4
2017/03/15 16:46:44 kid4| local=10.1.10.9:3080 remote=10.1.3.242:54705 
FD 1958 flags=1: read/write failure: (113) No route to host
2017/03/15 16:46:54 kid1| local=10.1.10.9:3080 remote=10.1.3.242:49770 
FD 539 flags=1: read/write failure: (113) No route to host
2017/03/15 16:47:01 kid6| local=10.1.10.9:3080 remote=10.1.3.242:62607 
FD 2158 flags=1: read/write failure: (113) No route to host
2017/03/15 16:47:47 kid6| local=10.1.10.9:3080 remote=10.1.3.242:62605 
FD 2141 flags=1: read/write failure: (113) No route to host

2017/03/15 16:52:52 kid6| Starting new store_id helpers...
2017/03/15 16:52:52 kid6| helperOpenServers: Starting 1/50 'store-id.pl' 
processes

2017/03/15 16:52:52 kid6| Starting new store_id helpers...
2017/03/15 16:52:52 kid6| helperOpenServers: Starting 1/50 'store-id.pl' 
processes

2017/03/15 16:54:21 kid1| Weak ETags are not allowed in If-Range: "0" ? "0"
2017/03/15 16:55:31 kid1| Weak ETags are not allowed in If-Range: 
"54133-1387455809000" ? "54133-1387455809000"
2017/03/15 16:57:21 kid1| SECURITY ALERT: Missing hostname in URL 
'http://'. see access.log for details.
2017/03/15 17:03:15 kid4| Weak ETags are not allowed in If-Range: 
"5878dd6c-4363" ? "5878dd6c-4363"
2017/03/15 17:30:03 kid1| SECURITY ALERT: Missing hostname in URL 
'http://'. see access.log for details.
2017/03/15 17:57:31 kid1| WARNING: Closing client connection due to 
lifetime timeout
2017/03/15 17:57:31 kid1| 
http://liveupdate.gocyberlink.com/bigbang/ap/query_xml.jsp
2017/03/15 18:09:30 kid1| WARNING: Closing client connection due to 
lifetime timeout
2017/03/15 18:09:30 kid1| 
http://cloudav.downloads.pandasecurity.com/nano/pavsignano/nano_1/av_1/plug_01/plug_01_20170315_014235.rar

2017/03/15 18:20:49 kid1| urlParse: URL too large (10319 bytes)
2017/03/15 18:35:57 kid1| Weak ETags are not allowed in If-Range: "0" ? "0"
2017/03/15 18:36:50 kid4| WARNING: Closing client connection due to 
lifetime timeout
2017/03/15 18:36:50 kid4| 
http://au.ff.avast.com/android/avast-android-vps-v4-release.apk
2017/03/15 18:50:07 kid1| ipcacheParse No Address records in response to 
'www.portaldoartesanato.com.br'
2017/03/15 18:50:07 kid1| ipcacheParse No Address records in response to 
'www.portaldoartesanato.com.br'
2017/03/15 18:50:07 kid1| ipcacheParse No Address records in response to 
'www.portaldoartesanato.com.br'
2017/03/15 18:50:07 kid

Re: [squid-dev] Rock store stopped accessing discs

2017-03-15 Thread Alex Rousskov
On 03/15/2017 11:57 AM, Heiler Bemerguy wrote:

> Commented out all the max-swap-rate and swap-timeout. Guess what? No
> more errors on cache.log..

> It seems the code that "shapes" the rockstore I/Os is kinda buggy.

I have not checked the code, but I suspect that

* You see no timeout errors because you disabled timeouts.

* You see no I/O delay warnings because you disabled I/O delays.

Said that, I would not be surprised if disabling max-swap-rate reduces
the probability of an I/O message getting stuck in the queue because
without artificial message delays, the probability of a kid dying with a
pending message is naturally lower.

There are probably some bugs in the code even when no kids are dying,
but there is currently no information warranting an investigation AFAICT.

Alex.

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev


Re: [squid-dev] Rock store stopped accessing discs

2017-03-15 Thread Heiler Bemerguy


Hello list,

I've made a simple test here.

cache_dir rock /cache2 11 min-size=0 max-size=65536
#max-swap-rate=150 swap-timeout=360
cache_dir rock /cache3 11 min-size=65537 max-size=262144
#max-swap-rate=150 swap-timeout=380
cache_dir rock /cache4 11 min-size=262145
#max-swap-rate=150 swap-timeout=500

Commented out all the max-swap-rate and swap-timeout. Guess what? No 
more errors on cache.log..


before:
root@proxy:/var/log/squid# cat cache.log.2 |grep overflow |wc -l
53

now:
root@proxy:/var/log/squid# cat cache.log |grep overflow |wc -l
1

before:
root@proxy:/var/log/squid# cat cache.log.2 |grep "7.00s timeout" |wc -l
86

now:
root@proxy:/var/log/squid# cat cache.log |grep "7.00s timeout" |wc -l
0

It seems the code that "shapes" the rockstore I/Os is kinda buggy. 
Version 4.0.18


--
Atenciosamente / Best Regards,

Heiler Bemerguy
Network Manager - CINBESA
55 91 98151-4894/3184-1751


Em 07/03/2017 20:26, Alex Rousskov escreveu:

On 03/07/2017 01:08 PM, Heiler Bemerguy wrote:


Some log from right now...

Here is my analysis:


15:53:05.255| ipc/Queue.h findOldest: peeking from 7 to 6 at 1

Squid worker (kid6) is looking at the queue of disker (kid7) responses.
There is just one response in the queue.



IpcIoFile.cc canWait: cannot wait: 1136930911 oldest: ipcIo6.381049w7

Squid worker is trying to estimate whether it has enough time to queue
(and eventually perform) more disk I/O. The expected wait is
1'136'930'911 milliseconds or ~13 days. That is longer than the
configured cache_dir swap-timeout (a few hundred milliseconds) so Squid
refuses to queue this disk I/O request:


store_io.cc storeCreate: no swapdirs for e:=sw1p2RDV/0x206e17d0*4


The same story happens in your other log snippets AFAICT:


15:53:05.302| ipc/Queue.h findOldest: peeking from 8 to 6 at 1
IpcIoFile.cc canWait: cannot wait: 1136930958 oldest: ipcIo6.153009r8
store_io.cc storeCreate: no swapdirs for e:=msw1DV/0x3a4d5870*2

and


15:53:05.318| ipc/Queue.h findOldest: peeking from 7 to 6 at 1
IpcIoFile.cc canWait: cannot wait: 1136930974 oldest: ipcIo6.381049w7
store_io.cc storeCreate: no swapdirs for e:m381432=w1p2DV/0x3ace6740*4

and


15:53:05.793| ipc/Queue.h findOldest: peeking from 7 to 6 at 1
IpcIoFile.cc canWait: cannot wait: 1136931448 oldest: ipcIo6.381049w7
store_io.cc storeCreate: no swapdirs for e:=sw1V/0x206e17d0*1


Focusing on one disker (kid7), we can see that the oldest response does
not change: It is always ipcIo6.381049w7. This stuck response results in
gradual increment of the expected wait time with every check, matching
wall clock time increment:


15:53:05.255| cannot wait: 1136930911
15:53:05.318| cannot wait: 1136930974
15:53:05.793| cannot wait: 1136931448

These stuck disker responses probably explain why your disks do not
receive any traffic. It is potentially important that both disker
responses shown in your logs got stuck at approximately the same
absolute time ~13 days ago (around 2017-02-22, give or take a day;
subtract 1136930911 milliseconds from 15:53:05.255 in your Squid time
zone to know the "exact" time when those stuck requests were queued).

How can a disker response get stuck? Most likely, something unusual
happened ~13 days ago. This could be a Squid bug and/or a kid restart.

* Do all currently running Squid kid processes have about the same start
time? [1]

* Do you see ipcIo6.381049w7 or ipcIo6.153009r8 mentioned in any old
non-debugging messages/warnings?

[1]
http://stackoverflow.com/questions/5731234/how-to-get-the-start-time-of-a-long-running-linux-process


Thank you,

Alex.



___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev


Re: [squid-dev] Rock store stopped accessing discs

2017-03-14 Thread Alex Rousskov
On 03/14/2017 10:43 AM, Heiler Bemerguy wrote:
> Em 07/03/2017 20:26, Alex Rousskov escreveu:
>> How can a disker response get stuck? Most likely, something unusual
>> happened ~13 days ago. This could be a Squid bug and/or a kid restart.


> root@proxy:~# ps auxw |grep squid-
> proxy10225  0.0  0.0 13964224 21708 ?  SMar10   0:10 
> (squid-coord-10) -s
> proxy10226  0.1 12.5 14737524 8268056 ?SMar10   7:14 
> (squid-disk-9) -s
> proxy10227  0.0 11.6 14737524 7686564 ?SMar10   3:08 
> (squid-disk-8) -s
> proxy10228  0.1 14.9 14737540 9863652 ?SMar10   7:30 
> (squid-disk-7) -s
> proxy18348  3.5 10.3 17157560 6859904 ?SMar13  48:44 (squid-6) -s
> proxy18604  2.8  9.0 16903948 5977728 ?SMar13  37:28 (squid-4) -s
> proxy18637  1.7 10.8 16836872 7163392 ?RMar13  23:03 (squid-1) -s
> proxy20831 15.3 10.3 17226652 6838372 ?S08:50  39:51 (squid-2) -s
> proxy21189  5.3  2.8 16538064 1871788 ?S12:29   2:12 (squid-5) -s
> proxy21214  3.8  1.5 16448972 1012720 ?S12:43   1:03 (squid-3) -s

> Diskers aren't dying but workers are, a lot.. 

I suspect that worker deaths may cause SMP queues to get stuck, but I
have not validated that theory. We probably need to add more code to SMP
queues so that they can recover from untimely kid deaths.


> Another weird thing: lots of timeouts and overflows are happening on
> non-active hours.. From 0h to 7h we have like 1-2% of the clients we
> usually have from 8h to 17h.. (commercial time)

If a queue is stuck, you will see these errors and warnings as long as
there is some need for disk I/O. The volume is not important.


> Still don't know how /cache3 stopped and /cache4 is still active, even
> with all those warnings and errors.. :/

Do you expect Squid to function well in the presence of assertions and
to explain what went wrong while asserting? Unfortunately, we are very
far from that kind of robustness and self-diagnosis nirvana!

I have not studied your error messages in detail, but it is possible
that there are not-yet-stuck queues that feed cache4 while all cache3
queues are stuck. There is one SMP queue for each worker:disker pair.

Alex.

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev


Re: [squid-dev] Rock store stopped accessing discs

2017-03-14 Thread Heiler Bemerguy


Em 07/03/2017 20:26, Alex Rousskov escreveu:

These stuck disker responses probably explain why your disks do not
receive any traffic. It is potentially important that both disker
responses shown in your logs got stuck at approximately the same
absolute time ~13 days ago (around 2017-02-22, give or take a day;
subtract 1136930911 milliseconds from 15:53:05.255 in your Squid time
zone to know the "exact" time when those stuck requests were queued).

How can a disker response get stuck? Most likely, something unusual
happened ~13 days ago. This could be a Squid bug and/or a kid restart.

* Do all currently running Squid kid processes have about the same start
time? [1]

* Do you see ipcIo6.381049w7 or ipcIo6.153009r8 mentioned in any old
non-debugging messages/warnings?


I searched the log files from those days, nothing unusual, "grep" 
returns nothing for ipcIo6.381049w7 or ipcIo6.153009r8.


On that day I couldn't verify if the kids were still with the same 
uptime, I've reformatted those /cache2 /cache3 and /cache4 partitions 
and started fresh with squid -z, but looking at the PS right now, I feel 
I can answer that question:


root@proxy:~# ps auxw |grep squid-
proxy10225  0.0  0.0 13964224 21708 ?  SMar10   0:10 
(squid-coord-10) -s
proxy10226  0.1 12.5 14737524 8268056 ?SMar10   7:14 
(squid-disk-9) -s
proxy10227  0.0 11.6 14737524 7686564 ?SMar10   3:08 
(squid-disk-8) -s
proxy10228  0.1 14.9 14737540 9863652 ?SMar10   7:30 
(squid-disk-7) -s
proxy18348  3.5 10.3 17157560 6859904 ?SMar13  48:44 
(squid-6) -s
proxy18604  2.8  9.0 16903948 5977728 ?SMar13  37:28 
(squid-4) -s
proxy18637  1.7 10.8 16836872 7163392 ?RMar13  23:03 
(squid-1) -s
proxy20831 15.3 10.3 17226652 6838372 ?S08:50  39:51 
(squid-2) -s
proxy21189  5.3  2.8 16538064 1871788 ?S12:29   2:12 
(squid-5) -s
proxy21214  3.8  1.5 16448972 1012720 ?S12:43   1:03 
(squid-3) -s


Diskers aren't dying but workers are, a lot.. with that "assertion 
failed: client_side_reply.cc:1167: http->storeEntry()->objectLen() >= 
headers_sz" thing.


Viewing DF and IOSTAT, it seems right now /cache3 isn't being accessed 
anymore. (I think it is the disk-8 above, look at the CPU time usage..)


Another weird thing: lots of timeouts and overflows are happening on 
non-active hours.. From 0h to 7h we have like 1-2% of the clients we 
usually have from 8h to 17h.. (commercial time)


2017/03/14 00:26:50 kid3| WARNING: abandoning 23 /cache4/rock I/Os after 
at least 7.00s timeout
2017/03/14 00:26:53 kid1| WARNING: abandoning 1 /cache4/rock I/Os after 
at least 7.00s timeout
2017/03/14 02:14:48 kid5| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo5.68259w9
2017/03/14 06:33:43 kid3| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo3.55919w9
2017/03/14 06:57:53 kid3| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo3.58130w9


This cache4 partition is where huge files would be stored:
maximum_object_size 4 GB
cache_dir rock /cache2 11 min-size=0 max-size=65536 
max-swap-rate=150 swap-timeout=360
cache_dir rock /cache3 11 min-size=65537 max-size=262144 
max-swap-rate=150 swap-timeout=380
cache_dir rock /cache4 11 min-size=262145 max-swap-rate=150 
swap-timeout=500


Still don't know how /cache3 stopped and /cache4 is still active, even 
with all those warnings and errors.. :/


--
Atenciosamente / Best Regards,

Heiler Bemerguy
Network Manager - CINBESA
55 91 98151-4894/3184-1751

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev


Re: [squid-dev] Rock store stopped accessing discs

2017-03-07 Thread Alex Rousskov
On 03/07/2017 01:08 PM, Heiler Bemerguy wrote:

> Some log from right now...

Here is my analysis:

> 15:53:05.255| ipc/Queue.h findOldest: peeking from 7 to 6 at 1

Squid worker (kid6) is looking at the queue of disker (kid7) responses.
There is just one response in the queue.


> IpcIoFile.cc canWait: cannot wait: 1136930911 oldest: ipcIo6.381049w7

Squid worker is trying to estimate whether it has enough time to queue
(and eventually perform) more disk I/O. The expected wait is
1'136'930'911 milliseconds or ~13 days. That is longer than the
configured cache_dir swap-timeout (a few hundred milliseconds) so Squid
refuses to queue this disk I/O request:

> store_io.cc storeCreate: no swapdirs for e:=sw1p2RDV/0x206e17d0*4


The same story happens in your other log snippets AFAICT:

> 15:53:05.302| ipc/Queue.h findOldest: peeking from 8 to 6 at 1
> IpcIoFile.cc canWait: cannot wait: 1136930958 oldest: ipcIo6.153009r8
> store_io.cc storeCreate: no swapdirs for e:=msw1DV/0x3a4d5870*2

and

> 15:53:05.318| ipc/Queue.h findOldest: peeking from 7 to 6 at 1
> IpcIoFile.cc canWait: cannot wait: 1136930974 oldest: ipcIo6.381049w7
> store_io.cc storeCreate: no swapdirs for e:m381432=w1p2DV/0x3ace6740*4

and

> 15:53:05.793| ipc/Queue.h findOldest: peeking from 7 to 6 at 1
> IpcIoFile.cc canWait: cannot wait: 1136931448 oldest: ipcIo6.381049w7
> store_io.cc storeCreate: no swapdirs for e:=sw1V/0x206e17d0*1


Focusing on one disker (kid7), we can see that the oldest response does
not change: It is always ipcIo6.381049w7. This stuck response results in
gradual increment of the expected wait time with every check, matching
wall clock time increment:

> 15:53:05.255| cannot wait: 1136930911
> 15:53:05.318| cannot wait: 1136930974
> 15:53:05.793| cannot wait: 1136931448

These stuck disker responses probably explain why your disks do not
receive any traffic. It is potentially important that both disker
responses shown in your logs got stuck at approximately the same
absolute time ~13 days ago (around 2017-02-22, give or take a day;
subtract 1136930911 milliseconds from 15:53:05.255 in your Squid time
zone to know the "exact" time when those stuck requests were queued).

How can a disker response get stuck? Most likely, something unusual
happened ~13 days ago. This could be a Squid bug and/or a kid restart.

* Do all currently running Squid kid processes have about the same start
time? [1]

* Do you see ipcIo6.381049w7 or ipcIo6.153009r8 mentioned in any old
non-debugging messages/warnings?

[1]
http://stackoverflow.com/questions/5731234/how-to-get-the-start-time-of-a-long-running-linux-process


Thank you,

Alex.

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev


Re: [squid-dev] Rock store stopped accessing discs

2017-03-07 Thread Heiler Bemerguy



Em 07/03/2017 15:22, Alex Rousskov escreveu:



2017/02/18 13:48:19 kid3| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo3.9082w9
2017/02/18 13:48:42 kid4| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo4.3371w9
2017/02/18 14:06:01 kid9| WARNING: /cache4/rock delays I/O requests for 9.97 
seconds to obey 200/sec rate limit
2017/02/18 14:06:34 kid9| WARNING: /cache4/rock delays I/O requests for 21.82 
seconds to obey 200/sec rate limit
2017/02/18 14:06:42 kid4| WARNING: abandoning 1 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 14:06:47 kid3| WARNING: abandoning 1 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 14:06:48 kid1| WARNING: abandoning 1 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 14:06:49 kid4| WARNING: abandoning 4 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 14:06:54 kid3| WARNING: abandoning 2 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 14:07:55 kid9| WARNING: /cache4/rock delays I/O requests for 68.64 
seconds to obey 200/sec rate limit
2017/02/18 14:08:03 kid5| WARNING: abandoning 511 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 14:08:47 kid2| WARNING: abandoning 20 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 14:08:51 kid3| WARNING: abandoning 41 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 14:08:54 kid1| WARNING: abandoning 41 /cache4/rock I/Os after at 
least 7.00s timeout
2017/02/18 15:26:35 kid5| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo5.31404w9
2017/02/18 15:29:00 kid9| WARNING: /cache4/rock delays I/O requests for 9.92 
seconds to obey 200/sec rate limit
2017/02/18 15:29:13 kid9| WARNING: /cache4/rock delays I/O requests for 8.23 
seconds to obey 200/sec rate limit
2017/02/18 15:29:45 kid9| WARNING: /cache4/rock delays I/O requests for 8.86 
seconds to obey 200/sec rate limit
2017/02/18 15:30:06 kid9| WARNING: /cache4/rock delays I/O requests for 7.34 
seconds to obey 200/sec rate limit
2017/02/18 15:30:27 kid9| WARNING: /cache4/rock delays I/O requests for 7.65 
seconds to obey 200/sec rate limit
2017/02/18 15:30:48 kid9| WARNING: /cache4/rock delays I/O requests for 8.97 
seconds to obey 200/sec rate limit
2017/02/18 15:31:09 kid9| WARNING: /cache4/rock delays I/O requests for 8.52 
seconds to obey 200/sec rate limit
2017/02/18 15:31:22 kid9| WARNING: /cache4/rock delays I/O requests for 10.61 
seconds to obey 200/sec rate limit
2017/02/18 17:19:40 kid9| WARNING: /cache4/rock delays I/O requests for 10.22 
seconds to obey 200/sec rate limit

There is a Squid bug and/or your cache disks could not keep up with the
I/O load. Please note that I/O load during initial cache index rebuild
is higher.

There are several ways to proceed IMO:

1. Figure out why your Squid is not using the disk cache _now_. This
will require enabling debugging, at least for a few seconds, and then
analyzing cache.log. I recommend enabling debugging in one worker kid
only (i.e., sending its process the right signal instead of running
"squid -k debug"). See Squid wiki for debugging details. Do not restart
or reconfigure Squid until you collect those debugging logs! Please note
that the logs may contain private user info.


Some log from right now...

2017/03/07 15:53:05.255 kid6| 19,4| MemObject.cc(431) isContiguous: 
MemObject::isContiguous: Returning true
2017/03/07 15:53:05.255 kid6| 73,3| HttpRequest.cc(658) storeId: sent 
back effectiveRequestUrl: http://www.agenciabelem.com.br/signalr/hubs
2017/03/07 15:53:05.255 kid6| 20,3| store_swapmeta.cc(52) 
storeSwapMetaBuild: storeSwapMetaBuild URL: 
http://www.agenciabelem.com.br/signalr/hubs
2017/03/07 15:53:05.255 kid6| 54,2| ../../../src/ipc/Queue.h(490) 
findOldest: peeking from 7 to 6 at 1
2017/03/07 15:53:05.255 kid6| 47,2| IpcIoFile.cc(415) canWait: cannot 
wait: 1136930911 oldest: ipcIo6.381049w7
2017/03/07 15:53:05.255 kid6| 20,2| store_io.cc(38) storeCreate: 
storeCreate: no swapdirs for e:=sw1p2RDV/0x206e17d0*4
2017/03/07 15:53:05.255 kid6| 90,3| store_client.cc(729) invokeHandlers: 
InvokeHandlers: 69538C9E111C077214004DE73D0F257B
2017/03/07 15:53:05.255 kid6| 90,3| store_client.cc(735) invokeHandlers: 
StoreEntry::InvokeHandlers: checking client #0
2017/03/07 15:53:05.255 kid6| 90,3| store_client.cc(755) 
storePendingNClients: storePendingNClients: returning 1
2017/03/07 15:53:05.255 kid6| 20,3| store.cc(494) unlock: Client 
unlocking key 69538C9E111C077214004DE73D0F257B e:=sp2RDV/0x206e17d0*4
2017/03/07 15:53:05.255 kid6| 17,3| FwdState.cc(274) ~FwdState: FwdState 
destructor start

--
2017/03/07 15:53:05.302 kid6| 90,3| store_client.cc(661) 
storeUnregister: storeUnregister: called for 
'C8303E098834088226809BF615D12C51'
2017/03/07 15:53:05.302 kid6| 19,4| MemObject.cc(431) isContiguous: 
MemObject::isContiguous: Returning true
2017/03/07 15:53:05.302 kid6| 20,3| store_swapmeta.cc(52) 
storeSwapMetaBuild: storeSwapMetaBuild URL: 
http://www.ormnews.com.br/library/bootstrap/css/bootstrap.m

Re: [squid-dev] Rock store stopped accessing discs

2017-03-07 Thread Alex Rousskov
On 03/07/2017 10:58 AM, Heiler Bemerguy wrote:

> I used iostat to check if "right now" the hds were being accessed. A
> lot of minutes passed and all writes/reads remained Zero. 

Understood.


> With a 80mbit/s traffic going on, how could nothing be written nor read from
> disc? 

I can come up with several possible explanations, but more information
is needed to minimize guessing. See below for some suggestions.


> How can I check IPC RAM? I've never tweaked it.

Search Squid wiki for SMP troubleshooting hints and/or ask your local
sysadmin. If you cannot find anything relevant, ask here again and,
hopefully somebody will be able to guide you.


> I've just noticed that squid was running since feb/18 (Start Time:
> Sat, 18 Feb 2017 15:38:44 GMT) and since the beginning there were a lot
> of warnings on cache.log.. (The logs I pasted on the earlier email was
> from today's usage only..)
> I think since then, it stopped using the cache stores..

I assume "then" means "today's usage" not "feb/18".

> 2017/02/18 13:48:19 kid3| ERROR: worker I/O push queue for /cache4/rock 
> overflow: ipcIo3.9082w9
> 2017/02/18 13:48:42 kid4| ERROR: worker I/O push queue for /cache4/rock 
> overflow: ipcIo4.3371w9
> 2017/02/18 14:06:01 kid9| WARNING: /cache4/rock delays I/O requests for 9.97 
> seconds to obey 200/sec rate limit
> 2017/02/18 14:06:34 kid9| WARNING: /cache4/rock delays I/O requests for 21.82 
> seconds to obey 200/sec rate limit
> 2017/02/18 14:06:42 kid4| WARNING: abandoning 1 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 14:06:47 kid3| WARNING: abandoning 1 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 14:06:48 kid1| WARNING: abandoning 1 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 14:06:49 kid4| WARNING: abandoning 4 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 14:06:54 kid3| WARNING: abandoning 2 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 14:07:55 kid9| WARNING: /cache4/rock delays I/O requests for 68.64 
> seconds to obey 200/sec rate limit
> 2017/02/18 14:08:03 kid5| WARNING: abandoning 511 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 14:08:47 kid2| WARNING: abandoning 20 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 14:08:51 kid3| WARNING: abandoning 41 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 14:08:54 kid1| WARNING: abandoning 41 /cache4/rock I/Os after at 
> least 7.00s timeout
> 2017/02/18 15:26:35 kid5| ERROR: worker I/O push queue for /cache4/rock 
> overflow: ipcIo5.31404w9
> 2017/02/18 15:29:00 kid9| WARNING: /cache4/rock delays I/O requests for 9.92 
> seconds to obey 200/sec rate limit
> 2017/02/18 15:29:13 kid9| WARNING: /cache4/rock delays I/O requests for 8.23 
> seconds to obey 200/sec rate limit
> 2017/02/18 15:29:45 kid9| WARNING: /cache4/rock delays I/O requests for 8.86 
> seconds to obey 200/sec rate limit
> 2017/02/18 15:30:06 kid9| WARNING: /cache4/rock delays I/O requests for 7.34 
> seconds to obey 200/sec rate limit
> 2017/02/18 15:30:27 kid9| WARNING: /cache4/rock delays I/O requests for 7.65 
> seconds to obey 200/sec rate limit
> 2017/02/18 15:30:48 kid9| WARNING: /cache4/rock delays I/O requests for 8.97 
> seconds to obey 200/sec rate limit
> 2017/02/18 15:31:09 kid9| WARNING: /cache4/rock delays I/O requests for 8.52 
> seconds to obey 200/sec rate limit
> 2017/02/18 15:31:22 kid9| WARNING: /cache4/rock delays I/O requests for 10.61 
> seconds to obey 200/sec rate limit
> 2017/02/18 17:19:40 kid9| WARNING: /cache4/rock delays I/O requests for 10.22 
> seconds to obey 200/sec rate limit

There is a Squid bug and/or your cache disks could not keep up with the
I/O load. Please note that I/O load during initial cache index rebuild
is higher.

There are several ways to proceed IMO:

1. Figure out why your Squid is not using the disk cache _now_. This
will require enabling debugging, at least for a few seconds, and then
analyzing cache.log. I recommend enabling debugging in one worker kid
only (i.e., sending its process the right signal instead of running
"squid -k debug"). See Squid wiki for debugging details. Do not restart
or reconfigure Squid until you collect those debugging logs! Please note
that the logs may contain private user info.

2. Reducing max-swap-rate from 200 to, say, 20. If your disks cannot
keep up _and_ there is a Squid bug that screws something up when your
disks cannot keep up, then this blind configuration change may avoid
triggering that bug.

3. Collect enough iostat 5-second outputs (or equivalent) to correlate
system performance with cache.log messages. I would also collect other
system activity during those hours. The "atop" tool may be useful for
collecting everything in one place. You will probably want to restart
Squid for a clean experiment/collection.


HTH,

Alex.

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org

Re: [squid-dev] Rock store stopped accessing discs

2017-03-07 Thread Heiler Bemerguy


Em 07/03/2017 13:14, Alex Rousskov escreveu:

On 03/07/2017 08:40 AM, Heiler Bemerguy wrote:

I'm using squid 4.0.18

And noticed something: (iostat -x 5)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,000,000,25 0,00 28,00 224,00 0,00  
  8,000,008,00   8,00   0,20
sdc   0,00 0,000,000,00 0,00 0,000,00 0,00  
  0,000,000,00   0,00   0,00
sdb   0,00 0,000,000,00 0,00 0,000,00 0,00  
  0,000,000,00   0,00   0,00
sdd   0,00 0,000,000,00 0,00 0,000,00 0,00  
  0,000,000,00   0,00   0,00

No hds are being accessed, only the main (SDA) one (which logs are
saved). Btw squid is sending 80mbit/s to the network, as iftop told me.

cache.log:

2017/03/07 05:23:59 kid5| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo5.206991w9
2017/03/07 05:24:10 kid5| WARNING: communication with /cache4/rock may be too 
slow or disrupted for about 7.00s; rescued 304 out of 304 I/Os
2017/03/07 08:00:30 kid5| WARNING: abandoning 1 /cache2/rock I/Os after at 
least 7.00s timeout
2017/03/07 10:50:45 kid5| WARNING: abandoning 1 /cache2/rock I/Os after at 
least 7.00s timeout

I presume your iostat output covers 5 seconds. The cache.log output
spans 5 hours. Were there no cache disk traffic during those 5 hours? Do
those iostat 5 seconds match the timestamp of any single cache.log WARNING?


No. I used iostat to check if "right now" the hds were being accessed. A 
lot of minutes passed and all writes/reads remained Zero. With a 
80mbit/s traffic going on, how could nothing be written nor read from 
disc? It's like squid stopped using the cache_dirs for some reason, then 
I greped the cache.log for the word 'rock", and that's what it outputted



squid.conf:

cache_dir rock /cache2 11 min-size=0 max-size=65536 max-swap-rate=200 
swap-timeout=360
cache_dir rock /cache3 11 min-size=65537 max-size=262144 max-swap-rate=200 
swap-timeout=380
cache_dir rock /cache4 11 min-size=262145 max-swap-rate=200 swap-timeout=500

Should I raise any values? tweak something?

Yes, but it is not yet clear what. If you suspect that your disks cannot
handle the load, decrease max-swap-rate. However, there is currently no
firm evidence that your disks cannot handle the load. It could be
something else like insufficient IPC RAM or Squid bugs.


How can I check IPC RAM? I've never tweaked it.


Any Squid kid crashes? How many Squid workers do you use?

Can you collect enough iostat 5-second outputs to correlate with
long-term cache.log messages? I would also collect other system activity
during those hours. The "atop" tool may be useful for collecting
everything in one place.

Couldn't find any crashes on the log file. 6 workers and 3 cache_dirs.
I've just noticed that squid was running since feb/18 (Start Time: 
Sat, 18 Feb 2017 15:38:44 GMT) and since the beginning there were a lot 
of warnings on cache.log.. (The logs I pasted on the earlier email was 
from today's usage only..)

I think since then, it stopped using the cache stores..

2017/02/18 13:48:19 kid3| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo3.9082w9
2017/02/18 13:48:42 kid4| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo4.3371w9
2017/02/18 14:06:01 kid9| WARNING: /cache4/rock delays I/O requests for 
9.97 seconds to obey 200/sec rate limit
2017/02/18 14:06:34 kid9| WARNING: /cache4/rock delays I/O requests for 
21.82 seconds to obey 200/sec rate limit
2017/02/18 14:06:42 kid4| WARNING: abandoning 1 /cache4/rock I/Os after 
at least 7.00s timeout
2017/02/18 14:06:47 kid3| WARNING: abandoning 1 /cache4/rock I/Os after 
at least 7.00s timeout
2017/02/18 14:06:48 kid1| WARNING: abandoning 1 /cache4/rock I/Os after 
at least 7.00s timeout
2017/02/18 14:06:49 kid4| WARNING: abandoning 4 /cache4/rock I/Os after 
at least 7.00s timeout
2017/02/18 14:06:54 kid3| WARNING: abandoning 2 /cache4/rock I/Os after 
at least 7.00s timeout
2017/02/18 14:07:55 kid9| WARNING: /cache4/rock delays I/O requests for 
68.64 seconds to obey 200/sec rate limit
2017/02/18 14:08:03 kid5| WARNING: abandoning 511 /cache4/rock I/Os 
after at least 7.00s timeout
2017/02/18 14:08:47 kid2| WARNING: abandoning 20 /cache4/rock I/Os after 
at least 7.00s timeout
2017/02/18 14:08:51 kid3| WARNING: abandoning 41 /cache4/rock I/Os after 
at least 7.00s timeout
2017/02/18 14:08:54 kid1| WARNING: abandoning 41 /cache4/rock I/Os after 
at least 7.00s timeout
2017/02/18 15:26:35 kid5| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo5.31404w9
2017/02/18 15:29:00 kid9| WARNING: /cache4/rock delays I/O requests for 
9.92 seconds to obey 200/sec rate limit
2017/02/18 15:29:13 kid9| WARNING: /cache4/rock delays I/O requests for 
8.23 seconds to obey 200/sec rate limit
2017/02/18 15:29:45 kid9| WARNING: /cache4/

Re: [squid-dev] Rock store stopped accessing discs

2017-03-07 Thread Alex Rousskov
On 03/07/2017 08:40 AM, Heiler Bemerguy wrote:
> I'm using squid 4.0.18
> 
> And noticed something: (iostat -x 5)
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda   0,00 0,000,000,25 0,00 28,00 224,00 
> 0,008,000,008,00   8,00   0,20
> sdc   0,00 0,000,000,00 0,00 0,000,00 
> 0,000,000,000,00   0,00   0,00
> sdb   0,00 0,000,000,00 0,00 0,000,00 
> 0,000,000,000,00   0,00   0,00
> sdd   0,00 0,000,000,00 0,00 0,000,00 
> 0,000,000,000,00   0,00   0,00
> 
> No hds are being accessed, only the main (SDA) one (which logs are
> saved). Btw squid is sending 80mbit/s to the network, as iftop told me.
> 
> cache.log:
> 
> 2017/03/07 05:23:59 kid5| ERROR: worker I/O push queue for /cache4/rock 
> overflow: ipcIo5.206991w9
> 2017/03/07 05:24:10 kid5| WARNING: communication with /cache4/rock may be too 
> slow or disrupted for about 7.00s; rescued 304 out of 304 I/Os
> 2017/03/07 08:00:30 kid5| WARNING: abandoning 1 /cache2/rock I/Os after at 
> least 7.00s timeout
> 2017/03/07 10:50:45 kid5| WARNING: abandoning 1 /cache2/rock I/Os after at 
> least 7.00s timeout

I presume your iostat output covers 5 seconds. The cache.log output
spans 5 hours. Were there no cache disk traffic during those 5 hours? Do
those iostat 5 seconds match the timestamp of any single cache.log WARNING?


> squid.conf:
> 
> cache_dir rock /cache2 11 min-size=0 max-size=65536 max-swap-rate=200 
> swap-timeout=360
> cache_dir rock /cache3 11 min-size=65537 max-size=262144 
> max-swap-rate=200 swap-timeout=380
> cache_dir rock /cache4 11 min-size=262145 max-swap-rate=200 
> swap-timeout=500
> 
> Should I raise any values? tweak something?

Yes, but it is not yet clear what. If you suspect that your disks cannot
handle the load, decrease max-swap-rate. However, there is currently no
firm evidence that your disks cannot handle the load. It could be
something else like insufficient IPC RAM or Squid bugs.

Any Squid kid crashes? How many Squid workers do you use?

Can you collect enough iostat 5-second outputs to correlate with
long-term cache.log messages? I would also collect other system activity
during those hours. The "atop" tool may be useful for collecting
everything in one place.

Alex.

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev