On 2014-02-21 06:10, Simon Beale wrote:
I've got a problem at the moment with our general squid proxies where
occasionally requests take a long time that shouldn't do. (i.e. 5+ seconds
or timeout, instead of milliseconds).

This is most common on our proxies doing 100 reqs/sec, but happens
overnight too when they're running at 10 reqs/sec. I've got this happening with both v3.4.2 and also with a box I've downgraded back to v3.1.10. For
v3.4.2, it's happening in both multiple worker and single worker modes.



What sort of CPU loading do you have at ~100req/sec?
 is that at or near your local installations req/sec capacity?

NP:
* slow-down at peak capacity is normal as the proxy is busy servicing other traffic.

* slow-down at only a few req/sec is normal as Squid spends a lot of its time in artificial I/O wait delays to prevent reading/writing individual bytes off the network. Nothing worse for the network than to have ~71 bytes of packet overhead for every 2 bytes of data transferred.

* slow-down randomly all the time could be network congestion, Window scaling, ECN or MTU related. even ICMp related (ICMP is *not* optional - though many admin block it).

 * then there is bugs.
- 3.1 had a few IPv6 bugs (some major) which caused TCP retry delays in certain circumstances. Since you are seeing it only randomly I would suspect remote network(s) somewhere with those issues being a transit hop occasionally. Though this is unlikely given 3.4 still shows it.

- There is a fix in the 3.4.3 release regarding connection IP failover that may help if that is part of the issue (or it may not).


The test is not reproducible, sadly, but I've got a cronjob running on
localhost on these boxes testing access times to various URLs covering:
HTTPS, non-HTTPS static content, using IP not hostname over both HTTP and HTTPS, and a URL on the same vlan as the proxies. All of these test cases
have it happen occasionally, but not repeatedly/reliably.

Some ideas:
 * DNS lookup delays ?
 * Random TCP connection setup delays?


Different boxes are either running Trend's IWSVA for it's antivirus as a cache_peer, or C-ICAP/clamd as an ICAP service. These both have it happen
(as does the case where I disabled the antivirus).

 * object size related? ie scanning time in the AV.

The servers are all running CentOS6.4 on HP Gen8 blades with 48G RAM.

Has anyone seen anything like this, or got any suggestions as to what
might be causing this that I can investigate further?

Simon

Lots of people see it for all sorts of reasons.


Amos

Reply via email to