Hi all,

I've been having some problems with squid in reverse proxy mode. I will preface this by saying that I realize I'm pushing squid to the limits and I've been following the recent discussions on performance. I'll try to be as clear as possible to expose the crux of my problem, but please bear with me as the explanation is rather long.

Hardware:
DualCore Opeteron 270, 1800MHz
8GB RAM
2x73GB 10k RPM SCSI drives (no raid)

Software:
RedHat Enterprise Linux 4 (64-bit) (kernel 2.6.9-22.0.2.ELsmp) (some RedHat frankenkernel)
squid-2.5-patch12
added epoll patch (verified that epoll is getting used, had to tweak the Makefile and a constant or two) ./configure --with-maxfd=32768 --disable-poll --disable-select -- enable-epoll --enable-async-io=48 --prefix=/usr/local/encap/ squid-2.5.12-64bit-epoll

I have two AUFS cache dirs:
cache_dir aufs /cache1 24000 8 256
cache_dir aufs /mnt/drive2/cache2 24000 8 256

I've overridden the refresh:
refresh_pattern . 525600 20% 525600 override-expire override-lastmod ignore-reload

This machine basically only serves a huge set of images, which do not change over time, so refresh isn't really relevant.

Ok, so now for the good news. I have about 17,000 active connections (keep-alive is set to 30 seconds) doing 1700 requests/sec. Disk I/O is negligible, as the cache is partitioned and all I/O is coming from the page cache. There is cpu to spare - a vmstat over 60 seconds reveal plenty of idle, 0 iowait, and an even split between user and sys cpu time. All this yields about 43Mbps of traffic on a 100Mbps NIC. Most files are ~4k.

This would be fantastic, but the machines "fall over" after several hours. I have 4 machines, each configured identically. They last a few hours - they slowly consume more and more cpu, all in user space, until it starts affecting the median HTTP repsonse time. Then throughput drops precipitously.

If I interrupt and start a new process, everything returns to normal - much lower cpu load, response times drop, throughput returns. The store log reveals nothing - just the usual RELEASE messages. Running an strace on a good and bad machine simultaneously doesn't reveal anything obviously bad. There aren't any "queue congestion" messages, and I'm not seeing SO_FAIL messages (I upped the number of threads to 48 after I saw those messages reported with an earlier machine). Additionally, the cache manager general info doesn't show any particularly distressed metric.

Basically, something in squid is chewing user-space cpu over time, but I can't figure it out. I'm pretty new to squid, and I'm pretty much stalled.

Turning down the persistent_request_timeout did not seem to have any effect as I recall.
request_timeout 30 seconds
persistent_request_timeout 30 seconds

I've attached a bit more data in case that can help.

I'd really appreciate any advice - squid is by far the best thing I've tried and if I can just make it stable, I'd be done. :-)

Thanks,

-Mike




Sample data from the cache manager:

File descriptor usage for squid:
        Maximum number of file descriptors:   32768
        Largest file desc currently in use:   18675
        Number of file desc currently in use: 17869
        Files queued for open:                   4
        Available number of file descriptors: 14895
        Reserved number of file descriptors:   100
        Store Disk files open:                  15
Internal Data Structures:
        885561 StoreEntries
          2916 StoreEntries with MemObjects
          2891 Hot Object Cache Items
        884513 on-disk objects


Results of cumulative strace:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
37.45   56.910200          13   4332990         3 futex
14.76   22.434863          11   1973577           epoll_ctl
14.30   21.724293          26    846843       550 write
13.02   19.786709          12   1642753    830428 read
11.45   17.393603           9   1920158           fcntl
  6.63   10.075447          12    829458           close
  1.15    1.751153         126     13919           epoll_wait
  0.82    1.244980           9    139200           accept
  0.30    0.461178           3    139200           getsockname
  0.08    0.114513           4     27840           gettimeofday
  0.01    0.019765          29       693       693 connect
  0.01    0.013483          19       693           socket
  0.01    0.008909          49       180           getdents64
  0.00    0.005713          13       430           stat
  0.00    0.005326           8       693           bind
  0.00    0.003456           5       693           getsockopt
  0.00    0.003367           5       693           setsockopt
  0.00    0.000936          20        48           open
  0.00    0.000347           3       103           lseek
  0.00    0.000241           5        48           fstat
  0.00    0.000110          22         5           brk
  0.00    0.000067           5        13           getrusage
------ ----------- ----------- --------- --------- ----------------
100.00  151.958659              11870230    831674 total

Reply via email to