Thanks Damon. I'll put the cache size down. On the other varnish server cache size is equal to physical ram and we haven't had this problem. I have read Kristian's blog and was using malloc before. It was on a server with 32 GB of ram. I set malloc to 28 GB but for some reason it ended up using ~ 15 GB of swap space. From what I understand varnish performs better using mmap then swap so that's why I changed to the file storage type. I'll also try adjusting the thread configuration on all our varnish nodes.
Sent from my iPhone On 2011-09-22, at 9:47 AM, "Damon Snyder" <[email protected]<mailto:[email protected]>> wrote: Hi Matt, We had some instability when running varnish using a file backed cache that was larger than RAM. This was a year or more ago, and I don't recall the exact details of the issue. We resolved it by reducing the size of the file cache to be smaller than memory and the problems went away. It looks like you can also increase your thread pool size. We are using something like -w 100,1600,120 at startup and have a similar connection rate. I would also suggest moving to malloc storage type (we are migrating from file to malloc). Kristian is quoted as saying: If you can not contain your cache in memory, first consider if you really need that big of a cache. Then consider buying more memory. Then sleep on it. <http://kristianlyng.wordpress.com/2010/01/26/varnish-best-practices/>http://kristianlyng.wordpress.com/2010/01/26/varnish-best-practices/ If you do switch to malloc on this box, try something like -smalloc,40G. You don't want to allocate up to the amount of RAM as varnish needs some additional space for other data and you need to make sure your OS and other processes has some margin as well. For my part, I wish there was an interface into the size of the objects being cached so that you can better estimate the size of the cache you need. I understand varnishsizes can do something like but it doesn't tell you what is currently stored in aggregate or allow you to traverse the objects and their sizes. In any case, you can monitor your hit rate and see what impact dropping the extra 12 GB has. I suppose you could argue that your hit rate is moot if you are not responding to connections so just getting rid of the no-response issue should be net positive. Hope this helps, Damon On Thu, Sep 22, 2011 at 9:14 AM, Matt Schurenko <<mailto:[email protected]>[email protected]<mailto:[email protected]>> wrote: I posted yesterday regarding this issue. It has happened again; however this time I have not restarted the problem varnish node. I am using the default configuration with regards to threads. It seems that once thread_pool_max hits 500 that the server stops responding to any requests. I stopped all connections from the load balancer to the server; however the threads do not decrease. They remain stuck at ~ 509: [root@mvp14 16328]# grep Threads /proc/`pgrep varnishd|tail -n1`/status Threads: 509 The server has been idle for ~ 45 minutes now and there are only a couple of established connections: [root@mvp14 16328]# netstat -ant | grep -w .*:80 | grep EST tcp 254 0 204.92.101.119:80<http://204.92.101.119:80> 192.168.105.32:37554<http://192.168.105.32:37554> ESTABLISHED tcp 532 0 204.92.101.119:80<http://204.92.101.119:80> 192.168.100.153:57722<http://192.168.100.153:57722> ESTABLISHED tcp 0 0 192.168.100.56:38818<http://192.168.100.56:38818> 204.92.101.124:80<http://204.92.101.124:80> ESTABLISHED There are however quite a number of connections in CLOSE_WAIT: [root@mvp14 16328]# netstat -ant | grep -w .*:80 | grep CLOSE_WAIT | wc -l 1118 Here is my varnishd version: [root@mvp14 16328]# varnishd -V varnishd (varnish-2.1.5 SVN ) Copyright (c) 2006-2009 Linpro AS / Verdens Gang ASrsion: Here are the system limits for varnishd: [root@mvp14 16328]# cat /proc/`pgrep varnishd|tail -n1`/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 10485760 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 397312 397312 processes Max open files 131072 131072 files Max locked memory 32768 32768 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 397312 397312 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Here is some memory info on the server: [root@mvp14 16328]# free -m total used free shared buffers cached Mem: 48299 48170 129 0 92 42815 -/+ buffers/cache: 5262 43037 Swap: 15147 0 15147 Here is my varnishd command line: /usr/local/sbin/varnishd -s file,/tmp/varnish-cache,60G -T 127.0.0.1:2000<http://127.0.0.1:2000> -a 0.0.0.0:80<http://0.0.0.0:80> -t 604800 -f /usr/local/etc/varnish/default.vcl -p http_headers 384 -p connect_timeout 4.0 Here is the output from 'varnishstat -1': client_conn 32772547 265.09 Client connections accepted client_drop 13103 0.11 Connection dropped, no sess/wrk client_req 32531681 263.14 Client requests received cache_hit 27525134 222.64 Cache hits cache_hitpass 0 0.00 Cache hits for pass cache_miss 5005404 40.49 Cache misses backend_conn 4954451 40.07 Backend conn. success backend_unhealthy 0 0.00 Backend conn. not attempted backend_busy 0 0.00 Backend conn. too many backend_fail 853 0.01 Backend conn. failures backend_reuse 51728 0.42 Backend conn. reuses backend_toolate 13 0.00 Backend conn. was closed backend_recycle 51742 0.42 Backend conn. recycles backend_unused 0 0.00 Backend conn. unused fetch_head 5 0.00 Fetch head fetch_length 81316 0.66 Fetch with Length fetch_chunked 4924086 39.83 Fetch chunked fetch_eof 0 0.00 Fetch EOF fetch_bad 0 0.00 Fetch had bad headers fetch_close 186 0.00 Fetch wanted close fetch_oldhttp 0 0.00 Fetch pre HTTP/1.1 closed fetch_zero 0 0.00 Fetch zero len fetch_failed 0 0.00 Fetch failed n_sess_mem 1268 . N struct sess_mem n_sess 1174 . N struct sess n_object 4922732 . N struct object n_vampireobject 0 . N unresurrected objects n_objectcore 4923144 . N struct objectcore n_objecthead 4642001 . N struct objecthead n_smf 9639146 . N struct smf n_smf_frag 394705 . N small free smf n_smf_large 0 . N large free smf n_vbe_conn 501 . N struct vbe_conn n_wrk 500 . N worker threads n_wrk_create 3622 0.03 N worker threads created n_wrk_failed 0 0.00 N worker threads not created n_wrk_max 4079 0.03 N worker threads limited n_wrk_queue 502 0.00 N queued work requests n_wrk_overflow 65305 0.53 N overflowed work requests n_wrk_drop 13102 0.11 N dropped work requests n_backend 2 . N backends n_expired 1347 . N expired objects n_lru_nuked 381454 . N LRU nuked objects n_lru_saved 0 . N LRU saved objects n_lru_moved 23327252 . N LRU moved objects n_deathrow 0 . N objects on deathrow losthdr 0 0.00 HTTP header overflows n_objsendfile 0 0.00 Objects sent with sendfile n_objwrite 31912510 258.13 Objects sent with write n_objoverflow 0 0.00 Objects overflowing workspace s_sess 32758443 264.97 Total Sessions s_req 32531681 263.14 Total Requests s_pipe 0 0.00 Total pipe s_pass 1134 0.01 Total pass s_fetch 5005593 40.49 Total fetch s_hdrbytes 10659824012 86223.60 Total header bytes s_bodybytes 129812627152 1050009.12 Total body bytes sess_closed 29276120 236.80 Session Closed sess_pipeline 17 0.00 Session Pipeline sess_readahead 32 0.00 Session Read Ahead sess_linger 3510104 28.39 Session Linger sess_herd 3554241 28.75 Session herd shm_records 1725429324 13956.40 SHM records shm_writes 144491896 1168.74 SHM writes shm_flushes 750 0.01 SHM flushes due to overflow shm_cont 494654 4.00 SHM MTX contention shm_cycles 794 0.01 SHM cycles through buffer sm_nreq 10391973 84.06 allocator requests sm_nobj 9244441 . outstanding allocations sm_balloc 41184530432 . bytes allocated sm_bfree 23239979008 . bytes free sma_nreq 0 0.00 SMA allocator requests sma_nobj 0 . SMA outstanding allocations sma_nbytes 0 . SMA outstanding bytes sma_balloc 0 . SMA bytes allocated sma_bfree 0 . SMA bytes free sms_nreq 945 0.01 SMS allocator requests sms_nobj 0 . SMS outstanding allocations sms_nbytes 0 . SMS outstanding bytes sms_balloc 395010 . SMS bytes allocated sms_bfree 395010 . SMS bytes freed backend_req 5006185 40.49 Backend requests made n_vcl 1 0.00 N vcl total n_vcl_avail 1 0.00 N vcl available n_vcl_discard 0 0.00 N vcl discarded n_purge 1 . N total active purges n_purge_add 1 0.00 N new purges added n_purge_retire 0 0.00 N old purges deleted n_purge_obj_test 0 0.00 N objects tested n_purge_re_test 0 0.00 N regexps tested against n_purge_dups 0 0.00 N duplicate purges removed hcb_nolock 32531033 263.13 HCB Lookups without lock hcb_lock 5005369 40.49 HCB Lookups with lock hcb_insert 5005363 40.49 HCB Inserts esi_parse 0 0.00 Objects ESI parsed (unlock) esi_errors 0 0.00 ESI parse errors (unlock) accept_fail 0 0.00 Accept failures client_drop_late 1 0.00 Connection dropped late uptime 123630 1.00 Client uptime backend_retry 0 0.00 Backend conn. retry dir_dns_lookups 0 0.00 DNS director lookups dir_dns_failed 0 0.00 DNS director failed lookups dir_dns_hit 0 0.00 DNS director cached lookups hit dir_dns_cache_full 0 0.00 DNS director full dnscache fetch_1xx 0 0.00 Fetch no body (1xx) fetch_204 0 0.00 Fetch no body (204) fetch_304 0 0.00 Fetch no body (304) Here is some output from varnishlog (Is this normal?): 1442 TTL - 216036249 RFC 604800 1316705246 0 0 0 0 1442 VCL_call - fetch 1442 VCL_return - deliver 1442 ObjProtocol - HTTP/1.1 1442 ObjStatus - 200 1442 ObjResponse - OK 1442 ObjHeader - Date: Thu, 22 Sep 2011 15:27:25 GMT 1442 ObjHeader - Server: Apache/1.3.41 (Unix) mod_perl/1.30 1442 ObjHeader - x-airg-hasbinary: 1 1442 ObjHeader - x-airg-return-contentType: image%2Fjpeg 1442 ObjHeader - x-airg-interfacestatus: 200 1442 ObjHeader - Content-Type: image/jpeg 0 ExpKill - 184966508 LRU 0 ExpKill - 184967370 LRU 0 ExpKill - 184969553 LRU 0 ExpKill - 184970764 LRU 0 ExpKill - 184971732 LRU 0 ExpKill - 184976538 LRU 0 ExpKill - 184977972 LRU 0 ExpKill - 184988825 LRU 0 ExpKill - 184917719 LRU 0 ExpKill - 184997163 LRU 0 ExpKill - 184940621 LRU 0 ExpKill - 185000270 LRU 0 ExpKill - 185001314 LRU 0 ExpKill - 185003793 LRU 0 ExpKill - 185004913 LRU 0 ExpKill - 183651304 LRU 0 ExpKill - 185010145 LRU 0 ExpKill - 185012162 LRU I also noticed in MRTG that when this happens there is a sudden spike in lru nuked activity. It looks like it went from 0 nukes/sec to ~ 200. Do I have something misconfigured? Is varnish running into some kind of resource limitation (memory, file descriptors) which is causing it to hang? Did I set the cache size to large compared to the amount of physcial RAM I have? I am running the same version of varnish on the exact same server and this has not happened. The only difference is that I am using '-s file,/tmp/varnish-cache,48G ' instead of '-s file,/tmp/varnish-cache,60G'. Any help here would be most appreciated. This is in production right now. Matt Schurenko Systems Administrator airG share your world Suite 710, 1133 Melville Street Vancouver, BC V6E 4E5 P: +1.604.408.2228 F: +1.866.874.8136 E: <mailto:[email protected]> [email protected]<mailto:[email protected]> W: <http://www.airg.com> www.airg.com<http://www.airg.com> _______________________________________________ varnish-misc mailing list <mailto:[email protected]>[email protected]<mailto:[email protected]> <https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc _______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
