Re: Breaking Varnish
In message , Tim Kientzle wri tes: >It also appears that Varnish eventually exits completely >if placed under high load. I'm okay with that as long as it's >intentional behavior; It is not intentional. The entire point about the two-process trick is to not ever throw in the towel if we can avoid it. That said, there are classes of bugs for which we have no hope, if for instance the manager process cannot fork or allocate memory, then we are hosed top and bottom. >Of course, >I understand that killing the child and starting a new one >will also lose the cache, which is obviously not particularly >desirable under heavy load. Persistent storage coming up in version 2.1 :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Breaking Varnish
On Jan 28, 2009, at 1:54 AM, Poul-Henning Kamp wrote: > In message <20090123222947.gb28...@digdug.corp.631h.metaweb.com>, > Niall O'Higgi > ns writes: Can I get you to take -trunk for a spin ? At least the second of the problems you pasted I'm pretty sure I have nailed recently and the first one could easily be the same one in a different disguise. >> >> I've re-run the load test against varnish-trunk. Trunk is better >> behaved, but I now get output like this over and over: >> >> child (19731) Started >> Child (19731) said Closed fds: 4 7 8 10 11 >> Child (19731) said Child starts >> Child (19731) said managed to mmap 49929912320 bytes of 49929912320 >> Child (19731) said Ready >> Child (19731) not responding to ping, killing it. > > This is a typical indication of raw overload, what levels of traffic > are you hitting it with ? Pretty heavy. We put together a test workload that saturated Squid at around 1500 req/s on a dual-core dev systems. The symptoms above appeared somewhere above 6000 req/s on the same hardware and workload. The test has two goals: 1) To try to find bugs in Varnish that might prevent us from switching to Varnish from Squid. 2) To understand how Varnish behaves when it becomes saturated. When testing Squid in this fashion, we found no bugs. Under heavy load, Squid did become very slow but recovered cleanly and went back into normal operation as soon as the load was removed. Varnish didn't fare quite so well. We did find bugs, as you know. Fortunately, those seem to be fixed in trunk. (When do you expect the next point release?) It also appears that Varnish eventually exits completely if placed under high load. I'm okay with that as long as it's intentional behavior; we have a standard nanny that we use in production to restart crashed services anyway. Of course, I understand that killing the child and starting a new one will also lose the cache, which is obviously not particularly desirable under heavy load. Cheers, Tim ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: [+] Re: Breaking Varnish
In message <20090128183618.ge28...@digdug.corp.631h.metaweb.com>, Niall O'Higgi ns writes: >On Wed, Jan 28, 2009 at 10:18:48AM -0800, Michael S. Fischer wrote: >> On Jan 28, 2009, at 10:04 AM, Niall O'Higgins wrote: > >Varnish is running on a dual CPU (amd64) Linux 2.6 machine. We have >pushed it up to 6701 t/sec with multiple load-generation machines. We >see the same child-restart behaviour whether we use a single >load-generation machine, or three. As I said, increase the cli_timeout parameter, it is probably to short for that kind of scenario. Also, you should probably set srcaddr_ttl to zero, to disable the (effectively unused) per source-IP statistics. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: [+] Re: Breaking Varnish
On Wed, Jan 28, 2009 at 10:18:48AM -0800, Michael S. Fischer wrote: > On Jan 28, 2009, at 10:04 AM, Niall O'Higgins wrote: >>> This is a typical indication of raw overload, what levels of traffic >>> are you hitting it with ? >> >> This kind of thing: >> >> Transaction rate:3776.65 trans/sec >> Throughput: 1.68 MB/sec >> Concurrency: 28.28 > > That doesn't seem that high. What OS/# CPUs are you using? Varnish is running on a dual CPU (amd64) Linux 2.6 machine. We have pushed it up to 6701 t/sec with multiple load-generation machines. We see the same child-restart behaviour whether we use a single load-generation machine, or three. > > --Michael > -- Niall O'Higgins Software Engineer Metaweb Technologies, Inc. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: [+] Re: Breaking Varnish
On Jan 28, 2009, at 10:04 AM, Niall O'Higgins wrote: >> This is a typical indication of raw overload, what levels of traffic >> are you hitting it with ? > > This kind of thing: > > Transaction rate:3776.65 trans/sec > Throughput: 1.68 MB/sec > Concurrency: 28.28 That doesn't seem that high. What OS/# CPUs are you using? --Michael ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: [+] Re: Breaking Varnish
In message <20090128180448.gd28...@digdug.corp.631h.metaweb.com>, Niall O'Higgi ns writes: >Transaction rate:3776.65 trans/sec >Throughput: 1.68 MB/sec >Concurrency: 28.28 > >Does the parent process give up on restarting the child after a >certain number of failures? I was surprised by the eventual complete >exit of varnishd with the message: > >Pushing vcls failed: CLI communication error It shouldn't do that, it should be able to restart it forever. >Also, Varnish seems to be able to handle up to about double that load >for a while (we got up to 6701 t/sec), then it dies as above. Seems >like it takes around 2-3 hours for the varnishd parent process >to die. Once you get to that level of load, the ability of the scheduler to not do something stupid is paramount for survival. Try to increase the "cli_timeout" parameter, it is probably set a bit on the aggresive side by default. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: [+] Re: Breaking Varnish
On Wed, Jan 28, 2009 at 09:54:26AM +, Poul-Henning Kamp wrote: > >I've re-run the load test against varnish-trunk. Trunk is better > >behaved, but I now get output like this over and over: > > > >child (19731) Started > >Child (19731) said Closed fds: 4 7 8 10 11 > >Child (19731) said Child starts > >Child (19731) said managed to mmap 49929912320 bytes of 49929912320 > >Child (19731) said Ready > >Child (19731) not responding to ping, killing it. > > This is a typical indication of raw overload, what levels of traffic > are you hitting it with ? This kind of thing: Transaction rate:3776.65 trans/sec Throughput: 1.68 MB/sec Concurrency: 28.28 Does the parent process give up on restarting the child after a certain number of failures? I was surprised by the eventual complete exit of varnishd with the message: Pushing vcls failed: CLI communication error Also, Varnish seems to be able to handle up to about double that load for a while (we got up to 6701 t/sec), then it dies as above. Seems like it takes around 2-3 hours for the varnishd parent process to die. > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > p...@freebsd.org | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > -- Niall O'Higgins Software Engineer Metaweb Technologies, Inc. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Breaking Varnish
In message <20090123222947.gb28...@digdug.corp.631h.metaweb.com>, Niall O'Higgi ns writes: >>> Hi Tim, >>> >>> Can I get you to take -trunk for a spin ? >>> >>> At least the second of the problems you pasted I'm pretty sure I >>> have nailed recently and the first one could easily be the same one >>> in a different disguise. > >I've re-run the load test against varnish-trunk. Trunk is better >behaved, but I now get output like this over and over: > >child (19731) Started >Child (19731) said Closed fds: 4 7 8 10 11 >Child (19731) said Child starts >Child (19731) said managed to mmap 49929912320 bytes of 49929912320 >Child (19731) said Ready >Child (19731) not responding to ping, killing it. This is a typical indication of raw overload, what levels of traffic are you hitting it with ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Breaking Varnish
Hi, Regarding: On Wed, Jan 21, 2009 at 02:05:55PM -0800, Tim Kientzle wrote: > On Jan 21, 2009, at 2:02 PM, Poul-Henning Kamp wrote: >>> Under heavy load, we're seeing a lot of segfaults and >>> assertion failures. I've pasted an excerpt below of >>> two of the issues we've seen using Varnish 2.0.2 on Linux >>> 2.6.21 kernel with the default VCL (using command-line options >>> to set the listen address and the addresses of the two back-end >>> servers). >> >> Hi Tim, >> >> Can I get you to take -trunk for a spin ? >> >> At least the second of the problems you pasted I'm pretty sure I >> have nailed recently and the first one could easily be the same one >> in a different disguise. I've re-run the load test against varnish-trunk. Trunk is better behaved, but I now get output like this over and over: child (19731) Started Child (19731) said Closed fds: 4 7 8 10 11 Child (19731) said Child starts Child (19731) said managed to mmap 49929912320 bytes of 49929912320 Child (19731) said Ready Child (19731) not responding to ping, killing it. Child (19731) not responding to ping, killing it. Child (19731) not responding to ping, killing it. Child (19731) died signal=3 Child cleanup complete And varnish eventually exits with this message: child (19773) Started Pushing vcls failed: CLI communication error I am running varnishd like so: sbin/varnishd -f etc/varnish/default.vcl -F -a'0.0.0.0:8101' My configuration file contains: director www_director round-robin { { .backend = { .host = "appserver1"; .port = "8105"; } } { .backend = { .host = "appserver2"; .port = "8105"; } } } sub vcl_recv { if (req.http.host ~ "^varnishserver$") { set req.backend = www_director; } } If there are other details which might help diagnose this, let me know and I'll try to provide them. -- Niall O'Higgins Software Engineer Metaweb Technologies, Inc. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Breaking Varnish
Dual-core AMD processor using the x86_64 kernel. Uname shows: Linux 2.6.21.5 #9 SMP Thu Aug 16 17:21:29 UTC 2007 x86_64 AMD Opteron(tm) Processor 248 AuthenticAMD GNU/Linux On Jan 21, 2009, at 2:01 PM, Iliya Sharov wrote: > amd64 or i386 architecture? > > Tim Kientzle пишет: >> We're evaluating Varnish as a possible replacement for our >> installed Squid servers. Performance-wise, Varnish is very >> impressive, and we're pretty pleased with the configuration >> flexibility. >> >> But... >> >> Under heavy load, we're seeing a lot of segfaults and >> assertion failures. I've pasted an excerpt below of >> two of the issues we've seen using Varnish 2.0.2 on Linux >> 2.6.21 kernel with the default VCL (using command-line options >> to set the listen address and the addresses of the two back-end >> servers). >> >> We're going to repeat these tests and see if we can get >> more detail, possibly including core dumps. What other >> information would be useful in diagnosing and fixing >> these issues? >> >> Cheers, >> >> Tim Kientzle >> >> == >> >> 1) Varnish repeatedly died due to SIGSEGV: >> >> child (2816) Started >> Child (2816) said Closed fds: 4 7 8 10 11 >> Child (2816) said Child starts >> Child (2816) said managed to mmap 49392648192 bytes of 49392648192 >> Child (2816) said Ready >> Child (2816) died signal=11 >> Child cleanup complete >> >> 2) Varnish repeatedly died due to SIGABRT: >> >> child (3017) Started >> Child (3017) said Closed fds: 4 7 8 10 11 >> Child (3017) said Child starts >> Child (3017) said managed to mmap 49392648192 bytes of 49392648192 >> Child (3017) said Ready >> Child (3017) died signal=6 >> Child (3017) Panic message: Assert error in cnt_lookup(), >> cache_center.c line 625: >> Condition(sp->objhead != NULL) not true. thread = (cache-worker)sp >> = 0x2afee0fb3008 { >> fd = -1, id = 15, xid = 0, >> client = 10.2.8.27:45430, >> step = STP_DONE, >> handling = DELIVER, >> ws = 0x2afee0fb3078 { >> id = "sess", >> {s,f,r,e} = {0x2afee0fb37b0,,+587,(nil),+8192}, >> }, >> }, >> ___ >> varnish-misc mailing list >> varnish-misc@projects.linpro.no >> http://projects.linpro.no/mailman/listinfo/varnish-misc >> > > ___ > varnish-misc mailing list > varnish-misc@projects.linpro.no > http://projects.linpro.no/mailman/listinfo/varnish-misc ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Breaking Varnish
In message <6545783f-b1a7-4fda-94d8-8439a2d13...@metaweb.com>, Tim Kientzle wri tes: >Under heavy load, we're seeing a lot of segfaults and >assertion failures. I've pasted an excerpt below of >two of the issues we've seen using Varnish 2.0.2 on Linux >2.6.21 kernel with the default VCL (using command-line options >to set the listen address and the addresses of the two back-end >servers). Hi Tim, Can I get you to take -trunk for a spin ? At least the second of the problems you pasted I'm pretty sure I have nailed recently and the first one could easily be the same one in a different disguise. Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Breaking Varnish
amd64 or i386 architecture? Tim Kientzle пишет: > We're evaluating Varnish as a possible replacement for our > installed Squid servers. Performance-wise, Varnish is very > impressive, and we're pretty pleased with the configuration > flexibility. > > But... > > Under heavy load, we're seeing a lot of segfaults and > assertion failures. I've pasted an excerpt below of > two of the issues we've seen using Varnish 2.0.2 on Linux > 2.6.21 kernel with the default VCL (using command-line options > to set the listen address and the addresses of the two back-end > servers). > > We're going to repeat these tests and see if we can get > more detail, possibly including core dumps. What other > information would be useful in diagnosing and fixing > these issues? > > Cheers, > > Tim Kientzle > > == > > 1) Varnish repeatedly died due to SIGSEGV: > > child (2816) Started > Child (2816) said Closed fds: 4 7 8 10 11 > Child (2816) said Child starts > Child (2816) said managed to mmap 49392648192 bytes of 49392648192 > Child (2816) said Ready > Child (2816) died signal=11 > Child cleanup complete > > 2) Varnish repeatedly died due to SIGABRT: > > child (3017) Started > Child (3017) said Closed fds: 4 7 8 10 11 > Child (3017) said Child starts > Child (3017) said managed to mmap 49392648192 bytes of 49392648192 > Child (3017) said Ready > Child (3017) died signal=6 > Child (3017) Panic message: Assert error in cnt_lookup(), > cache_center.c line 625: >Condition(sp->objhead != NULL) not true. thread = (cache-worker)sp > = 0x2afee0fb3008 { >fd = -1, id = 15, xid = 0, >client = 10.2.8.27:45430, >step = STP_DONE, >handling = DELIVER, >ws = 0x2afee0fb3078 { > id = "sess", > {s,f,r,e} = {0x2afee0fb37b0,,+587,(nil),+8192}, >}, > }, > ___ > varnish-misc mailing list > varnish-misc@projects.linpro.no > http://projects.linpro.no/mailman/listinfo/varnish-misc > ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Breaking Varnish
We're evaluating Varnish as a possible replacement for our installed Squid servers. Performance-wise, Varnish is very impressive, and we're pretty pleased with the configuration flexibility. But... Under heavy load, we're seeing a lot of segfaults and assertion failures. I've pasted an excerpt below of two of the issues we've seen using Varnish 2.0.2 on Linux 2.6.21 kernel with the default VCL (using command-line options to set the listen address and the addresses of the two back-end servers). We're going to repeat these tests and see if we can get more detail, possibly including core dumps. What other information would be useful in diagnosing and fixing these issues? Cheers, Tim Kientzle == 1) Varnish repeatedly died due to SIGSEGV: child (2816) Started Child (2816) said Closed fds: 4 7 8 10 11 Child (2816) said Child starts Child (2816) said managed to mmap 49392648192 bytes of 49392648192 Child (2816) said Ready Child (2816) died signal=11 Child cleanup complete 2) Varnish repeatedly died due to SIGABRT: child (3017) Started Child (3017) said Closed fds: 4 7 8 10 11 Child (3017) said Child starts Child (3017) said managed to mmap 49392648192 bytes of 49392648192 Child (3017) said Ready Child (3017) died signal=6 Child (3017) Panic message: Assert error in cnt_lookup(), cache_center.c line 625: Condition(sp->objhead != NULL) not true. thread = (cache-worker)sp = 0x2afee0fb3008 { fd = -1, id = 15, xid = 0, client = 10.2.8.27:45430, step = STP_DONE, handling = DELIVER, ws = 0x2afee0fb3078 { id = "sess", {s,f,r,e} = {0x2afee0fb37b0,,+587,(nil),+8192}, }, }, ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc