Re: Breaking Varnish

2009-01-28 Thread Poul-Henning Kamp
In message , Tim Kientzle wri
tes:

>It also appears that Varnish eventually exits completely
>if placed under high load.  I'm okay with that as long as it's
>intentional behavior; 

It is not intentional.

The entire point about the two-process trick is to not ever throw
in the towel if we can avoid it.

That said, there are classes of bugs for which we have no hope,
if for instance the manager process cannot fork or allocate
memory, then we are hosed top and bottom.

>Of course,
>I understand that killing the child and starting a new one
>will also lose the cache, which is obviously not particularly
>desirable under heavy load.

Persistent storage coming up in version 2.1 :-)

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Breaking Varnish

2009-01-28 Thread Tim Kientzle
On Jan 28, 2009, at 1:54 AM, Poul-Henning Kamp wrote:
> In message <20090123222947.gb28...@digdug.corp.631h.metaweb.com>,  
> Niall O'Higgi
> ns writes:
 Can I get you to take -trunk for a spin ?

 At least the second of the problems you pasted I'm pretty sure I
 have nailed recently and the first one could easily be the same one
 in a different disguise.
>>
>> I've re-run the load test against varnish-trunk.  Trunk is better
>> behaved, but I now get output like this over and over:
>>
>> child (19731) Started
>> Child (19731) said Closed fds: 4 7 8 10 11
>> Child (19731) said Child starts
>> Child (19731) said managed to mmap 49929912320 bytes of 49929912320
>> Child (19731) said Ready
>> Child (19731) not responding to ping, killing it.
>
> This is a typical indication of raw overload, what levels of traffic
> are you hitting it with ?

Pretty heavy.  We put together a test workload that saturated
Squid at around 1500 req/s on a dual-core dev systems.  The symptoms
above appeared somewhere above 6000 req/s on the same hardware
and workload.

The test has two goals:
  1) To try to find bugs in Varnish that might prevent us from
 switching to Varnish from Squid.
  2) To understand how Varnish behaves when it becomes
 saturated.

When testing Squid in this fashion, we found no bugs.  Under
heavy load, Squid did become very slow but recovered cleanly
and went back into normal operation as soon as the load was
removed.

Varnish didn't fare quite so well.  We did find bugs, as you
know.  Fortunately, those seem to be fixed in trunk.  (When do
you expect the next point release?)

It also appears that Varnish eventually exits completely
if placed under high load.  I'm okay with that as long as it's
intentional behavior; we have a standard nanny that we use in
production to restart crashed services anyway.  Of course,
I understand that killing the child and starting a new one
will also lose the cache, which is obviously not particularly
desirable under heavy load.

Cheers,

Tim

___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: [+] Re: Breaking Varnish

2009-01-28 Thread Poul-Henning Kamp
In message <20090128183618.ge28...@digdug.corp.631h.metaweb.com>, Niall O'Higgi
ns writes:
>On Wed, Jan 28, 2009 at 10:18:48AM -0800, Michael S. Fischer wrote:
>> On Jan 28, 2009, at 10:04 AM, Niall O'Higgins wrote:
>
>Varnish is running on a dual CPU (amd64) Linux 2.6 machine.  We have
>pushed it up to 6701 t/sec with multiple load-generation machines.  We
>see the same child-restart behaviour whether we use a single
>load-generation machine, or three.

As I said, increase the cli_timeout parameter, it is probably
to short for that kind of scenario.

Also, you should probably set srcaddr_ttl to zero, to disable
the (effectively unused) per source-IP statistics.


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: [+] Re: Breaking Varnish

2009-01-28 Thread Niall O'Higgins
On Wed, Jan 28, 2009 at 10:18:48AM -0800, Michael S. Fischer wrote:
> On Jan 28, 2009, at 10:04 AM, Niall O'Higgins wrote:
>>> This is a typical indication of raw overload, what levels of traffic
>>> are you hitting it with ?
>>
>> This kind of thing:
>>
>> Transaction rate:3776.65 trans/sec
>> Throughput: 1.68 MB/sec
>> Concurrency:   28.28
>
> That doesn't seem that high.  What OS/# CPUs are you using?

Varnish is running on a dual CPU (amd64) Linux 2.6 machine.  We have
pushed it up to 6701 t/sec with multiple load-generation machines.  We
see the same child-restart behaviour whether we use a single
load-generation machine, or three.

>
> --Michael
>

-- 
Niall O'Higgins
Software Engineer
Metaweb Technologies, Inc.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: [+] Re: Breaking Varnish

2009-01-28 Thread Michael S. Fischer
On Jan 28, 2009, at 10:04 AM, Niall O'Higgins wrote:
>> This is a typical indication of raw overload, what levels of traffic
>> are you hitting it with ?
>
> This kind of thing:
>
> Transaction rate:3776.65 trans/sec
> Throughput: 1.68 MB/sec
> Concurrency:   28.28

That doesn't seem that high.  What OS/# CPUs are you using?

--Michael
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: [+] Re: Breaking Varnish

2009-01-28 Thread Poul-Henning Kamp
In message <20090128180448.gd28...@digdug.corp.631h.metaweb.com>, Niall O'Higgi
ns writes:

>Transaction rate:3776.65 trans/sec
>Throughput: 1.68 MB/sec
>Concurrency:   28.28
>
>Does the parent process give up on restarting the child after a
>certain number of failures?  I was surprised by the eventual complete
>exit of varnishd with the message:
>
>Pushing vcls failed: CLI communication error

It shouldn't do that, it should be able to restart it forever.

>Also, Varnish seems to be able to handle up to about double that load
>for a while (we got up to 6701 t/sec), then it dies as above.  Seems
>like it takes around 2-3 hours for the varnishd parent process
>to die.

Once you get to that level of load, the ability of the scheduler
to not do something stupid is paramount for survival.

Try to increase the "cli_timeout" parameter, it is probably set a
bit on the aggresive side by default.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: [+] Re: Breaking Varnish

2009-01-28 Thread Niall O'Higgins
On Wed, Jan 28, 2009 at 09:54:26AM +, Poul-Henning Kamp wrote:
> >I've re-run the load test against varnish-trunk.  Trunk is better
> >behaved, but I now get output like this over and over:
> >
> >child (19731) Started
> >Child (19731) said Closed fds: 4 7 8 10 11
> >Child (19731) said Child starts
> >Child (19731) said managed to mmap 49929912320 bytes of 49929912320
> >Child (19731) said Ready
> >Child (19731) not responding to ping, killing it.
> 
> This is a typical indication of raw overload, what levels of traffic
> are you hitting it with ?

This kind of thing:

Transaction rate:3776.65 trans/sec
Throughput: 1.68 MB/sec
Concurrency:   28.28

Does the parent process give up on restarting the child after a
certain number of failures?  I was surprised by the eventual complete
exit of varnishd with the message:

Pushing vcls failed: CLI communication error

Also, Varnish seems to be able to handle up to about double that load
for a while (we got up to 6701 t/sec), then it dies as above.  Seems
like it takes around 2-3 hours for the varnishd parent process
to die.

> 
> -- 
> Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
> p...@freebsd.org | TCP/IP since RFC 956
> FreeBSD committer   | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.
> 

-- 
Niall O'Higgins
Software Engineer
Metaweb Technologies, Inc.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Breaking Varnish

2009-01-28 Thread Poul-Henning Kamp
In message <20090123222947.gb28...@digdug.corp.631h.metaweb.com>, Niall O'Higgi
ns writes:

>>> Hi Tim,
>>>
>>> Can I get you to take -trunk for a spin ?
>>>
>>> At least the second of the problems you pasted I'm pretty sure I
>>> have nailed recently and the first one could easily be the same one
>>> in a different disguise.
>
>I've re-run the load test against varnish-trunk.  Trunk is better
>behaved, but I now get output like this over and over:
>
>child (19731) Started
>Child (19731) said Closed fds: 4 7 8 10 11
>Child (19731) said Child starts
>Child (19731) said managed to mmap 49929912320 bytes of 49929912320
>Child (19731) said Ready
>Child (19731) not responding to ping, killing it.

This is a typical indication of raw overload, what levels of traffic
are you hitting it with ?


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Breaking Varnish

2009-01-23 Thread Niall O'Higgins
Hi,

Regarding:

On Wed, Jan 21, 2009 at 02:05:55PM -0800, Tim Kientzle wrote:
> On Jan 21, 2009, at 2:02 PM, Poul-Henning Kamp wrote:
>>> Under heavy load, we're seeing a lot of segfaults and
>>> assertion failures.  I've pasted an excerpt below of
>>> two of the issues we've seen using Varnish 2.0.2 on Linux
>>> 2.6.21 kernel with the default VCL (using command-line options
>>> to set the listen address and the addresses of the two back-end
>>> servers).
>>
>> Hi Tim,
>>
>> Can I get you to take -trunk for a spin ?
>>
>> At least the second of the problems you pasted I'm pretty sure I
>> have nailed recently and the first one could easily be the same one
>> in a different disguise.

I've re-run the load test against varnish-trunk.  Trunk is better
behaved, but I now get output like this over and over:

child (19731) Started
Child (19731) said Closed fds: 4 7 8 10 11
Child (19731) said Child starts
Child (19731) said managed to mmap 49929912320 bytes of 49929912320
Child (19731) said Ready
Child (19731) not responding to ping, killing it.
Child (19731) not responding to ping, killing it.
Child (19731) not responding to ping, killing it.
Child (19731) died signal=3
Child cleanup complete

And varnish eventually exits with this message:

child (19773) Started
Pushing vcls failed: CLI communication error

I am running varnishd like so:

sbin/varnishd -f etc/varnish/default.vcl -F -a'0.0.0.0:8101'

My configuration file contains:

director www_director round-robin {
{ .backend = { .host = "appserver1"; .port = "8105"; } }
{ .backend = { .host = "appserver2"; .port = "8105"; } }
}

sub vcl_recv {
if (req.http.host ~ "^varnishserver$") {
set req.backend = www_director;
}
}

If there are other details which might help diagnose this, let me know and I'll
try to provide them.

-- 
Niall O'Higgins
Software Engineer
Metaweb Technologies, Inc.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Breaking Varnish

2009-01-21 Thread Tim Kientzle
Dual-core AMD processor using the x86_64 kernel.  Uname shows:

Linux 2.6.21.5 #9 SMP Thu Aug 16 17:21:29 UTC 2007 x86_64 AMD  
Opteron(tm) Processor 248 AuthenticAMD GNU/Linux


On Jan 21, 2009, at 2:01 PM, Iliya Sharov wrote:
> amd64 or i386 architecture?
>
> Tim Kientzle пишет:
>> We're evaluating Varnish as a possible replacement for our
>> installed Squid servers.  Performance-wise, Varnish is very
>> impressive, and we're pretty pleased with the configuration
>> flexibility.
>>
>> But...
>>
>> Under heavy load, we're seeing a lot of segfaults and
>> assertion failures.  I've pasted an excerpt below of
>> two of the issues we've seen using Varnish 2.0.2 on Linux
>> 2.6.21 kernel with the default VCL (using command-line options
>> to set the listen address and the addresses of the two back-end
>> servers).
>>
>> We're going to repeat these tests and see if we can get
>> more detail, possibly including core dumps.  What other
>> information would be useful in diagnosing and fixing
>> these issues?
>>
>> Cheers,
>>
>> Tim Kientzle
>>
>> ==
>>
>> 1) Varnish repeatedly died due to SIGSEGV:
>>
>> child (2816) Started
>> Child (2816) said Closed fds: 4 7 8 10 11
>> Child (2816) said Child starts
>> Child (2816) said managed to mmap 49392648192 bytes of 49392648192
>> Child (2816) said Ready
>> Child (2816) died signal=11
>> Child cleanup complete
>>
>> 2) Varnish repeatedly died due to SIGABRT:
>>
>> child (3017) Started
>> Child (3017) said Closed fds: 4 7 8 10 11
>> Child (3017) said Child starts
>> Child (3017) said managed to mmap 49392648192 bytes of 49392648192
>> Child (3017) said Ready
>> Child (3017) died signal=6
>> Child (3017) Panic message: Assert error in cnt_lookup(),
>> cache_center.c line 625:
>>   Condition(sp->objhead != NULL) not true. thread = (cache-worker)sp
>> = 0x2afee0fb3008 {
>>   fd = -1, id = 15, xid = 0,
>>   client = 10.2.8.27:45430,
>>   step = STP_DONE,
>>   handling = DELIVER,
>>   ws = 0x2afee0fb3078 {
>> id = "sess",
>> {s,f,r,e} = {0x2afee0fb37b0,,+587,(nil),+8192},
>>   },
>> },
>> ___
>> varnish-misc mailing list
>> varnish-misc@projects.linpro.no
>> http://projects.linpro.no/mailman/listinfo/varnish-misc
>>
>
> ___
> varnish-misc mailing list
> varnish-misc@projects.linpro.no
> http://projects.linpro.no/mailman/listinfo/varnish-misc

___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Breaking Varnish

2009-01-21 Thread Poul-Henning Kamp
In message <6545783f-b1a7-4fda-94d8-8439a2d13...@metaweb.com>, Tim Kientzle wri
tes:

>Under heavy load, we're seeing a lot of segfaults and
>assertion failures.  I've pasted an excerpt below of
>two of the issues we've seen using Varnish 2.0.2 on Linux
>2.6.21 kernel with the default VCL (using command-line options
>to set the listen address and the addresses of the two back-end
>servers).

Hi Tim,

Can I get you to take -trunk for a spin ?

At least the second of the problems you pasted I'm pretty sure I
have nailed recently and the first one could easily be the same one
in a different disguise.

Poul-Henning

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Breaking Varnish

2009-01-21 Thread Iliya Sharov
amd64 or i386 architecture?

Tim Kientzle пишет:
> We're evaluating Varnish as a possible replacement for our
> installed Squid servers.  Performance-wise, Varnish is very
> impressive, and we're pretty pleased with the configuration
> flexibility.
>
> But...
>
> Under heavy load, we're seeing a lot of segfaults and
> assertion failures.  I've pasted an excerpt below of
> two of the issues we've seen using Varnish 2.0.2 on Linux
> 2.6.21 kernel with the default VCL (using command-line options
> to set the listen address and the addresses of the two back-end
> servers).
>
> We're going to repeat these tests and see if we can get
> more detail, possibly including core dumps.  What other
> information would be useful in diagnosing and fixing
> these issues?
>
> Cheers,
>
> Tim Kientzle
>
> ==
>
> 1) Varnish repeatedly died due to SIGSEGV:
>
> child (2816) Started
> Child (2816) said Closed fds: 4 7 8 10 11
> Child (2816) said Child starts
> Child (2816) said managed to mmap 49392648192 bytes of 49392648192
> Child (2816) said Ready
> Child (2816) died signal=11
> Child cleanup complete
>
> 2) Varnish repeatedly died due to SIGABRT:
>
> child (3017) Started
> Child (3017) said Closed fds: 4 7 8 10 11
> Child (3017) said Child starts
> Child (3017) said managed to mmap 49392648192 bytes of 49392648192
> Child (3017) said Ready
> Child (3017) died signal=6
> Child (3017) Panic message: Assert error in cnt_lookup(),  
> cache_center.c line 625:
>Condition(sp->objhead != NULL) not true. thread = (cache-worker)sp  
> = 0x2afee0fb3008 {
>fd = -1, id = 15, xid = 0,
>client = 10.2.8.27:45430,
>step = STP_DONE,
>handling = DELIVER,
>ws = 0x2afee0fb3078 {
>  id = "sess",
>  {s,f,r,e} = {0x2afee0fb37b0,,+587,(nil),+8192},
>},
> }, 
> ___
> varnish-misc mailing list
> varnish-misc@projects.linpro.no
> http://projects.linpro.no/mailman/listinfo/varnish-misc
>   

___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Breaking Varnish

2009-01-21 Thread Tim Kientzle
We're evaluating Varnish as a possible replacement for our
installed Squid servers.  Performance-wise, Varnish is very
impressive, and we're pretty pleased with the configuration
flexibility.

But...

Under heavy load, we're seeing a lot of segfaults and
assertion failures.  I've pasted an excerpt below of
two of the issues we've seen using Varnish 2.0.2 on Linux
2.6.21 kernel with the default VCL (using command-line options
to set the listen address and the addresses of the two back-end
servers).

We're going to repeat these tests and see if we can get
more detail, possibly including core dumps.  What other
information would be useful in diagnosing and fixing
these issues?

Cheers,

Tim Kientzle

==

1) Varnish repeatedly died due to SIGSEGV:

child (2816) Started
Child (2816) said Closed fds: 4 7 8 10 11
Child (2816) said Child starts
Child (2816) said managed to mmap 49392648192 bytes of 49392648192
Child (2816) said Ready
Child (2816) died signal=11
Child cleanup complete

2) Varnish repeatedly died due to SIGABRT:

child (3017) Started
Child (3017) said Closed fds: 4 7 8 10 11
Child (3017) said Child starts
Child (3017) said managed to mmap 49392648192 bytes of 49392648192
Child (3017) said Ready
Child (3017) died signal=6
Child (3017) Panic message: Assert error in cnt_lookup(),  
cache_center.c line 625:
   Condition(sp->objhead != NULL) not true. thread = (cache-worker)sp  
= 0x2afee0fb3008 {
   fd = -1, id = 15, xid = 0,
   client = 10.2.8.27:45430,
   step = STP_DONE,
   handling = DELIVER,
   ws = 0x2afee0fb3078 {
 id = "sess",
 {s,f,r,e} = {0x2afee0fb37b0,,+587,(nil),+8192},
   },
}, 
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc