Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 2:56 PM, Barry Abrahamson wrote:

> In my case, Varnish took a cache of 1 million objects, purged 920k of them.  
> When there were 80k objects left the child restarted, thus dumping the 
> remaining 80k :)  

Happened again - here is the backtrace info:

AdvChild (7222) died signal=6
Child (7222) Panic message: Assert error in STV_alloc(), stevedore.c line 71:
  Condition((st) != NULL) not true.
thread = (cache-worker)
Backtrace:
  0x41d655: pan_ic+85
  0x433815: STV_alloc+a5
  0x416ca4: Fetch+684
  0x41131f: cnt_fetch+cf
  0x4125a5: CNT_Session+3a5
  0x41f616: wrk_do_cnt_sess+86
  0x41eb90: wrk_thread+1b0
  0x7f79f61e0fc7: _end+7f79f5b7a147
  0x7f79f5abb59d: _end+7f79f545471d
sp = 0x7f542e45a008 {
  fd = 9, id = 9, xid = 116896,
  client = 10.2.255.5:22276,
  step = STP_FETCH,
  handling = discard,
  restarts = 0, esis = 0
  ws = 0x7f542e45a080 {
id = "sess",
{s,f,r,e} = {0x7f542e45a820,+347,(nil),+16384},
  },

The request information shows that it was apparently fetching a 1GB file from 
the backend and trying to insert it into the cache.
--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 12:47 PM, David Birdsong wrote:

> On Thu, Feb 25, 2010 at 8:41 AM, Barry Abrahamson  
> wrote:
>> 
>> On Feb 25, 2010, at 2:26 AM, David Birdsong wrote:
>> 
>>> I have seen this happen.
>>> 
>>> I have a similar hardware setup, though I changed the multi-ssd raid
>>> into 3 separate cache file arguments.
>> 
>> Did you try RAID and switch to the separate cache files because performance 
>> was better?
> seemingly so.
> 
> for some reason enabling block_dump showed that kswapd was always
> writing to those devices despite their not being any swap space on
> them.
> 
> i searched around fruitlessly to try to understand the overhead of
> software raid to explain this, but once i discovered varnish could
> take on multiple cache files, i saw no reason for the software raid
> and just abandoned it.

Interesting - I will try it out!  Thanks for the info.


>>> We had roughly 240GB storage space total, after about 2-3 weeks and
>>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>>> climbed to ~60GB, but lru_nuking never stopped.
>> 
>> How did you fix it?
> i haven't yet.
> 
> i'm changing up how i cache content, such that lru_nuking can be
> better tolerated.

In my case, Varnish took a cache of 1 million objects, purged 920k of them.  
When there were 80k objects left the child restarted, thus dumping the 
remaining 80k :)  


--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 3:54 AM, Poul-Henning Kamp wrote:

> In message , 
> David
> Birdsong writes:
> 
>> We had roughly 240GB storage space total, after about 2-3 weeks and
>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>> climbed to ~60GB, but lru_nuking never stopped.
> 
> We had a bug where we would nuke from one stevedore, but try to allocate
> from another.  Not sure if the fix made it into any of the 2.0 releases,
> it will be in 2.1

Thanks for the info - are the fixes in -trunk now?

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 2:26 AM, David Birdsong wrote:

> I have seen this happen.
> 
> I have a similar hardware setup, though I changed the multi-ssd raid
> into 3 separate cache file arguments.

Did you try RAID and switch to the separate cache files because performance was 
better?

> We had roughly 240GB storage space total, after about 2-3 weeks and
> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
> climbed to ~60GB, but lru_nuking never stopped.

How did you fix it?


--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Varnish 2.0.6 nuking all my objects?

2010-02-24 Thread Barry Abrahamson
  4519.58 SHM records
shm_writes   50380066   373.27 SHM writes
shm_flushes  9387 0.07 SHM flushes due to overflow
shm_cont47763 0.35 SHM MTX contention
shm_cycles226 0.00 SHM cycles through buffer
sm_nreq   444921332.96 allocator requests
sm_nobj341160  .   outstanding allocations
sm_balloc 11373072384  .   bytes allocated
sm_bfree 116602589184  .   bytes free
sma_nreq0 0.00 SMA allocator requests
sma_nobj0  .   SMA outstanding allocations
sma_nbytes  0  .   SMA outstanding bytes
sma_balloc  0  .   SMA bytes allocated
sma_bfree   0  .   SMA bytes free
sms_nreq63997 0.47 SMS allocator requests
sms_nobj0  .   SMS outstanding allocations
sms_nbytes   18446744073709548694  .   SMS outstanding bytes
sms_balloc   31161028  .   SMS bytes allocated
sms_bfree31162489  .   SMS bytes freed
backend_req   182196113.50 Backend requests made
n_vcl   1 0.00 N vcl total
n_vcl_avail 1 0.00 N vcl available
n_vcl_discard   0 0.00 N vcl discarded
n_purge 1  .   N total active purges
n_purge_add 1 0.00 N new purges added
n_purge_retire  0 0.00 N old purges deleted
n_purge_obj_test0 0.00 N objects tested
n_purge_re_test 0 0.00 N regexps tested against
n_purge_dups0 0.00 N duplicate purges removed
hcb_nolock  0 0.00 HCB Lookups without lock
hcb_lock0 0.00 HCB Lookups with lock
hcb_insert  0 0.00 HCB Inserts
esi_parse   0 0.00 Objects ESI parsed (unlock)
esi_errors  0 0.00 ESI parse errors (unlock)

Obviously, this isn't good for my cache hit rate :)  It is also using a lot of 
CPU.  Has anyone seen this happen before?

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: is 2.0.2 not as efficient as 1.1.2 was?

2009-03-12 Thread Barry Abrahamson

On Mar 11, 2009, at 2:28 AM, Alex Lines wrote:

> Barry, Demitrious - did you ever find a solution here?

Nope.  For now, we are just using more hardware to hide the problem :(

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: is 2.0.2 not as efficient as 1.1.2 was?

2009-02-10 Thread Barry Abrahamson

On Feb 4, 2009, at 1:53 PM, Poul-Henning Kamp wrote:

> In message <37eadde4-a23a-4204-b04a-46d47d348...@automattic.com>,  
> Barry Abraham
> son writes:
>
>>> This week we upgraded to 2.0.2 and are using varnish's back end &
>>> director configuration for the same work.  What we are seeing is  
>>> that
>>> 2.0.2 holds about 60% of the objects in the same amount of cache  
>>> space
>>> as 1.1.2 did (we tried tcmalloc, jemalloc, and mmap.)
>
> Your description does not make it obvious to me what is causing this
> but one candidate could be the stored hash-string, in particular if
> your URLS are long.
>
> The new purge code (likely included in 2.0.3, but already available
> in -trunk) dispenses with the need to store the hash-string so theory
> could be tested.

Upgraded to trunk, didn't help.

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com






___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: is 2.0.2 not as efficient as 1.1.2 was?

2009-02-04 Thread Barry Abrahamson
On Nov 25, 2008, at 5:37 PM, Demitrious Kelly wrote:

> Hello,
>
> We run Gravatar.com and use varnish to cache avatar responses.  There
> are a ton of very small objects and lots of requests per second. Last
> week we were using 1.1.2 compiled against tcmalloc (-t 600 -w 1,4000,5
> -h classic,59 -p thread_pools 10 -p listen_depth 4096 -s
> malloc,16G). This used an nginx load balancer on a separate host as  
> its
> back end which distributed varnish's requests to our pool of webs.   
> All
> was well.
>
> This week we upgraded to 2.0.2 and are using varnish's back end &
> director configuration for the same work.  What we are seeing is that
> 2.0.2 holds about 60% of the objects in the same amount of cache space
> as 1.1.2 did (we tried tcmalloc, jemalloc, and mmap.)  This caused us
> quite a few problems after the upgrade as varnish would start spiking
> the load on the boxes into the hundreds.  We attempted tuning the
> lru_interval (up) and obj_workspace (down) but we couldn't get varnish
> to hold the same data that it used to on the same machines.
>
> Right now we've reduced the time that we keep cached objects
> drastically, bringing our cache hit rate down to 92% from 96% which
> roughly doubled the requests (and load) on the web servers.  It is,
> however, stable at this point.  Obviously the idea of not keeping up
> with the latest versions of varnish is not what we want to do, however
> effectively doubling requirements for scaling the service is just as
> unappealing.
>
> So, what we're asking is... how do we get varnish 2 to be as efficient
> as varnish 1 was?  We're glad to try things...  It takes a while to  
> fill
> up the cache to the point that it can cause problems so testing and
> reporting back will take some time, but we'd like this fixed and will
> put in some work. We're currently running the following cli options:
>
> -a 0.0.0.0:80 -f ... -P ... -T 10.1.94.43:6969 -t 600 -w 1,4000,5 -h
> classic,59 -p thread_pools 10 -p listen_depth 4096 -s malloc,16G
>
> And our VCL looks like this (with most of the webs taken out for  
> brevity
> since they're repeated verbatim with only numbers changed)
>
> backend web11 { .host = "xxx"; .port = "8088"; .probe =
>{ .url = "xxx"; .timeout = 50 ms; .interval = 5s;
> .window = 2; .threshold = 1; }
> }
> backend web12 { .host = "xxx"; .port = "8088"; .probe =
>{ .url = "xxx"; .timeout = 50 ms; .interval = 5s;
> .window = 2; .threshold = 1; }
> }
>
> director default random {
>.retries = 3;
>{ .backend = web11; .weight = 1; }
>{ .backend = web12; .weight = 1; }
> }
>
> sub vcl_recv {
>  set req.backend = default;
>  set req.grace = 30s;
>  if ( req.url ~ "^/(avatar|userimage)" && req.http.cookie )  {
>lookup;
>  }
> }
>
> sub vcl_fetch {
>  if (obj.ttl < 600s) {
>set obj.ttl = 600s;
>  }
>  if (obj.status == 404) {
>set obj.ttl = 30s;
>  }
>  if (obj.status == 500 || obj.status == 503 ) {
>pass;
>  }
>  set obj.grace = 30s;
>  deliver;
> }
>
> sub vcl_deliver {
>  remove resp.http.Expires;
>  remove resp.http.Cache-Control;
>  set resp.http.Cache-Control = "public, max-age=600, proxy- 
> revalidate";
>  deliver;
> }


Bump :)  Is anyone else seeing the same thing?  I think it may be a  
result of the fact that a lot of the cached responses are just headers  
(302 redirects) and don't have any actual content.  That is the only  
thing I can think of why we would be seeing this issue and others  
wouldn't.  I suspect most people using varnish dont have stats that  
look like this:

  10094887744960644.65847668.80 Total header bytes
  22230934332   2174908.58   1866733.93 Total body bytes

I don't really want to revert to 1.1.2 because I like the general  
stability and features of 2.x, but I don't have any real ideas on how  
to troubleshoot why this would be happening.  Any ideas would be  
appreciated.

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Random director in varnish trunk doesn't work?

2008-07-10 Thread Barry Abrahamson
On Jul 10, 2008, at 2:10 PM, Poul-Henning Kamp wrote:

> In message <[EMAIL PROTECTED]>,  
> Barry Abraham
> son writes:
>> Is anyone successfully using the random director in varnish trunk
>> (r2917)?
>
> Try #2919, I have fixed an off-by one bug I introduced recently.

Worked.  Thanks.

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com






___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Random director in varnish trunk doesn't work?

2008-07-10 Thread Barry Abrahamson
Is anyone successfully using the random director in varnish trunk  
(r2917)?

I have something like this in my config:

backend web1 { .host = "10.0.1.1"; .port = "8080"; }
backend web2 { .host = "10.0.1.2"; .port = "8080"; }

director default random {
{
.backend = web1;
.weight = 1;
}
{
.backend = web2;
.weight = 1;
}
}

sub vcl_recv {
set req.backend = default;

.

Varnish won't serve any requests -- it looks like the child just dies.  
Strace shows:

read(11, "Assert error in vdi_random_choos"..., 8191) = 102
write(2, "Child (30484) said Assert error "..., 84) = 84

Using a normal backend (not random director) works fine.

Details:   Debian Etch amd64, 2.6.18-6.

I am not sure if it is an unhandled config/syntax error in my vcl or  
something else.


--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com






___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Strategy for large cache sets

2008-07-02 Thread Barry Abrahamson
On Jul 1, 2008, at 2:16 PM, Skye Poier Nott wrote:

> I want to deploy Varnish with very large cache sizes (200GB or more)
> for large, long lived file sets.

We are doing this also and running into performance problems.  We have  
tried files and swap  Running on Linux (Debian).  When the cache  
starts to get large, we start to see huge load spikes (loads of 200+)  
caused by IO wait but no corresponding spikes in request rates or any  
of the varnish metrics (except threads running which I am pretty sure  
is a result of the load spike and not the cause).   We are currently  
running 1.1.2 but have tried with trunk and the same thing happens.   
If you find anything useful in your testing, I would love you hear  
about it.

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com






___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Conditional caching question

2008-06-04 Thread Barry Abrahamson
On Jun 2, 2008, at 8:41 AM, Poul-Henning Kamp wrote:

> In message <[EMAIL PROTECTED]>, David Pratt writes:
>
>> Hi. In most cases, I want a request to be passed to a backend where  
>> it
>> will be handled by server. If frequency is high, however; I want to  
>> add
>> the object to varnish cache and have varnish handle it. I am not  
>> worried
>> about a mechanism to keeping track of frequency of requests.  
>> Question is
>> what is available to me to add an object/path to the varnish cache  
>> if it
>> was originally passed?
>
> I wouldn't say that your way of using varnish is backwards relative
> to the design objectives, but you do come close, since we assumed
> caching by default, and pass as exception, rather than the other
> way around.

We do this on WordPress.com to avoid filling our caches with  
infrequently requested data.  The way we handle it is when an object  
reaches a certain req/sec threshold, we send a header from the backend  
and then have varnish configured to only insert objects into the cache  
which contain this custom header.  Based on phk's reply, I guess we  
are using varnish in a somewhat backwards manner as well, since we  
assume pass as the detault, insert as the exception.

This used to work in 1.0.3.  I have started to look into upgrading to  
trunk, and it doesn't seem to work so well anymore.  It looks like the  
first time the URL is requested, if it is passed because it hasn't  
reached that threshold and the header hasn't been set, all subsequent  
requests are automatically "pass" ed.  These show up as "Cache hits  
for pass" in varnishstat.  Any way around this?


--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com






___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Multiple varnish instances per server?

2008-06-01 Thread Barry Abrahamson
\On Jun 1, 2008, at 1:38 PM, Michael S. Fischer wrote:

> Why are you using Varnish to serve primarily images?  Modern  
> webservers serve static files very efficiently off the filesystem.

Because we have about 6TB of content and are using Varnish as the  
"hot" cache and S3 as the "cold" store.

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com






___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Multiple varnish instances per server?

2008-06-01 Thread Barry Abrahamson
Hi,

Is anyone running multiple varnish instances per server (one per disk  
or similar?)

We are currently running a single varnish instance per server using  
the file backend.  Machines are Dual Opteron 2218, 4GB RAM, and 2  
250GB SATA drives.  We have the cache file on a software RAID 0  
array.  Our cache size is set to 300GB, but once we get to 100GB or  
so, IO starts to get very spiky, causing loads to spike into the 100  
range.  Our expires are rather long (1-2 weeks).  My initial thoughts  
were that this was caused by cache file fragmentation, but we are  
seeing similar issues when using the malloc backend.  We were thinking  
that running 2 instances per server with smaller cache files (one per  
physical disk), may improve our IO problems.  Is there any performance  
benefit/detriment to running multiple varnish instances per server?   
Is there a performance hit for having a large cache?

Request rates aren't that high (50-150/sec), but the cached files are  
all images, some of which can be rather big (3MB).

Also, is anyone else seeing similar issues under similar workloads?
--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com






___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc