Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2013-03-11 Thread paulm
Excusme David

What are the ab paramenters that use to test agains squid ?

thnks, Paul 



--
View this message in context: 
http://squid-web-proxy-cache.1019090.n4.nabble.com/squid-3-2-0-5-smp-scaling-issues-tp3395333p4658947.html
Sent from the Squid - Users mailing list archive at Nabble.com.


Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2013-03-11 Thread Amos Jeffries

On 12/03/2013 8:11 a.m., paulm wrote:

Excusme David

What are the ab paramenters that use to test agains squid ?


-n for request count
-c for concurrency level

SMP in Squid shares a listening port so -c 1 will still test both 
workers. But the results are more interesting as you vary client count 
versus request count.


For worst-case traffic scenario test with a guaranteed MISS response, 
for best-case test with a small HIT response.


Other than that whatever you like. Using a FQDN you host yourself is polite.

Amos


RE: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread Jenny Lee

On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee bodycar...@live.com wrote:

I like to know how you are able to do 13000 requests/sec.
tcp_fin_timeout is 60 seconds default on all *NIXes and available ephemeral 
port range is 64K.
I can't do more than 1K requests/sec even with tcp_tw_reuse/tcp_tw_recycle with 
ab. I get commBind errors due to connections in TIME_WAIT.
Any tuning options suggested for RHEL6 x64?
Jenny

I would have a concern using both those at the same time.   reuse and recycle. 
Reuse a socket, but recycle it, I've seen issues when testing my own linux 
distro's with both of these settings. Right or wrong that was my experience.
fin_timeout, if you have a good connection, there should be no reason that a 
system takes 60 seconds to send out a fin. Cut that in half, if not by 2/3's
And what is your limitation at 1K requests/sec, load (if so look at I/O) 
Network saturation? Maybe I missed an earlier thread and I too would tilt my 
head at 13K requests sec!
Tory
---
 
 
As I mentioned, my limitation is the ephemeral ports tied up with TIME_WAIT.  
TIME_WAIT issue is a known factor when you are doing testing.
 
When you are tuning, you apply options one at a time. tw_reuse/tc_recycle were 
not used togeter and I had 10 sec fin_timeout which made no difference.
 
Jenny

 
nb: i still dont know how to do indenting/quoting with this hotmail... after 10 
years.
  

Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread Amos Jeffries

On 12/06/11 18:46, Jenny Lee wrote:


On Sat, Jun 11, 2011 at 9:40 PM, Jenny Leebodycar...@live.com  wrote:

I like to know how you are able to do13000 requests/sec.
tcp_fin_timeout is 60 seconds default on all *NIXes and available ephemeral 
port range is 64K.
I can't do more than 1K requests/sec even with tcp_tw_reuse/tcp_tw_recycle with 
ab. I get commBind errors due to connections in TIME_WAIT.
Any tuning options suggested for RHEL6 x64?
Jenny

I would have a concern using both those at the same time.   reuse and recycle. 
Reuse a socket, but recycle it, I've seen issues when testing my own linux 
distro's with both of these settings. Right or wrong that was my experience.
fin_timeout, if you have a good connection, there should be no reason that a 
system takes 60 seconds to send out a fin. Cut that in half, if not by 2/3's
And what is your limitation at 1K requests/sec, load (if so look at I/O) 
Network saturation? Maybe I missed an earlier thread and I too would tilt my 
head at 13K requests sec!
Tory
---


As I mentioned, my limitation is the ephemeral ports tied up with TIME_WAIT.  
TIME_WAIT issue is a known factor when you are doing testing.

When you are tuning, you apply options one at a time. tw_reuse/tc_recycle were 
not used togeter and I had 10 sec fin_timeout which made no difference.

Jenny


nb: i still dont know how to do indenting/quoting with this hotmail... after 10 
years.



Couple of thing to note.
 Firstly that this was an ab (apache bench) reported figure. It 
calculates the software limitation based on speed of transactions done. 
Not necessarily accounting for things like TIME_WAIT. Particularly if it 
was extrapolated from say, 50K requests, which would not hit that OS limit.


He also mentioned using a local IP address. If that was on the lo 
interface. It would not be subject to things like TIME_WAIT or RTT lag.



The test was also specific to the very long lists of non-matching regex 
ACL he apparently used. Once those were eliminated the test showed much 
faster numbers, but similar worker pattern.


Overall, useful info for us regarding worker load sharing. And a bit of 
a warning for people writing long lists of regex ACL. But the ACL issue 
was not really surprising.


HTH

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.12
  Beta testers wanted for 3.2.0.8 and 3.1.12.2


RE: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread Jenny Lee




 Date: Sun, 12 Jun 2011 19:54:10 +1200
 From: squ...@treenet.co.nz
 To: squid-users@squid-cache.org
 Subject: Re: [squid-users] squid 3.2.0.5 smp scaling issues

 On 12/06/11 18:46, Jenny Lee wrote:
 
  On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee wrote:
 
  I like to know how you are able to do13000 requests/sec.
  tcp_fin_timeout is 60 seconds default on all *NIXes and available ephemeral 
  port range is 64K.
  I can't do more than 1K requests/sec even with tcp_tw_reuse/tcp_tw_recycle 
  with ab. I get commBind errors due to connections in TIME_WAIT.
  Any tuning options suggested for RHEL6 x64?
  Jenny
 
  I would have a concern using both those at the same time. reuse and 
  recycle. Reuse a socket, but recycle it, I've seen issues when testing my 
  own linux distro's with both of these settings. Right or wrong that was my 
  experience.
  fin_timeout, if you have a good connection, there should be no reason that 
  a system takes 60 seconds to send out a fin. Cut that in half, if not by 
  2/3's
  And what is your limitation at 1K requests/sec, load (if so look at I/O) 
  Network saturation? Maybe I missed an earlier thread and I too would tilt 
  my head at 13K requests sec!
  Tory
  ---
 
 
  As I mentioned, my limitation is the ephemeral ports tied up with 
  TIME_WAIT. TIME_WAIT issue is a known factor when you are doing testing.
 
  When you are tuning, you apply options one at a time. tw_reuse/tc_recycle 
  were not used togeter and I had 10 sec fin_timeout which made no difference.
 
  Jenny
 
 
  nb: i still dont know how to do indenting/quoting with this hotmail... 
  after 10 years.
 

 Couple of thing to note.
 Firstly that this was an ab (apache bench) reported figure. It
 calculates the software limitation based on speed of transactions done.
 Not necessarily accounting for things like TIME_WAIT. Particularly if it
 was extrapolated from say, 50K requests, which would not hit that OS limit.
 
Ab accounts for 200-OK responses and TIME_WAITS cause squid to issue 500. Of 
course if you send in 50K it would not be subject to this but I usually send 
couple 10+ million to simulate load at least for a while.

 
 He also mentioned using a local IP address. If that was on the lo
 interface. It would not be subject to things like TIME_WAIT or RTT lag.
 
When I was running my benches on loopback, I had tons of TIME_WAITS for 
127.0.0.1 and squid would bail out with: commBind: Cannot bind socket...
 
Of course, I might be doing things wrong.
 
I am interested in what to optimize on RHEL6 OS level to achieve higher 
requests per second.
 
Jenny
 
 
 
 
 
 
 
 
 
 
  

RE: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread david

On Sun, 12 Jun 2011, Jenny Lee wrote:


On 12/06/11 18:46, Jenny Lee wrote:


On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee wrote:

I like to know how you are able to do13000 requests/sec.
tcp_fin_timeout is 60 seconds default on all *NIXes and available ephemeral 
port range is 64K.
I can't do more than 1K requests/sec even with tcp_tw_reuse/tcp_tw_recycle with 
ab. I get commBind errors due to connections in TIME_WAIT.
Any tuning options suggested for RHEL6 x64?
Jenny

I would have a concern using both those at the same time. reuse and recycle. 
Reuse a socket, but recycle it, I've seen issues when testing my own linux 
distro's with both of these settings. Right or wrong that was my experience.
fin_timeout, if you have a good connection, there should be no reason that a 
system takes 60 seconds to send out a fin. Cut that in half, if not by 2/3's
And what is your limitation at 1K requests/sec, load (if so look at I/O) 
Network saturation? Maybe I missed an earlier thread and I too would tilt my 
head at 13K requests sec!
Tory
---


As I mentioned, my limitation is the ephemeral ports tied up with TIME_WAIT. 
TIME_WAIT issue is a known factor when you are doing testing.

When you are tuning, you apply options one at a time. tw_reuse/tc_recycle were 
not used togeter and I had 10 sec fin_timeout which made no difference.

Jenny


nb: i still dont know how to do indenting/quoting with this hotmail... after 10 
years.



Couple of thing to note.
Firstly that this was an ab (apache bench) reported figure. It
calculates the software limitation based on speed of transactions done.
Not necessarily accounting for things like TIME_WAIT. Particularly if it
was extrapolated from say, 50K requests, which would not hit that OS limit.


Ab accounts for 200-OK responses and TIME_WAITS cause squid to issue 500. Of 
course if you send in 50K it would not be subject to this but I usually send 
couple 10+ million to simulate load at least for a while.



He also mentioned using a local IP address. If that was on the lo
interface. It would not be subject to things like TIME_WAIT or RTT lag.


When I was running my benches on loopback, I had tons of TIME_WAITS for 127.0.0.1 and 
squid would bail out with: commBind: Cannot bind socket...

Of course, I might be doing things wrong.

I am interested in what to optimize on RHEL6 OS level to achieve higher 
requests per second.

Jenny


I'll post my configs when I get back to the office, but one thing is that 
if you send requests faster than they can be serviced the pending requests 
build up until you start getting timeouts. so I have to tinker with the 
number of requests that can be sent in parallel to keep the request rate 
below this point.


note that when I removed the long list of ACLs I was able to get this 13K 
requests/sec rate going from machine A to squid on machine B to apache on 
machine C so it's not a localhost thing.


getting up to the 13K rate on apache does require doing some tuning and 
tweaking of apache, stock configs that include dozens of dynamically 
loaded modules just can't achieve these speeds. These are also fairly 
beefy boxes, dual quad core opterons with 64G ram and 1G ethernet 
(multiple cards, but I haven't tried trunking them yet)


David Lang


RE: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread Jenny Lee

 Date: Sun, 12 Jun 2011 03:02:23 -0700
 From: da...@lang.hm
 To: bodycar...@live.com
 CC: squ...@treenet.co.nz; squid-users@squid-cache.org
 Subject: RE: [squid-users] squid 3.2.0.5 smp scaling issues
 
 On Sun, 12 Jun 2011, Jenny Lee wrote:
 
  On 12/06/11 18:46, Jenny Lee wrote:
 
  On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee wrote:
 
  I like to know how you are able to do13000 requests/sec.
  tcp_fin_timeout is 60 seconds default on all *NIXes and available 
  ephemeral port range is 64K.
  I can't do more than 1K requests/sec even with 
  tcp_tw_reuse/tcp_tw_recycle with ab. I get commBind errors due to 
  connections in TIME_WAIT.
  Any tuning options suggested for RHEL6 x64?
  Jenny
 
  I would have a concern using both those at the same time. reuse and 
  recycle. Reuse a socket, but recycle it, I've seen issues when testing my 
  own linux distro's with both of these settings. Right or wrong that was 
  my experience.
  fin_timeout, if you have a good connection, there should be no reason 
  that a system takes 60 seconds to send out a fin. Cut that in half, if 
  not by 2/3's
  And what is your limitation at 1K requests/sec, load (if so look at I/O) 
  Network saturation? Maybe I missed an earlier thread and I too would tilt 
  my head at 13K requests sec!
  Tory
  ---
 
 
  As I mentioned, my limitation is the ephemeral ports tied up with 
  TIME_WAIT. TIME_WAIT issue is a known factor when you are doing testing.
 
  When you are tuning, you apply options one at a time. tw_reuse/tc_recycle 
  were not used togeter and I had 10 sec fin_timeout which made no 
  difference.
 
  Jenny
 
 
  nb: i still dont know how to do indenting/quoting with this hotmail... 
  after 10 years.
 
 
  Couple of thing to note.
  Firstly that this was an ab (apache bench) reported figure. It
  calculates the software limitation based on speed of transactions done.
  Not necessarily accounting for things like TIME_WAIT. Particularly if it
  was extrapolated from say, 50K requests, which would not hit that OS limit.
 
  Ab accounts for 200-OK responses and TIME_WAITS cause squid to issue 500. 
  Of course if you send in 50K it would not be subject to this but I usually 
  send couple 10+ million to simulate load at least for a while.
 
 
  He also mentioned using a local IP address. If that was on the lo
  interface. It would not be subject to things like TIME_WAIT or RTT lag.
 
  When I was running my benches on loopback, I had tons of TIME_WAITS for 
  127.0.0.1 and squid would bail out with: commBind: Cannot bind socket...
 
  Of course, I might be doing things wrong.
 
  I am interested in what to optimize on RHEL6 OS level to achieve higher 
  requests per second.
 
  Jenny
 
 I'll post my configs when I get back to the office, but one thing is that 
 if you send requests faster than they can be serviced the pending requests 
 build up until you start getting timeouts. so I have to tinker with the 
 number of requests that can be sent in parallel to keep the request rate 
 below this point.
 
 note that when I removed the long list of ACLs I was able to get this 13K 
 requests/sec rate going from machine A to squid on machine B to apache on 
 machine C so it's not a localhost thing.
 
 getting up to the 13K rate on apache does require doing some tuning and 
 tweaking of apache, stock configs that include dozens of dynamically 
 loaded modules just can't achieve these speeds. These are also fairly 
 beefy boxes, dual quad core opterons with 64G ram and 1G ethernet 
 (multiple cards, but I haven't tried trunking them yet)
 
 David Lang


Ok, I am assuming that persistent-connections are on. This doesn't simulate any 
real life scenario.

I would like to know if anyone can do more than 500 reqs/sec with persistent 
connections off.

Jenny 

RE: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread david

On Sun, 12 Jun 2011, Jenny Lee wrote:


Date: Sun, 12 Jun 2011 03:02:23 -0700
From: da...@lang.hm
To: bodycar...@live.com
CC: squ...@treenet.co.nz; squid-users@squid-cache.org
Subject: RE: [squid-users] squid 3.2.0.5 smp scaling issues

On Sun, 12 Jun 2011, Jenny Lee wrote:


On 12/06/11 18:46, Jenny Lee wrote:


On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee wrote:

I like to know how you are able to do13000 requests/sec.
tcp_fin_timeout is 60 seconds default on all *NIXes and available ephemeral 
port range is 64K.
I can't do more than 1K requests/sec even with tcp_tw_reuse/tcp_tw_recycle with 
ab. I get commBind errors due to connections in TIME_WAIT.
Any tuning options suggested for RHEL6 x64?
Jenny

I would have a concern using both those at the same time. reuse and recycle. 
Reuse a socket, but recycle it, I've seen issues when testing my own linux 
distro's with both of these settings. Right or wrong that was my experience.
fin_timeout, if you have a good connection, there should be no reason that a 
system takes 60 seconds to send out a fin. Cut that in half, if not by 2/3's
And what is your limitation at 1K requests/sec, load (if so look at I/O) 
Network saturation? Maybe I missed an earlier thread and I too would tilt my 
head at 13K requests sec!
Tory
---


As I mentioned, my limitation is the ephemeral ports tied up with TIME_WAIT. 
TIME_WAIT issue is a known factor when you are doing testing.

When you are tuning, you apply options one at a time. tw_reuse/tc_recycle were 
not used togeter and I had 10 sec fin_timeout which made no difference.

Jenny


nb: i still dont know how to do indenting/quoting with this hotmail... after 10 
years.



Couple of thing to note.
Firstly that this was an ab (apache bench) reported figure. It
calculates the software limitation based on speed of transactions done.
Not necessarily accounting for things like TIME_WAIT. Particularly if it
was extrapolated from say, 50K requests, which would not hit that OS limit.


Ab accounts for 200-OK responses and TIME_WAITS cause squid to issue 500. Of 
course if you send in 50K it would not be subject to this but I usually send 
couple 10+ million to simulate load at least for a while.



He also mentioned using a local IP address. If that was on the lo
interface. It would not be subject to things like TIME_WAIT or RTT lag.


When I was running my benches on loopback, I had tons of TIME_WAITS for 127.0.0.1 and 
squid would bail out with: commBind: Cannot bind socket...

Of course, I might be doing things wrong.

I am interested in what to optimize on RHEL6 OS level to achieve higher 
requests per second.

Jenny


I'll post my configs when I get back to the office, but one thing is that
if you send requests faster than they can be serviced the pending requests
build up until you start getting timeouts. so I have to tinker with the
number of requests that can be sent in parallel to keep the request rate
below this point.

note that when I removed the long list of ACLs I was able to get this 13K
requests/sec rate going from machine A to squid on machine B to apache on
machine C so it's not a localhost thing.

getting up to the 13K rate on apache does require doing some tuning and
tweaking of apache, stock configs that include dozens of dynamically
loaded modules just can't achieve these speeds. These are also fairly
beefy boxes, dual quad core opterons with 64G ram and 1G ethernet
(multiple cards, but I haven't tried trunking them yet)

David Lang



Ok, I am assuming that persistent-connections are on. This doesn't simulate any 
real life scenario.

I would like to know if anyone can do more than 500 reqs/sec with persistent 
connections off.


I'm not using persistant connections. I do this same sort of testing to 
validate various proxies that don't support persistant connections.


I'm remembering the theoretical max of the TCP stack (from one source IP 
to one destination IP) as being ~16K requests/sec, but I don't have 
references to point to at the moment.


David Lang


Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread Amos Jeffries

On 12/06/11 22:20, Jenny Lee wrote:



Date: Sun, 12 Jun 2011 03:02:23 -0700
From: da...@lang.hm
To: bodycar...@live.com
CC: squ...@treenet.co.nz; squid-users@squid-cache.org
Subject: RE: [squid-users] squid 3.2.0.5 smp scaling issues

On Sun, 12 Jun 2011, Jenny Lee wrote:


On 12/06/11 18:46, Jenny Lee wrote:


On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee wrote:

I like to know how you are able to do13000 requests/sec.
tcp_fin_timeout is 60 seconds default on all *NIXes and available ephemeral 
port range is 64K.
I can't do more than 1K requests/sec even with tcp_tw_reuse/tcp_tw_recycle with 
ab. I get commBind errors due to connections in TIME_WAIT.
Any tuning options suggested for RHEL6 x64?
Jenny

I would have a concern using both those at the same time. reuse and recycle. 
Reuse a socket, but recycle it, I've seen issues when testing my own linux 
distro's with both of these settings. Right or wrong that was my experience.
fin_timeout, if you have a good connection, there should be no reason that a 
system takes 60 seconds to send out a fin. Cut that in half, if not by 2/3's
And what is your limitation at 1K requests/sec, load (if so look at I/O) 
Network saturation? Maybe I missed an earlier thread and I too would tilt my 
head at 13K requests sec!
Tory
---


As I mentioned, my limitation is the ephemeral ports tied up with TIME_WAIT. 
TIME_WAIT issue is a known factor when you are doing testing.

When you are tuning, you apply options one at a time. tw_reuse/tc_recycle were 
not used togeter and I had 10 sec fin_timeout which made no difference.

Jenny


nb: i still dont know how to do indenting/quoting with this hotmail... after 10 
years.



Couple of thing to note.
Firstly that this was an ab (apache bench) reported figure. It
calculates the software limitation based on speed of transactions done.
Not necessarily accounting for things like TIME_WAIT. Particularly if it
was extrapolated from say, 50K requests, which would not hit that OS limit.


Ab accounts for 200-OK responses and TIME_WAITS cause squid to issue 500. Of 
course if you send in 50K it would not be subject to this but I usually send 
couple 10+ million to simulate load at least for a while.



He also mentioned using a local IP address. If that was on the lo
interface. It would not be subject to things like TIME_WAIT or RTT lag.


When I was running my benches on loopback, I had tons of TIME_WAITS for 127.0.0.1 and 
squid would bail out with: commBind: Cannot bind socket...

Of course, I might be doing things wrong.

I am interested in what to optimize on RHEL6 OS level to achieve higher 
requests per second.

Jenny


I'll post my configs when I get back to the office, but one thing is that
if you send requests faster than they can be serviced the pending requests
build up until you start getting timeouts. so I have to tinker with the
number of requests that can be sent in parallel to keep the request rate
below this point.

note that when I removed the long list of ACLs I was able to get this 13K
requests/sec rate going from machine A to squid on machine B to apache on
machine C so it's not a localhost thing.

getting up to the 13K rate on apache does require doing some tuning and
tweaking of apache, stock configs that include dozens of dynamically
loaded modules just can't achieve these speeds. These are also fairly
beefy boxes, dual quad core opterons with 64G ram and 1G ethernet
(multiple cards, but I haven't tried trunking them yet)

David Lang



Ok, I am assuming that persistent-connections are on. This doesn't simulate any 
real life scenario.


What do you mean by that? it is the basic requirement for access to the 
major HTTP/1.1 performance features. ON is the default.





I would like to know if anyone can do more than 500 reqs/sec with persistent 
connections off.

Jenny   


Good question. Anyone?

These are our collected reports:
  http://wiki.squid-cache.org/KnowledgeBase/Benchmarks

They are all actual production networks traffic rates. The actual 
benchmark tests like David's have been kept out since we have no 
standard set to make them comparable.


Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.12
  Beta testers wanted for 3.2.0.8 and 3.1.12.2


Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread david

On Sun, 12 Jun 2011, Amos Jeffries wrote:


On 12/06/11 22:20, Jenny Lee wrote:



On Sun, 12 Jun 2011, Jenny Lee wrote:


On 12/06/11 18:46, Jenny Lee wrote:


On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee wrote:

I like to know how you are able to do13000 requests/sec.
tcp_fin_timeout is 60 seconds default on all *NIXes and available 
ephemeral port range is 64K.
I can't do more than 1K requests/sec even with 
tcp_tw_reuse/tcp_tw_recycle with ab. I get commBind errors due to 
connections in TIME_WAIT.

Any tuning options suggested for RHEL6 x64?
Jenny

I would have a concern using both those at the same time. reuse and 
recycle. Reuse a socket, but recycle it, I've seen issues when testing 
my own linux distro's with both of these settings. Right or wrong that 
was my experience.
fin_timeout, if you have a good connection, there should be no reason 
that a system takes 60 seconds to send out a fin. Cut that in half, if 
not by 2/3's
And what is your limitation at 1K requests/sec, load (if so look at 
I/O) Network saturation? Maybe I missed an earlier thread and I too 
would tilt my head at 13K requests sec!

Tory
---


As I mentioned, my limitation is the ephemeral ports tied up with 
TIME_WAIT. TIME_WAIT issue is a known factor when you are doing 
testing.


When you are tuning, you apply options one at a time. 
tw_reuse/tc_recycle were not used togeter and I had 10 sec fin_timeout 
which made no difference.


Jenny


nb: i still dont know how to do indenting/quoting with this hotmail... 
after 10 years.




Couple of thing to note.
Firstly that this was an ab (apache bench) reported figure. It
calculates the software limitation based on speed of transactions done.
Not necessarily accounting for things like TIME_WAIT. Particularly if it
was extrapolated from say, 50K requests, which would not hit that OS 
limit.


Ab accounts for 200-OK responses and TIME_WAITS cause squid to issue 500. 
Of course if you send in 50K it would not be subject to this but I 
usually send couple 10+ million to simulate load at least for a while.




He also mentioned using a local IP address. If that was on the lo
interface. It would not be subject to things like TIME_WAIT or RTT lag.


When I was running my benches on loopback, I had tons of TIME_WAITS for 
127.0.0.1 and squid would bail out with: commBind: Cannot bind 
socket...


Of course, I might be doing things wrong.

I am interested in what to optimize on RHEL6 OS level to achieve higher 
requests per second.


Jenny


I'll post my configs when I get back to the office, but one thing is that
if you send requests faster than they can be serviced the pending requests
build up until you start getting timeouts. so I have to tinker with the
number of requests that can be sent in parallel to keep the request rate
below this point.

note that when I removed the long list of ACLs I was able to get this 13K
requests/sec rate going from machine A to squid on machine B to apache on
machine C so it's not a localhost thing.

getting up to the 13K rate on apache does require doing some tuning and
tweaking of apache, stock configs that include dozens of dynamically
loaded modules just can't achieve these speeds. These are also fairly
beefy boxes, dual quad core opterons with 64G ram and 1G ethernet
(multiple cards, but I haven't tried trunking them yet)

David Lang



Ok, I am assuming that persistent-connections are on. This doesn't simulate 
any real life scenario.


What do you mean by that? it is the basic requirement for access to the major 
HTTP/1.1 performance features. ON is the default.


some of the proxies that I've been testing don't support this (and don't 
support HTTP/1.1), so I am sure that my tests are not using persistant 
connections.


using olde firewall toolkit http-gw (which forks a new process for every 
incoming connection and doesn't even support all HTTP/1.0 features), I've 
seen 4000 requests/sec.


I've got systems in production that routinely top 1000 connections/sec 
between one source and one destination.


David Lang

I would like to know if anyone can do more than 500 reqs/sec with 
persistent connections off.


Jenny 


Good question. Anyone?

These are our collected reports:
 http://wiki.squid-cache.org/KnowledgeBase/Benchmarks

They are all actual production networks traffic rates. The actual benchmark 
tests like David's have been kept out since we have no standard set to make 
them comparable.


RE: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread Jenny Lee


 Date: Sun, 12 Jun 2011 22:47:25 +1200
 From: squ...@treenet.co.nz
 To: squid-users@squid-cache.org
 Subject: Re: [squid-users] squid 3.2.0.5 smp scaling issues

 On 12/06/11 22:20, Jenny Lee wrote:
 
  Date: Sun, 12 Jun 2011 03:02:23 -0700
  From: da...@lang.hm
  To: bodycar...@live.com
  CC: squ...@treenet.co.nz; squid-users@squid-cache.org
  Subject: RE: [squid-users] squid 3.2.0.5 smp scaling issues
 
  On Sun, 12 Jun 2011, Jenny Lee wrote:
 
  On 12/06/11 18:46, Jenny Lee wrote:
 
  On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee wrote:
 
  I like to know how you are able to do13000 requests/sec.
  tcp_fin_timeout is 60 seconds default on all *NIXes and available 
  ephemeral port range is 64K.
  I can't do more than 1K requests/sec even with 
  tcp_tw_reuse/tcp_tw_recycle with ab. I get commBind errors due to 
  connections in TIME_WAIT.
  Any tuning options suggested for RHEL6 x64?
  Jenny
 
  I would have a concern using both those at the same time. reuse and 
  recycle. Reuse a socket, but recycle it, I've seen issues when testing 
  my own linux distro's with both of these settings. Right or wrong that 
  was my experience.
  fin_timeout, if you have a good connection, there should be no reason 
  that a system takes 60 seconds to send out a fin. Cut that in half, if 
  not by 2/3's
  And what is your limitation at 1K requests/sec, load (if so look at 
  I/O) Network saturation? Maybe I missed an earlier thread and I too 
  would tilt my head at 13K requests sec!
  Tory
  ---
 
 
  As I mentioned, my limitation is the ephemeral ports tied up with 
  TIME_WAIT. TIME_WAIT issue is a known factor when you are doing testing.
 
  When you are tuning, you apply options one at a time. 
  tw_reuse/tc_recycle were not used togeter and I had 10 sec fin_timeout 
  which made no difference.
 
  Jenny
 
 
  nb: i still dont know how to do indenting/quoting with this hotmail... 
  after 10 years.
 
 
  Couple of thing to note.
  Firstly that this was an ab (apache bench) reported figure. It
  calculates the software limitation based on speed of transactions done.
  Not necessarily accounting for things like TIME_WAIT. Particularly if it
  was extrapolated from say, 50K requests, which would not hit that OS 
  limit.
 
  Ab accounts for 200-OK responses and TIME_WAITS cause squid to issue 500. 
  Of course if you send in 50K it would not be subject to this but I 
  usually send couple 10+ million to simulate load at least for a while.
 
 
  He also mentioned using a local IP address. If that was on the lo
  interface. It would not be subject to things like TIME_WAIT or RTT lag.
 
  When I was running my benches on loopback, I had tons of TIME_WAITS for 
  127.0.0.1 and squid would bail out with: commBind: Cannot bind socket...
 
  Of course, I might be doing things wrong.
 
  I am interested in what to optimize on RHEL6 OS level to achieve higher 
  requests per second.
 
  Jenny
 
  I'll post my configs when I get back to the office, but one thing is that
  if you send requests faster than they can be serviced the pending requests
  build up until you start getting timeouts. so I have to tinker with the
  number of requests that can be sent in parallel to keep the request rate
  below this point.
 
  note that when I removed the long list of ACLs I was able to get this 13K
  requests/sec rate going from machine A to squid on machine B to apache on
  machine C so it's not a localhost thing.
 
  getting up to the 13K rate on apache does require doing some tuning and
  tweaking of apache, stock configs that include dozens of dynamically
  loaded modules just can't achieve these speeds. These are also fairly
  beefy boxes, dual quad core opterons with 64G ram and 1G ethernet
  (multiple cards, but I haven't tried trunking them yet)
 
  David Lang
 
 
  Ok, I am assuming that persistent-connections are on. This doesn't simulate 
  any real life scenario.

 What do you mean by that? it is the basic requirement for access to the
 major HTTP/1.1 performance features. ON is the default.
 
 
 
First of all, this breaks tcp_outgoing_address in squid. So it is definitely 
off for me.
 
Above issue also makes persistent-connections unusable for peers.
 
Second, when you have many users going to many destinations, 
persistent-connections is of no use. Even though I have persistent connections 
on for the client site, I am still bitten by the ephemeral ports.
 
These are my scenarios.
 
 

  I would like to know if anyone can do more than 500 reqs/sec with 
  persistent connections off.
 
  Jenny

 Good question. Anyone?
 
I can do 450 reqs/sec under constant load. But no more. And I have tried all 
available TCP tuning options.
 
Jenny 

RE: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread Jenny Lee




 Date: Sun, 12 Jun 2011 03:35:28 -0700
 From: da...@lang.hm
 To: bodycar...@live.com
 CC: squid-users@squid-cache.org
 Subject: RE: [squid-users] squid 3.2.0.5 smp scaling issues

 On Sun, 12 Jun 2011, Jenny Lee wrote:

  Date: Sun, 12 Jun 2011 03:02:23 -0700
  From: da...@lang.hm
  To: bodycar...@live.com
  CC: squ...@treenet.co.nz; squid-users@squid-cache.org
  Subject: RE: [squid-users] squid 3.2.0.5 smp scaling issues
 
  On Sun, 12 Jun 2011, Jenny Lee wrote:
 
  On 12/06/11 18:46, Jenny Lee wrote:
 
  On Sat, Jun 11, 2011 at 9:40 PM, Jenny Lee wrote:
 
  I like to know how you are able to do13000 requests/sec.
  tcp_fin_timeout is 60 seconds default on all *NIXes and available 
  ephemeral port range is 64K.
  I can't do more than 1K requests/sec even with 
  tcp_tw_reuse/tcp_tw_recycle with ab. I get commBind errors due to 
  connections in TIME_WAIT.
  Any tuning options suggested for RHEL6 x64?
  Jenny
 
  I would have a concern using both those at the same time. reuse and 
  recycle. Reuse a socket, but recycle it, I've seen issues when testing 
  my own linux distro's with both of these settings. Right or wrong that 
  was my experience.
  fin_timeout, if you have a good connection, there should be no reason 
  that a system takes 60 seconds to send out a fin. Cut that in half, if 
  not by 2/3's
  And what is your limitation at 1K requests/sec, load (if so look at 
  I/O) Network saturation? Maybe I missed an earlier thread and I too 
  would tilt my head at 13K requests sec!
  Tory
  ---
 
 
  As I mentioned, my limitation is the ephemeral ports tied up with 
  TIME_WAIT. TIME_WAIT issue is a known factor when you are doing testing.
 
  When you are tuning, you apply options one at a time. 
  tw_reuse/tc_recycle were not used togeter and I had 10 sec fin_timeout 
  which made no difference.
 
  Jenny
 
 
  nb: i still dont know how to do indenting/quoting with this hotmail... 
  after 10 years.
 
 
  Couple of thing to note.
  Firstly that this was an ab (apache bench) reported figure. It
  calculates the software limitation based on speed of transactions done.
  Not necessarily accounting for things like TIME_WAIT. Particularly if it
  was extrapolated from say, 50K requests, which would not hit that OS 
  limit.
 
  Ab accounts for 200-OK responses and TIME_WAITS cause squid to issue 500. 
  Of course if you send in 50K it would not be subject to this but I 
  usually send couple 10+ million to simulate load at least for a while.
 
 
  He also mentioned using a local IP address. If that was on the lo
  interface. It would not be subject to things like TIME_WAIT or RTT lag.
 
  When I was running my benches on loopback, I had tons of TIME_WAITS for 
  127.0.0.1 and squid would bail out with: commBind: Cannot bind socket...
 
  Of course, I might be doing things wrong.
 
  I am interested in what to optimize on RHEL6 OS level to achieve higher 
  requests per second.
 
  Jenny
 
  I'll post my configs when I get back to the office, but one thing is that
  if you send requests faster than they can be serviced the pending requests
  build up until you start getting timeouts. so I have to tinker with the
  number of requests that can be sent in parallel to keep the request rate
  below this point.
 
  note that when I removed the long list of ACLs I was able to get this 13K
  requests/sec rate going from machine A to squid on machine B to apache on
  machine C so it's not a localhost thing.
 
  getting up to the 13K rate on apache does require doing some tuning and
  tweaking of apache, stock configs that include dozens of dynamically
  loaded modules just can't achieve these speeds. These are also fairly
  beefy boxes, dual quad core opterons with 64G ram and 1G ethernet
  (multiple cards, but I haven't tried trunking them yet)
 
  David Lang
 
 
  Ok, I am assuming that persistent-connections are on. This doesn't simulate 
  any real life scenario.
 
  I would like to know if anyone can do more than 500 reqs/sec with 
  persistent connections off.

 I'm not using persistant connections. I do this same sort of testing to
 validate various proxies that don't support persistant connections.

 I'm remembering the theoretical max of the TCP stack (from one source IP
 to one destination IP) as being ~16K requests/sec, but I don't have
 references to point to at the moment.

 David Lang
 
 
With tcp_fin_timeout set at theoretical minimum of 12 secs, we can do 5K req/s 
with 64K ports.
 
Setting tcp_fin_timeout had no effect for me. Apparently there is conflicting / 
outdated information everywhere and I could not lower TIME_WAIT from its 
default of 60 secs which is hardcoded into include/net/tcp.h. But I doubt this 
would have any effect when you are constantly loading the machine.
 
Making localhost to localhost connections didn't help either.
 
I am not a network guru, so of course I am probably doing things wrong. But no 
matter how wrong you do stuff

RE: [squid-users] squid 3.2.0.5 smp scaling issues

2011-06-12 Thread david

On Sun, 12 Jun 2011, Jenny Lee wrote:


With tcp_fin_timeout set at theoretical minimum of 12 secs, we can do 5K req/s 
with 64K ports.

Setting tcp_fin_timeout had no effect for me. Apparently there is conflicting / 
outdated information everywhere and I could not lower TIME_WAIT from its 
default of 60 secs which is hardcoded into include/net/tcp.h. But I doubt this 
would have any effect when you are constantly loading the machine.

Making localhost to localhost connections didn't help either.

I am not a network guru, so of course I am probably doing things wrong. But no 
matter how wrong you do stuff, they cannot escape brute-forcing :) And I have 
tried everything!

I Can't do more than 450-470 reqs/sec even with 200K in /proc/sys/net/netfilter/nf_conntrack_max 
and /sys/module/nf_conntrack/parameters/hashsize. This allows me bypass CONNTRACK table 
full issues, but my ports run out.

Could you be kind enough to specify which OS you are using and if you are 
running the benches for extended periods of time?

Any TCP tuning options you are doing also would be very useful. Of course, when 
you are back in the office.

As I mentioned, we find your work on acls and workers valuable.


I'm running Debian with custom built kernels.

In the testing that I have done over the years, I have had tests at 6000+ 
connections/sec through forking proxies (that only log when they get a new 
connection, with connection rates calculated by the logs of the proxy so I 
know that they aren't using persistant or keep-alive connections)


unfortunantly the machine in my lab with squid on it is unplugged right 
now. I can get at the machines running ab and apache remotely, so I can 
hopefully get logged in and give you the kernel settings in the next 
coupld of days (things are _extremely_ hectic through most of monday, so 
it'll probably be monday night or tuesday before I get a chance)


David Lang


[squid-users] squid 3.2.0.5 smp scaling issues

2011-06-11 Thread Jenny Lee

I like to know how you are able to do 13000 requests/sec.
 
tcp_fin_timeout is 60 seconds default on all *NIXes and available ephemeral 
port range is 64K.
 
I can't do more than 1K requests/sec even with tcp_tw_reuse/tcp_tw_recycle with 
ab. I get commBind errors due to connections in TIME_WAIT.
 
Any tuning options suggested for RHEL6 x64?
 
Jenny
 
 
 
 
---
test setup
box A running apache and ab
test against local IP address 13000 requests/sec
box B running squid, 8 2.3 GHz Opteron cores with 16G ram
non acl/cache-peer related lines in the config are (including typos from 
me manually entering this)
http_port 8000
icp_port 0
visible_hostname gromit1
cache_effective_user proxy
cache_effective_group proxy
appaend_domain .invalid.server.name
pid_filename /var/run/squid.pid
cache_dir null /tmp
client_db off
cache_access_log syslog squid
cache_log /var/log/squid/cache.log
cache_store_log none
coredump_dir none
no_cache deny all

results when requesting short html page 
squid 3.0.STABLE12 4200 requests/sec
squid 3.1.11 2100 requests/sec
squid 3.2.0.5 1 worker 1400 requests/sec
squid 3.2.0.5 2 workers 2100 requests/sec
squid 3.2.0.5 3 workers 2500 requests/sec
squid 3.2.0.5 4 workers 2900 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2500 requests/sec
squid 3.2.0.5 7 workers 2000 requests/sec
squid 3.2.0.5 8 workers 1900 requests/sec
in all these tests the squid process was using 100% of the cpu
I tried it pulling a large file (100K instead of 50 bytes) on the thought 
that this may be bottlenecking on accepting the connections but with 
something that took more time to service the connections it could do 
better however what I found is that with 8 workers all 8 were using 50% 
of the CPU at 1000 requests/sec
local machine would do 7000 requests/sec to itself
1 worker 500 requests/sec
2 workers 957 requests/sec
from there it remained about 1000 requests/sec with the cpu 
utilization slowly dropping off (but not dropping as fast as it should 
with the number of cores available)
so it looks like there is some significant bottleneck in version 3.2 that 
makes the SMP support fairly ineffective.

in reading the wiki page at wili.squid-cache.org/Features/SmpScale I see 
you worrying about fairness between workers. If you have put in code to 
try and ensure fairness, you may want to remove it and see what happens to 
performance. what you are describing on that page in terms of fairness is 
what I would expect form a 'first-come-first-served' approach to multiple 
processes grabbing new connections. The worker that last ran is hot in the 
cache and so has an 'unfair' advantage in noticing and processing the new 
request, but as that worker gets busier, it will be spending more time 
servicing the request and the other processes will get more of a chance to 
grab the new connection, so it will appear unfair under light load, but 
become more fair under heavy load.
David Lang

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-05-04 Thread david

ping,

anything new on this issue? (including any patches for me to test?)

David Lang

On Mon, 25 Apr 2011, da...@lang.hm wrote:


Date: Mon, 25 Apr 2011 17:14:52 -0700 (PDT)
From: da...@lang.hm
To: Alex Rousskov rouss...@measurement-factory.com
Cc: Marcos mczue...@yahoo.com.br, squid-users@squid-cache.org,
squid-...@squid-cache.org
Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

On Mon, 25 Apr 2011, Alex Rousskov wrote:


On 04/25/2011 05:31 PM, da...@lang.hm wrote:

On Mon, 25 Apr 2011, da...@lang.hm wrote:

On Mon, 25 Apr 2011, Alex Rousskov wrote:

On 04/14/2011 09:06 PM, da...@lang.hm wrote:


In addition, there seems to be some sort of locking betwen the multiple
worker processes in 3.2 when checking the ACLs


There are pretty much no locks in the current official SMP code. This
will change as we start adding shared caches in a week or so, but even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such), but I
do not know much about them.


what are the 3rd party libraries that I would be using?


See ldd squid. Here is a sample based on a randomly picked Squid:

   libnsl, libresolv, libstdc++, libgcc_s, libm, libc, libz, libepol

Please note that I am not saying that any of these have problems in SMP
environment. I am only saying that Squid itself does not lock anything
runtime so if our suspect is SMP-related locks, they would have to
reside elsewhere. The other possibility is that we should suspect
something else, of course. IMHO, it is more likely to be something else:
after all, Squid does not use threads, where such problems are expected.




BTW, do you see more-or-less even load across CPU cores? If not, you may
need a patch that we find useful on older Linux kernels. It is discussed
in the Will similar workers receive similar amount of work? section of
http://wiki.squid-cache.org/Features/SmpScale


the load is pretty even across all workers.

with the problems descripted on that page, I would expect uneven utilization 
at low loads, but at high loads (with the workers busy serviceing requests 
rather than waiting for new connections), I would expect the work to even out 
(and the types of hacks described in that section to end up costing 
performance, but not in a way that would scale with the ACL processing load)



one thought I had is that this could be locking on name lookups. how
hard would it be to create a quick patch that would bypass the name
lookups entirely and only do the lookups by IP.


I did not realize your ACLs use DNS lookups. Squid internal DNS code
does not have any runtime SMP locks. However, the presence of DNS
lookups increases the number of suspects.


they don't, everything in my test environment is by IP. But I've seen other 
software that still runs everything through name lookups, even if what's 
presented to the software (both in what's requested and in the ACLs) is all 
done by IPs. It's a easy way to bullet-proof the input (if it's a name it 
gets resolved, if it's an IP, the IP comes back as-is, and it works for IPv4 
and IPv6, no need to have logic that looks at the value and tries to figure 
out if the user intended to type a name or an IP). I don't know how squid is 
working internally (it's a pretty large codebase, and I haven't tried to 
really dive into it) so I don't know if squid does this or not.



A patch you propose does not sound difficult to me, but since I cannot
contribute such a patch soon, it is probably better to test with ACLs
that do not require any DNS lookups instead.



if that regains the speed and/or scalability it would point fingers
fairly conclusively at the DNS components.

this is the only think that I can think of that should be shared between
multiple workers processing ACLs


but it is _not_ currently shared from Squid point of view.


Ok, I was assuming from the description of things that there would be one DNS 
process that all the workers would be accessing. from the way it's described 
in the documentation it sounds as if it's already a separate process, so I 
was thinking that it was possible that if each ACL IP address is being put 
through a single DNS process, I could be running into contention on that 
process (and having to do name lookups for both IPv6 and then falling back to 
IPv4 would explain the severe performance hit far more than the difference 
between IPs being 128 bit values instead of 32 bit values)


David Lang




Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-05-04 Thread Alex Rousskov
On 05/04/2011 11:41 AM, da...@lang.hm wrote:

 anything new on this issue? (including any patches for me to test?)

If you mean the ACLs do not scale well issue, then I do not have any
free cycles to work on it right now.  I was happy to clarify the new SMP
architecture and suggest ways to triage the issue further. Let's hope
somebody else can volunteer to do the required legwork.

Alex.


 On Mon, 25 Apr 2011, da...@lang.hm wrote:
 
 Date: Mon, 25 Apr 2011 17:14:52 -0700 (PDT)
 From: da...@lang.hm
 To: Alex Rousskov rouss...@measurement-factory.com
 Cc: Marcos mczue...@yahoo.com.br, squid-users@squid-cache.org,
 squid-...@squid-cache.org
 Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

 On Mon, 25 Apr 2011, Alex Rousskov wrote:

 On 04/25/2011 05:31 PM, da...@lang.hm wrote:
 On Mon, 25 Apr 2011, da...@lang.hm wrote:
 On Mon, 25 Apr 2011, Alex Rousskov wrote:
 On 04/14/2011 09:06 PM, da...@lang.hm wrote:

 In addition, there seems to be some sort of locking betwen the
 multiple
 worker processes in 3.2 when checking the ACLs

 There are pretty much no locks in the current official SMP code. This
 will change as we start adding shared caches in a week or so, but
 even
 then the ACLs will remain lock-free. There could be some internal
 locking in the 3rd-party libraries used by ACLs (regex and such),
 but I
 do not know much about them.

 what are the 3rd party libraries that I would be using?

 See ldd squid. Here is a sample based on a randomly picked Squid:

libnsl, libresolv, libstdc++, libgcc_s, libm, libc, libz, libepol

 Please note that I am not saying that any of these have problems in SMP
 environment. I am only saying that Squid itself does not lock anything
 runtime so if our suspect is SMP-related locks, they would have to
 reside elsewhere. The other possibility is that we should suspect
 something else, of course. IMHO, it is more likely to be something else:
 after all, Squid does not use threads, where such problems are expected.


 BTW, do you see more-or-less even load across CPU cores? If not, you may
 need a patch that we find useful on older Linux kernels. It is discussed
 in the Will similar workers receive similar amount of work? section of
 http://wiki.squid-cache.org/Features/SmpScale

 the load is pretty even across all workers.

 with the problems descripted on that page, I would expect uneven
 utilization at low loads, but at high loads (with the workers busy
 serviceing requests rather than waiting for new connections), I would
 expect the work to even out (and the types of hacks described in that
 section to end up costing performance, but not in a way that would
 scale with the ACL processing load)

 one thought I had is that this could be locking on name lookups. how
 hard would it be to create a quick patch that would bypass the name
 lookups entirely and only do the lookups by IP.

 I did not realize your ACLs use DNS lookups. Squid internal DNS code
 does not have any runtime SMP locks. However, the presence of DNS
 lookups increases the number of suspects.

 they don't, everything in my test environment is by IP. But I've seen
 other software that still runs everything through name lookups, even
 if what's presented to the software (both in what's requested and in
 the ACLs) is all done by IPs. It's a easy way to bullet-proof the
 input (if it's a name it gets resolved, if it's an IP, the IP comes
 back as-is, and it works for IPv4 and IPv6, no need to have logic that
 looks at the value and tries to figure out if the user intended to
 type a name or an IP). I don't know how squid is working internally
 (it's a pretty large codebase, and I haven't tried to really dive into
 it) so I don't know if squid does this or not.

 A patch you propose does not sound difficult to me, but since I cannot
 contribute such a patch soon, it is probably better to test with ACLs
 that do not require any DNS lookups instead.


 if that regains the speed and/or scalability it would point fingers
 fairly conclusively at the DNS components.

 this is the only think that I can think of that should be shared
 between
 multiple workers processing ACLs

 but it is _not_ currently shared from Squid point of view.

 Ok, I was assuming from the description of things that there would be
 one DNS process that all the workers would be accessing. from the way
 it's described in the documentation it sounds as if it's already a
 separate process, so I was thinking that it was possible that if each
 ACL IP address is being put through a single DNS process, I could be
 running into contention on that process (and having to do name lookups
 for both IPv6 and then falling back to IPv4 would explain the severe
 performance hit far more than the difference between IPs being 128 bit
 values instead of 32 bit values)

 David Lang





Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-05-04 Thread david
I don't know how many developers are working on squid, so I don't knwo if 
you are the only person who can do this sort of work or not.


do you think that I should join the squid-dev list?

David Lang

On Wed, 4 May 2011, Alex Rousskov wrote:


On 05/04/2011 11:41 AM, da...@lang.hm wrote:


anything new on this issue? (including any patches for me to test?)


If you mean the ACLs do not scale well issue, then I do not have any
free cycles to work on it right now.  I was happy to clarify the new SMP
architecture and suggest ways to triage the issue further. Let's hope
somebody else can volunteer to do the required legwork.

Alex.



On Mon, 25 Apr 2011, da...@lang.hm wrote:


Date: Mon, 25 Apr 2011 17:14:52 -0700 (PDT)
From: da...@lang.hm
To: Alex Rousskov rouss...@measurement-factory.com
Cc: Marcos mczue...@yahoo.com.br, squid-users@squid-cache.org,
squid-...@squid-cache.org
Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

On Mon, 25 Apr 2011, Alex Rousskov wrote:


On 04/25/2011 05:31 PM, da...@lang.hm wrote:

On Mon, 25 Apr 2011, da...@lang.hm wrote:

On Mon, 25 Apr 2011, Alex Rousskov wrote:

On 04/14/2011 09:06 PM, da...@lang.hm wrote:


In addition, there seems to be some sort of locking betwen the
multiple
worker processes in 3.2 when checking the ACLs


There are pretty much no locks in the current official SMP code. This
will change as we start adding shared caches in a week or so, but
even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such),
but I
do not know much about them.


what are the 3rd party libraries that I would be using?


See ldd squid. Here is a sample based on a randomly picked Squid:

   libnsl, libresolv, libstdc++, libgcc_s, libm, libc, libz, libepol

Please note that I am not saying that any of these have problems in SMP
environment. I am only saying that Squid itself does not lock anything
runtime so if our suspect is SMP-related locks, they would have to
reside elsewhere. The other possibility is that we should suspect
something else, of course. IMHO, it is more likely to be something else:
after all, Squid does not use threads, where such problems are expected.




BTW, do you see more-or-less even load across CPU cores? If not, you may
need a patch that we find useful on older Linux kernels. It is discussed
in the Will similar workers receive similar amount of work? section of
http://wiki.squid-cache.org/Features/SmpScale


the load is pretty even across all workers.

with the problems descripted on that page, I would expect uneven
utilization at low loads, but at high loads (with the workers busy
serviceing requests rather than waiting for new connections), I would
expect the work to even out (and the types of hacks described in that
section to end up costing performance, but not in a way that would
scale with the ACL processing load)


one thought I had is that this could be locking on name lookups. how
hard would it be to create a quick patch that would bypass the name
lookups entirely and only do the lookups by IP.


I did not realize your ACLs use DNS lookups. Squid internal DNS code
does not have any runtime SMP locks. However, the presence of DNS
lookups increases the number of suspects.


they don't, everything in my test environment is by IP. But I've seen
other software that still runs everything through name lookups, even
if what's presented to the software (both in what's requested and in
the ACLs) is all done by IPs. It's a easy way to bullet-proof the
input (if it's a name it gets resolved, if it's an IP, the IP comes
back as-is, and it works for IPv4 and IPv6, no need to have logic that
looks at the value and tries to figure out if the user intended to
type a name or an IP). I don't know how squid is working internally
(it's a pretty large codebase, and I haven't tried to really dive into
it) so I don't know if squid does this or not.


A patch you propose does not sound difficult to me, but since I cannot
contribute such a patch soon, it is probably better to test with ACLs
that do not require any DNS lookups instead.



if that regains the speed and/or scalability it would point fingers
fairly conclusively at the DNS components.

this is the only think that I can think of that should be shared
between
multiple workers processing ACLs


but it is _not_ currently shared from Squid point of view.


Ok, I was assuming from the description of things that there would be
one DNS process that all the workers would be accessing. from the way
it's described in the documentation it sounds as if it's already a
separate process, so I was thinking that it was possible that if each
ACL IP address is being put through a single DNS process, I could be
running into contention on that process (and having to do name lookups
for both IPv6 and then falling back to IPv4 would explain the severe
performance hit far more than the difference between IPs being 128 bit
values instead

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-05-04 Thread Alex Rousskov
On 05/04/2011 12:49 PM, da...@lang.hm wrote:

 I don't know how many developers are working on squid, so I don't knwo
 if you are the only person who can do this sort of work or not.

I am sure there are others who can do this. The question is whether you
can quickly find somebody interested enough to spend their time on your
problem. In general, folks work on issues that are important to them or
to their customers. Most active developers donate a lot of free time,
but it still tends to revolve around issues they care about for one
reason or another. We all have to prioritize.


 do you think that I should join the squid-dev list?

I believe your messages are posted to squid-dev so you are not going to
reach a wider audience if you do. If you want to write Squid code,
joining is a good idea!


IMHO, you can maximize your chances of getting free help by isolating
the problem better. For example, perhaps you can try to reproduce it
with different kinds of fast ACLs (the simpler the better!). This will
help clarify whether the problem is specific to IPv6, IP, or ACLs in
general. Test different number of ACLs: Does the problem happen only
when there number of simple ACLs is huge? Make the problem easier to
reproduce by posting configuration files (including Polygraph workloads
or options for some other benchmarking tool you use).

This is not a guarantee that somebody will jump and help you, but fixing
a well-triaged issue is often much easier.


HTH,

Alex.


 On Wed, 4 May 2011, Alex Rousskov wrote:
 
 On 05/04/2011 11:41 AM, da...@lang.hm wrote:

 anything new on this issue? (including any patches for me to test?)

 If you mean the ACLs do not scale well issue, then I do not have any
 free cycles to work on it right now.  I was happy to clarify the new SMP
 architecture and suggest ways to triage the issue further. Let's hope
 somebody else can volunteer to do the required legwork.

 Alex.


 On Mon, 25 Apr 2011, da...@lang.hm wrote:

 Date: Mon, 25 Apr 2011 17:14:52 -0700 (PDT)
 From: da...@lang.hm
 To: Alex Rousskov rouss...@measurement-factory.com
 Cc: Marcos mczue...@yahoo.com.br, squid-users@squid-cache.org,
 squid-...@squid-cache.org
 Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

 On Mon, 25 Apr 2011, Alex Rousskov wrote:

 On 04/25/2011 05:31 PM, da...@lang.hm wrote:
 On Mon, 25 Apr 2011, da...@lang.hm wrote:
 On Mon, 25 Apr 2011, Alex Rousskov wrote:
 On 04/14/2011 09:06 PM, da...@lang.hm wrote:

 In addition, there seems to be some sort of locking betwen the
 multiple
 worker processes in 3.2 when checking the ACLs

 There are pretty much no locks in the current official SMP code.
 This
 will change as we start adding shared caches in a week or so, but
 even
 then the ACLs will remain lock-free. There could be some internal
 locking in the 3rd-party libraries used by ACLs (regex and such),
 but I
 do not know much about them.

 what are the 3rd party libraries that I would be using?

 See ldd squid. Here is a sample based on a randomly picked Squid:

libnsl, libresolv, libstdc++, libgcc_s, libm, libc, libz, libepol

 Please note that I am not saying that any of these have problems in
 SMP
 environment. I am only saying that Squid itself does not lock anything
 runtime so if our suspect is SMP-related locks, they would have to
 reside elsewhere. The other possibility is that we should suspect
 something else, of course. IMHO, it is more likely to be something
 else:
 after all, Squid does not use threads, where such problems are
 expected.


 BTW, do you see more-or-less even load across CPU cores? If not,
 you may
 need a patch that we find useful on older Linux kernels. It is
 discussed
 in the Will similar workers receive similar amount of work?
 section of
 http://wiki.squid-cache.org/Features/SmpScale

 the load is pretty even across all workers.

 with the problems descripted on that page, I would expect uneven
 utilization at low loads, but at high loads (with the workers busy
 serviceing requests rather than waiting for new connections), I would
 expect the work to even out (and the types of hacks described in that
 section to end up costing performance, but not in a way that would
 scale with the ACL processing load)

 one thought I had is that this could be locking on name lookups. how
 hard would it be to create a quick patch that would bypass the name
 lookups entirely and only do the lookups by IP.

 I did not realize your ACLs use DNS lookups. Squid internal DNS code
 does not have any runtime SMP locks. However, the presence of DNS
 lookups increases the number of suspects.

 they don't, everything in my test environment is by IP. But I've seen
 other software that still runs everything through name lookups, even
 if what's presented to the software (both in what's requested and in
 the ACLs) is all done by IPs. It's a easy way to bullet-proof the
 input (if it's a name it gets resolved, if it's an IP, the IP comes
 back as-is, and it works for IPv4

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-05-04 Thread Amos Jeffries

On Wed, 4 May 2011 11:49:01 -0700 (PDT), da...@lang.hm wrote:

I don't know how many developers are working on squid, so I don't
knwo if you are the only person who can do this sort of work or not.


4 part-timers and a few others focused on specific areas.



do you think that I should join the squid-dev list?


I thought you had, if you are intending to follow this for long it 
could be a good idea anyway.
If you have any time to spare on tinkering with optimizations even 
better.


Amos



Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-05-04 Thread david

On Wed, 4 May 2011, Alex Rousskov wrote:


On 05/04/2011 12:49 PM, da...@lang.hm wrote:


I don't know how many developers are working on squid, so I don't knwo
if you are the only person who can do this sort of work or not.


I am sure there are others who can do this. The question is whether you
can quickly find somebody interested enough to spend their time on your
problem. In general, folks work on issues that are important to them or
to their customers. Most active developers donate a lot of free time,
but it still tends to revolve around issues they care about for one
reason or another. We all have to prioritize.


I do understand this.


do you think that I should join the squid-dev list?


I believe your messages are posted to squid-dev so you are not going to
reach a wider audience if you do. If you want to write Squid code,
joining is a good idea!


I don't really have the time to do coding on this project


IMHO, you can maximize your chances of getting free help by isolating
the problem better. For example, perhaps you can try to reproduce it
with different kinds of fast ACLs (the simpler the better!). This will
help clarify whether the problem is specific to IPv6, IP, or ACLs in
general. Test different number of ACLs: Does the problem happen only
when there number of simple ACLs is huge? Make the problem easier to
reproduce by posting configuration files (including Polygraph workloads
or options for some other benchmarking tool you use).

This is not a guarantee that somebody will jump and help you, but fixing
a well-triaged issue is often much easier.


that's why I'm speaking up. I just have not known what to test.

are there other types of ACLs that I should be testing?

I'll setup some tests with differnet numbers of ACLs. since I've already 
verified that the number of ACLs defined isn't the significant factor, 
only the number tested before one succeds (by moving the ACL that allows 
my access from the end of the file to the beginning of the file, keeping 
everything else the same), I'll see if the slowdown seems proportional to 
the number of rules, or if there is something else going on.


any other types of testing I should do?

David Lang



HTH,

Alex.



On Wed, 4 May 2011, Alex Rousskov wrote:


On 05/04/2011 11:41 AM, da...@lang.hm wrote:


anything new on this issue? (including any patches for me to test?)


If you mean the ACLs do not scale well issue, then I do not have any
free cycles to work on it right now.  I was happy to clarify the new SMP
architecture and suggest ways to triage the issue further. Let's hope
somebody else can volunteer to do the required legwork.

Alex.



On Mon, 25 Apr 2011, da...@lang.hm wrote:


Date: Mon, 25 Apr 2011 17:14:52 -0700 (PDT)
From: da...@lang.hm
To: Alex Rousskov rouss...@measurement-factory.com
Cc: Marcos mczue...@yahoo.com.br, squid-users@squid-cache.org,
squid-...@squid-cache.org
Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

On Mon, 25 Apr 2011, Alex Rousskov wrote:


On 04/25/2011 05:31 PM, da...@lang.hm wrote:

On Mon, 25 Apr 2011, da...@lang.hm wrote:

On Mon, 25 Apr 2011, Alex Rousskov wrote:

On 04/14/2011 09:06 PM, da...@lang.hm wrote:


In addition, there seems to be some sort of locking betwen the
multiple
worker processes in 3.2 when checking the ACLs


There are pretty much no locks in the current official SMP code.
This
will change as we start adding shared caches in a week or so, but
even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such),
but I
do not know much about them.


what are the 3rd party libraries that I would be using?


See ldd squid. Here is a sample based on a randomly picked Squid:

   libnsl, libresolv, libstdc++, libgcc_s, libm, libc, libz, libepol

Please note that I am not saying that any of these have problems in
SMP
environment. I am only saying that Squid itself does not lock anything
runtime so if our suspect is SMP-related locks, they would have to
reside elsewhere. The other possibility is that we should suspect
something else, of course. IMHO, it is more likely to be something
else:
after all, Squid does not use threads, where such problems are
expected.




BTW, do you see more-or-less even load across CPU cores? If not,
you may
need a patch that we find useful on older Linux kernels. It is
discussed
in the Will similar workers receive similar amount of work?
section of
http://wiki.squid-cache.org/Features/SmpScale


the load is pretty even across all workers.

with the problems descripted on that page, I would expect uneven
utilization at low loads, but at high loads (with the workers busy
serviceing requests rather than waiting for new connections), I would
expect the work to even out (and the types of hacks described in that
section to end up costing performance, but not in a way that would
scale with the ACL processing load)


one thought I had is that this could be locking on name

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-05-04 Thread Amos Jeffries

On Wed, 4 May 2011 16:36:08 -0700 (PDT), da...@lang.hm wrote:

On Wed, 4 May 2011, Alex Rousskov wrote:


On 05/04/2011 12:49 PM, da...@lang.hm wrote:


snip
IMHO, you can maximize your chances of getting free help by 
isolating

the problem better. For example, perhaps you can try to reproduce it
with different kinds of fast ACLs (the simpler the better!). This 
will

help clarify whether the problem is specific to IPv6, IP, or ACLs in
general. Test different number of ACLs: Does the problem happen only
when there number of simple ACLs is huge? Make the problem easier to
reproduce by posting configuration files (including Polygraph 
workloads

or options for some other benchmarking tool you use).
-
This is not a guarantee that somebody will jump and help you, but 
fixing

a well-triaged issue is often much easier.


that's why I'm speaking up. I just have not known what to test.

are there other types of ACLs that I should be testing?


We can't answer that without having seen your config file and which are 
in use now.


The list of all available ACL are at 
http://wiki.squid-cache.org/SquidFaq/SquidAcl and 
http://www.squid-cache.org/Doc/config/acl/




I'll setup some tests with differnet numbers of ACLs. since I've
already verified that the number of ACLs defined isn't the 
significant

factor, only the number tested before one succeds (by moving the ACL
that allows my access from the end of the file to the beginning of 
the

file, keeping everything else the same), I'll see if the slowdown
seems proportional to the number of rules, or if there is something
else going on.

any other types of testing I should do?


The above looks like a good benchmark *provided* all the ACLs have the 
same type with consistent content counts. Mixing types makes the result 
non-comparable with other tests.


If you have time (and want to), we kind of need that type of 
benchmarking done for each ACL type. Prioritising by popularity: src/dst 
by IP, port, domain and regex variants. Then proxy_auth, external (the 
fake helpers can help here). Then the others; ie browser, proto, 
method, header matching.


We know general fuzzy details like, for example, a port test is faster 
than a domain test. One with details presented up front by the client is 
also faster than one where a lookup is needed. But have no deeper info 
to say if a dstdomain test is faster or slower than a src (IP) test.


Way down my TODO list is the dream of micro-benchmarking the ACLs in 
their unit-tests.



Amos


Res: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread Marcos
thanks for your answer David.

i'm seeing too much feature been included at squid 3.x, but it's getting as 
slower as new features are added.
i think squid 3.2 with 1 worker should be as fast as 2.7, but it's getting 
slower e hungry.


Marcos


- Mensagem original 
De: da...@lang.hm da...@lang.hm
Para: Marcos mczue...@yahoo.com.br
Cc: Amos Jeffries squ...@treenet.co.nz; squid-users@squid-cache.org; 
squid-...@squid-cache.org
Enviadas: Sexta-feira, 22 de Abril de 2011 15:10:44
Assunto: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

ping, I haven't seen a response to this additional information that I sent out 
last week.

squid 3.1 and 3.2 are a significant regression in performance from squid 2.7 or 
3.0

David Lang

On Thu, 14 Apr 2011, da...@lang.hm wrote:

 Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues
 
 Ok, I finally got a chance to test 2.7STABLE9
 
 it performs about the same as squid 3.0, possibly a little better.
 
 with my somewhat stripped down config (smaller regex patterns, replacing CIDR 
blocks and names that would need to be looked up in /etc/hosts with individual 
IP addresses)
 
 2.7 gives ~4800 requests/sec
 3.0 gives ~4600 requests/sec
 3.2.0.6 with 1 worker gives ~1300 requests/sec
 3.2.0.6 with 5 workers gives ~2800 requests/sec
 
 the numbers for 3.0 are slightly better than what I was getting with the full 
ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from 
the 
last round of tests (with either the full or simplified ruleset)
 
 so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the 
ability to use multiple worker processes in 3.2 doesn't make up for this.
 
 the time taken seems to almost all be in the ACL avaluation as eliminating 
 all 
the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.
 
 one theory is that even though I have IPv6 disabled on this build, the added 
space and more expensive checks needed to compare IPv6 addresses instead of 
IPv4 
addresses accounts for the single worker drop of ~66%. that seems rather 
expensive, even though there are 293 http_access lines (and one of them uses 
external file contents in it's acls, so it's a total of ~2400 
source/destination 
pairs, however due to the ability to shortcut the comparison the number of 
tests 
that need to be done should be 400)
 
 
 
 In addition, there seems to be some sort of locking betwen the multiple 
 worker 
processes in 3.2 when checking the ACLs as the test with almost no ACLs scales 
close to 100% per worker while with the ACLs it scales much more slowly, and 
above 4-5 workers actually drops off dramatically (to the point where with 8 
workers the throughput is down to about what you get with 1-2 workers) I don't 
see any conceptual reason why the ACL checks of the different worker threads 
should impact each other in any way, let alone in a way that limits 
scalability 
to ~4 workers before adding more workers is a net loss.
 
 David Lang
 
 
 On Wed, 13 Apr 2011, Marcos wrote:
 
 Hi David,
 
 could you run and publish your benchmark with squid 2.7 ???
 i'd like to know if is there any regression between 2.7 and 3.x series.
 
 thanks.
 
 Marcos
 
 
 - Mensagem original 
 De: da...@lang.hm da...@lang.hm
 Para: Amos Jeffries squ...@treenet.co.nz
 Cc: squid-users@squid-cache.org; squid-...@squid-cache.org
 Enviadas: S?bado, 9 de Abril de 2011 12:56:12
 Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
 
 On Sat, 9 Apr 2011, Amos Jeffries wrote:
 
 On 09/04/11 14:27, da...@lang.hm wrote:
 A couple more things about the ACLs used in my test
 
 all of them are allow ACLs (no deny rules to worry about precidence of)
 except for a deny-all at the bottom
 
 the ACL line that permits the test source to the test destination has
 zero overlap with the rest of the rules
 
 every rule has an IP based restriction (even the ones with url_regex are
 source - URL regex)
 
 I moved the ACL that allows my test from the bottom of the ruleset to
 the top and the resulting performance numbers were up as if the other
 ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
 rule.
 
 I changed one of the url_regex rules to just match one line rather than
 a file containing 307 lines to see if that made a difference, and it
 made no significant difference. So this indicates to me that it's not
 having to fully evaluate every rule (it's able to skip doing the regex
 if the IP match doesn't work)
 
 I then changed all the acl lines that used hostnames to have IP
 addresses in them, and this also made no significant difference
 
 I then changed all subnet matches to single IP address (just nuked /##
 throughout the config file) and this also made no significant difference.
 
 
 Squid has always worked this way. It will *test* every rule from the top 
 down 
to the one that matches. Also testing each line left-to-right until one 
fails or 
the whole line matches.
 
 
 so why are the address matches so

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread Alex Rousskov
On 04/14/2011 09:06 PM, da...@lang.hm wrote:
 Ok, I finally got a chance to test 2.7STABLE9
 
 it performs about the same as squid 3.0, possibly a little better.
 
 with my somewhat stripped down config (smaller regex patterns, replacing
 CIDR blocks and names that would need to be looked up in /etc/hosts with
 individual IP addresses)
 
 2.7 gives ~4800 requests/sec
 3.0 gives ~4600 requests/sec
 3.2.0.6 with 1 worker gives ~1300 requests/sec
 3.2.0.6 with 5 workers gives ~2800 requests/sec

Glad you did not see a significant regression between v2.7 and v3.0. We
have heard rather different stories. Every environment is different, and
many lab tests are misguided, of course, but it is still good to hear
positive reports.

The difference between v3.2 and v3.0 is known and have been discussed on
squid-dev. A few specific culprits are also known, but more need to be
identified. We are working on identifying these performance bugs and
reducing that difference.

As for 1 versus 5 worker difference, it seems to be specific to your
environment (as discussed below).


 the numbers for 3.0 are slightly better than what I was getting with the
 full ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I
 got from the last round of tests (with either the full or simplified
 ruleset)
 
 so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and
 the ability to use multiple worker processes in 3.2 doesn't make up for
 this.
 
 the time taken seems to almost all be in the ACL avaluation as
 eliminating all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.

If ACLs are the major culprit in your environment, then this is most
likely not a problem in Squid source code. AFAIK, there are no locks or
other synchronization primitives/overheads when it comes to Squid ACLs.
The solution may lie in optimizing some 3rd-party libraries (used by
ACLs) or in optimizing how they are used by Squid, depending on what
ACLs you use. As far as Squid-specific code is concerned, you should see
nearly linear ACL scale with the number of workers.


 one theory is that even though I have IPv6 disabled on this build, the
 added space and more expensive checks needed to compare IPv6 addresses
 instead of IPv4 addresses accounts for the single worker drop of ~66%.
 that seems rather expensive, even though there are 293 http_access lines
 (and one of them uses external file contents in it's acls, so it's a
 total of ~2400 source/destination pairs, however due to the ability to
 shortcut the comparison the number of tests that need to be done should
 be 400)

Yes, IPv6 is one of the known major performance regression culprits, but
IPv6 ACLs should still scale linearly with the number of workers, AFAICT.

Please note that I am not an ACL expert. I am just talking from the
overall Squid SMP design point of view and from our testing/deployment
experience point of view.


 In addition, there seems to be some sort of locking betwen the multiple
 worker processes in 3.2 when checking the ACLs

There are pretty much no locks in the current official SMP code. This
will change as we start adding shared caches in a week or so, but even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such), but I
do not know much about them.


HTH,

Alex.


 On Wed, 13 Apr 2011, Marcos wrote:

 Hi David,

 could you run and publish your benchmark with squid 2.7 ???
 i'd like to know if is there any regression between 2.7 and 3.x series.

 thanks.

 Marcos


 - Mensagem original 
 De: da...@lang.hm da...@lang.hm
 Para: Amos Jeffries squ...@treenet.co.nz
 Cc: squid-users@squid-cache.org; squid-...@squid-cache.org
 Enviadas: S?bado, 9 de Abril de 2011 12:56:12
 Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues

 On Sat, 9 Apr 2011, Amos Jeffries wrote:

 On 09/04/11 14:27, da...@lang.hm wrote:
 A couple more things about the ACLs used in my test

 all of them are allow ACLs (no deny rules to worry about precidence
 of)
 except for a deny-all at the bottom

 the ACL line that permits the test source to the test destination has
 zero overlap with the rest of the rules

 every rule has an IP based restriction (even the ones with
 url_regex are
 source - URL regex)

 I moved the ACL that allows my test from the bottom of the ruleset to
 the top and the resulting performance numbers were up as if the other
 ACLs didn't exist. As such it is very clear that 3.2 is evaluating
 every
 rule.

 I changed one of the url_regex rules to just match one line rather
 than
 a file containing 307 lines to see if that made a difference, and it
 made no significant difference. So this indicates to me that it's not
 having to fully evaluate every rule (it's able to skip doing the regex
 if the IP match doesn't work)

 I then changed all the acl lines that used hostnames to have IP
 addresses in them, and this also made no significant difference

 I then changed all subnet

Re: Res: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread david

On Mon, 25 Apr 2011, Marcos wrote:


thanks for your answer David.

i'm seeing too much feature been included at squid 3.x, but it's getting as 
slower as new features are added.


that's unfortunantly fairly normal.

i think squid 3.2 with 1 worker should be as fast as 2.7, but it's getting 
slower e hungry.


that's one major problem, but the fact that the ACL matching isn't scaling 
with more workers I think is what's killing us.


1 3.2 worker is ~1/3 the speed of 2.7, but with the easy availablity of 8+ 
real cores (not hyperthreaded 'fake' cores), you should still be able to 
get ~3x the performance of 2.7 by using 3.2.


unfortunantly that's not what's happening, and we end up topping out 
around 1/2-2/3 the performance of 2.7


David Lang



Marcos


- Mensagem original 
De: da...@lang.hm da...@lang.hm
Para: Marcos mczue...@yahoo.com.br
Cc: Amos Jeffries squ...@treenet.co.nz; squid-users@squid-cache.org; 
squid-...@squid-cache.org

Enviadas: Sexta-feira, 22 de Abril de 2011 15:10:44
Assunto: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

ping, I haven't seen a response to this additional information that I sent out 
last week.


squid 3.1 and 3.2 are a significant regression in performance from squid 2.7 or 
3.0


David Lang

On Thu, 14 Apr 2011, da...@lang.hm wrote:


Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

Ok, I finally got a chance to test 2.7STABLE9

it performs about the same as squid 3.0, possibly a little better.

with my somewhat stripped down config (smaller regex patterns, replacing CIDR 
blocks and names that would need to be looked up in /etc/hosts with individual 
IP addresses)


2.7 gives ~4800 requests/sec
3.0 gives ~4600 requests/sec
3.2.0.6 with 1 worker gives ~1300 requests/sec
3.2.0.6 with 5 workers gives ~2800 requests/sec

the numbers for 3.0 are slightly better than what I was getting with the full 
ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from the 
last round of tests (with either the full or simplified ruleset)


so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the 
ability to use multiple worker processes in 3.2 doesn't make up for this.


the time taken seems to almost all be in the ACL avaluation as eliminating all 
the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.


one theory is that even though I have IPv6 disabled on this build, the added 
space and more expensive checks needed to compare IPv6 addresses instead of IPv4 
addresses accounts for the single worker drop of ~66%. that seems rather 
expensive, even though there are 293 http_access lines (and one of them uses 
external file contents in it's acls, so it's a total of ~2400 source/destination 
pairs, however due to the ability to shortcut the comparison the number of tests 
that need to be done should be 400)




In addition, there seems to be some sort of locking betwen the multiple worker 
processes in 3.2 when checking the ACLs as the test with almost no ACLs scales 
close to 100% per worker while with the ACLs it scales much more slowly, and 
above 4-5 workers actually drops off dramatically (to the point where with 8 
workers the throughput is down to about what you get with 1-2 workers) I don't 
see any conceptual reason why the ACL checks of the different worker threads 
should impact each other in any way, let alone in a way that limits scalability 
to ~4 workers before adding more workers is a net loss.


David Lang



On Wed, 13 Apr 2011, Marcos wrote:


Hi David,

could you run and publish your benchmark with squid 2.7 ???
i'd like to know if is there any regression between 2.7 and 3.x series.

thanks.

Marcos


- Mensagem original 
De: da...@lang.hm da...@lang.hm
Para: Amos Jeffries squ...@treenet.co.nz
Cc: squid-users@squid-cache.org; squid-...@squid-cache.org
Enviadas: S?bado, 9 de Abril de 2011 12:56:12
Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues

On Sat, 9 Apr 2011, Amos Jeffries wrote:


On 09/04/11 14:27, da...@lang.hm wrote:

A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source - URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to
the top and the resulting performance numbers were up as if the other
ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
rule.

I changed one of the url_regex rules to just match one line rather than
a file containing 307 lines to see if that made a difference, and it
made no significant difference. So this indicates to me that it's not
having to fully evaluate every rule (it's able to skip doing the regex
if the IP match doesn't work)

I then changed all the acl lines that used hostnames

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread david

On Mon, 25 Apr 2011, Alex Rousskov wrote:


On 04/14/2011 09:06 PM, da...@lang.hm wrote:

Ok, I finally got a chance to test 2.7STABLE9

it performs about the same as squid 3.0, possibly a little better.

with my somewhat stripped down config (smaller regex patterns, replacing
CIDR blocks and names that would need to be looked up in /etc/hosts with
individual IP addresses)

2.7 gives ~4800 requests/sec
3.0 gives ~4600 requests/sec
3.2.0.6 with 1 worker gives ~1300 requests/sec
3.2.0.6 with 5 workers gives ~2800 requests/sec


Glad you did not see a significant regression between v2.7 and v3.0. We
have heard rather different stories. Every environment is different, and
many lab tests are misguided, of course, but it is still good to hear
positive reports.

The difference between v3.2 and v3.0 is known and have been discussed on
squid-dev. A few specific culprits are also known, but more need to be
identified. We are working on identifying these performance bugs and
reducing that difference.


let me know if there are any tests that I can run that will help you.


As for 1 versus 5 worker difference, it seems to be specific to your
environment (as discussed below).



the numbers for 3.0 are slightly better than what I was getting with the
full ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I
got from the last round of tests (with either the full or simplified
ruleset)

so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and
the ability to use multiple worker processes in 3.2 doesn't make up for
this.

the time taken seems to almost all be in the ACL avaluation as
eliminating all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.


If ACLs are the major culprit in your environment, then this is most
likely not a problem in Squid source code. AFAIK, there are no locks or
other synchronization primitives/overheads when it comes to Squid ACLs.
The solution may lie in optimizing some 3rd-party libraries (used by
ACLs) or in optimizing how they are used by Squid, depending on what
ACLs you use. As far as Squid-specific code is concerned, you should see
nearly linear ACL scale with the number of workers.


given that my ACLs are IP/port matches or regex matches (and I've tested 
replacing the regex matches with IP matches with no significant change in 
performance), what components would be used.





one theory is that even though I have IPv6 disabled on this build, the
added space and more expensive checks needed to compare IPv6 addresses
instead of IPv4 addresses accounts for the single worker drop of ~66%.
that seems rather expensive, even though there are 293 http_access lines
(and one of them uses external file contents in it's acls, so it's a
total of ~2400 source/destination pairs, however due to the ability to
shortcut the comparison the number of tests that need to be done should
be 400)


Yes, IPv6 is one of the known major performance regression culprits, but
IPv6 ACLs should still scale linearly with the number of workers, AFAICT.

Please note that I am not an ACL expert. I am just talking from the
overall Squid SMP design point of view and from our testing/deployment
experience point of view.


that makes sense and is what I would have expected, but in my case (lots 
of ACLs) I am seeing a definante problem with more workers not completing 
more work, and beyond about 5 workers I am seeing the total work being 
completed drop. I can't think of any reason besides locking that this may 
be the case.



In addition, there seems to be some sort of locking betwen the multiple
worker processes in 3.2 when checking the ACLs


There are pretty much no locks in the current official SMP code. This
will change as we start adding shared caches in a week or so, but even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such), but I
do not know much about them.


what are the 3rd party libraries that I would be using?

David Lang



HTH,

Alex.



On Wed, 13 Apr 2011, Marcos wrote:


Hi David,

could you run and publish your benchmark with squid 2.7 ???
i'd like to know if is there any regression between 2.7 and 3.x series.

thanks.

Marcos


- Mensagem original 
De: da...@lang.hm da...@lang.hm
Para: Amos Jeffries squ...@treenet.co.nz
Cc: squid-users@squid-cache.org; squid-...@squid-cache.org
Enviadas: S?bado, 9 de Abril de 2011 12:56:12
Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues

On Sat, 9 Apr 2011, Amos Jeffries wrote:


On 09/04/11 14:27, da...@lang.hm wrote:

A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence
of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with
url_regex are
source - URL regex)

I moved the ACL that allows my test from the bottom

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread david

On Mon, 25 Apr 2011, da...@lang.hm wrote:


On Mon, 25 Apr 2011, Alex Rousskov wrote:


On 04/14/2011 09:06 PM, da...@lang.hm wrote:


In addition, there seems to be some sort of locking betwen the multiple
worker processes in 3.2 when checking the ACLs


There are pretty much no locks in the current official SMP code. This
will change as we start adding shared caches in a week or so, but even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such), but I
do not know much about them.


what are the 3rd party libraries that I would be using?


one thought I had is that this could be locking on name lookups. how hard 
would it be to create a quick patch that would bypass the name lookups 
entirely and only do the lookups by IP.


if that regains the speed and/or scalability it would point fingers fairly 
conclusively at the DNS components.


this is the only think that I can think of that should be shared between 
multiple workers processing ACLs


David Lang


Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread Alex Rousskov
On 04/25/2011 05:31 PM, da...@lang.hm wrote:
 On Mon, 25 Apr 2011, da...@lang.hm wrote: 
 On Mon, 25 Apr 2011, Alex Rousskov wrote:
 On 04/14/2011 09:06 PM, da...@lang.hm wrote:

 In addition, there seems to be some sort of locking betwen the multiple
 worker processes in 3.2 when checking the ACLs

 There are pretty much no locks in the current official SMP code. This
 will change as we start adding shared caches in a week or so, but even
 then the ACLs will remain lock-free. There could be some internal
 locking in the 3rd-party libraries used by ACLs (regex and such), but I
 do not know much about them.

 what are the 3rd party libraries that I would be using?

See ldd squid. Here is a sample based on a randomly picked Squid:

libnsl, libresolv, libstdc++, libgcc_s, libm, libc, libz, libepol

Please note that I am not saying that any of these have problems in SMP
environment. I am only saying that Squid itself does not lock anything
runtime so if our suspect is SMP-related locks, they would have to
reside elsewhere. The other possibility is that we should suspect
something else, of course. IMHO, it is more likely to be something else:
after all, Squid does not use threads, where such problems are expected.

BTW, do you see more-or-less even load across CPU cores? If not, you may
need a patch that we find useful on older Linux kernels. It is discussed
in the Will similar workers receive similar amount of work? section of
http://wiki.squid-cache.org/Features/SmpScale


 one thought I had is that this could be locking on name lookups. how
 hard would it be to create a quick patch that would bypass the name
 lookups entirely and only do the lookups by IP.

I did not realize your ACLs use DNS lookups. Squid internal DNS code
does not have any runtime SMP locks. However, the presence of DNS
lookups increases the number of suspects.

A patch you propose does not sound difficult to me, but since I cannot
contribute such a patch soon, it is probably better to test with ACLs
that do not require any DNS lookups instead.


 if that regains the speed and/or scalability it would point fingers
 fairly conclusively at the DNS components.
 
 this is the only think that I can think of that should be shared between
 multiple workers processing ACLs

but it is _not_ currently shared from Squid point of view.


Cheers,

Alex.


Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread david

On Mon, 25 Apr 2011, Alex Rousskov wrote:


On 04/25/2011 05:31 PM, da...@lang.hm wrote:

On Mon, 25 Apr 2011, da...@lang.hm wrote:

On Mon, 25 Apr 2011, Alex Rousskov wrote:

On 04/14/2011 09:06 PM, da...@lang.hm wrote:


In addition, there seems to be some sort of locking betwen the multiple
worker processes in 3.2 when checking the ACLs


There are pretty much no locks in the current official SMP code. This
will change as we start adding shared caches in a week or so, but even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such), but I
do not know much about them.


what are the 3rd party libraries that I would be using?


See ldd squid. Here is a sample based on a randomly picked Squid:

   libnsl, libresolv, libstdc++, libgcc_s, libm, libc, libz, libepol

Please note that I am not saying that any of these have problems in SMP
environment. I am only saying that Squid itself does not lock anything
runtime so if our suspect is SMP-related locks, they would have to
reside elsewhere. The other possibility is that we should suspect
something else, of course. IMHO, it is more likely to be something else:
after all, Squid does not use threads, where such problems are expected.




BTW, do you see more-or-less even load across CPU cores? If not, you may
need a patch that we find useful on older Linux kernels. It is discussed
in the Will similar workers receive similar amount of work? section of
http://wiki.squid-cache.org/Features/SmpScale


the load is pretty even across all workers.

with the problems descripted on that page, I would expect uneven 
utilization at low loads, but at high loads (with the workers busy 
serviceing requests rather than waiting for new connections), I would 
expect the work to even out (and the types of hacks described in that 
section to end up costing performance, but not in a way that would scale 
with the ACL processing load)



one thought I had is that this could be locking on name lookups. how
hard would it be to create a quick patch that would bypass the name
lookups entirely and only do the lookups by IP.


I did not realize your ACLs use DNS lookups. Squid internal DNS code
does not have any runtime SMP locks. However, the presence of DNS
lookups increases the number of suspects.


they don't, everything in my test environment is by IP. But I've seen 
other software that still runs everything through name lookups, even if 
what's presented to the software (both in what's requested and in the 
ACLs) is all done by IPs. It's a easy way to bullet-proof the input (if 
it's a name it gets resolved, if it's an IP, the IP comes back as-is, and 
it works for IPv4 and IPv6, no need to have logic that looks at the value 
and tries to figure out if the user intended to type a name or an IP). I 
don't know how squid is working internally (it's a pretty large codebase, 
and I haven't tried to really dive into it) so I don't know if squid does 
this or not.



A patch you propose does not sound difficult to me, but since I cannot
contribute such a patch soon, it is probably better to test with ACLs
that do not require any DNS lookups instead.



if that regains the speed and/or scalability it would point fingers
fairly conclusively at the DNS components.

this is the only think that I can think of that should be shared between
multiple workers processing ACLs


but it is _not_ currently shared from Squid point of view.


Ok, I was assuming from the description of things that there would be one 
DNS process that all the workers would be accessing. from the way it's 
described in the documentation it sounds as if it's already a separate 
process, so I was thinking that it was possible that if each ACL IP 
address is being put through a single DNS process, I could be running into 
contention on that process (and having to do name lookups for both IPv6 
and then falling back to IPv4 would explain the severe performance hit far 
more than the difference between IPs being 128 bit values instead of 32 
bit values)


David Lang



Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread Alex Rousskov
On 04/25/2011 06:14 PM, da...@lang.hm wrote:
 if that regains the speed and/or scalability it would point fingers
 fairly conclusively at the DNS components.

 this is the only think that I can think of that should be shared between
 multiple workers processing ACLs

 but it is _not_ currently shared from Squid point of view.
 
 Ok, I was assuming from the description of things that there would be
 one DNS process that all the workers would be accessing. from the way
 it's described in the documentation it sounds as if it's already a
 separate process

I would like to fix that documentation, but I cannot find what phrase
led you to the above conclusion. The SmpScale wiki page says:

 Currently, Squid workers do not share and do not synchronize other
 resources or services, including:
 
 * DNS caches (ipcache and fqdncache);

So that seems to be correct and clear. Which documentation are you
referring to?


Thank you,

Alex.


Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-25 Thread david

On Mon, 25 Apr 2011, Alex Rousskov wrote:


On 04/25/2011 06:14 PM, da...@lang.hm wrote:

if that regains the speed and/or scalability it would point fingers
fairly conclusively at the DNS components.

this is the only think that I can think of that should be shared between
multiple workers processing ACLs


but it is _not_ currently shared from Squid point of view.


Ok, I was assuming from the description of things that there would be
one DNS process that all the workers would be accessing. from the way
it's described in the documentation it sounds as if it's already a
separate process


I would like to fix that documentation, but I cannot find what phrase
led you to the above conclusion. The SmpScale wiki page says:


Currently, Squid workers do not share and do not synchronize other
resources or services, including:

* DNS caches (ipcache and fqdncache);


So that seems to be correct and clear. Which documentation are you
referring to?


ahh, I missed that, I was going by the description of the config options 
that configure and disable the DNS cache (they don't say anything about 
the SMP mode, but I read them to imply that the squid-internal DNS cache 
was a separate thread/proccess)


David Lang


Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-22 Thread david
ping, I haven't seen a response to this additional information that I sent 
out last week.


squid 3.1 and 3.2 are a significant regression in performance from squid 
2.7 or 3.0


David Lang

On Thu, 14 Apr 2011, da...@lang.hm wrote:


Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

Ok, I finally got a chance to test 2.7STABLE9

it performs about the same as squid 3.0, possibly a little better.

with my somewhat stripped down config (smaller regex patterns, replacing CIDR 
blocks and names that would need to be looked up in /etc/hosts with 
individual IP addresses)


2.7 gives ~4800 requests/sec
3.0 gives ~4600 requests/sec
3.2.0.6 with 1 worker gives ~1300 requests/sec
3.2.0.6 with 5 workers gives ~2800 requests/sec

the numbers for 3.0 are slightly better than what I was getting with the full 
ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from 
the last round of tests (with either the full or simplified ruleset)


so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the 
ability to use multiple worker processes in 3.2 doesn't make up for this.


the time taken seems to almost all be in the ACL avaluation as eliminating 
all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.


one theory is that even though I have IPv6 disabled on this build, the added 
space and more expensive checks needed to compare IPv6 addresses instead of 
IPv4 addresses accounts for the single worker drop of ~66%. that seems rather 
expensive, even though there are 293 http_access lines (and one of them uses 
external file contents in it's acls, so it's a total of ~2400 
source/destination pairs, however due to the ability to shortcut the 
comparison the number of tests that need to be done should be 400)




In addition, there seems to be some sort of locking betwen the multiple 
worker processes in 3.2 when checking the ACLs as the test with almost no 
ACLs scales close to 100% per worker while with the ACLs it scales much more 
slowly, and above 4-5 workers actually drops off dramatically (to the point 
where with 8 workers the throughput is down to about what you get with 1-2 
workers) I don't see any conceptual reason why the ACL checks of the 
different worker threads should impact each other in any way, let alone in a 
way that limits scalability to ~4 workers before adding more workers is a net 
loss.


David Lang



On Wed, 13 Apr 2011, Marcos wrote:


Hi David,

could you run and publish your benchmark with squid 2.7 ???
i'd like to know if is there any regression between 2.7 and 3.x series.

thanks.

Marcos


- Mensagem original 
De: da...@lang.hm da...@lang.hm
Para: Amos Jeffries squ...@treenet.co.nz
Cc: squid-users@squid-cache.org; squid-...@squid-cache.org
Enviadas: S?bado, 9 de Abril de 2011 12:56:12
Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues

On Sat, 9 Apr 2011, Amos Jeffries wrote:


On 09/04/11 14:27, da...@lang.hm wrote:

A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source - URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to
the top and the resulting performance numbers were up as if the other
ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
rule.

I changed one of the url_regex rules to just match one line rather than
a file containing 307 lines to see if that made a difference, and it
made no significant difference. So this indicates to me that it's not
having to fully evaluate every rule (it's able to skip doing the regex
if the IP match doesn't work)

I then changed all the acl lines that used hostnames to have IP
addresses in them, and this also made no significant difference

I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant 
difference.




Squid has always worked this way. It will *test* every rule from the top 
down to the one that matches. Also testing each line left-to-right until 
one fails or the whole line matches.




so why are the address matches so expensive



3.0 and older IP address is a 32-bit comparison.
3.1 and newer IP address is a 128-bit comparison with memcmp().

If something like a word-wise comparison can be implemented faster than 
memcmp() we would welcome it.


I wonder if there should be a different version that's used when IPv6 is 
disabled. this is a pretty large hit.


if the data is aligned properly, on a 64 bit system this should still only 
be 2 compares. do you do any alignment on the data now?



and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one 3.2
process

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-14 Thread david

Ok, I finally got a chance to test 2.7STABLE9

it performs about the same as squid 3.0, possibly a little better.

with my somewhat stripped down config (smaller regex patterns, replacing 
CIDR blocks and names that would need to be looked up in /etc/hosts with 
individual IP addresses)


2.7 gives ~4800 requests/sec
3.0 gives ~4600 requests/sec
3.2.0.6 with 1 worker gives ~1300 requests/sec
3.2.0.6 with 5 workers gives ~2800 requests/sec

the numbers for 3.0 are slightly better than what I was getting with the 
full ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I 
got from the last round of tests (with either the full or simplified 
ruleset)


so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the 
ability to use multiple worker processes in 3.2 doesn't make up for this.


the time taken seems to almost all be in the ACL avaluation as eliminating 
all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.


one theory is that even though I have IPv6 disabled on this build, the 
added space and more expensive checks needed to compare IPv6 addresses 
instead of IPv4 addresses accounts for the single worker drop of ~66%. 
that seems rather expensive, even though there are 293 http_access lines 
(and one of them uses external file contents in it's acls, so it's a total 
of ~2400 source/destination pairs, however due to the ability to shortcut 
the comparison the number of tests that need to be done should be 400)




In addition, there seems to be some sort of locking betwen the multiple 
worker processes in 3.2 when checking the ACLs as the test with almost no 
ACLs scales close to 100% per worker while with the ACLs it scales much 
more slowly, and above 4-5 workers actually drops off dramatically (to the 
point where with 8 workers the throughput is down to about what you get 
with 1-2 workers) I don't see any conceptual reason why the ACL checks of 
the different worker threads should impact each other in any way, let 
alone in a way that limits scalability to ~4 workers before adding more 
workers is a net loss.


David Lang



On Wed, 13 Apr 2011, Marcos wrote:


Hi David,

could you run and publish your benchmark with squid 2.7 ???
i'd like to know if is there any regression between 2.7 and 3.x series.

thanks.

Marcos


- Mensagem original 
De: da...@lang.hm da...@lang.hm
Para: Amos Jeffries squ...@treenet.co.nz
Cc: squid-users@squid-cache.org; squid-...@squid-cache.org
Enviadas: S?bado, 9 de Abril de 2011 12:56:12
Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues

On Sat, 9 Apr 2011, Amos Jeffries wrote:


On 09/04/11 14:27, da...@lang.hm wrote:

A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source - URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to
the top and the resulting performance numbers were up as if the other
ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
rule.

I changed one of the url_regex rules to just match one line rather than
a file containing 307 lines to see if that made a difference, and it
made no significant difference. So this indicates to me that it's not
having to fully evaluate every rule (it's able to skip doing the regex
if the IP match doesn't work)

I then changed all the acl lines that used hostnames to have IP
addresses in them, and this also made no significant difference

I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant difference.



Squid has always worked this way. It will *test* every rule from the top 
down to the one that matches. Also testing each line left-to-right until 
one fails or the whole line matches.




so why are the address matches so expensive



3.0 and older IP address is a 32-bit comparison.
3.1 and newer IP address is a 128-bit comparison with memcmp().

If something like a word-wise comparison can be implemented faster than 
memcmp() we would welcome it.


I wonder if there should be a different version that's used when IPv6 is 
disabled. this is a pretty large hit.


if the data is aligned properly, on a 64 bit system this should still only 
be 2 compares. do you do any alignment on the data now?



and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one 3.2
process is about 1/3 the speed of a 3.0 process in checking the acls
wouldn't matter nearly as much when it's so easy to get an 8+ core 
system.




There you have the unknown.


I think this is a fairly critical thing to figure out.


Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-13 Thread Marcos
Hi David,

could you run and publish your benchmark with squid 2.7 ???
i'd like to know if is there any regression between 2.7 and 3.x series.

thanks.

Marcos


- Mensagem original 
De: da...@lang.hm da...@lang.hm
Para: Amos Jeffries squ...@treenet.co.nz
Cc: squid-users@squid-cache.org; squid-...@squid-cache.org
Enviadas: Sábado, 9 de Abril de 2011 12:56:12
Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues

On Sat, 9 Apr 2011, Amos Jeffries wrote:

 On 09/04/11 14:27, da...@lang.hm wrote:
 A couple more things about the ACLs used in my test
 
 all of them are allow ACLs (no deny rules to worry about precidence of)
 except for a deny-all at the bottom
 
 the ACL line that permits the test source to the test destination has
 zero overlap with the rest of the rules
 
 every rule has an IP based restriction (even the ones with url_regex are
 source - URL regex)
 
 I moved the ACL that allows my test from the bottom of the ruleset to
 the top and the resulting performance numbers were up as if the other
 ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
 rule.
 
 I changed one of the url_regex rules to just match one line rather than
 a file containing 307 lines to see if that made a difference, and it
 made no significant difference. So this indicates to me that it's not
 having to fully evaluate every rule (it's able to skip doing the regex
 if the IP match doesn't work)
 
 I then changed all the acl lines that used hostnames to have IP
 addresses in them, and this also made no significant difference
 
 I then changed all subnet matches to single IP address (just nuked /##
 throughout the config file) and this also made no significant difference.
 
 
 Squid has always worked this way. It will *test* every rule from the top down 
to the one that matches. Also testing each line left-to-right until one fails 
or 
the whole line matches.
 
 
 so why are the address matches so expensive
 
 
 3.0 and older IP address is a 32-bit comparison.
 3.1 and newer IP address is a 128-bit comparison with memcmp().
 
 If something like a word-wise comparison can be implemented faster than 
memcmp() we would welcome it.

I wonder if there should be a different version that's used when IPv6 is 
disabled. this is a pretty large hit.

if the data is aligned properly, on a 64 bit system this should still only be 2 
compares. do you do any alignment on the data now?

 and as noted in the e-mail below, why do these checks not scale nicely
 with the number of worker processes? If they did, the fact that one 3.2
 process is about 1/3 the speed of a 3.0 process in checking the acls
 wouldn't matter nearly as much when it's so easy to get an 8+ core system.
 
 
 There you have the unknown.

I think this is a fairly critical thing to figure out.

 
 it seems to me that all accept/deny rules in a set should be able to be
 combined into a tree to make searching them very fast.
 
 so for example if you have
 
 accept 1
 accept 2
 deny 3
 deny 4
 accept 5
 
 you need to create three trees (one with accept 1 and accept 2, one with
 deny3 and deny4, and one with accept 5) and then check each tree to see
 if you have a match.
 
 the types of match could be done in order of increasing cost, so if you
 
 The config file is specific structure configured by admin under guaranteed 
rules of operation for access lines (top-down, left-to-right, 
first-match-wins) 
to perform boolean-logic calculations using ACL sets.
 Sorting access line rules is not an option.
 Sorting ACL values and tree-forming them is already done (regex being the one 
exception AFAIK).
 Sorting position-wise on a single access line is also ruled out by 
 interactions 
with deny_info, auth and external ACL types.

It would seem that as long as you don't cross boundries between the different 
types, you should be able to optimize within a group.

using my example above, you couldn't combine the 'accept 5' with any of the 
other accepts, but you could combine accept 1 and 2 and combine deny 3 and 4 
togeather.

now, I know that I don't fully understand all the possible ACL types, so this 
may not work for some of them, but I believe that a fairly common use case is 
to 
have either a lot of allow rules, or a lot of deny rules as a block (either a 
list of sites you are allowed to access, or a list of sites that are blocked), 
so an ability to optimize these use cases may be well worth it.

 have acl entries of type port, src, dst, and url regex, organize the
 tree so that you check ports first, then src, then dst, then only if all
 that matches do you need to do the regex. This would be very similar to
 the shortcut logic that you use today with a single rule where you bail
 out when you don't find a match.
 
 you could go with a complex tree structure, but since this only needs to
 be changed at boot time,
 
 Um, boot/startup time and arbitrary -k reconfigure times.
 With a reverse-configuration display dump on any cache manager request

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-13 Thread david
sorry, haven't had time to do that yet. I will try and get this done 
today.


David Lang

On Wed, 13 Apr 2011, Marcos wrote:


Date: Wed, 13 Apr 2011 04:11:09 -0700 (PDT)
From: Marcos mczue...@yahoo.com.br
To: da...@lang.hm, Amos Jeffries squ...@treenet.co.nz
Cc: squid-users@squid-cache.org, squid-...@squid-cache.org
Subject: Res: [squid-users] squid 3.2.0.5 smp scaling issues

Hi David,

could you run and publish your benchmark with squid 2.7 ???
i'd like to know if is there any regression between 2.7 and 3.x series.

thanks.

Marcos


- Mensagem original 
De: da...@lang.hm da...@lang.hm
Para: Amos Jeffries squ...@treenet.co.nz
Cc: squid-users@squid-cache.org; squid-...@squid-cache.org
Enviadas: S?bado, 9 de Abril de 2011 12:56:12
Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues

On Sat, 9 Apr 2011, Amos Jeffries wrote:


On 09/04/11 14:27, da...@lang.hm wrote:

A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source - URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to
the top and the resulting performance numbers were up as if the other
ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
rule.

I changed one of the url_regex rules to just match one line rather than
a file containing 307 lines to see if that made a difference, and it
made no significant difference. So this indicates to me that it's not
having to fully evaluate every rule (it's able to skip doing the regex
if the IP match doesn't work)

I then changed all the acl lines that used hostnames to have IP
addresses in them, and this also made no significant difference

I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant difference.



Squid has always worked this way. It will *test* every rule from the top down 
to the one that matches. Also testing each line left-to-right until one fails or 
the whole line matches.




so why are the address matches so expensive



3.0 and older IP address is a 32-bit comparison.
3.1 and newer IP address is a 128-bit comparison with memcmp().

If something like a word-wise comparison can be implemented faster than 
memcmp() we would welcome it.


I wonder if there should be a different version that's used when IPv6 is 
disabled. this is a pretty large hit.


if the data is aligned properly, on a 64 bit system this should still only be 2 
compares. do you do any alignment on the data now?



and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one 3.2
process is about 1/3 the speed of a 3.0 process in checking the acls
wouldn't matter nearly as much when it's so easy to get an 8+ core system.



There you have the unknown.


I think this is a fairly critical thing to figure out.



it seems to me that all accept/deny rules in a set should be able to be
combined into a tree to make searching them very fast.

so for example if you have

accept 1
accept 2
deny 3
deny 4
accept 5

you need to create three trees (one with accept 1 and accept 2, one with
deny3 and deny4, and one with accept 5) and then check each tree to see
if you have a match.

the types of match could be done in order of increasing cost, so if you


The config file is specific structure configured by admin under guaranteed 
rules of operation for access lines (top-down, left-to-right, first-match-wins) 
to perform boolean-logic calculations using ACL sets.

Sorting access line rules is not an option.
Sorting ACL values and tree-forming them is already done (regex being the one 
exception AFAIK).
Sorting position-wise on a single access line is also ruled out by interactions 
with deny_info, auth and external ACL types.


It would seem that as long as you don't cross boundries between the different 
types, you should be able to optimize within a group.


using my example above, you couldn't combine the 'accept 5' with any of the 
other accepts, but you could combine accept 1 and 2 and combine deny 3 and 4 
togeather.


now, I know that I don't fully understand all the possible ACL types, so this 
may not work for some of them, but I believe that a fairly common use case is to 
have either a lot of allow rules, or a lot of deny rules as a block (either a 
list of sites you are allowed to access, or a list of sites that are blocked), 
so an ability to optimize these use cases may be well worth it.



have acl entries of type port, src, dst, and url regex, organize the
tree so that you check ports first, then src, then dst, then only if all
that matches do you need to do the regex. This would be very similar to
the shortcut logic

Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-09 Thread Amos Jeffries

On 09/04/11 14:27, da...@lang.hm wrote:

A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source - URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to
the top and the resulting performance numbers were up as if the other
ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
rule.

I changed one of the url_regex rules to just match one line rather than
a file containing 307 lines to see if that made a difference, and it
made no significant difference. So this indicates to me that it's not
having to fully evaluate every rule (it's able to skip doing the regex
if the IP match doesn't work)

I then changed all the acl lines that used hostnames to have IP
addresses in them, and this also made no significant difference

I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant difference.



Squid has always worked this way. It will *test* every rule from the top 
down to the one that matches. Also testing each line left-to-right until 
one fails or the whole line matches.




so why are the address matches so expensive



3.0 and older IP address is a 32-bit comparison.
3.1 and newer IP address is a 128-bit comparison with memcmp().

If something like a word-wise comparison can be implemented faster than 
memcmp() we would welcome it.




and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one 3.2
process is about 1/3 the speed of a 3.0 process in checking the acls
wouldn't matter nearly as much when it's so easy to get an 8+ core system.



There you have the unknown.



it seems to me that all accept/deny rules in a set should be able to be
combined into a tree to make searching them very fast.

so for example if you have

accept 1
accept 2
deny 3
deny 4
accept 5

you need to create three trees (one with accept 1 and accept 2, one with
deny3 and deny4, and one with accept 5) and then check each tree to see
if you have a match.

the types of match could be done in order of increasing cost, so if you


The config file is specific structure configured by admin under 
guaranteed rules of operation for access lines (top-down, left-to-right, 
first-match-wins) to perform boolean-logic calculations using ACL sets.

 Sorting access line rules is not an option.
 Sorting ACL values and tree-forming them is already done (regex being 
the one exception AFAIK).
 Sorting position-wise on a single access line is also ruled out by 
interactions with deny_info, auth and external ACL types.




have acl entries of type port, src, dst, and url regex, organize the
tree so that you check ports first, then src, then dst, then only if all
that matches do you need to do the regex. This would be very similar to
the shortcut logic that you use today with a single rule where you bail
out when you don't find a match.

you could go with a complex tree structure, but since this only needs to
be changed at boot time,


Um, boot/startup time and arbitrary -k reconfigure times.
With a reverse-configuration display dump on any cache manager request.


it seems to me that a simple array that you can
do a binary search on will work for the port, src, and dst trees. The
url regex is probably easiest to initially create by just doing a list
of regex strings to match and working down that list, but eventually it


This is already how we do these. But with a splay tree instead of binary.


may be best to create a parse tree so that you only have to walk down
the string once to see if you have a match.


That would be nice. Care to implement?
 You just have to get the regex library to adjust its pre-compiled 
patterns with OR into (existing|new) whenever a new pattern string is 
added to an ACL.




you wouldn't quite be able to get this fast as you would have to
actually do two checks, one if you have a match on that level and one
for the rules that don't specify something in the current tree (one
check for if the http_access line specifies a port number and one for if
it doesn't for example)


We get around this problem by using C++ types. ACLChecklist walks the 
tree holding the current location, expected result, and all details 
available about the transaction. Each node in the tree has a match() 
function which gets called at most once per walk. Each ACL data type 
provides its own match() algorithm.


That is why the following config is invalid:
 acl foo src 1.2.3.4
 acl foo port 80



this sort of acl structure would reduce a complex ruleset down to ~O(log
n) instead of the current O(n) (a really complex ruleset would be log n
of each tree added 

Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-09 Thread david

On Sat, 9 Apr 2011, Amos Jeffries wrote:


On 09/04/11 14:27, da...@lang.hm wrote:

A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source - URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to
the top and the resulting performance numbers were up as if the other
ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
rule.

I changed one of the url_regex rules to just match one line rather than
a file containing 307 lines to see if that made a difference, and it
made no significant difference. So this indicates to me that it's not
having to fully evaluate every rule (it's able to skip doing the regex
if the IP match doesn't work)

I then changed all the acl lines that used hostnames to have IP
addresses in them, and this also made no significant difference

I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant difference.



Squid has always worked this way. It will *test* every rule from the top down 
to the one that matches. Also testing each line left-to-right until one fails 
or the whole line matches.




so why are the address matches so expensive



3.0 and older IP address is a 32-bit comparison.
3.1 and newer IP address is a 128-bit comparison with memcmp().

If something like a word-wise comparison can be implemented faster than 
memcmp() we would welcome it.


I wonder if there should be a different version that's used when IPv6 is 
disabled. this is a pretty large hit.


if the data is aligned properly, on a 64 bit system this should still only 
be 2 compares. do you do any alignment on the data now?



and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one 3.2
process is about 1/3 the speed of a 3.0 process in checking the acls
wouldn't matter nearly as much when it's so easy to get an 8+ core system.



There you have the unknown.


I think this is a fairly critical thing to figure out.



it seems to me that all accept/deny rules in a set should be able to be
combined into a tree to make searching them very fast.

so for example if you have

accept 1
accept 2
deny 3
deny 4
accept 5

you need to create three trees (one with accept 1 and accept 2, one with
deny3 and deny4, and one with accept 5) and then check each tree to see
if you have a match.

the types of match could be done in order of increasing cost, so if you


The config file is specific structure configured by admin under guaranteed 
rules of operation for access lines (top-down, left-to-right, 
first-match-wins) to perform boolean-logic calculations using ACL sets.

Sorting access line rules is not an option.
Sorting ACL values and tree-forming them is already done (regex being the 
one exception AFAIK).
Sorting position-wise on a single access line is also ruled out by 
interactions with deny_info, auth and external ACL types.


It would seem that as long as you don't cross boundries between the 
different types, you should be able to optimize within a group.


using my example above, you couldn't combine the 'accept 5' with any of 
the other accepts, but you could combine accept 1 and 2 and combine deny 3 
and 4 togeather.


now, I know that I don't fully understand all the possible ACL types, so 
this may not work for some of them, but I believe that a fairly common use 
case is to have either a lot of allow rules, or a lot of deny rules as a 
block (either a list of sites you are allowed to access, or a list of 
sites that are blocked), so an ability to optimize these use cases may be 
well worth it.



have acl entries of type port, src, dst, and url regex, organize the
tree so that you check ports first, then src, then dst, then only if all
that matches do you need to do the regex. This would be very similar to
the shortcut logic that you use today with a single rule where you bail
out when you don't find a match.

you could go with a complex tree structure, but since this only needs to
be changed at boot time,


Um, boot/startup time and arbitrary -k reconfigure times.
With a reverse-configuration display dump on any cache manager request.


still a pretty rare case, and one where you can build a completely new 
ruleset and swap it out. My point was that this isn't something that you 
have to be able to update dynamically.



it seems to me that a simple array that you can
do a binary search on will work for the port, src, and dst trees. The
url regex is probably easiest to initially create by just doing a list
of regex strings to match and working down that list, but eventually it


This is already how we do these. But 

Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-08 Thread david
/sec
3.2.0.6 with 5 workers gets 15,800 requests/sec
3.2.0.6 with 6 workers gets 16,400 requests/sec

David Lang



On Fri, 8 Apr 2011, Amos Jeffries wrote:


Date: Fri, 08 Apr 2011 15:37:24 +1200
From: Amos Jeffries squ...@treenet.co.nz
To: squid-users@squid-cache.org
Subject: Re: [squid-users] squid 3.2.0.5 smp scaling issues

On 08/04/11 14:32, da...@lang.hm wrote:

sorry for the delay. I got a chance to do some more testing (slightly
different environment on the apache server, so these numbers are a
little lower for the same versions than the last ones I posted)

results when requesting short html page


squid 3.0.STABLE12 4000 requests/sec
squid 3.1.11 1500 requests/sec
squid 3.1.12 1530 requests/sec
squid 3.2.0.5 1 worker 1300 requests/sec
squid 3.2.0.5 2 workers 2050 requests/sec
squid 3.2.0.5 3 workers 2700 requests/sec
squid 3.2.0.5 4 workers 2950 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2530 requests/sec
squid 3.2.0.6 1 worker 1400 requests/sec
squid 3.2.0.6 2 workers 2050 requests/sec
squid 3.2.0.6 3 workers 2730 requests/sec
squid 3.2.0.6 4 workers 2950 requests/sec
squid 3.2.0.6 5 workers 2830 requests/sec
squid 3.2.0.6 6 workers 2530 requests/sec
squid 3.2.0.6 7 workers 2160 requests/sec instead of all processes being
at 100% several were at 99%
squid 3.2.0.6 8 workers 1950 requests/sec instead of all processes being
at 100% some were as low as 92%

so the new versions are really about the same

moving to large requests cut these numbers by about 1/3, but the squid
processes were not maxing out the CPU

one issue I saw, I had to reduce the number of concurrent connections or
I would have requests time out (3.2 vs earlier versions), on 3.2 I had
to have -c on ab at ~100-150 where I could go significantly higher on
3.1 and 3.0

David Lang



Thank you.
So with small files 2% on 3.1 and ~7% on 3.2 with a single worker. But 
under 1% on multiple 3.2 workers.

And overloading/flooding the I/O bandwidth on large files.

NP: when overloading I/O one cannot compare to runs with different sizes. 
Only with runs of the same traffic. Also only the CPU max load is reliable 
there, since requests/sec bottlenecks behind the I/O.

So... your measure that CPU dropped is a good sign for large files.

Amos





Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-07 Thread david
sorry for the delay. I got a chance to do some more testing (slightly 
different environment on the apache server, so these numbers are a 
little lower for the same versions than the last ones I posted)


results when requesting short html page


squid 3.0.STABLE12 4000 requests/sec
squid 3.1.11 1500 requests/sec
squid 3.1.12 1530 requests/sec
squid 3.2.0.5 1 worker 1300 requests/sec
squid 3.2.0.5 2 workers 2050 requests/sec
squid 3.2.0.5 3 workers 2700 requests/sec
squid 3.2.0.5 4 workers 2950 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2530 requests/sec
squid 3.2.0.6 1 worker 1400 requests/sec
squid 3.2.0.6 2 workers 2050 requests/sec
squid 3.2.0.6 3 workers 2730 requests/sec
squid 3.2.0.6 4 workers 2950 requests/sec
squid 3.2.0.6 5 workers 2830 requests/sec
squid 3.2.0.6 6 workers 2530 requests/sec
squid 3.2.0.6 7 workers 2160 requests/sec instead of all processes being at 
100% several were at 99%
squid 3.2.0.6 8 workers 1950 requests/sec instead of all processes being at 
100% some were as low as 92%

so the new versions are really about the same

moving to large requests cut these numbers by about 1/3, but the squid 
processes were not maxing out the CPU


one issue I saw, I had to reduce the number of concurrent connections or I 
would have requests time out (3.2 vs earlier versions), on 3.2 I had to 
have -c on ab at ~100-150 where I could go significantly higher on 3.1 and 
3.0


David Lang





















On Mon, 4 Apr 2011, da...@lang.hm wrote:


On Mon, 4 Apr 2011, Amos Jeffries wrote:


On 03/04/11 12:52, da...@lang.hm wrote:

still no response from anyone.

Is there any interest in investigating this issue? or should I just
write off squid for future use due to it's performance degrading?


It is a very ambiguous issue..
* We have your report with some nice rate benchmarks indicating regression
* We have two others saying me-too with less details
* We have an independent report indicating that 3.1 is faster than 2.7. 
With benchmarks to prove it.
* We have several independent reports indicating that 3.2 is faster than 
3.1. One like yours with benchmark proof.
* We have someone responding to your report saying the CPU type affects 
things in a large way (likely due to SMP using CPU-level features)
* We have our own internal testing which shows also a mix of results with 
the variance being dependent on which component of Squid is tested.


Your test in particular is testing both the large object pass-thru (proxy 
only) capacity and the parser CPU ceiling.


Could you try your test on 3.2.0.6 and 3.1.12 please? They both now have a 
server-facing buffer change which should directly affect your test results 
in a good way.


thanks for the response, part of my frustration was just not hearing anything 
back.


I'll do the tests on the new version shortly (hopefully on monday)

if there are other tests that people would like me to perform on the hardware 
I have available, please let me know.


right now I am just testing proxy/ACL with no caching, but I am testing four 
traffic types


1. small static files
2. large static files
3. small dynamic files (returning the exact same data as 1, but only after a 
fixed delay)

4. large dynamic files.

while I see a dramatic difference in the performance on the different tests, 
so far the ratios between the different versions have been consistant across 
all four scenerios.


David Lang



Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-07 Thread Amos Jeffries

On 08/04/11 14:32, da...@lang.hm wrote:

sorry for the delay. I got a chance to do some more testing (slightly
different environment on the apache server, so these numbers are a
little lower for the same versions than the last ones I posted)

results when requesting short html page


squid 3.0.STABLE12 4000 requests/sec
squid 3.1.11 1500 requests/sec
squid 3.1.12 1530 requests/sec
squid 3.2.0.5 1 worker 1300 requests/sec
squid 3.2.0.5 2 workers 2050 requests/sec
squid 3.2.0.5 3 workers 2700 requests/sec
squid 3.2.0.5 4 workers 2950 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2530 requests/sec
squid 3.2.0.6 1 worker 1400 requests/sec
squid 3.2.0.6 2 workers 2050 requests/sec
squid 3.2.0.6 3 workers 2730 requests/sec
squid 3.2.0.6 4 workers 2950 requests/sec
squid 3.2.0.6 5 workers 2830 requests/sec
squid 3.2.0.6 6 workers 2530 requests/sec
squid 3.2.0.6 7 workers 2160 requests/sec instead of all processes being
at 100% several were at 99%
squid 3.2.0.6 8 workers 1950 requests/sec instead of all processes being
at 100% some were as low as 92%

so the new versions are really about the same

moving to large requests cut these numbers by about 1/3, but the squid
processes were not maxing out the CPU

one issue I saw, I had to reduce the number of concurrent connections or
I would have requests time out (3.2 vs earlier versions), on 3.2 I had
to have -c on ab at ~100-150 where I could go significantly higher on
3.1 and 3.0

David Lang



Thank you.
 So with small files 2% on 3.1 and ~7% on 3.2 with a single worker. But 
under 1% on multiple 3.2 workers.

 And overloading/flooding the I/O bandwidth on large files.

NP: when overloading I/O one cannot compare to runs with different 
sizes. Only with runs of the same traffic. Also only the CPU max load is 
reliable there, since requests/sec bottlenecks behind the I/O.

 So... your measure that CPU dropped is a good sign for large files.

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.12
  Beta testers wanted for 3.2.0.6


Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-04 Thread david

On Mon, 4 Apr 2011, Amos Jeffries wrote:


On 03/04/11 12:52, da...@lang.hm wrote:

still no response from anyone.

Is there any interest in investigating this issue? or should I just
write off squid for future use due to it's performance degrading?


It is a very ambiguous issue..
* We have your report with some nice rate benchmarks indicating regression
* We have two others saying me-too with less details
* We have an independent report indicating that 3.1 is faster than 2.7. With 
benchmarks to prove it.
* We have several independent reports indicating that 3.2 is faster than 
3.1. One like yours with benchmark proof.
* We have someone responding to your report saying the CPU type affects 
things in a large way (likely due to SMP using CPU-level features)
* We have our own internal testing which shows also a mix of results with 
the variance being dependent on which component of Squid is tested.


Your test in particular is testing both the large object pass-thru (proxy 
only) capacity and the parser CPU ceiling.


Could you try your test on 3.2.0.6 and 3.1.12 please? They both now have a 
server-facing buffer change which should directly affect your test results in 
a good way.


thanks for the response, part of my frustration was just not hearing 
anything back.


I'll do the tests on the new version shortly (hopefully on monday)

if there are other tests that people would like me to perform on the 
hardware I have available, please let me know.


right now I am just testing proxy/ACL with no caching, but I am testing 
four traffic types


1. small static files
2. large static files
3. small dynamic files (returning the exact same data as 1, but only after 
a fixed delay)

4. large dynamic files.

while I see a dramatic difference in the performance on the different 
tests, so far the ratios between the different versions have been 
consistant across all four scenerios.


David Lang


Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-03 Thread Amos Jeffries

On 03/04/11 12:52, da...@lang.hm wrote:

still no response from anyone.

Is there any interest in investigating this issue? or should I just
write off squid for future use due to it's performance degrading?


It is a very ambiguous issue..
 * We have your report with some nice rate benchmarks indicating regression
 * We have two others saying me-too with less details
 * We have an independent report indicating that 3.1 is faster than 
2.7. With benchmarks to prove it.
 * We have several independent reports indicating that 3.2 is faster 
than 3.1. One like yours with benchmark proof.
 * We have someone responding to your report saying the CPU type 
affects things in a large way (likely due to SMP using CPU-level features)
 * We have our own internal testing which shows also a mix of results 
with the variance being dependent on which component of Squid is tested.


Your test in particular is testing both the large object pass-thru 
(proxy only) capacity and the parser CPU ceiling.


Could you try your test on 3.2.0.6 and 3.1.12 please? They both now have 
a server-facing buffer change which should directly affect your test 
results in a good way.


Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.12
  Beta testers wanted for 3.2.0.6


Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-04-02 Thread david

still no response from anyone.

Is there any interest in investigating this issue? or should I just write 
off squid for future use due to it's performance degrading?


David Lang

On Sat, 26 Mar 2011, da...@lang.hm wrote:


Subject: Re: [squid-users] squid 3.2.0.5 smp scaling issues

re-sending and adding -dev list

performance drops going from 3.0 - 3.1 - 3.2 and in addition squid 3.2 
scales poorly (only goes up to 2x single-threaded performance going up to 4 
cores and drops off again after that)


this makes it so that I actually get better performance on 3.0 than on 3.2, 
even with multiple workers


David Lang

On Mon, 21 Mar 2011, da...@lang.hm wrote:


Date: Mon, 21 Mar 2011 19:26:38 -0700 (PDT)
From: da...@lang.hm
To: squid-users@squid-cache.org
Subject: [squid-users] squid 3.2.0.5 smp scaling issues

test setup

box A running apache and ab

test against local IP address 13000 requests/sec

box B running squid, 8 2.3 GHz Opteron cores with 16G ram

non acl/cache-peer related lines in the config are (including typos from me 
manually entering this)


http_port 8000
icp_port 0
visible_hostname gromit1
cache_effective_user proxy
cache_effective_group proxy
appaend_domain .invalid.server.name
pid_filename /var/run/squid.pid
cache_dir null /tmp
client_db off
cache_access_log syslog squid
cache_log /var/log/squid/cache.log
cache_store_log none
coredump_dir none
no_cache deny all


results when requesting short html page squid 3.0.STABLE12 4200 
requests/sec

squid 3.1.11 2100 requests/sec
squid 3.2.0.5 1 worker 1400 requests/sec
squid 3.2.0.5 2 workers 2100 requests/sec
squid 3.2.0.5 3 workers 2500 requests/sec
squid 3.2.0.5 4 workers 2900 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2500 requests/sec
squid 3.2.0.5 7 workers 2000 requests/sec
squid 3.2.0.5 8 workers 1900 requests/sec

in all these tests the squid process was using 100% of the cpu

I tried it pulling a large file (100K instead of 50 bytes) on the thought 
that this may be bottlenecking on accepting the connections but with 
something that took more time to service the connections it could do better 
however what I found is that with 8 workers all 8 were using 50% of the 
CPU at 1000 requests/sec


local machine would do 7000 requests/sec to itself

1 worker 500 requests/sec
2 workers 957 requests/sec

from there it remained about 1000 requests/sec with the cpu utilization 
slowly dropping off (but not dropping as fast as it should with the number 
of cores available)


so it looks like there is some significant bottleneck in version 3.2 that 
makes the SMP support fairly ineffective.



in reading the wiki page at wili.squid-cache.org/Features/SmpScale I see 
you worrying about fairness between workers. If you have put in code to try 
and ensure fairness, you may want to remove it and see what happens to 
performance. what you are describing on that page in terms of fairness is 
what I would expect form a 'first-come-first-served' approach to multiple 
processes grabbing new connections. The worker that last ran is hot in the 
cache and so has an 'unfair' advantage in noticing and processing the new 
request, but as that worker gets busier, it will be spending more time 
servicing the request and the other processes will get more of a chance to 
grab the new connection, so it will appear unfair under light load, but 
become more fair under heavy load.


David Lang





Re: [squid-users] squid 3.2.0.5 smp scaling issues

2011-03-27 Thread david

re-sending and adding -dev list

performance drops going from 3.0 - 3.1 - 3.2 and in addition squid 3.2 
scales poorly (only goes up to 2x single-threaded performance going up to 
4 cores and drops off again after that)


this makes it so that I actually get better performance on 3.0 than on 
3.2, even with multiple workers


David Lang

On Mon, 21 Mar 2011, da...@lang.hm wrote:


Date: Mon, 21 Mar 2011 19:26:38 -0700 (PDT)
From: da...@lang.hm
To: squid-users@squid-cache.org
Subject: [squid-users] squid 3.2.0.5 smp scaling issues

test setup

box A running apache and ab

test against local IP address 13000 requests/sec

box B running squid, 8 2.3 GHz Opteron cores with 16G ram

non acl/cache-peer related lines in the config are (including typos from me 
manually entering this)


http_port 8000
icp_port 0
visible_hostname gromit1
cache_effective_user proxy
cache_effective_group proxy
appaend_domain .invalid.server.name
pid_filename /var/run/squid.pid
cache_dir null /tmp
client_db off
cache_access_log syslog squid
cache_log /var/log/squid/cache.log
cache_store_log none
coredump_dir none
no_cache deny all


results when requesting short html page squid 3.0.STABLE12 4200 requests/sec
squid 3.1.11 2100 requests/sec
squid 3.2.0.5 1 worker 1400 requests/sec
squid 3.2.0.5 2 workers 2100 requests/sec
squid 3.2.0.5 3 workers 2500 requests/sec
squid 3.2.0.5 4 workers 2900 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2500 requests/sec
squid 3.2.0.5 7 workers 2000 requests/sec
squid 3.2.0.5 8 workers 1900 requests/sec

in all these tests the squid process was using 100% of the cpu

I tried it pulling a large file (100K instead of 50 bytes) on the thought 
that this may be bottlenecking on accepting the connections but with 
something that took more time to service the connections it could do better 
however what I found is that with 8 workers all 8 were using 50% of the CPU 
at 1000 requests/sec


local machine would do 7000 requests/sec to itself

1 worker 500 requests/sec
2 workers 957 requests/sec

from there it remained about 1000 requests/sec with the cpu utilization 
slowly dropping off (but not dropping as fast as it should with the number of 
cores available)


so it looks like there is some significant bottleneck in version 3.2 that 
makes the SMP support fairly ineffective.



in reading the wiki page at wili.squid-cache.org/Features/SmpScale I see you 
worrying about fairness between workers. If you have put in code to try and 
ensure fairness, you may want to remove it and see what happens to 
performance. what you are describing on that page in terms of fairness is 
what I would expect form a 'first-come-first-served' approach to multiple 
processes grabbing new connections. The worker that last ran is hot in the 
cache and so has an 'unfair' advantage in noticing and processing the new 
request, but as that worker gets busier, it will be spending more time 
servicing the request and the other processes will get more of a chance to 
grab the new connection, so it will appear unfair under light load, but 
become more fair under heavy load.


David Lang



[squid-users] squid 3.2.0.5 smp scaling issues

2011-03-21 Thread david

test setup

box A running apache and ab

test against local IP address 13000 requests/sec

box B running squid, 8 2.3 GHz Opteron cores with 16G ram

non acl/cache-peer related lines in the config are (including typos from 
me manually entering this)


http_port 8000
icp_port 0
visible_hostname gromit1
cache_effective_user proxy
cache_effective_group proxy
appaend_domain .invalid.server.name
pid_filename /var/run/squid.pid
cache_dir null /tmp
client_db off
cache_access_log syslog squid
cache_log /var/log/squid/cache.log
cache_store_log none
coredump_dir none
no_cache deny all


results when requesting short html page 
squid 3.0.STABLE12 4200 requests/sec

squid 3.1.11 2100 requests/sec
squid 3.2.0.5 1 worker 1400 requests/sec
squid 3.2.0.5 2 workers 2100 requests/sec
squid 3.2.0.5 3 workers 2500 requests/sec
squid 3.2.0.5 4 workers 2900 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2500 requests/sec
squid 3.2.0.5 7 workers 2000 requests/sec
squid 3.2.0.5 8 workers 1900 requests/sec

in all these tests the squid process was using 100% of the cpu

I tried it pulling a large file (100K instead of 50 bytes) on the thought 
that this may be bottlenecking on accepting the connections but with 
something that took more time to service the connections it could do 
better however what I found is that with 8 workers all 8 were using 50% 
of the CPU at 1000 requests/sec


local machine would do 7000 requests/sec to itself

1 worker 500 requests/sec
2 workers 957 requests/sec

from there it remained about 1000 requests/sec with the cpu 
utilization slowly dropping off (but not dropping as fast as it should 
with the number of cores available)


so it looks like there is some significant bottleneck in version 3.2 that 
makes the SMP support fairly ineffective.



in reading the wiki page at wili.squid-cache.org/Features/SmpScale I see 
you worrying about fairness between workers. If you have put in code to 
try and ensure fairness, you may want to remove it and see what happens to 
performance. what you are describing on that page in terms of fairness is 
what I would expect form a 'first-come-first-served' approach to multiple 
processes grabbing new connections. The worker that last ran is hot in the 
cache and so has an 'unfair' advantage in noticing and processing the new 
request, but as that worker gets busier, it will be spending more time 
servicing the request and the other processes will get more of a chance to 
grab the new connection, so it will appear unfair under light load, but 
become more fair under heavy load.


David Lang