RE: [pfSense Support] Network Device pooling
Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. Are you trying this from the same host as the benchmark program? I wonder if a 2nd host would have the same problem. -Original Message- From: Peter Zaitsev [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 3:53 PM To: support@pfsense.com Subject: Re: [pfSense Support] Network Device pooling On Mon, 2005-10-31 at 16:31 -0500, Scott Ullrich wrote: Are we absolutely sure this program works as intended? Personally I wouldn't trust anything like this but smartbits. Well... It works if filtering is disabled on pfsese - this is what worries me. If the program would be broken it should not work in both cases. Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. If it is protection on FreeBSD side from too much activity from same IP (Ie as it limits response to flood ping) this would be good to know. I hope this problem is actually something like that - I know there are a lot of FreeBSD based routers out where - if it would be broken for real workloads something would scream already. One more interesting thing I noticed: Percentage of the requests served within a certain time (ms) 50% 32 66% 33 75% 33 80% 33 90% 44 95%295 98%324 99%330 100% 21285 (longest request) Even if apache benchmark does not timeout it often shows too long response rate - (21 sec in this case) What I've noticed - it can be 3, 9 or 21 secs in this case - This really look like the times at which SYN packets are resent by TCP/IP stacks if no reply for previous one arrives. Doing more experiments I also discovered I can increase chance of passing benchmark (still not to 100%) if i reduce tcp_fin_timeout and increase ip_local_port_range variables ob my test driver host. This still brings the question why with filtering and without behavior is different but it makes me worry less :) Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 16:25 -0500, Scott Ullrich wrote: apr_poll: The timeout specified has expired (70007) What is the above from? Your benchmark testing box? Yes. This is output from apache benchmark program. Benchmarking 111.111.111.158 (be patient) Completed 1 requests Completed 2 requests Completed 3 requests apr_poll: The timeout specified has expired (70007) Total of 30517 requests completed On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 15:48 -0500, Scott Ullrich wrote: Are you viewing the traffic queue status? This would be normal if you are... Heh, yes good quess. These were running in the other window. So here is the output for stalled case # pfctl -ss | wc -l 51898 I have number of states set to 100.000 in advanced page so it is not peak number. Note what really surprises me is the number of request when if fails: apr_poll: The timeout specified has expired (70007) Total of 28217 requests completed This number of 28217 is seen so often... Sometimes it is a bit more ot less but it is very frequently withing +/- 100 of it. I was asked if I can connect to the remote box when this problem happens - yes. I can SSH to the same box which runs Apache, but I can't connect to the port 80 when this problem happens. So it looks like it does not like to see all these states corresponding to the same target port number. Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 14:39 -0500, Scott Ullrich wrote: On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down transferring data, but I would think it would kill the benchmark client to have 10K(ish) of open TCP sessions. One way to deterimine this would be to run pfctl -ss | wc -l once pfSense stops responding? Very interesting I tried running this before the problems but it looks strange already: # pfctl -ss | wc -l 4893 Killed # pfctl -ss | wc -l 23245 Killed There is nothing in dmesg or system logs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
RE: [pfSense Support] Network Device pooling
On Tue, 2005-11-01 at 10:43 -0600, Fleming, John (ZeroChaos) wrote: Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. Are you trying this from the same host as the benchmark program? I wonder if a 2nd host would have the same problem. I did not have an extra host for test. I've finally figured out it looks like client is running out of local ports as increasing ip_local_port_range allowed to get to the different point. Two things confused me here 1) For some reason it does not fail if firewall is disabled. Probably something is different with connect closure. 2) The error code reported by ab is connect timeout. for this kind of error it should be Can't assign requested address or something similar. I guess it could be apache runtime abstraction library does not report this error well enough. -Original Message- From: Peter Zaitsev [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 3:53 PM To: support@pfsense.com Subject: Re: [pfSense Support] Network Device pooling On Mon, 2005-10-31 at 16:31 -0500, Scott Ullrich wrote: Are we absolutely sure this program works as intended? Personally I wouldn't trust anything like this but smartbits. Well... It works if filtering is disabled on pfsese - this is what worries me. If the program would be broken it should not work in both cases. Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. If it is protection on FreeBSD side from too much activity from same IP (Ie as it limits response to flood ping) this would be good to know. I hope this problem is actually something like that - I know there are a lot of FreeBSD based routers out where - if it would be broken for real workloads something would scream already. One more interesting thing I noticed: Percentage of the requests served within a certain time (ms) 50% 32 66% 33 75% 33 80% 33 90% 44 95%295 98%324 99%330 100% 21285 (longest request) Even if apache benchmark does not timeout it often shows too long response rate - (21 sec in this case) What I've noticed - it can be 3, 9 or 21 secs in this case - This really look like the times at which SYN packets are resent by TCP/IP stacks if no reply for previous one arrives. Doing more experiments I also discovered I can increase chance of passing benchmark (still not to 100%) if i reduce tcp_fin_timeout and increase ip_local_port_range variables ob my test driver host. This still brings the question why with filtering and without behavior is different but it makes me worry less :) Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 16:25 -0500, Scott Ullrich wrote: apr_poll: The timeout specified has expired (70007) What is the above from? Your benchmark testing box? Yes. This is output from apache benchmark program. Benchmarking 111.111.111.158 (be patient) Completed 1 requests Completed 2 requests Completed 3 requests apr_poll: The timeout specified has expired (70007) Total of 30517 requests completed On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 15:48 -0500, Scott Ullrich wrote: Are you viewing the traffic queue status? This would be normal if you are... Heh, yes good quess. These were running in the other window. So here is the output for stalled case # pfctl -ss | wc -l 51898 I have number of states set to 100.000 in advanced page so it is not peak number. Note what really surprises me is the number of request when if fails: apr_poll: The timeout specified has expired (70007) Total of 28217 requests completed This number of 28217 is seen so often... Sometimes it is a bit more ot less but it is very frequently withing +/- 100 of it. I was asked if I can connect to the remote box when this problem happens - yes. I can SSH to the same box which runs Apache, but I can't connect to the port 80 when this problem happens. So it looks like it does not like to see all these states corresponding to the same target port number. Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 14:39 -0500, Scott Ullrich wrote: On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down
Re: [pfSense Support] Network Device pooling
Can we please let this thread die already? I'm tired about hearing of benchmarking the *WRONG* way. Scott On 11/1/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Tue, 2005-11-01 at 10:43 -0600, Fleming, John (ZeroChaos) wrote: Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. Are you trying this from the same host as the benchmark program? I wonder if a 2nd host would have the same problem. I did not have an extra host for test. I've finally figured out it looks like client is running out of local ports as increasing ip_local_port_range allowed to get to the different point. Two things confused me here 1) For some reason it does not fail if firewall is disabled. Probably something is different with connect closure. 2) The error code reported by ab is connect timeout. for this kind of error it should be Can't assign requested address or something similar. I guess it could be apache runtime abstraction library does not report this error well enough. -Original Message- From: Peter Zaitsev [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 3:53 PM To: support@pfsense.com Subject: Re: [pfSense Support] Network Device pooling On Mon, 2005-10-31 at 16:31 -0500, Scott Ullrich wrote: Are we absolutely sure this program works as intended? Personally I wouldn't trust anything like this but smartbits. Well... It works if filtering is disabled on pfsese - this is what worries me. If the program would be broken it should not work in both cases. Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. If it is protection on FreeBSD side from too much activity from same IP (Ie as it limits response to flood ping) this would be good to know. I hope this problem is actually something like that - I know there are a lot of FreeBSD based routers out where - if it would be broken for real workloads something would scream already. One more interesting thing I noticed: Percentage of the requests served within a certain time (ms) 50% 32 66% 33 75% 33 80% 33 90% 44 95%295 98%324 99%330 100% 21285 (longest request) Even if apache benchmark does not timeout it often shows too long response rate - (21 sec in this case) What I've noticed - it can be 3, 9 or 21 secs in this case - This really look like the times at which SYN packets are resent by TCP/IP stacks if no reply for previous one arrives. Doing more experiments I also discovered I can increase chance of passing benchmark (still not to 100%) if i reduce tcp_fin_timeout and increase ip_local_port_range variables ob my test driver host. This still brings the question why with filtering and without behavior is different but it makes me worry less :) Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 16:25 -0500, Scott Ullrich wrote: apr_poll: The timeout specified has expired (70007) What is the above from? Your benchmark testing box? Yes. This is output from apache benchmark program. Benchmarking 111.111.111.158 (be patient) Completed 1 requests Completed 2 requests Completed 3 requests apr_poll: The timeout specified has expired (70007) Total of 30517 requests completed On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 15:48 -0500, Scott Ullrich wrote: Are you viewing the traffic queue status? This would be normal if you are... Heh, yes good quess. These were running in the other window. So here is the output for stalled case # pfctl -ss | wc -l 51898 I have number of states set to 100.000 in advanced page so it is not peak number. Note what really surprises me is the number of request when if fails: apr_poll: The timeout specified has expired (70007) Total of 28217 requests completed This number of 28217 is seen so often... Sometimes it is a bit more ot less but it is very frequently withing +/- 100 of it. I was asked if I can connect to the remote box when this problem happens - yes. I can SSH to the same box which runs Apache, but I can't connect to the port 80 when this problem happens. So it looks like it does not like to see all these states corresponding to the same target port number. Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 14:39 -0500, Scott Ullrich wrote: On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote
RE: [pfSense Support] Network Device pooling
I think the first rule of testing applies start at the beginning and work your way backwards. Please u solved ur problems -Original Message- From: Peter Zaitsev [mailto:[EMAIL PROTECTED] Sent: 01 November 2005 18:16 To: support@pfsense.com Subject: RE: [pfSense Support] Network Device pooling On Tue, 2005-11-01 at 10:43 -0600, Fleming, John (ZeroChaos) wrote: Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. Are you trying this from the same host as the benchmark program? I wonder if a 2nd host would have the same problem. I did not have an extra host for test. I've finally figured out it looks like client is running out of local ports as increasing ip_local_port_range allowed to get to the different point. Two things confused me here 1) For some reason it does not fail if firewall is disabled. Probably something is different with connect closure. 2) The error code reported by ab is connect timeout. for this kind of error it should be Can't assign requested address or something similar. I guess it could be apache runtime abstraction library does not report this error well enough. -Original Message- From: Peter Zaitsev [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 3:53 PM To: support@pfsense.com Subject: Re: [pfSense Support] Network Device pooling On Mon, 2005-10-31 at 16:31 -0500, Scott Ullrich wrote: Are we absolutely sure this program works as intended? Personally I wouldn't trust anything like this but smartbits. Well... It works if filtering is disabled on pfsese - this is what worries me. If the program would be broken it should not work in both cases. Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. If it is protection on FreeBSD side from too much activity from same IP (Ie as it limits response to flood ping) this would be good to know. I hope this problem is actually something like that - I know there are a lot of FreeBSD based routers out where - if it would be broken for real workloads something would scream already. One more interesting thing I noticed: Percentage of the requests served within a certain time (ms) 50% 32 66% 33 75% 33 80% 33 90% 44 95%295 98%324 99%330 100% 21285 (longest request) Even if apache benchmark does not timeout it often shows too long response rate - (21 sec in this case) What I've noticed - it can be 3, 9 or 21 secs in this case - This really look like the times at which SYN packets are resent by TCP/IP stacks if no reply for previous one arrives. Doing more experiments I also discovered I can increase chance of passing benchmark (still not to 100%) if i reduce tcp_fin_timeout and increase ip_local_port_range variables ob my test driver host. This still brings the question why with filtering and without behavior is different but it makes me worry less :) Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 16:25 -0500, Scott Ullrich wrote: apr_poll: The timeout specified has expired (70007) What is the above from? Your benchmark testing box? Yes. This is output from apache benchmark program. Benchmarking 111.111.111.158 (be patient) Completed 1 requests Completed 2 requests Completed 3 requests apr_poll: The timeout specified has expired (70007) Total of 30517 requests completed On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 15:48 -0500, Scott Ullrich wrote: Are you viewing the traffic queue status? This would be normal if you are... Heh, yes good quess. These were running in the other window. So here is the output for stalled case # pfctl -ss | wc -l 51898 I have number of states set to 100.000 in advanced page so it is not peak number. Note what really surprises me is the number of request when if fails: apr_poll: The timeout specified has expired (70007) Total of 28217 requests completed This number of 28217 is seen so often... Sometimes it is a bit more ot less but it is very frequently withing +/- 100 of it. I was asked if I can connect to the remote box when this problem happens - yes. I can SSH to the same box which runs Apache, but I can't connect to the port 80 when this problem happens. So it looks like it does not like to see all these states corresponding to the same target port number. Scott On 10/31/05, Peter Zaitsev [EMAIL
Re: [pfSense Support] Network Device pooling
At 01:31 PM 11/1/2005, you wrote: Can we please let this thread die already? I'm tired about hearing of benchmarking the *WRONG* way. Must. Control. The. Fist. Of. Death. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [pfSense Support] Network Device pooling
Please describe the hardware your using fully. NICS, etc. This is not normal behavior. On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Sun, 2005-10-30 at 23:14 +0100, Espen Johansen wrote: Hi Peter, I have seen you have done a lot of testing with apache benchmarking. I find it a little strange to use this as a test. Basically you will hit the roof of standing I/O operations because you introduce latency with pfsense. The lower the latency the more finished tasks/connections per time unit. Most people don't take this into consideration when they tune apache. Although, this is one of the most important aspects of web-server tuning. Espen, If you would see to the set of my emails you would see the growing latency with network pooling is not my concern, as well as well as dropping throughput with pfsense in the middle - it is all understandable. What is NOT ok however is the stall (20+ seconds) when CPU usage on pfsense drops almost to zero and no traffics come on connections. Sometimes it causes apache benchmark to abort sometimes just shows crazy response times. This does not happen in direct benchmark (no pfsense in the middle) or with pfsense with disable firewall. Why I used apache benchmark ? Well it is simple stress test which results in a lot of traffic and a lot of states in the state tables. This is the scenario: Client with low BW and high latency will generate a standing I/O because of the way apache is designed. So if a client with 100ms latency asks for a file of 100Kbyte and he has a 3KB/s transfer rate he will generate a standing I/O operation for latency + transfer time, and the I/O operation will not be finished until he has a completed transfer. So basically you do the same, because you change the amount of time the request takes to process you will have more standing I/O operations then if pfsense does routing only (faster then routing and filtering). So lets say that you increase latency from 0.4 ms to 2 ms it will mean that you have standing I/O 250% longer. So in turn that will mean that your ability to serve connections will be 1/5 with 2ms compared to 0.4 ms latency. Well... This would be the case in real life scenario - slow clients blowing up number of apache children. But it is not the case in synthetic Apache benchmark test. In this case you set fixed concurrency. I obviously set it low enough for my Apache box to handle. Furthermore pfsense locks even with single connection (this is independent if device pooling is enabled) The ones listed below seems to be the once that has the most effect on polling and performance. You will have to play around with these settings to find out what works best on your HW, as I can't seem to find some common setting that works well for all kinds of HW. kern.polling.each_burst=80 kern.polling.burst_max=1000 kern.polling.user_frac=50 Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [pfSense Support] Network Device pooling
On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 12:03 -0500, Scott Ullrich wrote: Please describe the hardware your using fully. NICS, etc. This is not normal behavior. Sure It is Dell Poweredge 750 512MB RAM, SATA150 disk, Celeron 2.4Ghz ACPI APIC Table: DELL PE750 Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Intel(R) Celeron(R) CPU 2.40GHz (2400.10-MHz 686-class CPU) Origin = GenuineIntel Id = 0xf29 Stepping = 9 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Features2=0x4400CNTX-ID,b14 real memory = 536608768 (511 MB) avail memory = 515547136 (491 MB) Nics are build in Intel 10/100/1000 NICs: em0: Intel(R) PRO/1000 Network Connection, Version - 2.1.7 port 0xece0-0xecff mem 0xfe1e-0xfe1f irq 18 at device 1.0 on pci1 em0: Ethernet address: 00:14:22:0a:64:4c em0: Speed:N/A Duplex:N/A It does not looks like this is hardware issue for me as if I disable firewall it works fine. I tried turning off scrub and it does not change anything. Still timeout after few requests: And when this timeout occurs do you see anything in the system logs? Can you still telnet into the apache server behind pfsense? This really doesn't make a lot of sense. It should be able to stand up to this. Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [pfSense Support] Network Device pooling
Send the output.txt of... date /tmp/output.txt netstat -m /tmp/output.txt netstat -in /tmp/output.txt sysctl hw.em0.stats=1 /tmp/output.txt sysctl hw.em1.stats=1 /tmp/output.txt sysctl hw.em2.stats=1 /tmp/output.txt Can you send these while the machine is normal and when the machine is choking? (send the output.txt file btw) Are you able to try this test using routing ver bridging? -Original Message- From: Scott Ullrich [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 1:09 PM To: support@pfsense.com Subject: Re: [pfSense Support] Network Device pooling On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 12:03 -0500, Scott Ullrich wrote: Please describe the hardware your using fully. NICS, etc. This is not normal behavior. Sure It is Dell Poweredge 750 512MB RAM, SATA150 disk, Celeron 2.4Ghz ACPI APIC Table: DELL PE750 Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Intel(R) Celeron(R) CPU 2.40GHz (2400.10-MHz 686-class CPU) Origin = GenuineIntel Id = 0xf29 Stepping = 9 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE ,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Features2=0x4400CNTX-ID,b14 real memory = 536608768 (511 MB) avail memory = 515547136 (491 MB) Nics are build in Intel 10/100/1000 NICs: em0: Intel(R) PRO/1000 Network Connection, Version - 2.1.7 port 0xece0-0xecff mem 0xfe1e-0xfe1f irq 18 at device 1.0 on pci1 em0: Ethernet address: 00:14:22:0a:64:4c em0: Speed:N/A Duplex:N/A It does not looks like this is hardware issue for me as if I disable firewall it works fine. I tried turning off scrub and it does not change anything. Still timeout after few requests: And when this timeout occurs do you see anything in the system logs? Can you still telnet into the apache server behind pfsense? This really doesn't make a lot of sense. It should be able to stand up to this. Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [pfSense Support] Network Device pooling
On Mon, 2005-10-31 at 13:26 -0600, Fleming, John (ZeroChaos) wrote: Benchmarking 111.111.111.158 (be patient) Completed 1 requests - isn't 10,000 the default limit of the state table? That sure would explain a lot. I boosted it to 10 of course - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [pfSense Support] Network Device pooling
I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down transferring data, but I would think it would kill the benchmark client to have 10K(ish) of open TCP sessions. -Original Message- From: Scott Ullrich [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 1:28 PM To: support@pfsense.com Subject: Re: [pfSense Support] Network Device pooling On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: Benchmarking 111.111.111.158 (be patient) Completed 1 requests - isn't 10,000 the default limit of the state table? That sure would explain a lot. Yep. 10K is the default and it is adjustable from the System - Advanced screen. Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [pfSense Support] Network Device pooling
On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down transferring data, but I would think it would kill the benchmark client to have 10K(ish) of open TCP sessions. One way to deterimine this would be to run pfctl -ss | wc -l once pfSense stops responding? Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [pfSense Support] Network Device pooling
On Mon, 2005-10-31 at 14:39 -0500, Scott Ullrich wrote: On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down transferring data, but I would think it would kill the benchmark client to have 10K(ish) of open TCP sessions. One way to deterimine this would be to run pfctl -ss | wc -l once pfSense stops responding? Very interesting I tried running this before the problems but it looks strange already: # pfctl -ss | wc -l 4893 Killed # pfctl -ss | wc -l 23245 Killed There is nothing in dmesg or system logs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [pfSense Support] Network Device pooling
Are you viewing the traffic queue status? This would be normal if you are... Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 14:39 -0500, Scott Ullrich wrote: On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down transferring data, but I would think it would kill the benchmark client to have 10K(ish) of open TCP sessions. One way to deterimine this would be to run pfctl -ss | wc -l once pfSense stops responding? Very interesting I tried running this before the problems but it looks strange already: # pfctl -ss | wc -l 4893 Killed # pfctl -ss | wc -l 23245 Killed There is nothing in dmesg or system logs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [pfSense Support] Network Device pooling
On Mon, 2005-10-31 at 13:25 -0600, Fleming, John (ZeroChaos) wrote: Can you send these while the machine is normal and when the machine is choking? (send the output.txt file btw) Normal: # cat /tmp/output.txt Mon Oct 31 07:50:52 PST 2005 564/336/900 mbufs in use (current/cache/total) 555/269/824/17088 mbuf clusters in use (current/cache/total/max) 0/3/4528 sfbufs in use (current/peak/max) 1253K/622K/1875K bytes allocated to network (current/cache/total) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines NameMtu Network Address Ipkts IerrsOpkts Oerrs Coll em01500 Link#1 00:14:22:0a:64:4c 2200575 0 2004248 0 0 em01500 fe80:1::214:2 fe80:1::214:22ff:0 -4 - - em01500 111.111.111.152 111.111.111.154 3395 -0 - - em11500 Link#2 00:14:22:0a:64:4d 2003036 0 2195974 0 0 em11500 fe80:2::214:2 fe80:2::214:22ff:0 -4 - - em11500 111.111.111.152 111.111.111.1540 - 6162 - - pfsyn 2020 Link#3 0 00 0 0 lo0 16384 Link#4 0 00 0 0 lo0 16384 127 127.0.0.10 -0 - - lo0 16384 ::1/128 ::1 0 -0 - - lo0 16384 fe80:4::1/64 fe80:4::10 -0 - - pflog 33208 Link#5 0 00 0 0 bridg 1500 Link#6 ac:de:48:e1:dd:5f 4197981 0 4200265 0 0 Choking: Mon Oct 31 07:48:44 PST 2005 515/385/900 mbufs in use (current/cache/total) 514/310/824/17088 mbuf clusters in use (current/cache/total/max) 0/3/4528 sfbufs in use (current/peak/max) 1156K/716K/1873K bytes allocated to network (current/cache/total) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines NameMtu Network Address Ipkts IerrsOpkts Oerrs Coll em01500 Link#1 00:14:22:0a:64:4c 2011449 0 1838611 0 0 em01500 fe80:1::214:2 fe80:1::214:22ff:0 -4 - - em01500 111.111.111.152 111.111.111.154 2644 -0 - - em11500 Link#2 00:14:22:0a:64:4d 1835313 0 2007595 0 0 em11500 fe80:2::214:2 fe80:2::214:22ff:0 -4 - - em11500 111.111.111.152 111.111.111.1540 - 5336 - - pfsyn 2020 Link#3 0 00 0 0 lo0 16384 Link#4 0 00 0 0 lo0 16384 127 127.0.0.10 -0 - - lo0 16384 ::1/128 ::1 0 -0 - - lo0 16384 fe80:4::1/64 fe80:4::10 -0 - - pflog 33208 Link#5 0 00 0 0 bridg 1500 Link#6 ac:de:48:e1:dd:5f 3841883 0 3846209 0 0 Some of your advised commands fail: # sysctl hw.em0.stats=1 /tmp/output.txt sysctl: unknown oid 'hw.em0.stats' # # sysctl hw.em1.stats=1 /tmp/output.txt sysctl: unknown oid 'hw.em1.stats' # # sysctl hw.em2.stats=1 /tmp/output.txt sysctl: unknown oid 'hw.em2.stats' Are you able to try this test using routing ver bridging? I did not try with routing as this is not what I'm going to use. I however tried doing this with firewall disabled and bridging enabled which seems to show it is not bridging itself at least. -Original Message- From: Scott Ullrich [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 1:09 PM To: support@pfsense.com Subject: Re: [pfSense Support] Network Device pooling On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 12:03 -0500, Scott Ullrich wrote: Please describe the hardware your using fully. NICS, etc. This is not normal behavior. Sure It is Dell Poweredge 750 512MB RAM, SATA150 disk, Celeron 2.4Ghz ACPI APIC Table: DELL PE750 Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Intel(R) Celeron(R) CPU 2.40GHz (2400.10-MHz 686-class CPU) Origin = GenuineIntel Id = 0xf29 Stepping = 9 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE ,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Features2=0x4400CNTX-ID,b14 real memory = 536608768 (511 MB) avail memory = 515547136 (491 MB) Nics are build in Intel 10/100/1000 NICs: em0: Intel(R) PRO/1000 Network Connection, Version - 2.1.7 port 0xece0-0xecff mem 0xfe1e-0xfe1f irq 18 at device 1.0 on pci1 em0: Ethernet address: 00:14:22:0a:64:4c em0: Speed:N/A Duplex:N/A It does not looks like this is hardware issue for me as if I disable firewall it works fine. I tried turning off scrub
Re: [pfSense Support] Network Device pooling
apr_poll: The timeout specified has expired (70007) What is the above from? Your benchmark testing box? On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 15:48 -0500, Scott Ullrich wrote: Are you viewing the traffic queue status? This would be normal if you are... Heh, yes good quess. These were running in the other window. So here is the output for stalled case # pfctl -ss | wc -l 51898 I have number of states set to 100.000 in advanced page so it is not peak number. Note what really surprises me is the number of request when if fails: apr_poll: The timeout specified has expired (70007) Total of 28217 requests completed This number of 28217 is seen so often... Sometimes it is a bit more ot less but it is very frequently withing +/- 100 of it. I was asked if I can connect to the remote box when this problem happens - yes. I can SSH to the same box which runs Apache, but I can't connect to the port 80 when this problem happens. So it looks like it does not like to see all these states corresponding to the same target port number. Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 14:39 -0500, Scott Ullrich wrote: On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down transferring data, but I would think it would kill the benchmark client to have 10K(ish) of open TCP sessions. One way to deterimine this would be to run pfctl -ss | wc -l once pfSense stops responding? Very interesting I tried running this before the problems but it looks strange already: # pfctl -ss | wc -l 4893 Killed # pfctl -ss | wc -l 23245 Killed There is nothing in dmesg or system logs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [pfSense Support] Network Device pooling
Have you seen this? https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=110887 Looks like a apachebench problem to me. Scott On 10/31/05, Scott Ullrich [EMAIL PROTECTED] wrote: Are we absolutely sure this program works as intended? Personally I wouldn't trust anything like this but smartbits. Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 16:25 -0500, Scott Ullrich wrote: apr_poll: The timeout specified has expired (70007) What is the above from? Your benchmark testing box? Yes. This is output from apache benchmark program. Benchmarking 111.111.111.158 (be patient) Completed 1 requests Completed 2 requests Completed 3 requests apr_poll: The timeout specified has expired (70007) Total of 30517 requests completed On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 15:48 -0500, Scott Ullrich wrote: Are you viewing the traffic queue status? This would be normal if you are... Heh, yes good quess. These were running in the other window. So here is the output for stalled case # pfctl -ss | wc -l 51898 I have number of states set to 100.000 in advanced page so it is not peak number. Note what really surprises me is the number of request when if fails: apr_poll: The timeout specified has expired (70007) Total of 28217 requests completed This number of 28217 is seen so often... Sometimes it is a bit more ot less but it is very frequently withing +/- 100 of it. I was asked if I can connect to the remote box when this problem happens - yes. I can SSH to the same box which runs Apache, but I can't connect to the port 80 when this problem happens. So it looks like it does not like to see all these states corresponding to the same target port number. Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 14:39 -0500, Scott Ullrich wrote: On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down transferring data, but I would think it would kill the benchmark client to have 10K(ish) of open TCP sessions. One way to deterimine this would be to run pfctl -ss | wc -l once pfSense stops responding? Very interesting I tried running this before the problems but it looks strange already: # pfctl -ss | wc -l 4893 Killed # pfctl -ss | wc -l 23245 Killed There is nothing in dmesg or system logs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [pfSense Support] Network Device pooling
On Mon, 2005-10-31 at 16:31 -0500, Scott Ullrich wrote: Are we absolutely sure this program works as intended? Personally I wouldn't trust anything like this but smartbits. Well... It works if filtering is disabled on pfsese - this is what worries me. If the program would be broken it should not work in both cases. Also I wrote when stall happens I can't telnet to port 80 on web server host - which means it is not just program causing stall. If it is protection on FreeBSD side from too much activity from same IP (Ie as it limits response to flood ping) this would be good to know. I hope this problem is actually something like that - I know there are a lot of FreeBSD based routers out where - if it would be broken for real workloads something would scream already. One more interesting thing I noticed: Percentage of the requests served within a certain time (ms) 50% 32 66% 33 75% 33 80% 33 90% 44 95%295 98%324 99%330 100% 21285 (longest request) Even if apache benchmark does not timeout it often shows too long response rate - (21 sec in this case) What I've noticed - it can be 3, 9 or 21 secs in this case - This really look like the times at which SYN packets are resent by TCP/IP stacks if no reply for previous one arrives. Doing more experiments I also discovered I can increase chance of passing benchmark (still not to 100%) if i reduce tcp_fin_timeout and increase ip_local_port_range variables ob my test driver host. This still brings the question why with filtering and without behavior is different but it makes me worry less :) Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 16:25 -0500, Scott Ullrich wrote: apr_poll: The timeout specified has expired (70007) What is the above from? Your benchmark testing box? Yes. This is output from apache benchmark program. Benchmarking 111.111.111.158 (be patient) Completed 1 requests Completed 2 requests Completed 3 requests apr_poll: The timeout specified has expired (70007) Total of 30517 requests completed On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 15:48 -0500, Scott Ullrich wrote: Are you viewing the traffic queue status? This would be normal if you are... Heh, yes good quess. These were running in the other window. So here is the output for stalled case # pfctl -ss | wc -l 51898 I have number of states set to 100.000 in advanced page so it is not peak number. Note what really surprises me is the number of request when if fails: apr_poll: The timeout specified has expired (70007) Total of 28217 requests completed This number of 28217 is seen so often... Sometimes it is a bit more ot less but it is very frequently withing +/- 100 of it. I was asked if I can connect to the remote box when this problem happens - yes. I can SSH to the same box which runs Apache, but I can't connect to the port 80 when this problem happens. So it looks like it does not like to see all these states corresponding to the same target port number. Scott On 10/31/05, Peter Zaitsev [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 14:39 -0500, Scott Ullrich wrote: On 10/31/05, Fleming, John (ZeroChaos) [EMAIL PROTECTED] wrote: I wonder if part of the problem is PF isn't seeing the TCP tear down. It seems a little odd that the max gets hit and nothing else gets through. I guess it could be the benchmark isn't shutting down the session right after its down transferring data, but I would think it would kill the benchmark client to have 10K(ish) of open TCP sessions. One way to deterimine this would be to run pfctl -ss | wc -l once pfSense stops responding? Very interesting I tried running this before the problems but it looks strange already: # pfctl -ss | wc -l 4893 Killed # pfctl -ss | wc -l 23245 Killed There is nothing in dmesg or system logs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -