Hello,

I've been running pf_ring using pfdnacluster_master with Bro on a couple of dual socket Dell 720xds and trying to figure out the optimum workload each can handle. The systems seem to be oversubscribed with 3-4Gbps and 500kpps of traffic going to each and I'm frequently seeing 10-25% packet loss via stats available in Bro (capture loss script etc). I am running 5.6.2, but I had similar loss with 5.6.1. It was pointed out during troubleshooting that the DNA based pf_ring processes in /proc/net/pf_ring don't seem to have packet loss / dropped packet statistics that are present when running pf_ring without DNA. Naturally I would like to know if there is any loss before the data gets handed off to Bro. Is there a counter for this somewhere? I'd also like to see if my system is configured properly as I've been observing that other people appeared to be getting much better performance out of similar or even sometimes less powerful hardware than myself.

In regards to this latter point someone from the Bro community pointed out an article from a couple years back on the NTOP blog, "Not All Servers Are Alike (With DNA) <http://www.ntop.org/pf_ring/not-all-servers-are-alike-with-dna/>", which seemed like it might relate to my experience. One of the performance tests used was to run numademo, which I did, and quickly discovered my memory results seemed to be closer to the worst case results of the tested systems. After a quick jaunt through bios/uefi I was able to see limited improvements in some tests by not running my RAM in low voltage mode, and verifying that everything was using performance optimized settings (some were already). I'm currently running:

2 Dell 720xd servers each with:
64 GB RAM (8 banks with up to 3 DIMMs per bank, 4 banks per socket, 1 8GB module per bank, Dual-rank, 1600Mhz RDIMMs) 2 Intel Xeon E5-2670v1 (2.60GHz, 20M Cache, 8.0GT/s QPI, Turbo, 8 Physical Cores/16 Logical Cores each)
Intel X520 DP 10Gb DA/SFP+ NIC

Before tweaking BIOS settings my results looked like this:

numademo 128M memcpy
2 nodes available
memory with no policy memcpy Avg 6247.16 MB/s Max 6796.87 MB/s Min 3669.56 MB/s local memory memcpy Avg 6722.95 MB/s Max 6770.47 MB/s Min 6623.46 MB/s memory interleaved on all nodes memcpy Avg 4960.33 MB/s Max 4971.40 MB/s Min 4943.93 MB/s memory on node 0 memcpy Avg 6775.29 MB/s Max 6793.43 MB/s Min 6726.02 MB/s memory on node 1 memcpy Avg 3947.07 MB/s Max 3957.24 MB/s Min 3937.39 MB/s memory interleaved on 0 1 memcpy Avg 4974.14 MB/s Max 4986.17 MB/s Min 4940.65 MB/s
setting preferred node to 0
memory without policy memcpy Avg 6734.39 MB/s Max 6779.70 MB/s Min 6637.87 MB/s
setting preferred node to 1
memory without policy memcpy Avg 3938.87 MB/s Max 3944.91 MB/s Min 3926.56 MB/s manual interleaving to all nodes memcpy Avg 4936.62 MB/s Max 4965.14 MB/s Min 4851.89 MB/s manual interleaving on node 0/1 memcpy Avg 4958.69 MB/s Max 4975.08 MB/s Min 4943.38 MB/s
current interleave node 1
running on node 0, preferred node 0
local memory memcpy Avg 6733.07 MB/s Max 6796.52 MB/s Min 6652.35 MB/s memory interleaved on all nodes memcpy Avg 4915.64 MB/s Max 4959.82 MB/s Min 4671.37 MB/s memory interleaved on node 0/1 memcpy Avg 4951.08 MB/s Max 4958.36 MB/s Min 4931.21 MB/s alloc on node 1 memcpy Avg 3923.04 MB/s Max 3942.94 MB/s Min 3890.59 MB/s local allocation memcpy Avg 6759.52 MB/s Max 6782.44 MB/s Min 6726.69 MB/s setting wrong preferred node memcpy Avg 3923.52 MB/s Max 3946.65 MB/s Min 3880.47 MB/s setting correct preferred node memcpy Avg 6793.87 MB/s Max 6821.05 MB/s Min 6757.51 MB/s
running on node 1, preferred node 0
local memory memcpy Avg 6886.99 MB/s Max 7038.16 MB/s Min 5890.62 MB/s memory interleaved on all nodes memcpy Avg 5191.57 MB/s Max 5203.04 MB/s Min 5165.80 MB/s memory interleaved on node 0/1 memcpy Avg 5187.48 MB/s Max 5198.61 MB/s Min 5172.57 MB/s alloc on node 0 memcpy Avg 4070.32 MB/s Max 4073.13 MB/s Min 4067.82 MB/s local allocation memcpy Avg 7037.24 MB/s Max 7049.99 MB/s Min 7028.95 MB/s setting wrong preferred node memcpy Avg 4062.93 MB/s Max 4075.11 MB/s Min 4049.17 MB/s setting correct preferred node memcpy Avg 7037.02 MB/s Max 7045.18 MB/s Min 7026.00 MB/s

After tweaking (Voltage from 1.35V to 1.5V (RAM supported standard and Low Voltage modes), CPU set to maximize performance, memory was already set to optimize for performance):
numademo 128M memcpy
2 nodes available
memory with no policy memcpy Avg 7174.50 MB/s Max 7190.88 MB/s Min 7159.42 MB/s local memory memcpy Avg 7169.29 MB/s Max 7186.26 MB/s Min 7144.18 MB/s memory interleaved on all nodes memcpy Avg 5223.29 MB/s Max 5228.58 MB/s Min 5214.97 MB/s memory on node 0 memcpy Avg 4104.18 MB/s Max 4111.68 MB/s Min 4097.63 MB/s memory on node 1 memcpy Avg 7171.44 MB/s Max 7182.80 MB/s Min 7156.75 MB/s memory interleaved on 0 1 memcpy Avg 5225.20 MB/s Max 5244.31 MB/s Min 5215.58 MB/s
setting preferred node to 0
memory without policy memcpy Avg 4104.44 MB/s Max 4111.68 MB/s Min 4099.13 MB/s
setting preferred node to 1
memory without policy memcpy Avg 7171.36 MB/s Max 7182.80 MB/s Min 7149.51 MB/s manual interleaving to all nodes memcpy Avg 5227.46 MB/s Max 5241.04 MB/s Min 5217.81 MB/s manual interleaving on node 0/1 memcpy Avg 5224.47 MB/s Max 5232.86 MB/s Min 5218.62 MB/s
current interleave node 1
running on node 0, preferred node 0
local memory memcpy Avg 7216.01 MB/s Max 7232.73 MB/s Min 7199.75 MB/s memory interleaved on all nodes memcpy Avg 5198.67 MB/s Max 5206.07 MB/s Min 5181.75 MB/s memory interleaved on node 0/1 memcpy Avg 5202.42 MB/s Max 5215.99 MB/s Min 5190.97 MB/s alloc on node 1 memcpy Avg 4102.75 MB/s Max 4120.90 MB/s Min 4096.13 MB/s local allocation memcpy Avg 7217.29 MB/s Max 7229.61 MB/s Min 7190.88 MB/s setting wrong preferred node memcpy Avg 4100.52 MB/s Max 4105.27 MB/s Min 4095.25 MB/s setting correct preferred node memcpy Avg 7217.71 MB/s Max 7223.00 MB/s Min 7207.09 MB/s
running on node 1, preferred node 0
local memory memcpy Avg 7175.62 MB/s Max 7184.72 MB/s Min 7165.54 MB/s memory interleaved on all nodes memcpy Avg 5227.73 MB/s Max 5238.17 MB/s Min 5215.99 MB/s memory interleaved on node 0/1 memcpy Avg 5224.31 MB/s Max 5236.74 MB/s Min 5213.55 MB/s alloc on node 0 memcpy Avg 4099.95 MB/s Max 4106.15 MB/s Min 4093.63 MB/s local allocation memcpy Avg 7179.80 MB/s Max 7195.89 MB/s Min 7162.10 MB/s setting wrong preferred node memcpy Avg 4099.20 MB/s Max 4104.64 MB/s Min 4093.50 MB/s setting correct preferred node memcpy Avg 7173.35 MB/s Max 7187.80 MB/s Min 7163.63 MB/s

I suspect any of the improvements here are due to more aggressive memory timings from not running the RAM low voltage mode, but I'm wondering if I'd see memory bandwidth improvements by running two DIMMs per bank/channel instead of one. I'm also wondering if there is any benefit to running two NICs with each NIC on a PCIe slot connected to a different CPU socket. Can I effectively ensure that I'm always processing traffic from NIC 1 on the socket 1 CPU while accessing RAM attached to that CPU and so forth for NIC2/CPU2? Is it worth paying attention to? The two articles left me thinking that these may be concerns, but I'm not sure how significant a role they may play. One thing not mentioned in the articles, and maybe I missed it was whether or not hugepages was configured and what if any role hugepages might contribute to performance.

Regards,

--
Gary Faulkner

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Reply via email to