packet loss

Luca Deri Fri, 02 May 2014 02:20:07 -0700

Gary,
I have written 
http://www.ntop.org/pf_ring/not-all-servers-are-alike-with-pf_ring-zcdna-part-3/
 that tries to describe some common misconfigurations that might also apply to 
your case
In your case I think that:


1. you need to allocate memory properly
2. you need to verify to what NUMA node your NICs are installed
3. you need to make sure Bro is bound to the correct node *prior* to start it, 
as if it opens PF_RING on the wrong node then it sets its affinity to the good 
node it might be too late
4. hugepages can help but in your case the performance degradation is so big 
that I suspect it’s something else the cause
5. when you aggregate many ports with the cluster master, you have cache 
misses, but again the performance you see is too low and I think the problem is 
somewhere else

Good luck Luca



On 01 May 2014, at 18:43, Gary Faulkner <[email protected]> wrote:

> On 5/1/2014 1:59 AM, Luca Deri wrote:
>> Gary,
>> in order to answer your questions I need to know
>> 0. How did you insmod the DNA driver and what are the command line option 
>> you have uses to that we can figure out on what node your memory was 
>> allocated.
> I use the load_dna_driver.sh script with the following insmod lines:
> 
> insmod /nsm/modules/pf_ring.ko min_num_slots=65534 transparent_mode=2 
> enable_tx_capture=0
> insmod /nsm/modules/ixgbe.ko RSS=1,1,1,1, InterruptThrottleRate=16000 
> mtu=9000 adapters_to_enable=MAC1,MAC2,MAC3,MAC4
> 
> *MAC1 etc would have the licensed MACs listed for each host.
> 
>> 1. how you have started the cluster master (command line). Did you use the 
>> huge-pages BTW?
> I'm not using hugepages at present, but I have noticed this is the default in 
> ZC. Is this a significant improvement? I don't have any experience with 
> hugepages, but it looks like the load_dna_driver.sh script that ships with 
> 6.0/ZC includes commands for loading hugepages. Would these be OK to use with 
> the 5.6.2 load_dna_driver.sh script assuming I adjust the memory size and 
> such to match my system?
> 
> I'm currently loading as such: 
> 
> pfdnacluster_master -i dna0,dna1,dna2,dna3 -d -n 14,1 -c 21 
> 
> I've also tried using just a single port as I really am not sending anywhere 
> near 10GB/s to each host:
> 
> pfdnacluster_master -i dna0 -d -n 14,1 -c 21
> 
> The servers were handed to me with two X520s and 4SFPs each, but I don't 
> think I'd ever need more than a single 10GB port. I was considering buying 
> new servers to expand the cluster with a single 10GB port each unless there 
> is a need or benefit to running one per CPU on the appropriate PCIe slots.
> 
>> 2. the hardware topology of your system, this to figure out to what NUMA 
>> node your PCI NICs are connected. As you use a dual-port NIC into a single 
>> PCIe NIC, it means that this card is bound to one CPU and thus that you 
>> should *not* run any code or bind anything on the CPU that is not directly 
>> attached to your NIC as this will result in poor performance
> I've attached the spec sheet and technical guide for the 720xd from Dell. The 
> system diagram is Figure 20 on page 65 of the technical guide and shows how 
> the PCIe slots and memory are attached. The NICs are in slot 1 & slot 2, 
> which appear to be bound to CPU2/node1. All 8 "Dimm 1" slots are populated 
> with matching 8GB dual-rank DIMMs.    
> 
>> 3. on what cores are you running (binding) your apps
> In Bro:
> 
> pin_cpus=2,3,4,5,6,7,8,9,10,11,12,13,14,15 (16-31 are on the same physical 
> cores as 0-15 as far as I can tell so I avoid binding to those, but I haven't 
> gone so far as to disable hyper-threading. I leave 0,1 free for other system 
> stuff)
> 
>> 4. please report what coreId is attached to what node (numactl —hardware)
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
> node 0 size: 32722 MB
> node 0 free: 12261 MB
> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
> node 1 size: 32768 MB
> node 1 free: 15785 MB
> node distances:
> node   0   1 
>   0:  10  20 
>   1:  20  10 
> 
> Regards,
> Gary
> 
>> 
>> 
>> I cannot help you before all the above questions are answer, but i believe 
>> that the problem might be due to non-optimal system configuration as your 
>> system might be able to handle 3/4 Gbit with no cops whatsoever
>> 
>> Cheers Luca
>> 
>> 
>> 
>> On 30 Apr 2014, at 22:23, Gary Faulkner <[email protected]> wrote:
>> 
>>> Hello,
>>> 
>>> I've been running pf_ring using pfdnacluster_master with Bro on a couple of 
>>> dual socket Dell 720xds and trying to figure out the optimum workload each 
>>> can handle. The systems seem to be oversubscribed with 3-4Gbps and 500kpps 
>>> of traffic going to each and I'm frequently seeing 10-25% packet loss via 
>>> stats available in Bro (capture loss script etc). I am running 5.6.2, but I 
>>> had similar loss with 5.6.1. It was pointed out during troubleshooting that 
>>> the DNA based pf_ring processes in /proc/net/pf_ring don't seem to have 
>>> packet loss / dropped packet statistics that are present when running 
>>> pf_ring without DNA. Naturally I would like to know if there is any loss 
>>> before the data gets handed off to Bro. Is there a counter for this 
>>> somewhere? 
>>> I'd also like to see if my system is configured properly as I've been 
>>> observing that other people appeared to be getting much better performance 
>>> out of similar or even sometimes less powerful hardware than myself.
>>> 
>>> In regards to this latter point someone from the Bro community pointed out 
>>> an article from a couple years back on the NTOP blog, "Not All Servers Are 
>>> Alike (With DNA)", which seemed like it might relate to my experience. One 
>>> of the performance tests used was to run numademo, which I did, and quickly 
>>> discovered my memory results seemed to be closer to the worst case results 
>>> of the tested systems. After a quick jaunt through bios/uefi I was able to 
>>> see limited improvements in some tests by not running my RAM in low voltage 
>>> mode, and verifying that everything was using performance optimized 
>>> settings (some were already). I'm currently running:
>>> 
>>> 2 Dell 720xd servers each with:
>>> 64 GB RAM (8 banks with up to 3 DIMMs per bank, 4 banks per socket, 1 8GB 
>>> module per bank, Dual-rank, 1600Mhz RDIMMs)
>>> 2 Intel Xeon E5-2670v1 (2.60GHz, 20M Cache, 8.0GT/s QPI, Turbo, 8 Physical 
>>> Cores/16 Logical Cores each)
>>> Intel X520 DP 10Gb DA/SFP+ NIC
>>> 
>>> Before tweaking BIOS settings my results looked like this:
>>> 
>>> numademo 128M memcpy
>>> 2 nodes available
>>> memory with no policy memcpy              Avg 6247.16 MB/s Max 6796.87 MB/s 
>>> Min 3669.56 MB/s
>>> local memory memcpy                       Avg 6722.95 MB/s Max 6770.47 MB/s 
>>> Min 6623.46 MB/s
>>> memory interleaved on all nodes memcpy    Avg 4960.33 MB/s Max 4971.40 MB/s 
>>> Min 4943.93 MB/s
>>> memory on node 0 memcpy                   Avg 6775.29 MB/s Max 6793.43 MB/s 
>>> Min 6726.02 MB/s
>>> memory on node 1 memcpy                   Avg 3947.07 MB/s Max 3957.24 MB/s 
>>> Min 3937.39 MB/s
>>> memory interleaved on 0 1 memcpy          Avg 4974.14 MB/s Max 4986.17 MB/s 
>>> Min 4940.65 MB/s
>>> setting preferred node to 0
>>> memory without policy memcpy              Avg 6734.39 MB/s Max 6779.70 MB/s 
>>> Min 6637.87 MB/s
>>> setting preferred node to 1
>>> memory without policy memcpy              Avg 3938.87 MB/s Max 3944.91 MB/s 
>>> Min 3926.56 MB/s
>>> manual interleaving to all nodes memcpy   Avg 4936.62 MB/s Max 4965.14 MB/s 
>>> Min 4851.89 MB/s
>>> manual interleaving on node 0/1 memcpy    Avg 4958.69 MB/s Max 4975.08 MB/s 
>>> Min 4943.38 MB/s
>>> current interleave node 1
>>> running on node 0, preferred node 0
>>> local memory memcpy                       Avg 6733.07 MB/s Max 6796.52 MB/s 
>>> Min 6652.35 MB/s
>>> memory interleaved on all nodes memcpy    Avg 4915.64 MB/s Max 4959.82 MB/s 
>>> Min 4671.37 MB/s
>>> memory interleaved on node 0/1 memcpy     Avg 4951.08 MB/s Max 4958.36 MB/s 
>>> Min 4931.21 MB/s
>>> alloc on node 1 memcpy                    Avg 3923.04 MB/s Max 3942.94 MB/s 
>>> Min 3890.59 MB/s
>>> local allocation memcpy                   Avg 6759.52 MB/s Max 6782.44 MB/s 
>>> Min 6726.69 MB/s
>>> setting wrong preferred node memcpy       Avg 3923.52 MB/s Max 3946.65 MB/s 
>>> Min 3880.47 MB/s
>>> setting correct preferred node memcpy     Avg 6793.87 MB/s Max 6821.05 MB/s 
>>> Min 6757.51 MB/s
>>> running on node 1, preferred node 0
>>> local memory memcpy                       Avg 6886.99 MB/s Max 7038.16 MB/s 
>>> Min 5890.62 MB/s
>>> memory interleaved on all nodes memcpy    Avg 5191.57 MB/s Max 5203.04 MB/s 
>>> Min 5165.80 MB/s
>>> memory interleaved on node 0/1 memcpy     Avg 5187.48 MB/s Max 5198.61 MB/s 
>>> Min 5172.57 MB/s
>>> alloc on node 0 memcpy                    Avg 4070.32 MB/s Max 4073.13 MB/s 
>>> Min 4067.82 MB/s
>>> local allocation memcpy                   Avg 7037.24 MB/s Max 7049.99 MB/s 
>>> Min 7028.95 MB/s
>>> setting wrong preferred node memcpy       Avg 4062.93 MB/s Max 4075.11 MB/s 
>>> Min 4049.17 MB/s
>>> setting correct preferred node memcpy     Avg 7037.02 MB/s Max 7045.18 MB/s 
>>> Min 7026.00 MB/s
>>> 
>>> After tweaking (Voltage from 1.35V to 1.5V (RAM supported standard and Low 
>>> Voltage modes), CPU set to maximize performance, memory was already set to 
>>> optimize for performance):
>>> numademo 128M memcpy
>>> 2 nodes available
>>> memory with no policy memcpy              Avg 7174.50 MB/s Max 7190.88 MB/s 
>>> Min 7159.42 MB/s
>>> local memory memcpy                       Avg 7169.29 MB/s Max 7186.26 MB/s 
>>> Min 7144.18 MB/s
>>> memory interleaved on all nodes memcpy    Avg 5223.29 MB/s Max 5228.58 MB/s 
>>> Min 5214.97 MB/s
>>> memory on node 0 memcpy                   Avg 4104.18 MB/s Max 4111.68 MB/s 
>>> Min 4097.63 MB/s
>>> memory on node 1 memcpy                   Avg 7171.44 MB/s Max 7182.80 MB/s 
>>> Min 7156.75 MB/s
>>> memory interleaved on 0 1 memcpy          Avg 5225.20 MB/s Max 5244.31 MB/s 
>>> Min 5215.58 MB/s
>>> setting preferred node to 0
>>> memory without policy memcpy              Avg 4104.44 MB/s Max 4111.68 MB/s 
>>> Min 4099.13 MB/s
>>> setting preferred node to 1
>>> memory without policy memcpy              Avg 7171.36 MB/s Max 7182.80 MB/s 
>>> Min 7149.51 MB/s
>>> manual interleaving to all nodes memcpy   Avg 5227.46 MB/s Max 5241.04 MB/s 
>>> Min 5217.81 MB/s
>>> manual interleaving on node 0/1 memcpy    Avg 5224.47 MB/s Max 5232.86 MB/s 
>>> Min 5218.62 MB/s
>>> current interleave node 1
>>> running on node 0, preferred node 0
>>> local memory memcpy                       Avg 7216.01 MB/s Max 7232.73 MB/s 
>>> Min 7199.75 MB/s
>>> memory interleaved on all nodes memcpy    Avg 5198.67 MB/s Max 5206.07 MB/s 
>>> Min 5181.75 MB/s
>>> memory interleaved on node 0/1 memcpy     Avg 5202.42 MB/s Max 5215.99 MB/s 
>>> Min 5190.97 MB/s
>>> alloc on node 1 memcpy                    Avg 4102.75 MB/s Max 4120.90 MB/s 
>>> Min 4096.13 MB/s
>>> local allocation memcpy                   Avg 7217.29 MB/s Max 7229.61 MB/s 
>>> Min 7190.88 MB/s
>>> setting wrong preferred node memcpy       Avg 4100.52 MB/s Max 4105.27 MB/s 
>>> Min 4095.25 MB/s
>>> setting correct preferred node memcpy     Avg 7217.71 MB/s Max 7223.00 MB/s 
>>> Min 7207.09 MB/s
>>> running on node 1, preferred node 0
>>> local memory memcpy                       Avg 7175.62 MB/s Max 7184.72 MB/s 
>>> Min 7165.54 MB/s
>>> memory interleaved on all nodes memcpy    Avg 5227.73 MB/s Max 5238.17 MB/s 
>>> Min 5215.99 MB/s
>>> memory interleaved on node 0/1 memcpy     Avg 5224.31 MB/s Max 5236.74 MB/s 
>>> Min 5213.55 MB/s
>>> alloc on node 0 memcpy                    Avg 4099.95 MB/s Max 4106.15 MB/s 
>>> Min 4093.63 MB/s
>>> local allocation memcpy                   Avg 7179.80 MB/s Max 7195.89 MB/s 
>>> Min 7162.10 MB/s
>>> setting wrong preferred node memcpy       Avg 4099.20 MB/s Max 4104.64 MB/s 
>>> Min 4093.50 MB/s
>>> setting correct preferred node memcpy     Avg 7173.35 MB/s Max 7187.80 MB/s 
>>> Min 7163.63 MB/s
>>> 
>>> I suspect any of the improvements here are due to more aggressive memory 
>>> timings from not running the RAM low voltage mode, but I'm wondering if I'd 
>>> see memory bandwidth improvements by running two DIMMs per bank/channel 
>>> instead of one. I'm also wondering if there is any benefit to running two 
>>> NICs with each NIC on a PCIe slot connected to a different CPU socket. Can 
>>> I effectively ensure that I'm always processing traffic from NIC 1 on the 
>>> socket 1 CPU while accessing RAM attached to that CPU and so forth for 
>>> NIC2/CPU2? Is it worth paying attention to? The two articles left me 
>>> thinking that these may be concerns, but I'm not sure how significant a 
>>> role they may play. One thing not mentioned in the articles, and maybe I 
>>> missed it was whether or not hugepages was configured and what if any role 
>>> hugepages might contribute to performance.
>>> 
>>> Regards, 
>>> -- 
>>> Gary Faulkner
>>> _______________________________________________
>>> Ntop-misc mailing list
>>> [email protected]
>>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>> 
>> 
>> 
>> _______________________________________________
>> Ntop-misc mailing list
>> [email protected]
>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
> 
> <Dell-PowerEdge-R720xd-Spec-Sheet.pdf><dell-poweredge-r720-r720xd-technical-guide.pdf>

_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Re: [Ntop-misc] Question about Libzero/DNA performance tuning on multi-socket systems / packet loss

Reply via email to