Dear Rubina, Excellent, thank you very much! The change is in the master now.
Note that to keep the default memory footprint the same I have temporarily halved the default upper limit on sessions (since we create two bihash entries now instead of one). FYI, I plan to do some more work on session management/reuse before 1807 release. --a > On 2 Jun 2018, at 07:48, Rubina Bianchi <r_bian...@outlook.com> wrote: > > Dear Andrew > > Sorry for delayed response. I checked your second patch and here is my test > result: > > Best case is still the best and vpp throughput is Maximum (18.5 Gbps) in my > scenario. > Worst case is getting better than past. I never see deadlock again and > throughput increases from 50 Mbps to 5.5 Gbps. I also added my T-Rex result. > > -Per port stats table > ports | 0 | 1 > > ----------------------------------------------------------------------------------------- > opackets | 1119818503 | 1065627562 > obytes | 490687253990 | 471065675962 > ipackets | 274437415 | 391504529 > ibytes | 120020261974 | 170214837563 > ierrors | 0 | 0 > oerrors | 0 | 0 > Tx Bw | 9.48 Gbps | 9.08 Gbps > > -Global stats enabled > Cpu Utilization : 88.4 % 7.0 Gb/core > Platform_factor : 1.0 > Total-Tx : 18.56 Gbps > Total-Rx : 5.78 Gbps > Total-PPS : 5.27 Mpps > Total-CPS : 79.51 Kcps > > Expected-PPS : 9.02 Mpps > Expected-CPS : 135.31 Kcps > Expected-BPS : 31.77 Gbps > > Active-flows : 88840 Clients : 252 Socket-util : 0.5598 % > Open-flows : 33973880 Servers : 65532 Socket : 88840 > Socket/Clients : 352.5 > drop-rate : 12.79 Gbps > current time : 423.4 sec > test duration : 99576.6 sec > > One point that I missed and would be helpful is that I run T-Rex with '-p' > parameter: > ./t-rex-64 -c 6 -d 100000 -f cap2/sfr.yaml --cfg cfg/trex_cfg.yaml -m 30 -p > > Thanks, > Sincerely > > From: Andrew 👽 Yourtchenko <ayour...@gmail.com> > Sent: Wednesday, May 30, 2018 12:08 PM > To: Rubina Bianchi > Cc: vpp-dev@lists.fd.io > Subject: Re: [vpp-dev] Rx stuck to 0 after a while > > Dear Rubina, > > Thanks for checking it! > > yeah actually that patch was leaking the sessions in the session reuse > path. I have got the setup in the lab locally yesterday and am working > on a better way to do it... > > Will get back to you when I am happy with the way the code works.. > > --a > > > > On 5/29/18, Rubina Bianchi <r_bian...@outlook.com> wrote: > > Dear Andrew > > > > I cleaned everything and created a new deb packages by your patch once > > again. With your patch I never see deadlock again, but still I have > > throughput problem in my scenario. > > > > -Per port stats table > > ports | 0 | 1 > > ----------------------------------------------------------------------------------------- > > opackets | 474826597 | 452028770 > > obytes | 207843848531 | 199591809555 > > ipackets | 71010677 | 72028456 > > ibytes | 31441646551 | 31687562468 > > ierrors | 0 | 0 > > oerrors | 0 | 0 > > Tx Bw | 9.56 Gbps | 9.16 Gbps > > > > -Global stats enabled > > Cpu Utilization : 88.4 % 7.1 Gb/core > > Platform_factor : 1.0 > > Total-Tx : 18.72 Gbps > > Total-Rx : 59.30 Mbps > > Total-PPS : 5.31 Mpps > > Total-CPS : 79.79 Kcps > > > > Expected-PPS : 9.02 Mpps > > Expected-CPS : 135.31 Kcps > > Expected-BPS : 31.77 Gbps > > > > Active-flows : 88837 Clients : 252 Socket-util : 0.5598 % > > Open-flows : 14708455 Servers : 65532 Socket : 88837 > > Socket/Clients : 352.5 > > Total_queue_full : 328355248 > > drop-rate : 18.66 Gbps > > current time : 180.9 sec > > test duration : 99819.1 sec > > > > In best case (4 interface in one numa that only 2 of them has acl) my device > > (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4 > > interface in one numa that all of them has acl) my device throughput will > > decrease from maximum to around 60Mbps. Actually patch just prevent deadlock > > in my case but throughput is same as before. > > > > ________________________________ > > From: Andrew 👽 Yourtchenko <ayour...@gmail.com> > > Sent: Tuesday, May 29, 2018 10:11 AM > > To: Rubina Bianchi > > Cc: vpp-dev@lists.fd.io > > Subject: Re: [vpp-dev] Rx stuck to 0 after a while > > > > Dear Rubina, > > > > thank you for quickly checking it! > > > > Judging by the logs the VPP quits, so I would say there should be a > > core file, could you check ? > > > > If you find it (doublecheck by the timestamps that it is indeed the > > fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary' > > 'path-to-core') and then get the backtrace using 'bt', this will give > > more idea on what is going on. > > > > --a > > > > On 5/29/18, Rubina Bianchi <r_bian...@outlook.com> wrote: > >> Dear Andrew > >> > >> I tested your patch and my problem still exist, but my service status > >> changed and now there isn't any information about deadlock problem. Do > >> you > >> have any idea about how I can provide you more information? > >> > >> root@MYRB:~# service vpp status > >> * vpp.service - vector packet processing engine > >> Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor > >> preset: > >> enabled) > >> Active: inactive (dead) > >> > >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded > >> plugin: udp_ping_test_plugin.so > >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded > >> plugin: stn_test_plugin.so > >> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init > >> args: > >> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w > >> 0000:08:00.0 > >> -w 0000:08:00.1 -w 0000:08 > >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n > >> 4 > >> --huge-dir /run/vpp/hugepages --file-prefix vpp -w 0000:08:00.0 -w > >> 0000:08:00.1 -w 0000:08:00.2 -w 000 > >> May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough > >> DPDK > >> crypto resources, default to OpenSSL > >> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received > >> signal > >> SIGCONT, PC 0x7fa535dfbac0 > >> May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting... > >> May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing > >> engine... > >> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received > >> signal > >> SIGTERM, PC 0x7fa534121867 > >> May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine. > >> > >> > >> ________________________________ > >> From: Andrew 👽 Yourtchenko <ayour...@gmail.com> > >> Sent: Monday, May 28, 2018 5:58 PM > >> To: Rubina Bianchi > >> Cc: vpp-dev@lists.fd.io > >> Subject: Re: [vpp-dev] Rx stuck to 0 after a while > >> > >> Dear Rubina, > >> > >> Thanks for catching and reporting this! > >> > >> I suspect what might be happening is my recent change of using two > >> unidirectional sessions in bihash vs. the single one triggered a race, > >> whereby as the owning worker is deleting the session, > >> the non-owning worker is trying to update it. That would logically > >> explain the "BUG: .." line (since you don't change the interfaces nor > >> moving the traffic around, the 5 tuples should not collide), and as > >> well the later stop. > >> > >> To take care of this issue, I think I will split the deletion of the > >> session in two stages: > >> 1) deactivation of the bihash entries that steer the traffic > >> 2) freeing up the per-worker session structure > >> > >> and have a little pause time inbetween these two so that the > >> workers-in-progress could > >> finish updating the structures. > >> > >> The below gerrit is the first cut: > >> > >> https://gerrit.fd.io/r/#/c/12770/ > >> > >> It passes the make test right now but I did not kick its tires too > >> much yet, will do tomorrow. > >> > >> You can try this change out in your test setup as well and tell me how it > >> feels. > >> > >> --a > >> > >> On 5/28/18, Rubina Bianchi <r_bian...@outlook.com> wrote: > >>> Hi > >>> > >>> I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl > >>> (permit+reflect) and generated sfr traffic using trex v2.27. My rx will > >>> become 0 after a short while, about 300 sec in my machine. Here is vpp > >>> status: > >>> > >>> root@MYRB:~# service vpp status > >>> * vpp.service - vector packet processing engine > >>> Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor > >>> preset: > >>> enabled) > >>> Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130; > >>> 37s > >>> ago > >>> Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm > >>> /dev/shm/vpe-api (code=exited, status=0/SUCCESS) > >>> Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf > >>> (code=killed, signal=ABRT) > >>> Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic > >>> (code=exited, > >>> status=0/SUCCESS) > >>> Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm > >>> /dev/shm/vpe-api (code=exited, status=0/SUCCESS) > >>> Main PID: 31754 (code=killed, signal=ABRT) > >>> > >>> May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session > >>> LSB16(sw_if_index) and 5-tuple collision! > >>> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received > >>> signal > >>> SIGCONT, PC 0x7f1fb591cac0 > >>> May 28 16:35:02 MYRB vnet[31754]: received SIGTERM, exiting... > >>> May 28 16:35:02 MYRB systemd[1]: Stopping vector packet processing > >>> engine... > >>> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received > >>> signal > >>> SIGTERM, PC 0x7f1fb3c40867 > >>> May 28 16:35:03 MYRB vpp[31754]: vlib_worker_thread_barrier_sync_int: > >>> worker > >>> thread deadlock > >>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Main process exited, > >>> code=killed, status=6/ABRT > >>> May 28 16:35:03 MYRB systemd[1]: Stopped vector packet processing > >>> engine. > >>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Unit entered failed state. > >>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Failed with result > >>> 'signal'. > >>> > >>> I attach my vpp configs to this email. I also run this test with the > >>> same > >>> config and added 4 interface instead of two. But in this case nothing > >>> happened to vpp and it was functional for a long time. > >>> > >>> Thanks, > >>> RB > >>> > >> > >