Re: [vpp-dev] Rx stuck to 0 after a while

Andrew Yourtchenko Fri, 01 Jun 2018 23:33:02 -0700

Dear Rubina,

Excellent, thank you very much! The change is in the master now.


Note that to keep the default memory footprint the same I have temporarily 
halved the default upper limit on sessions (since we create two bihash entries 
now instead of one).

FYI, I plan to do some more work on session management/reuse before 1807 
release.

--a

> On 2 Jun 2018, at 07:48, Rubina Bianchi <r_bian...@outlook.com> wrote:
> 
> Dear Andrew
> 
> Sorry for delayed response. I checked your second patch and here is my test 
> result:
> 
> Best case is still the best and vpp throughput is Maximum (18.5 Gbps) in my 
> scenario.
> Worst case is getting better than past. I never see deadlock again and 
> throughput increases from 50 Mbps to 5.5 Gbps. I also added my T-Rex result.
> 
> -Per port stats table 
>       ports |               0 |               1 
>  
> -----------------------------------------------------------------------------------------
>    opackets |      1119818503 |      1065627562 
>      obytes |    490687253990 |    471065675962 
>    ipackets |       274437415 |       391504529 
>      ibytes |    120020261974 |    170214837563 
>     ierrors |               0 |               0 
>     oerrors |               0 |               0 
>       Tx Bw |       9.48 Gbps |       9.08 Gbps 
> 
> -Global stats enabled 
>  Cpu Utilization : 88.4  %  7.0 Gb/core 
>  Platform_factor : 1.0  
>  Total-Tx        :      18.56 Gbps  
>  Total-Rx        :       5.78 Gbps  
>  Total-PPS       :       5.27 Mpps  
>  Total-CPS       :      79.51 Kcps  
> 
>  Expected-PPS    :       9.02 Mpps  
>  Expected-CPS    :     135.31 Kcps  
>  Expected-BPS    :      31.77 Gbps  
> 
>  Active-flows    :    88840  Clients :      252   Socket-util : 0.5598 %    
>  Open-flows      : 33973880  Servers :    65532   Socket :    88840 
> Socket/Clients :  352.5 
>  drop-rate       :      12.79 Gbps   
>  current time    : 423.4 sec  
>  test duration   : 99576.6 sec
> 
> One point that I missed and would be helpful is that I run T-Rex with '-p' 
> parameter:
> ./t-rex-64 -c 6 -d 100000 -f cap2/sfr.yaml --cfg cfg/trex_cfg.yaml -m 30 -p
> 
> Thanks,
> Sincerely
> 
> From: Andrew 👽 Yourtchenko <ayour...@gmail.com>
> Sent: Wednesday, May 30, 2018 12:08 PM
> To: Rubina Bianchi
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>  
> Dear Rubina,
> 
> Thanks for checking it!
> 
> yeah actually that patch was leaking the sessions in the session reuse
> path. I have got the setup in the lab locally yesterday and am working
> on a better way to do it...
> 
> Will get back to you when I am happy with the way the code works..
> 
> --a
> 
> 
> 
> On 5/29/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
> > Dear Andrew
> >
> > I cleaned everything and created a new deb packages by your patch once
> > again. With your patch I never see deadlock again, but still I have
> > throughput problem in my scenario.
> >
> > -Per port stats table
> >       ports |               0 |               1
> > -----------------------------------------------------------------------------------------
> >    opackets |       474826597 |       452028770
> >      obytes |    207843848531 |    199591809555
> >    ipackets |        71010677 |        72028456
> >      ibytes |     31441646551 |     31687562468
> >     ierrors |               0 |               0
> >     oerrors |               0 |               0
> >       Tx Bw |       9.56 Gbps |       9.16 Gbps
> >
> > -Global stats enabled
> >  Cpu Utilization : 88.4  %  7.1 Gb/core
> >  Platform_factor : 1.0
> >  Total-Tx        :      18.72 Gbps
> >  Total-Rx        :      59.30 Mbps
> >  Total-PPS       :       5.31 Mpps
> >  Total-CPS       :      79.79 Kcps
> >
> >  Expected-PPS    :       9.02 Mpps
> >  Expected-CPS    :     135.31 Kcps
> >  Expected-BPS    :      31.77 Gbps
> >
> >  Active-flows    :    88837  Clients :      252   Socket-util : 0.5598 %
> >  Open-flows      : 14708455  Servers :    65532   Socket :    88837
> > Socket/Clients :  352.5
> >  Total_queue_full : 328355248
> >  drop-rate       :      18.66 Gbps
> >  current time    : 180.9 sec
> >  test duration   : 99819.1 sec
> >
> > In best case (4 interface in one numa that only 2 of them has acl) my device
> > (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4
> > interface in one numa that all of them has acl) my device throughput will
> > decrease from maximum to around 60Mbps. Actually patch just prevent deadlock
> > in my case but throughput is same as before.
> >
> > ________________________________
> > From: Andrew 👽 Yourtchenko <ayour...@gmail.com>
> > Sent: Tuesday, May 29, 2018 10:11 AM
> > To: Rubina Bianchi
> > Cc: vpp-dev@lists.fd.io
> > Subject: Re: [vpp-dev] Rx stuck to 0 after a while
> >
> > Dear Rubina,
> >
> > thank you for quickly checking it!
> >
> > Judging by the logs the VPP quits, so I would say there should be a
> > core file, could you check ?
> >
> > If you find it (doublecheck by the timestamps that it is indeed the
> > fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary'
> > 'path-to-core') and then get the backtrace using 'bt', this will give
> > more idea on what is going on.
> >
> > --a
> >
> > On 5/29/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
> >> Dear Andrew
> >>
> >> I tested your patch and my problem still exist, but my service status
> >> changed and now there isn't any information about deadlock problem. Do
> >> you
> >> have any idea about how I can provide you more information?
> >>
> >> root@MYRB:~# service vpp status
> >> * vpp.service - vector packet processing engine
> >>    Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor
> >> preset:
> >> enabled)
> >>    Active: inactive (dead)
> >>
> >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
> >> plugin: udp_ping_test_plugin.so
> >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
> >> plugin: stn_test_plugin.so
> >> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init
> >> args:
> >> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w
> >> 0000:08:00.0
> >> -w 0000:08:00.1 -w 0000:08
> >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n
> >> 4
> >> --huge-dir /run/vpp/hugepages --file-prefix vpp -w 0000:08:00.0 -w
> >> 0000:08:00.1 -w 0000:08:00.2 -w 000
> >> May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough
> >> DPDK
> >> crypto resources, default to OpenSSL
> >> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received
> >> signal
> >> SIGCONT, PC 0x7fa535dfbac0
> >> May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting...
> >> May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing
> >> engine...
> >> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received
> >> signal
> >> SIGTERM, PC 0x7fa534121867
> >> May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine.
> >>
> >>
> >> ________________________________
> >> From: Andrew 👽 Yourtchenko <ayour...@gmail.com>
> >> Sent: Monday, May 28, 2018 5:58 PM
> >> To: Rubina Bianchi
> >> Cc: vpp-dev@lists.fd.io
> >> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
> >>
> >> Dear Rubina,
> >>
> >> Thanks for catching and reporting this!
> >>
> >> I suspect what might be happening is my recent change of using two
> >> unidirectional sessions in bihash vs. the single one triggered a race,
> >> whereby as the owning worker is deleting the session,
> >> the non-owning worker is trying to update it. That would logically
> >> explain the "BUG: .." line (since you don't change the interfaces nor
> >> moving the traffic around, the 5 tuples should not collide), and as
> >> well the later stop.
> >>
> >> To take care of this issue, I think I will split the deletion of the
> >> session in two stages:
> >> 1) deactivation of the bihash entries that steer the traffic
> >> 2) freeing up the per-worker session structure
> >>
> >> and have a little pause time inbetween these two so that the
> >> workers-in-progress could
> >> finish updating the structures.
> >>
> >> The below gerrit is the first cut:
> >>
> >> https://gerrit.fd.io/r/#/c/12770/
> >>
> >> It passes the make test right now but I did not kick its tires too
> >> much yet, will do tomorrow.
> >>
> >> You can try this change out in your test setup as well and tell me how it
> >> feels.
> >>
> >> --a
> >>
> >> On 5/28/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
> >>> Hi
> >>>
> >>> I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl
> >>> (permit+reflect) and generated sfr traffic using trex v2.27. My rx will
> >>> become 0 after a short while, about 300 sec in my machine. Here is vpp
> >>> status:
> >>>
> >>> root@MYRB:~# service vpp status
> >>> * vpp.service - vector packet processing engine
> >>>    Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor
> >>> preset:
> >>> enabled)
> >>>    Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130;
> >>> 37s
> >>> ago
> >>>   Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm
> >>> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
> >>>   Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf
> >>> (code=killed, signal=ABRT)
> >>>   Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic
> >>> (code=exited,
> >>> status=0/SUCCESS)
> >>>   Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm
> >>> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
> >>>  Main PID: 31754 (code=killed, signal=ABRT)
> >>>
> >>> May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session
> >>> LSB16(sw_if_index) and 5-tuple collision!
> >>> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received
> >>> signal
> >>> SIGCONT, PC 0x7f1fb591cac0
> >>> May 28 16:35:02 MYRB vnet[31754]: received SIGTERM, exiting...
> >>> May 28 16:35:02 MYRB systemd[1]: Stopping vector packet processing
> >>> engine...
> >>> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received
> >>> signal
> >>> SIGTERM, PC 0x7f1fb3c40867
> >>> May 28 16:35:03 MYRB vpp[31754]: vlib_worker_thread_barrier_sync_int:
> >>> worker
> >>> thread deadlock
> >>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Main process exited,
> >>> code=killed, status=6/ABRT
> >>> May 28 16:35:03 MYRB systemd[1]: Stopped vector packet processing
> >>> engine.
> >>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Unit entered failed state.
> >>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Failed with result
> >>> 'signal'.
> >>>
> >>> I attach my vpp configs to this email. I also run this test with the
> >>> same
> >>> config and added 4 interface instead of two. But in this case nothing
> >>> happened to vpp and it was functional for a long time.
> >>>
> >>> Thanks,
> >>> RB
> >>>
> >>
> >

Re: [vpp-dev] Rx stuck to 0 after a while

Reply via email to