> I think your analysis is spot on. Thank you! > How you're a VPP PPPoE expert __ do you have some suggestions on a fix?
Shouldn't the usual solution of storing the index in the pool instead of the pointer itself be enough? Ie, use session_id as fixup_data, and use it in pppoe_fixup() to retrieve the session from the pool. ben > On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj" <vpp- > d...@lists.fd.io on behalf of rajlistu...@gmail.com> wrote: > > Hello all, > > I could do some more analysis of this issue and I think I have > identified what could be wrong here. > > During pppoe session creation time frame, as we mentioned earlier, the > below function will get invoked from pppoe_update_adj() > > adj_nbr_midchain_update_rewrite(adj_index_t > adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...) > > Note the 3rd arg 'fixup_data', here we are passing the current session > address as fixup_data > > When subsequent sessions are added this memory address will get > altered as we are resizing the vector pool pem->sessions. Hence the > above stored fixup_data (adj->sub_type.midchain.fixup_data) address is > no longer valid, and that address could have been already freed. > > I think that is the reason why we see memory corruption. It would be > great if some one with far better knowledge of VPP internals could > take a look at this and confirm this. > > Thanks and Regards, > > Raj > > On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun <hongjun...@intel.com> > wrote: > > > > Hi Raj, > > > > I tried to reproduce your issue on VPP 20.01 as per your steps for > some times, but cannot reproduce it. > > > > From your description, please set a breakpoint in > vnet_pppoe_add_del_session(). > > And to see what happened when you create your second pppoe session > with traffic in first pppoe session. > > > > Thanks, > > Hongjun > > > > -----Original Message----- > > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of > Raj > > Sent: Monday, September 30, 2019 11:54 PM > > To: vpp-dev <vpp-dev@lists.fd.io> > > Subject: Re: [vpp-dev] VPP core dump with PPPoE > > > > Hello all, > > > > I did some more debugging to find out when and where exactly the > pppoe_session_t get corrupted. Added couple of log entries as shown below > to log pppoe session id when a session is created as well as when packets > from north traverses to south. I have tried this in VPP 19.08, 19.04 and > 19.01 with same results. > > > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 35666 > > vnet [21892]: pppoe_fixup:169: New_Packet pppoe01 session id > 35666 > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 58191 > > > > The sequence when corruption happens seems to be: > > > > 1. A new session is created > > 2. A packet for the newly created session traverses from north to > south 3. Next session is created - and vpp crashes. > > > > Digging deeper, I added watch for all newly created sessions using > the following gdb script > > > > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444 > > commands 1 > > watch -l ((pppoe_session_t*)adj0- > >sub_type.midchain.fixup_data).session_id > > bt > > continue > > end > > > > gdb, running with this script, bails out with following message: > > > > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location > ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > > > > Old value = 35666 > > New value = 4883 > > __memset_avx2_unaligned_erms () at > > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > > 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No > > such file or directory. > > (gdb) > > > > It is interesting to note that 4883 is 0x1313 > > > > Back trace shows the path it took to reach here: > > > > > > (gdb) bt > > #0 __memset_avx2_unaligned_erms () at > > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > > #1 0x00007ffff61a4179 in mspace_put (msp=0x7fffb4df7010, > > p_arg=0x7fffb592d9c8) at /root/build- > 1901/src/vppinfra/dlmalloc.c:4294 > > #2 0x00007ffff618ea39 in clib_mem_free (p=0x7fffb592d9c8) at > > /root/build-1901/src/vppinfra/mem.h:215 > > #3 0x00007ffff618edd8 in vec_resize_allocate_memory > (v=0x7fffb592da00, length_increment=1, data_bytes=312, header_bytes=56, > data_align=64) at > > /root/build-1901/src/vppinfra/vec.c:96 > > #4 0x00007fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, > length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at > > /root/build-1901/src/vppinfra/vec.h:147 > > #5 0x00007fffb0aa9ca4 in vnet_pppoe_add_del_session > (a=0x7fffb6703950, sw_if_indexp=0x7fffb67038e8) at > > /root/build-1901/src/plugins/pppoe/pppoe.c:335 > > #6 0x00007fffb0aaadec in pppoe_add_del_session_command_fn > > (vm=0x7ffff68e3400 <vlib_global_main>, input=0x7fffb6703ee0, > > cmd=0x7fffb65dd73c) at /root/build- > 1901/src/plugins/pppoe/pppoe.c:554 > > #7 0x00007ffff6617db0 in vlib_cli_dispatch_sub_commands > > (vm=0x7ffff68e3400 <vlib_global_main>, cm=0x7ffff68e3600 > <vlib_global_main+512>, input=0x7fffb6703ee0, parent_command_index=21) at > /root/build-1901/src/vlib/cli.c:644 > > > > This do not occur if traffic is not initiated, and there is no > packet flow through the system. It would be great if some one who > understands this code to confirm if my analysis is correct and give some > pointers to figure out > > > > 1. Why, when a new session is created, the data of an old session is > changed to 0x1313 2. What debugging steps can I take next to figure out > why this happens. > > > > > > > > Thanks and Regards, > > > > > > Raj > > > > On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io > <rajlistuser=gmail....@lists.fd.io> wrote: > > > > > > Hello all, > > > > > > I have done some more tests to pinpoint the exact condition of the > > > crash. What I could figure out was that the crash happens when > memory > > > is being allocated for pppoe_session_t while packets are flowing > > > through pppoe interface. > > > > > > Here is what I did to arrive at this conclusion: > > > > > > 1. Configure VPP without any default route (to ensure packets do > not > > > hit north interface from south) 2. Provision 100 PPPoE clients - > No > > > crash observed 3. Deprovision all 100 PPPoE clients 4. Configure > > > default route 5. Provision 100 PPPoE clients again, and start a > ping > > > to an external IP from each client - No Crash observed 6. > Provision 50 > > > more PPPoE clients - VPP crashes. > > > > > > Based on this test, and from what I could understand from the > code, my > > > guess is that there is some memory corruption happening inside > the > > > pppoe_session_t when memory is being allocated for it when there > is > > > packets traversing through PPPoE interface. > > > > > > Thanks and Regards, > > > > > > Raj > > > > > > > > > On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io > > > <rajlistuser=gmail....@lists.fd.io> wrote: > > > > > > > > Hello all, > > > > > > > > I am observing a VPP crash when approximately 20 - 50 PPPoE > clients > > > > are connecting and traffic is flowing through them. This crash > was > > > > reproducible every time I tried. > > > > > > > > I did some debugging and here is what I could find out so far: > > > > > > > > If I understand correctly, when a incoming packet from north > side is > > > > being sent to PPPoE interface, pppoe_fixup() is called to update > > > > pppoe0->length, and t->encap_if_index. Length and encap_if_index > is > > > > taken from adj0->sub_type.midchain.fixup_data > > > > > > > > My observation is that while clients are connecting and traffic > is > > > > flowing for connected clients, adj0- > >sub_type.midchain.fixup_data > > > > appears to hold incorrect data, at some point in time, during > the > > > > test. What we have seen is the incorrect data > > > > (adj0->sub_type.midchain.fixup_data) is observed for clients > which > > > > are already provisioned for some time and which had packets > flowing > > > > through them. > > > > > > > > I figured this out by using gdb and inspecting > > > > adj0->sub_type.midchain.fixup_data, after typecasting it into > > > > pppoe_session_t > > > > > > > > In the structure, I could see that session_id, client_ip and > > > > encap_idx are incorrect. I did not check other values in the > structure. > > > > > > > > I also added code to log this fields in pppoe_fixup() and logs > too > > > > shows incorrect data in the fields. > > > > > > > > Example logs taken just before crash: > > > > > > > > vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 -> 2:42: a: 1: > 0: 2 > > > > , type 8864 > > > > vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip > > > > 0x13131313 encap idx 0x13131313 > > > > > > > > First log prints out packet headers, to verify that data in > packet > > > > is as expected and is correct. Second log prints values in > > > > pppoe_session data, and it can be seen that the values are > obviously > > > > incorrect. At this point the packet is sent out through the > south > > > > interface. Again after some time the TX index values become some > > > > thing similar to > > > > 1422457436 and VPP core dumps. > > > > > > > > We have tested the following scenarios: > > > > > > > > 1. Add PPPoE clients without sending out any traffic: There is > no > > > > crash observed. > > > > 2. Add n number of PPPoE clients, load traffic [No adding or > removal > > > > or clients while traffic is on, see next scenario]: There is no > > > > crash observed 3. Load traffic as soon as each client connects: > VPP > > > > crash observed. > > > > > > > > Another observation is that encap_if_index is available in two > > > > places inside pppoe_fixup: > > > > > > > > 1. adj->rewrite_header.sw_if_index > > > > 2. t->encap_if_index > > > > > > > > t->encap_if_index is used for updating TX, and this gets > corrupted, > > > > while adj->rewrite_header.sw_if_index has the correct index. > > > > > > > > I can check and get back if you need any additional information. > Let > > > > me know if a bug report is to be created for this. > > > > > > > > Environment: > > > > > > > > vpp# show version verbose > > > > Version: v19.08.1-59~ga2aa83ca9-dirty > > > > Compiled by: root > > > > Compile host: build-02 > > > > Compile date: Thu Sep 26 16:44:00 IST 2019 > > > > Compile location: /root/build-1908 > > > > Compiler: GCC 7.4.0 > > > > Current PID: 7802 > > > > > > > > Operating system: Ubuntu 18.04 amd64 > > > > > > > > startup.conf and associated exec file is attached. > > > > > > > > There is a small patch to stock VPP to disable > > > > ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also > > > > attached output of show show hardware and gdb bt output. I have > the > > > > core file and its matching VPP debs, and can be shared if > needed. > > > > > > > > In the bt the incorrect value of index can be seen in bt #5: > > > > > > > > #5 0x00007fba88e9ce0b in vlib_increment_combined_counter > > > > (n_bytes=<optimized out>, n_packets=1, index=538976288, > > > > thread_index=0, cm=0x7fba481f46a0) at > > > > /root/build-1908/src/vlib/counter.h:229 > > > > > > > > Thanks and Regards, > > > > > > > > Raj > > > > -=-=-=-=-=-=-=-=-=-=-=- > > > > Links: You receive all messages sent to this group. > > > > > > > > View/Reply Online (#14063): > > > > https://lists.fd.io/g/vpp-dev/message/14063 > > > > Mute This Topic: https://lists.fd.io/mt/34298895/157026 > > > > Group Owner: vpp-dev+ow...@lists.fd.io > > > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub > > > > [rajlistu...@gmail.com] > > > > -=-=-=-=-=-=-=-=-=-=-=- > > > -=-=-=-=-=-=-=-=-=-=-=- > > > Links: You receive all messages sent to this group. > > > > > > View/Reply Online (#14081): > > > https://lists.fd.io/g/vpp-dev/message/14081 > > > Mute This Topic: https://lists.fd.io/mt/34298895/157026 > > > Group Owner: vpp-dev+ow...@lists.fd.io > > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub > > > [rajlistu...@gmail.com] > > > -=-=-=-=-=-=-=-=-=-=-=- >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14201): https://lists.fd.io/g/vpp-dev/message/14201 Mute This Topic: https://lists.fd.io/mt/34298895/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-