> I think your analysis is spot on. Thank you!
> How you're a VPP PPPoE expert __ do you have some suggestions on a fix?

Shouldn't the usual solution of storing the index in the pool instead of the 
pointer itself be enough?
Ie, use session_id as fixup_data, and use it in pppoe_fixup() to retrieve the 
session from the pool.

ben

> On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj" <vpp-
> d...@lists.fd.io on behalf of rajlistu...@gmail.com> wrote:
> 
>     Hello all,
> 
>     I could do some more analysis of this issue and I think I have
>     identified what could be wrong here.
> 
>     During pppoe session creation time frame, as we mentioned earlier, the
>     below function will get invoked from pppoe_update_adj()
> 
>     adj_nbr_midchain_update_rewrite(adj_index_t
>     adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...)
> 
>     Note the 3rd arg 'fixup_data', here we are passing the current session
>     address as fixup_data
> 
>     When subsequent sessions are added this memory address will get
>     altered as we are resizing the vector pool pem->sessions. Hence the
>     above stored fixup_data (adj->sub_type.midchain.fixup_data) address is
>     no longer valid, and that address could have been already freed.
> 
>     I think that is the reason why we see memory corruption. It would be
>     great if some one with far better knowledge of VPP internals could
>     take a look at this and confirm this.
> 
>     Thanks and Regards,
> 
>     Raj
> 
>     On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun <hongjun...@intel.com>
> wrote:
>     >
>     > Hi Raj,
>     >
>     > I tried to reproduce your issue on VPP 20.01 as per your steps for
> some times, but cannot reproduce it.
>     >
>     > From your description, please set a breakpoint in
> vnet_pppoe_add_del_session().
>     > And to see what happened when you create your second pppoe session
> with traffic in first pppoe session.
>     >
>     > Thanks,
>     > Hongjun
>     >
>     > -----Original Message-----
>     > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of
> Raj
>     > Sent: Monday, September 30, 2019 11:54 PM
>     > To: vpp-dev <vpp-dev@lists.fd.io>
>     > Subject: Re: [vpp-dev] VPP core dump with PPPoE
>     >
>     > Hello all,
>     >
>     > I did some more debugging to find out when and where exactly the
> pppoe_session_t get corrupted. Added couple of log entries as shown below
> to log pppoe session id when a session is created as well as when packets
> from north traverses to south. I have tried this in VPP 19.08, 19.04 and
> 19.01 with same results.
>     >
>     > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 35666
>     > vnet [21892]: pppoe_fixup:169:      New_Packet  pppoe01 session id
> 35666
>     > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 58191
>     >
>     > The sequence when corruption happens seems to be:
>     >
>     > 1. A new session is created
>     > 2. A packet for the newly created session traverses from north to
> south 3. Next session is created - and vpp crashes.
>     >
>     > Digging deeper, I added watch for all newly created sessions using
> the following gdb script
>     >
>     > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
>     > commands 1
>     >  watch -l ((pppoe_session_t*)adj0-
> >sub_type.midchain.fixup_data).session_id
>     >  bt
>     > continue
>     > end
>     >
>     > gdb, running with this script, bails out with following message:
>     >
>     > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location
> ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
>     >
>     > Old value = 35666
>     > New value = 4883
>     > __memset_avx2_unaligned_erms () at
>     > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
>     > 203     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
>     > such file or directory.
>     > (gdb)
>     >
>     > It is interesting to note that 4883 is 0x1313
>     >
>     > Back trace shows the path it took to reach here:
>     >
>     >
>     > (gdb) bt
>     > #0  __memset_avx2_unaligned_erms () at
>     > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
>     > #1  0x00007ffff61a4179 in mspace_put (msp=0x7fffb4df7010,
>     > p_arg=0x7fffb592d9c8) at /root/build-
> 1901/src/vppinfra/dlmalloc.c:4294
>     > #2  0x00007ffff618ea39 in clib_mem_free (p=0x7fffb592d9c8) at
>     > /root/build-1901/src/vppinfra/mem.h:215
>     > #3  0x00007ffff618edd8 in vec_resize_allocate_memory
> (v=0x7fffb592da00, length_increment=1, data_bytes=312, header_bytes=56,
> data_align=64) at
>     > /root/build-1901/src/vppinfra/vec.c:96
>     > #4  0x00007fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00,
> length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at
>     > /root/build-1901/src/vppinfra/vec.h:147
>     > #5  0x00007fffb0aa9ca4 in vnet_pppoe_add_del_session
> (a=0x7fffb6703950, sw_if_indexp=0x7fffb67038e8) at
>     > /root/build-1901/src/plugins/pppoe/pppoe.c:335
>     > #6  0x00007fffb0aaadec in pppoe_add_del_session_command_fn
>     > (vm=0x7ffff68e3400 <vlib_global_main>, input=0x7fffb6703ee0,
>     > cmd=0x7fffb65dd73c) at /root/build-
> 1901/src/plugins/pppoe/pppoe.c:554
>     > #7  0x00007ffff6617db0 in vlib_cli_dispatch_sub_commands
>     > (vm=0x7ffff68e3400 <vlib_global_main>, cm=0x7ffff68e3600
> <vlib_global_main+512>, input=0x7fffb6703ee0, parent_command_index=21) at
> /root/build-1901/src/vlib/cli.c:644
>     >
>     > This do not occur if traffic is not initiated, and there is no
> packet flow through the system. It would be great if some one who
> understands this code to confirm if my analysis is correct and give some
> pointers to figure out
>     >
>     > 1. Why, when a new session is created, the data of an old session is
> changed to 0x1313 2. What debugging steps can I take next to figure out
> why this happens.
>     >
>     >
>     >
>     > Thanks and Regards,
>     >
>     >
>     > Raj
>     >
>     > On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io
> <rajlistuser=gmail....@lists.fd.io> wrote:
>     > >
>     > > Hello all,
>     > >
>     > > I have done some more tests to pinpoint the exact condition of the
>     > > crash. What I could figure out was that the crash happens when
> memory
>     > > is being allocated for pppoe_session_t while packets are flowing
>     > > through pppoe interface.
>     > >
>     > > Here is what I did to arrive at this conclusion:
>     > >
>     > > 1. Configure VPP without any default route (to ensure packets do
> not
>     > > hit north interface from south) 2. Provision 100 PPPoE clients -
> No
>     > > crash observed 3. Deprovision all 100 PPPoE clients 4. Configure
>     > > default route 5. Provision 100 PPPoE clients again, and start a
> ping
>     > > to an external IP from each client - No Crash observed 6.
> Provision 50
>     > > more PPPoE clients - VPP crashes.
>     > >
>     > > Based on this test, and from what I could understand from the
> code, my
>     > > guess is  that there is some memory corruption happening inside
> the
>     > > pppoe_session_t when memory is being allocated for it when there
> is
>     > > packets traversing through PPPoE interface.
>     > >
>     > > Thanks and Regards,
>     > >
>     > > Raj
>     > >
>     > >
>     > > On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io
>     > > <rajlistuser=gmail....@lists.fd.io> wrote:
>     > > >
>     > > > Hello all,
>     > > >
>     > > > I am observing a VPP crash when approximately 20 - 50 PPPoE
> clients
>     > > > are connecting and traffic is flowing through them. This crash
> was
>     > > > reproducible every time I tried.
>     > > >
>     > > > I did some debugging and here is what I could find out so far:
>     > > >
>     > > > If I understand correctly, when a incoming packet from north
> side is
>     > > > being sent to PPPoE interface, pppoe_fixup() is called to update
>     > > > pppoe0->length, and t->encap_if_index. Length and encap_if_index
> is
>     > > > taken from adj0->sub_type.midchain.fixup_data
>     > > >
>     > > > My observation is that while clients are connecting and traffic
> is
>     > > > flowing for connected clients, adj0-
> >sub_type.midchain.fixup_data
>     > > > appears to hold incorrect data, at some point in time, during
> the
>     > > > test. What we have seen is the incorrect data
>     > > > (adj0->sub_type.midchain.fixup_data) is observed for clients
> which
>     > > > are already provisioned for some time and which had packets
> flowing
>     > > > through them.
>     > > >
>     > > > I figured this out by using gdb and inspecting
>     > > > adj0->sub_type.midchain.fixup_data, after typecasting it into
>     > > > pppoe_session_t
>     > > >
>     > > > In the structure, I could see that session_id, client_ip and
>     > > > encap_idx are incorrect. I did not check other values in the
> structure.
>     > > >
>     > > > I also added code to log this fields in pppoe_fixup() and logs
> too
>     > > > shows incorrect data in the fields.
>     > > >
>     > > > Example logs taken just before crash:
>     > > >
>     > > > vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 ->  2:42: a: 1:
> 0: 2
>     > > > , type 8864
>     > > > vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip
>     > > > 0x13131313 encap idx 0x13131313
>     > > >
>     > > > First log prints out packet headers, to verify that data in
> packet
>     > > > is as expected and is correct. Second log prints values in
>     > > > pppoe_session data, and it can be seen that the values are
> obviously
>     > > > incorrect. At this point the packet is sent out through the
> south
>     > > > interface. Again after some time the TX index values become some
>     > > > thing similar to
>     > > > 1422457436 and VPP core dumps.
>     > > >
>     > > > We have tested the following scenarios:
>     > > >
>     > > > 1. Add PPPoE clients without sending out any traffic: There is
> no
>     > > > crash observed.
>     > > > 2. Add n number of PPPoE clients, load traffic [No adding or
> removal
>     > > > or clients while traffic is on, see next scenario]: There is no
>     > > > crash observed 3. Load traffic as soon as each client connects:
> VPP
>     > > > crash observed.
>     > > >
>     > > > Another observation is that  encap_if_index is available in two
>     > > > places inside pppoe_fixup:
>     > > >
>     > > > 1. adj->rewrite_header.sw_if_index
>     > > > 2. t->encap_if_index
>     > > >
>     > > > t->encap_if_index is used for updating TX, and this gets
> corrupted,
>     > > > while adj->rewrite_header.sw_if_index has the correct index.
>     > > >
>     > > > I can check and get back if you need any additional information.
> Let
>     > > > me know if a bug report is to be created for this.
>     > > >
>     > > > Environment:
>     > > >
>     > > > vpp# show version verbose
>     > > > Version:          v19.08.1-59~ga2aa83ca9-dirty
>     > > > Compiled by:          root
>     > > > Compile host:          build-02
>     > > > Compile date:          Thu Sep 26 16:44:00 IST 2019
>     > > > Compile location:      /root/build-1908
>     > > > Compiler:          GCC 7.4.0
>     > > > Current PID:          7802
>     > > >
>     > > > Operating system: Ubuntu 18.04 amd64
>     > > >
>     > > > startup.conf and associated exec file is attached.
>     > > >
>     > > > There is a small patch to stock VPP to disable
>     > > > ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also
>     > > > attached output of show show hardware and gdb bt output. I have
> the
>     > > > core file and its matching VPP debs, and can be shared if
> needed.
>     > > >
>     > > > In the bt the incorrect value of index can be seen in bt #5:
>     > > >
>     > > > #5  0x00007fba88e9ce0b in vlib_increment_combined_counter
>     > > > (n_bytes=<optimized out>, n_packets=1, index=538976288,
>     > > > thread_index=0, cm=0x7fba481f46a0) at
>     > > > /root/build-1908/src/vlib/counter.h:229
>     > > >
>     > > > Thanks and Regards,
>     > > >
>     > > > Raj
>     > > > -=-=-=-=-=-=-=-=-=-=-=-
>     > > > Links: You receive all messages sent to this group.
>     > > >
>     > > > View/Reply Online (#14063):
>     > > > https://lists.fd.io/g/vpp-dev/message/14063
>     > > > Mute This Topic: https://lists.fd.io/mt/34298895/157026
>     > > > Group Owner: vpp-dev+ow...@lists.fd.io
>     > > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub
>     > > > [rajlistu...@gmail.com]
>     > > > -=-=-=-=-=-=-=-=-=-=-=-
>     > > -=-=-=-=-=-=-=-=-=-=-=-
>     > > Links: You receive all messages sent to this group.
>     > >
>     > > View/Reply Online (#14081):
>     > > https://lists.fd.io/g/vpp-dev/message/14081
>     > > Mute This Topic: https://lists.fd.io/mt/34298895/157026
>     > > Group Owner: vpp-dev+ow...@lists.fd.io
>     > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub
>     > > [rajlistu...@gmail.com]
>     > > -=-=-=-=-=-=-=-=-=-=-=-
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14201): https://lists.fd.io/g/vpp-dev/message/14201
Mute This Topic: https://lists.fd.io/mt/34298895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to