Hi Raj,

I tried to reproduce your issue on VPP 20.01 as per your steps for some times, 
but cannot reproduce it.

From your description, please set a breakpoint in vnet_pppoe_add_del_session().
And to see what happened when you create your second pppoe session with traffic 
in first pppoe session.

Thanks,
Hongjun

-----Original Message-----
From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj
Sent: Monday, September 30, 2019 11:54 PM
To: vpp-dev <vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] VPP core dump with PPPoE

Hello all,

I did some more debugging to find out when and where exactly the 
pppoe_session_t get corrupted. Added couple of log entries as shown below to 
log pppoe session id when a session is created as well as when packets from 
north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 with 
same results.

vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet 
[21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666
vnet [21892]: pppoe_fixup:169:      New_Packet  pppoe01 session id 35666
vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191

The sequence when corruption happens seems to be:

1. A new session is created
2. A packet for the newly created session traverses from north to south 3. Next 
session is created - and vpp crashes.

Digging deeper, I added watch for all newly created sessions using the 
following gdb script

b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
commands 1
 watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
 bt
continue
end

gdb, running with this script, bails out with following message:

Thread 1 "vpp_main" hit Hardware watchpoint 2: -location 
((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id

Old value = 35666
New value = 4883
__memset_avx2_unaligned_erms () at
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
203     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
such file or directory.
(gdb)

It is interesting to note that 4883 is 0x1313

Back trace shows the path it took to reach here:


(gdb) bt
#0  __memset_avx2_unaligned_erms () at
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
#1  0x00007ffff61a4179 in mspace_put (msp=0x7fffb4df7010,
p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294
#2  0x00007ffff618ea39 in clib_mem_free (p=0x7fffb592d9c8) at
/root/build-1901/src/vppinfra/mem.h:215
#3  0x00007ffff618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, 
length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at
/root/build-1901/src/vppinfra/vec.c:96
#4  0x00007fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, 
length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at
/root/build-1901/src/vppinfra/vec.h:147
#5  0x00007fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, 
sw_if_indexp=0x7fffb67038e8) at
/root/build-1901/src/plugins/pppoe/pppoe.c:335
#6  0x00007fffb0aaadec in pppoe_add_del_session_command_fn
(vm=0x7ffff68e3400 <vlib_global_main>, input=0x7fffb6703ee0,
cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554
#7  0x00007ffff6617db0 in vlib_cli_dispatch_sub_commands
(vm=0x7ffff68e3400 <vlib_global_main>, cm=0x7ffff68e3600 
<vlib_global_main+512>, input=0x7fffb6703ee0, parent_command_index=21) at 
/root/build-1901/src/vlib/cli.c:644

This do not occur if traffic is not initiated, and there is no packet flow 
through the system. It would be great if some one who understands this code to 
confirm if my analysis is correct and give some pointers to figure out

1. Why, when a new session is created, the data of an old session is changed to 
0x1313 2. What debugging steps can I take next to figure out why this happens.



Thanks and Regards,


Raj

On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io 
<rajlistuser=gmail....@lists.fd.io> wrote:
>
> Hello all,
>
> I have done some more tests to pinpoint the exact condition of the 
> crash. What I could figure out was that the crash happens when memory 
> is being allocated for pppoe_session_t while packets are flowing 
> through pppoe interface.
>
> Here is what I did to arrive at this conclusion:
>
> 1. Configure VPP without any default route (to ensure packets do not 
> hit north interface from south) 2. Provision 100 PPPoE clients - No 
> crash observed 3. Deprovision all 100 PPPoE clients 4. Configure 
> default route 5. Provision 100 PPPoE clients again, and start a ping 
> to an external IP from each client - No Crash observed 6. Provision 50 
> more PPPoE clients - VPP crashes.
>
> Based on this test, and from what I could understand from the code, my 
> guess is  that there is some memory corruption happening inside the 
> pppoe_session_t when memory is being allocated for it when there is 
> packets traversing through PPPoE interface.
>
> Thanks and Regards,
>
> Raj
>
>
> On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io 
> <rajlistuser=gmail....@lists.fd.io> wrote:
> >
> > Hello all,
> >
> > I am observing a VPP crash when approximately 20 - 50 PPPoE clients 
> > are connecting and traffic is flowing through them. This crash was 
> > reproducible every time I tried.
> >
> > I did some debugging and here is what I could find out so far:
> >
> > If I understand correctly, when a incoming packet from north side is 
> > being sent to PPPoE interface, pppoe_fixup() is called to update
> > pppoe0->length, and t->encap_if_index. Length and encap_if_index is
> > taken from adj0->sub_type.midchain.fixup_data
> >
> > My observation is that while clients are connecting and traffic is 
> > flowing for connected clients, adj0->sub_type.midchain.fixup_data
> > appears to hold incorrect data, at some point in time, during the 
> > test. What we have seen is the incorrect data
> > (adj0->sub_type.midchain.fixup_data) is observed for clients which 
> > are already provisioned for some time and which had packets flowing 
> > through them.
> >
> > I figured this out by using gdb and inspecting
> > adj0->sub_type.midchain.fixup_data, after typecasting it into
> > pppoe_session_t
> >
> > In the structure, I could see that session_id, client_ip and 
> > encap_idx are incorrect. I did not check other values in the structure.
> >
> > I also added code to log this fields in pppoe_fixup() and logs too 
> > shows incorrect data in the fields.
> >
> > Example logs taken just before crash:
> >
> > vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 ->  2:42: a: 1: 0: 2 
> > , type 8864
> > vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip
> > 0x13131313 encap idx 0x13131313
> >
> > First log prints out packet headers, to verify that data in packet 
> > is as expected and is correct. Second log prints values in 
> > pppoe_session data, and it can be seen that the values are obviously 
> > incorrect. At this point the packet is sent out through the south 
> > interface. Again after some time the TX index values become some 
> > thing similar to
> > 1422457436 and VPP core dumps.
> >
> > We have tested the following scenarios:
> >
> > 1. Add PPPoE clients without sending out any traffic: There is no 
> > crash observed.
> > 2. Add n number of PPPoE clients, load traffic [No adding or removal 
> > or clients while traffic is on, see next scenario]: There is no 
> > crash observed 3. Load traffic as soon as each client connects: VPP 
> > crash observed.
> >
> > Another observation is that  encap_if_index is available in two 
> > places inside pppoe_fixup:
> >
> > 1. adj->rewrite_header.sw_if_index
> > 2. t->encap_if_index
> >
> > t->encap_if_index is used for updating TX, and this gets corrupted,
> > while adj->rewrite_header.sw_if_index has the correct index.
> >
> > I can check and get back if you need any additional information. Let 
> > me know if a bug report is to be created for this.
> >
> > Environment:
> >
> > vpp# show version verbose
> > Version:          v19.08.1-59~ga2aa83ca9-dirty
> > Compiled by:          root
> > Compile host:          build-02
> > Compile date:          Thu Sep 26 16:44:00 IST 2019
> > Compile location:      /root/build-1908
> > Compiler:          GCC 7.4.0
> > Current PID:          7802
> >
> > Operating system: Ubuntu 18.04 amd64
> >
> > startup.conf and associated exec file is attached.
> >
> > There is a small patch to stock VPP to disable 
> > ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also 
> > attached output of show show hardware and gdb bt output. I have the 
> > core file and its matching VPP debs, and can be shared if needed.
> >
> > In the bt the incorrect value of index can be seen in bt #5:
> >
> > #5  0x00007fba88e9ce0b in vlib_increment_combined_counter 
> > (n_bytes=<optimized out>, n_packets=1, index=538976288, 
> > thread_index=0, cm=0x7fba481f46a0) at
> > /root/build-1908/src/vlib/counter.h:229
> >
> > Thanks and Regards,
> >
> > Raj
> > -=-=-=-=-=-=-=-=-=-=-=-
> > Links: You receive all messages sent to this group.
> >
> > View/Reply Online (#14063): 
> > https://lists.fd.io/g/vpp-dev/message/14063
> > Mute This Topic: https://lists.fd.io/mt/34298895/157026
> > Group Owner: vpp-dev+ow...@lists.fd.io
> > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
> > [rajlistu...@gmail.com]
> > -=-=-=-=-=-=-=-=-=-=-=-
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
>
> View/Reply Online (#14081): 
> https://lists.fd.io/g/vpp-dev/message/14081
> Mute This Topic: https://lists.fd.io/mt/34298895/157026
> Group Owner: vpp-dev+ow...@lists.fd.io
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
> [rajlistu...@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14159): https://lists.fd.io/g/vpp-dev/message/14159
Mute This Topic: https://lists.fd.io/mt/34298895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to