> On Jul 23, 2020, at 4:09 PM, Dave Barach via lists.fd.io 
> <dbarach=cisco....@lists.fd.io> wrote:
> 
[ swapped the order of my reply :) ]

> Without having all of the source code available and a reasonable way to repro 
> the issue, it's going to be quite hard to help you find the culprit.

Yes, I realize, so was really only looking for general debug tools to help me 
track it down, e.g., like Benoit's suggestion of AddressSanitizer. Which I have 
running with the latest SIGSEGVs that I put below.

> What is the invalid buffer index value? How many elements are in the frame? 
> Is it always the same frame element which takes a lightning hit?

Sorry for late response It's been tough to gather this time around. I have 2 
instances that hit around a similar time:

vpp 1:

(gdb) frame
#1  0x00007ffff4df20ba in esp_encrypt_inline (vm=0x7fffb4a41040, 
node=0x7fffb4d95f00, frame=0x7fffb4a63ac0, is_ip6=0, is_tun=0, async_next=1) at 
/home/chopps/w/vpp/src/vnet/ipsec/esp_encrypt.c:628
628               p = vlib_buffer_get_current (b[1]);
(gdb) p *frame
$40 = {
  frame_flags = 6,
  flags = 0,
  scalar_size = 0 '\000',
  vector_size = 4 '\004',
  n_vectors = 5,
  arguments = 0x7fffb4a63ac8 "\376\376\376\376\376\376\376\376\344\317", 
<incomplete sequence \362>
}
(gdb) p n_left
$41 = 3
(gdb) p/x *from@5
$42 = {0xf2cfe4, 0xe8252e, 0xfb7161, 0xb4d0b42a, 0x72a90}
(gdb) p/x *from@10
$43 = {0xf2cfe4, 0xe8252e, 0xfb7161, 0xb4d0b42a, 0x72a90, 0xe41942, 0xe41dd3, 
0xe4ebf7, 0xe4c8bd, 0xf18a5f}

vpp 2:

(gdb) frame
#1  0x00007ffff4df20ba in esp_encrypt_inline (vm=0x7fffb4a41040, 
node=0x7fffb4b530c0, frame=0x7fffb48205c0, is_ip6=0, is_tun=0, async_next=1) at 
/home/chopps/w/vpp/src/vnet/ipsec/esp_encrypt.c:628
628               p = vlib_buffer_get_current (b[1]);
(gdb) p *frame
$12 = {
  frame_flags = 6,
  flags = 0,
  scalar_size = 0 '\000',
  vector_size = 4 '\004',
  n_vectors = 7,
  arguments = 0x7fffb48205c8 "\376\376\376\376\376\376\376\376\332\354\377"
}
(gdb) p n_left
$13 = 5
(gdb) p/x *from@7
$14 = {0xffecda, 0xe50030, 0xf0baed, 0x7442defe, 0x727ba, 0x219c, 0x0}
(gdb) p/x *from@10
$15 = {0xffecda, 0xe50030, 0xf0baed, 0x7442defe, 0x727ba, 0x219c, 0x0, 
0xe4369d, 0xeb2e1f, 0xfe172a}
(gdb)

In both cases above the corruption starts in the 4th element, it also appears 
to run to the end of the from array (5th and 7th element respectively), but not 
any further (that's why I printed @10 for both froms).

I'm going to rerun this at this point, but this time I am going to validate the 
from array on entry to esp_encrypt_inline to see if the corruption is happening 
before or during that functions execution.

The other variation I can try is to run the tunnel a bit faster and slower to 
increase the per call packet count to see if the corruption moves away from the 
4th element.

Thanks,
Chris.

> 
> D.
> -----Original Message-----
> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Christian Hopps
> Sent: Thursday, July 23, 2020 1:10 PM
> To: vpp-dev <vpp-dev@lists.fd.io>
> Cc: Christian Hopps <cho...@chopps.org>
> Subject: [vpp-dev] debugging corrupted frame arguments
> 
> I have a very intermittent memory corruption occurring in the buffer indices 
> passed in a nodes frame (encryption node).
> 
> Basically one of the indices is clearly not a vadlid buffer index, and this 
> is leading to a SIGSEGV when the code attempts to use the buffer.
> 
> I see that vlib_frame_ts are allocated from the main heap, so I'm wondering 
> is there any heap/alloc debugging I can enable to help figure out what is 
> corrupting this vlib_frame_t?
> 
> FWIW what seems weird is that I am validating the indices in 
> vlib_put_frame_to_node() (I changed the validation code to actually resolve 
> the buffer index), so apparently the vector is being corrupted between the 
> node that creates and puts the vlib_frame_t, and the pending node being 
> dispatched. The heap appears to be per CPU as well, this all makes things odd 
> since the pending node should be running immediately after the node that put 
> the valid frame there. So it's hard to imagine what could be corrupting this 
> memory in between the put and the dispatch.
> 
> Thanks,
> Chris.
> 

Attachment: signature.asc
Description: Message signed with OpenPGP

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17082): https://lists.fd.io/g/vpp-dev/message/17082
Mute This Topic: https://lists.fd.io/mt/75750436/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to