Re: [vpp-dev] PPPOE

2018-11-27 Thread Zhang Dongya
Hi,

We have put our pppoe client implementation on our github, I have synced
our latest commit from our internal branch.

If you want to have a try, you can compile that branch and refer to
commands of plugin/pppox and plugin/pppoeclient.

The github link is:

https://github.com/raydonetworks/vpp-pppoeclient

xulang  于2018年11月26日周一 上午9:22写道:

> Hi all,
> I would like to use pppoe server and pppoe client, is there any material
>  about this?
>
>
> Regards,
> Xlangyun
>
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
>
> View/Reply Online (#11398): https://lists.fd.io/g/vpp-dev/message/11398
> Mute This Topic: https://lists.fd.io/mt/28316755/675661
> Group Owner: vpp-dev+ow...@lists.fd.io
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [
> fortitude.zh...@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#11430): https://lists.fd.io/g/vpp-dev/message/11430
Mute This Topic: https://lists.fd.io/mt/28316755/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


[vpp-dev] deadloop in internal_mallinfo

2022-07-06 Thread Zhang Dongya
Hi list,

Recently I encountered a deadloop in 1 master + 1 worker mode of vpp(also
happens in 1 master 0 worker mode) when I do some vpp (based on vpp 21.10)
configuration through govpp.

The thread stats in gdb is:
(gdb) info threads
  Id   Target IdFrame
* 1Thread 0x7f735c3287c0 (LWP 274213) "vpp_main"
 internal_mallinfo (m=0x7f731c227040) at
/home/fortitude/glx/vpp/src/vppinfra/dlmalloc.c:2100
  2Thread 0x7f73127d2700 (LWP 274214) "eal-intr-thread"
0x7f735c5f449e in epoll_wait (epfd=16, events=0x7f73127d1d30,
maxevents=7, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  3Thread 0x7f7311fd1700 (LWP 274215) "vpp_wk_0"
 0x7f735c5d774b in sched_yield () at
../sysdeps/unix/syscall-template.S:78

>From what I observed, it is the inner loop (segment_holds) become a dead
loop, and I still
have not found the reason (see bottom of the mail.)

there is old mail https://lists.fd.io/mt/89947053/675661 in this list
report the same issue,
but no answer yet and I can't reply to that thread, so I create this one.

Hope someone can give some clue on this.

static struct dlmallinfo internal_mallinfo(mstate m) {
struct dlmallinfo nm = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
ensure_initialization();
if (!PREACTION(m)) {
check_malloc_state(m);
if (is_initialized(m)) {
size_t nfree = SIZE_T_ONE; /* top always free */
size_t mfree = m->topsize + TOP_FOOT_SIZE;
size_t sum = mfree;
msegmentptr s = &m->seg;
while (s != 0) {
mchunkptr q = align_as_chunk(s->base);
while (segment_holds(s, q) &&
q != m->top && q->head != FENCEPOST_HEAD) {
size_t sz = chunksize(q);
sum += sz;
if (!is_inuse(q)) {
mfree += sz;
++nfree;
}
q = next_chunk(q);
}
s = s->next;
}

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#21616): https://lists.fd.io/g/vpp-dev/message/21616
Mute This Topic: https://lists.fd.io/mt/92202054/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] deadloop in internal_mallinfo

2022-07-07 Thread Zhang Dongya
Hi list,

It seems there is memory overwrite in our plugin which corrupts the memory,
after fix that, the deadloop vanished.


Zhang Dongya via lists.fd.io 
于2022年7月6日周三 15:52写道:

> Hi list,
>
> Recently I encountered a deadloop in 1 master + 1 worker mode of vpp(also
> happens in 1 master 0 worker mode) when I do some vpp (based on vpp 21.10)
> configuration through govpp.
>
> The thread stats in gdb is:
> (gdb) info threads
>   Id   Target IdFrame
> * 1Thread 0x7f735c3287c0 (LWP 274213) "vpp_main"
>  internal_mallinfo (m=0x7f731c227040) at
> /home/fortitude/glx/vpp/src/vppinfra/dlmalloc.c:2100
>   2Thread 0x7f73127d2700 (LWP 274214) "eal-intr-thread"
> 0x7f735c5f449e in epoll_wait (epfd=16, events=0x7f73127d1d30,
> maxevents=7, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
>   3Thread 0x7f7311fd1700 (LWP 274215) "vpp_wk_0"
>  0x7f735c5d774b in sched_yield () at
> ../sysdeps/unix/syscall-template.S:78
>
> From what I observed, it is the inner loop (segment_holds) become a dead
> loop, and I still
> have not found the reason (see bottom of the mail.)
>
> there is old mail https://lists.fd.io/mt/89947053/675661 in this list
> report the same issue,
> but no answer yet and I can't reply to that thread, so I create this one.
>
> Hope someone can give some clue on this.
>
> static struct dlmallinfo internal_mallinfo(mstate m) {
> struct dlmallinfo nm = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> ensure_initialization();
> if (!PREACTION(m)) {
> check_malloc_state(m);
> if (is_initialized(m)) {
> size_t nfree = SIZE_T_ONE; /* top always free */
> size_t mfree = m->topsize + TOP_FOOT_SIZE;
> size_t sum = mfree;
> msegmentptr s = &m->seg;
> while (s != 0) {
> mchunkptr q = align_as_chunk(s->base);
> while (segment_holds(s, q) &&
> q != m->top && q->head != FENCEPOST_HEAD) {
> size_t sz = chunksize(q);
> sum += sz;
> if (!is_inuse(q)) {
> mfree += sz;
> ++nfree;
> }
> q = next_chunk(q);
> }
> s = s->next;
> }
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#21620): https://lists.fd.io/g/vpp-dev/message/21620
Mute This Topic: https://lists.fd.io/mt/92202054/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



[vpp-dev] vpp crash when close a host-stack tcp session in syn-sent state.

2022-10-12 Thread Zhang Dongya
Hi,

I am now trying to use vpp host-stack to negotiate a valid TCP session,
however, I found if I call vnet_disconnect_session when the TCP stuck in
syn-sent state (this may be caused by I have shutdown the remove side).

Vpp will crash in the following code which call svm_fifo_clear_deq_ntf
while the tx_fifo is not inited, this is because the tx_fifo will be
allocated init app_worker_init_connected.

Is this a bug or I have something wrong with my using of host-stack?



void
session_close (session_t * s)
{
if (!s)
return;

if (s->session_state >= SESSION_STATE_CLOSING)
{
/* Session will only be removed once both app and transport
* acknowledge the close */
if (s->session_state == SESSION_STATE_TRANSPORT_CLOSED
|| s->session_state == SESSION_STATE_TRANSPORT_DELETED)
session_program_transport_ctrl_evt (s, SESSION_CTRL_EVT_CLOSE);
return;
}

/* App closed so stop propagating dequeue notifications */
svm_fifo_clear_deq_ntf (s->tx_fifo);
s->session_state = SESSION_STATE_CLOSING;
session_program_transport_ctrl_evt (s, SESSION_CTRL_EVT_CLOSE);
}

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22002): https://lists.fd.io/g/vpp-dev/message/22002
Mute This Topic: https://lists.fd.io/mt/94276501/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] vpp crash when close a host-stack tcp session in syn-sent state.

2022-10-12 Thread Zhang Dongya
Thanks a lot,I just add a check for tx_fifo there locally and it seems
works.

BTW,

I'd like to help to submit a patch, however I don't know the reason when I
trying to login gerrit using my linux foundation id, it always reports
Forbidden error, do you know where I can get help to
solve this ?  or gerrit need some approval for get involved?

It's ok if you want to get it fixed asap.

Florin Coras  于2022年10月12日周三 23:44写道:

> Hi,
>
> It looks like a bug. We should make sure the fifo exists, which is
> typically the case unless transport is stuck in half-open. Note that tcp
> does timeout and cleanups those stuck half-open sessions, but we should
> allow the app to cleanup as well.
>
> Let me know if you plan to push a patch or I should do it.
>
> Regards,
> Florin
>
> On Oct 12, 2022, at 12:44 AM, Zhang Dongya 
> wrote:
>
> Hi,
>
> I am now trying to use vpp host-stack to negotiate a valid TCP session,
> however, I found if I call vnet_disconnect_session when the TCP stuck in
> syn-sent state (this may be caused by I have shutdown the remove side).
>
> Vpp will crash in the following code which call svm_fifo_clear_deq_ntf
> while the tx_fifo is not inited, this is because the tx_fifo will be
> allocated init app_worker_init_connected.
>
> Is this a bug or I have something wrong with my using of host-stack?
>
>
>
> void
> session_close (session_t * s)
> {
> if (!s)
> return;
>
> if (s->session_state >= SESSION_STATE_CLOSING)
> {
> /* Session will only be removed once both app and transport
> * acknowledge the close */
> if (s->session_state == SESSION_STATE_TRANSPORT_CLOSED
> || s->session_state == SESSION_STATE_TRANSPORT_DELETED)
> session_program_transport_ctrl_evt (s, SESSION_CTRL_EVT_CLOSE);
> return;
> }
>
> /* App closed so stop propagating dequeue notifications */
> svm_fifo_clear_deq_ntf (s->tx_fifo);
> s->session_state = SESSION_STATE_CLOSING;
> session_program_transport_ctrl_evt (s, SESSION_CTRL_EVT_CLOSE);
> }
>
>
>
>
>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22017): https://lists.fd.io/g/vpp-dev/message/22017
Mute This Topic: https://lists.fd.io/mt/94276501/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] vpp crash when close a host-stack tcp session in syn-sent state.

2022-10-12 Thread Zhang Dongya
Yes, I can login to link [1] and can see my account have been registered in
LF 5 years, however, when I login the gerrit web ui, it still reports
Forbidden error, my account username is ZhangDongya.

Ok, I will try to use git command line to give a try.

Florin Coras  于2022年10月13日周四 10:12写道:

> An LF account should suffice. Could you confirm your lf credentials work
> here [1]?
>
> And, in case you haven’t seen this already, here are the steps to get you
> started on pushing the patch, once the above is solved [2].
>
> Regards,
> Florin
>
> [1] https://identity.linuxfoundation.org/
> [2]
> https://wiki.fd.io/view/VPP/Pulling,_Building,_Running,_Hacking_and_Pushing_VPP_Code#Pulling_code_via_ssh
>
>
> On Oct 12, 2022, at 6:21 PM, Zhang Dongya 
> wrote:
>
> Thanks a lot,I just add a check for tx_fifo there locally and it seems
> works.
>
> BTW,
>
> I'd like to help to submit a patch, however I don't know the reason when I
> trying to login gerrit using my linux foundation id, it always reports
> Forbidden error, do you know where I can get help to
> solve this ?  or gerrit need some approval for get involved?
>
> It's ok if you want to get it fixed asap.
>
> Florin Coras  于2022年10月12日周三 23:44写道:
>
>> Hi,
>>
>> It looks like a bug. We should make sure the fifo exists, which is
>> typically the case unless transport is stuck in half-open. Note that tcp
>> does timeout and cleanups those stuck half-open sessions, but we should
>> allow the app to cleanup as well.
>>
>> Let me know if you plan to push a patch or I should do it.
>>
>> Regards,
>> Florin
>>
>> On Oct 12, 2022, at 12:44 AM, Zhang Dongya 
>> wrote:
>>
>> Hi,
>>
>> I am now trying to use vpp host-stack to negotiate a valid TCP session,
>> however, I found if I call vnet_disconnect_session when the TCP stuck in
>> syn-sent state (this may be caused by I have shutdown the remove side).
>>
>> Vpp will crash in the following code which call svm_fifo_clear_deq_ntf
>> while the tx_fifo is not inited, this is because the tx_fifo will be
>> allocated init app_worker_init_connected.
>>
>> Is this a bug or I have something wrong with my using of host-stack?
>>
>>
>>
>> void
>> session_close (session_t * s)
>> {
>> if (!s)
>> return;
>>
>> if (s->session_state >= SESSION_STATE_CLOSING)
>> {
>> /* Session will only be removed once both app and transport
>> * acknowledge the close */
>> if (s->session_state == SESSION_STATE_TRANSPORT_CLOSED
>> || s->session_state == SESSION_STATE_TRANSPORT_DELETED)
>> session_program_transport_ctrl_evt (s, SESSION_CTRL_EVT_CLOSE);
>> return;
>> }
>>
>> /* App closed so stop propagating dequeue notifications */
>> svm_fifo_clear_deq_ntf (s->tx_fifo);
>> s->session_state = SESSION_STATE_CLOSING;
>> session_program_transport_ctrl_evt (s, SESSION_CTRL_EVT_CLOSE);
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22019): https://lists.fd.io/g/vpp-dev/message/22019
Mute This Topic: https://lists.fd.io/mt/94276501/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] vpp crash when close a host-stack tcp session in syn-sent state.

2022-10-13 Thread Zhang Dongya
Thanks a lot, I will give a try.

Florin Coras  于2022年10月14日周五 01:01写道:

> Hi,
>
> [cc Vanessa]
>
> Could you please open a ticket here [1]? Hopefully this can be solved.
>
> Regards,
> Florin
>
> [1]
> https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/create/37
>
> On Oct 12, 2022, at 10:42 PM, Zhang Dongya 
> wrote:
>
> Yes, I can login to link [1] and can see my account have been registered
> in LF 5 years, however, when I login the gerrit web ui, it still reports
> Forbidden error, my account username is ZhangDongya.
>
> Ok, I will try to use git command line to give a try.
>
> Florin Coras  于2022年10月13日周四 10:12写道:
>
>> An LF account should suffice. Could you confirm your lf credentials work
>> here [1]?
>>
>> And, in case you haven’t seen this already, here are the steps to get you
>> started on pushing the patch, once the above is solved [2].
>>
>> Regards,
>> Florin
>>
>> [1] https://identity.linuxfoundation.org/
>> [2]
>> https://wiki.fd.io/view/VPP/Pulling,_Building,_Running,_Hacking_and_Pushing_VPP_Code#Pulling_code_via_ssh
>>
>>
>> On Oct 12, 2022, at 6:21 PM, Zhang Dongya 
>> wrote:
>>
>> Thanks a lot,I just add a check for tx_fifo there locally and it seems
>> works.
>>
>> BTW,
>>
>> I'd like to help to submit a patch, however I don't know the reason when
>> I trying to login gerrit using my linux foundation id, it always reports
>> Forbidden error, do you know where I can get help to
>> solve this ?  or gerrit need some approval for get involved?
>>
>> It's ok if you want to get it fixed asap.
>>
>> Florin Coras  于2022年10月12日周三 23:44写道:
>>
>>> Hi,
>>>
>>> It looks like a bug. We should make sure the fifo exists, which is
>>> typically the case unless transport is stuck in half-open. Note that tcp
>>> does timeout and cleanups those stuck half-open sessions, but we should
>>> allow the app to cleanup as well.
>>>
>>> Let me know if you plan to push a patch or I should do it.
>>>
>>> Regards,
>>> Florin
>>>
>>> On Oct 12, 2022, at 12:44 AM, Zhang Dongya 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I am now trying to use vpp host-stack to negotiate a valid TCP session,
>>> however, I found if I call vnet_disconnect_session when the TCP stuck in
>>> syn-sent state (this may be caused by I have shutdown the remove side).
>>>
>>> Vpp will crash in the following code which call svm_fifo_clear_deq_ntf
>>> while the tx_fifo is not inited, this is because the tx_fifo will be
>>> allocated init app_worker_init_connected.
>>>
>>> Is this a bug or I have something wrong with my using of host-stack?
>>>
>>>
>>>
>>> void
>>> session_close (session_t * s)
>>> {
>>> if (!s)
>>> return;
>>>
>>> if (s->session_state >= SESSION_STATE_CLOSING)
>>> {
>>> /* Session will only be removed once both app and transport
>>> * acknowledge the close */
>>> if (s->session_state == SESSION_STATE_TRANSPORT_CLOSED
>>> || s->session_state == SESSION_STATE_TRANSPORT_DELETED)
>>> session_program_transport_ctrl_evt (s, SESSION_CTRL_EVT_CLOSE);
>>> return;
>>> }
>>>
>>> /* App closed so stop propagating dequeue notifications */
>>> svm_fifo_clear_deq_ntf (s->tx_fifo);
>>> s->session_state = SESSION_STATE_CLOSING;
>>> session_program_transport_ctrl_evt (s, SESSION_CTRL_EVT_CLOSE);
>>> }
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22025): https://lists.fd.io/g/vpp-dev/message/22025
Mute This Topic: https://lists.fd.io/mt/94276501/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



[vpp-dev] weird crash when allocate new ho_session_alloc in debug image

2022-10-24 Thread Zhang Dongya
Hi list,

Recently I am testing my TCP application in a plugin, what I did is to
initiate a TCP client in my plugin, however, when I build the debug image
and test, the vpp
will crash and complaint about out of memory.

After doing some research, it seems the following code may cause the crash:





















*always_inline session_t *ho_session_alloc (void){  session_t *s;  ASSERT
(vlib_get_thread_index () == 0);  s = session_alloc (0);  s->session_state
= SESSION_STATE_CONNECTING;  s->flags |= SESSION_F_HALF_OPEN;  /* Not
ideal. Half-opens are only allocated from main with worker barrier   * but
can be cleaned up, i.e., session_half_open_free, from main without   * a
barrier. In debug images, the free_bitmap can grow while workers peek   *
the sessions pool, e.g., session_half_open_migrate_notify, and as a   *
result crash while validating the session. To avoid this, grow the bitmap
 * now. */  if (CLIB_DEBUG){  session_t *sp =
session_main.wrk[0].sessions;  clib_bitmap_validate (pool_header
(sp)->free_bitmap, s->session_index);}  return s;}*

since the clib_bitmap_validate is defined as:



*/* Make sure that a bitmap is at least n_bits in size */#define
clib_bitmap_validate(v,n_bits) \  clib_bitmap_vec_validate ((v), ((n_bits)
- 1) / BITS (uword))*


The first half open session have a session_index with zero, so 0-1 will
make a overflow which cause it try to allocate (UINT64_MAX-1)/ 64 memory
which make
the vppinfra abort.

I think we should modify the code above with s->session_index + 1, if
that's correct, I will submit a patch later.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22064): https://lists.fd.io/g/vpp-dev/message/22064
Mute This Topic: https://lists.fd.io/mt/94529335/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] weird crash when allocate new ho_session_alloc in debug image

2022-10-24 Thread Zhang Dongya
Hi,

Can you elaborate a bit on that, If session index is 64, if we do not
increase by 1, it will only make one 64B vec for the bitmap, which may not
hold the session index.

Florin Coras  于2022年10月25日周二 01:14写道:

> Hi,
>
> Could you replace s->session_index by s->session_index ? : 1 in the patch?
>
> Regards,
> Florin
>
> On Oct 24, 2022, at 12:23 AM, Zhang Dongya 
> wrote:
>
> Hi list,
>
> Recently I am testing my TCP application in a plugin, what I did is to
> initiate a TCP client in my plugin, however, when I build the debug image
> and test, the vpp
> will crash and complaint about out of memory.
>
> After doing some research, it seems the following code may cause the crash:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *always_inline session_t *ho_session_alloc (void){  session_t *s;  ASSERT
> (vlib_get_thread_index () == 0);  s = session_alloc (0);  s->session_state
> = SESSION_STATE_CONNECTING;  s->flags |= SESSION_F_HALF_OPEN;  /* Not
> ideal. Half-opens are only allocated from main with worker barrier   * but
> can be cleaned up, i.e., session_half_open_free, from main without   * a
> barrier. In debug images, the free_bitmap can grow while workers peek   *
> the sessions pool, e.g., session_half_open_migrate_notify, and as a   *
> result crash while validating the session. To avoid this, grow the bitmap
>  * now. */  if (CLIB_DEBUG){  session_t *sp =
> session_main.wrk[0].sessions;  clib_bitmap_validate (pool_header
> (sp)->free_bitmap, s->session_index);}  return s;}*
>
> since the clib_bitmap_validate is defined as:
>
>
>
> */* Make sure that a bitmap is at least n_bits in size */#define
> clib_bitmap_validate(v,n_bits) \  clib_bitmap_vec_validate ((v), ((n_bits)
> - 1) / BITS (uword))*
>
>
> The first half open session have a session_index with zero, so 0-1 will
> make a overflow which cause it try to allocate (UINT64_MAX-1)/ 64 memory
> which make
> the vppinfra abort.
>
> I think we should modify the code above with s->session_index + 1, if
> that's correct, I will submit a patch later.
>
>
>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22069): https://lists.fd.io/g/vpp-dev/message/22069
Mute This Topic: https://lists.fd.io/mt/94529335/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] weird crash when allocate new ho_session_alloc in debug image

2022-11-03 Thread Zhang Dongya
Hi,

I have made a patch and submit it to gerrit for review.

https://gerrit.fd.io/r/c/vpp/+/37567

I have run against vpp unit test for session feature and no regression
found yet.

Florin Coras  于2022年11月1日周二 23:42写道:

> Hi,
>
> Will you be pushing the fix or should I do it?
>
> Regards,
> Florin
>
> On Oct 25, 2022, at 9:26 AM, Florin Coras via lists.fd.io <
> fcoras.lists=gmail@lists.fd.io> wrote:
>
> Hi,
>
> Apologies, I missed your original point and only though about the large
> bitmap we create at startup. So yes, go for s->session_index + 1.
>
> Regards,
> Florin
>
> On Oct 24, 2022, at 9:11 PM, Zhang Dongya 
> wrote:
>
> Hi,
>
> Can you elaborate a bit on that, If session index is 64, if we do not
> increase by 1, it will only make one 64B vec for the bitmap, which may not
> hold the session index.
>
> Florin Coras  于2022年10月25日周二 01:14写道:
>
>> Hi,
>>
>> Could you replace s->session_index by s->session_index ? : 1 in the
>> patch?
>>
>> Regards,
>> Florin
>>
>> On Oct 24, 2022, at 12:23 AM, Zhang Dongya 
>> wrote:
>>
>> Hi list,
>>
>> Recently I am testing my TCP application in a plugin, what I did is to
>> initiate a TCP client in my plugin, however, when I build the debug image
>> and test, the vpp
>> will crash and complaint about out of memory.
>>
>> After doing some research, it seems the following code may cause the
>> crash:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *always_inline session_t *ho_session_alloc (void){  session_t *s;  ASSERT
>> (vlib_get_thread_index () == 0);  s = session_alloc (0);  s->session_state
>> = SESSION_STATE_CONNECTING;  s->flags |= SESSION_F_HALF_OPEN;  /* Not
>> ideal. Half-opens are only allocated from main with worker barrier   * but
>> can be cleaned up, i.e., session_half_open_free, from main without   * a
>> barrier. In debug images, the free_bitmap can grow while workers peek   *
>> the sessions pool, e.g., session_half_open_migrate_notify, and as a   *
>> result crash while validating the session. To avoid this, grow the bitmap
>>  * now. */  if (CLIB_DEBUG){  session_t *sp =
>> session_main.wrk[0].sessions;  clib_bitmap_validate (pool_header
>> (sp)->free_bitmap, s->session_index);}  return s;}*
>>
>> since the clib_bitmap_validate is defined as:
>>
>>
>>
>> */* Make sure that a bitmap is at least n_bits in size */#define
>> clib_bitmap_validate(v,n_bits) \  clib_bitmap_vec_validate ((v), ((n_bits)
>> - 1) / BITS (uword))*
>>
>>
>> The first half open session have a session_index with zero, so 0-1 will
>> make a overflow which cause it try to allocate (UINT64_MAX-1)/ 64 memory
>> which make
>> the vppinfra abort.
>>
>> I think we should modify the code above with s->session_index + 1, if
>> that's correct, I will submit a patch later.
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> 
>
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22114): https://lists.fd.io/g/vpp-dev/message/22114
Mute This Topic: https://lists.fd.io/mt/94529335/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



[vpp-dev] possible use deleted sw if index in ip4-lookup and cause crash

2022-11-28 Thread Zhang Dongya
Hi list,

Recently I encountered a vpp crash with my plugin enabled, after some
investigation I find it may related with l3 sub interface delete while my
process node add work to ip4-lookup node.

Intuitively I think it may related to a barrier usage but I tried to fix by
add some check in my process node to guard the case that l3 sub interface
is deleted. however the crash still exists.

Finally I think it should be related to a pattern like this:

1, my process node adds a pkt by using put_frame_to_node to ip4-lookup
directly, which set the rx interface to the l3 sub interface created before.
2, my control plane agent (using govpp) delete the l3 sub interface. (it
should be handled in vpp api-process node)
3, vpp schedule pending nodes. since the rx interface is deleted, vpp can't
get a valid fib index and there is not check in the following
ip4_fib_forwarding_lookup, so it crash with abort.

I think vpp may schedule my process node(timeout driven) and api-process
node one over one, then it will schedule the pending nodes.

Should I add some check in ip4-lookup or there are better way of sending
pkt in ctrl process not correct ?

Thanks a lot.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22245): https://lists.fd.io/g/vpp-dev/message/22245
Mute This Topic: https://lists.fd.io/mt/95307938/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] possible use deleted sw if index in ip4-lookup and cause crash

2022-11-28 Thread Zhang Dongya
I have found a solution and it can solve the crash issue.

In ip4_sw_interface_add_del which is a callback for interface deletion, we
may set the fib index of the removed interface to 0 (default fib) instead
of ~0.  This behavior is same with interface creation.


Zhang Dongya via lists.fd.io 
于2022年11月28日周一 19:41写道:

> Hi list,
>
> Recently I encountered a vpp crash with my plugin enabled, after some
> investigation I find it may related with l3 sub interface delete while my
> process node add work to ip4-lookup node.
>
> Intuitively I think it may related to a barrier usage but I tried to fix
> by add some check in my process node to guard the case that l3 sub
> interface is deleted. however the crash still exists.
>
> Finally I think it should be related to a pattern like this:
>
> 1, my process node adds a pkt by using put_frame_to_node to ip4-lookup
> directly, which set the rx interface to the l3 sub interface created before.
> 2, my control plane agent (using govpp) delete the l3 sub interface. (it
> should be handled in vpp api-process node)
> 3, vpp schedule pending nodes. since the rx interface is deleted, vpp
> can't get a valid fib index and there is not check in the following
> ip4_fib_forwarding_lookup, so it crash with abort.
>
> I think vpp may schedule my process node(timeout driven) and api-process
> node one over one, then it will schedule the pending nodes.
>
> Should I add some check in ip4-lookup or there are better way of sending
> pkt in ctrl process not correct ?
>
> Thanks a lot.
>
>
>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22251): https://lists.fd.io/g/vpp-dev/message/22251
Mute This Topic: https://lists.fd.io/mt/95307938/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] possible use deleted sw if index in ip4-lookup and cause crash

2022-11-29 Thread Zhang Dongya
Hi ben,

In the beginning I also think it should be a barrier issue, however it
turned out not the case.

The pkt which had sw_if_index[VLIB_RX] set as the to-be-deleted interface
is actually being put to ip4-lookup node by my process node, the process
node add pkt in a timer drive way.

Since the pkt is added by my process node, I think it is not affected by
the worker barrier.  in my case the sub if is deleted by API, which is
processed in linux_epoll_input PRE_INPUT node, let's consider the following
sequence:


   1. my process add a pkt to ip4-node, and the pkt refer to a valid sw if
   index
   2. linux_epoll_input process a API request to delete the above sw if
   index.
   3. vpp schedule ip4-lookup node, then it will crash because the sw if
   index is deleted and ip4_lookup node can't use sw_if_index[VLIB_RX] which
   is now ~0 to get a valid fib index.


There are some code that do this way (ikev2_send_ike and others), I think
it's not doable to update the pending frame when the interface is deleted.

Benoit Ganne (bganne) via lists.fd.io 
于2022年11月29日周二 22:22写道:

> Hi Zhang,
>
> I'd expect the interface deletion to happen under the worker barrier. VPP
> workers should drain all their in-flight packets before entering the
> barrier, so it should not be possible for the interface to disappear
> between your node and ip4-lookup. Or am I missing something?
> What I have seen happening is you'd have some data structure where you
> keep the interface index that you use in your node, and this data is not
> updated when the interface is removed.
> Regarding your proposal, I suspect an issue could be when we reuse the
> sw_if_index: if you del a sw_interface and then add a new one, chances are
> you'll be reusing the same index, but fib_index might be different.
>
> Best
> ben
>
> > -Original Message-
> > From: vpp-dev@lists.fd.io  On Behalf Of Zhang
> Dongya
> > Sent: Tuesday, November 29, 2022 3:45
> > To: vpp-dev@lists.fd.io
> > Subject: Re: [vpp-dev] possible use deleted sw if index in ip4-lookup and
> > cause crash
> >
> >
> > I have found a solution and it can solve the crash issue.
> >
> > In ip4_sw_interface_add_del which is a callback for interface deletion,
> we
> > may set the fib index of the removed interface to 0 (default fib) instead
> > of ~0.  This behavior is same with interface creation.
> >
> >
> >
> > Zhang Dongya via lists.fd.io <http://lists.fd.io>
> > mailto:gmail@lists.fd.io> >
> 于
> > 2022年11月28日周一 19:41写道:
> >
> >
> >   Hi list,
> >
> >   Recently I encountered a vpp crash with my plugin enabled, after
> > some investigation I find it may related with l3 sub interface delete
> > while my process node add work to ip4-lookup node.
> >
> >
> >   Intuitively I think it may related to a barrier usage but I tried
> > to fix by add some check in my process node to guard the case that l3 sub
> > interface is deleted. however the crash still exists.
> >
> >   Finally I think it should be related to a pattern like this:
> >
> >   1, my process node adds a pkt by using put_frame_to_node to ip4-
> > lookup directly, which set the rx interface to the l3 sub interface
> > created before.
> >
> >   2, my control plane agent (using govpp) delete the l3 sub
> > interface. (it should be handled in vpp api-process node)
> >
> >   3, vpp schedule pending nodes. since the rx interface is deleted,
> > vpp can't get a valid fib index and there is not check in the following
> > ip4_fib_forwarding_lookup, so it crash with abort.
> >
> >   I think vpp may schedule my process node(timeout driven) and api-
> > process node one over one, then it will schedule the pending nodes.
> >
> >   Should I add some check in ip4-lookup or there are better way of
> > sending pkt in ctrl process not correct ?
> >
> >   Thanks a lot.
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22256): https://lists.fd.io/g/vpp-dev/message/22256
Mute This Topic: https://lists.fd.io/mt/95307938/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] possible use deleted sw if index in ip4-lookup and cause crash

2022-12-07 Thread Zhang Dongya
The crash have not been found anymore.

Does this fix make any sense? it it does, I will submit a patch later.

Zhang Dongya via lists.fd.io  于
2022年11月29日周二 22:51写道:

> Hi ben,
>
> In the beginning I also think it should be a barrier issue, however it
> turned out not the case.
>
> The pkt which had sw_if_index[VLIB_RX] set as the to-be-deleted interface
> is actually being put to ip4-lookup node by my process node, the process
> node add pkt in a timer drive way.
>
> Since the pkt is added by my process node, I think it is not affected by
> the worker barrier.  in my case the sub if is deleted by API, which is
> processed in linux_epoll_input PRE_INPUT node, let's consider the following
> sequence:
>
>
>1. my process add a pkt to ip4-node, and the pkt refer to a valid sw
>if index
>2. linux_epoll_input process a API request to delete the above sw if
>index.
>3. vpp schedule ip4-lookup node, then it will crash because the sw if
>index is deleted and ip4_lookup node can't use sw_if_index[VLIB_RX] which
>is now ~0 to get a valid fib index.
>
>
> There are some code that do this way (ikev2_send_ike and others), I think
> it's not doable to update the pending frame when the interface is deleted.
>
> Benoit Ganne (bganne) via lists.fd.io 
> 于2022年11月29日周二 22:22写道:
>
>> Hi Zhang,
>>
>> I'd expect the interface deletion to happen under the worker barrier. VPP
>> workers should drain all their in-flight packets before entering the
>> barrier, so it should not be possible for the interface to disappear
>> between your node and ip4-lookup. Or am I missing something?
>> What I have seen happening is you'd have some data structure where you
>> keep the interface index that you use in your node, and this data is not
>> updated when the interface is removed.
>> Regarding your proposal, I suspect an issue could be when we reuse the
>> sw_if_index: if you del a sw_interface and then add a new one, chances are
>> you'll be reusing the same index, but fib_index might be different.
>>
>> Best
>> ben
>>
>> > -Original Message-
>> > From: vpp-dev@lists.fd.io  On Behalf Of Zhang
>> Dongya
>> > Sent: Tuesday, November 29, 2022 3:45
>> > To: vpp-dev@lists.fd.io
>> > Subject: Re: [vpp-dev] possible use deleted sw if index in ip4-lookup
>> and
>> > cause crash
>> >
>> >
>> > I have found a solution and it can solve the crash issue.
>> >
>> > In ip4_sw_interface_add_del which is a callback for interface deletion,
>> we
>> > may set the fib index of the removed interface to 0 (default fib)
>> instead
>> > of ~0.  This behavior is same with interface creation.
>> >
>> >
>> >
>> > Zhang Dongya via lists.fd.io <http://lists.fd.io>
>> > mailto:gmail@lists.fd.io>
>> > 于
>> > 2022年11月28日周一 19:41写道:
>> >
>> >
>> >   Hi list,
>> >
>> >   Recently I encountered a vpp crash with my plugin enabled, after
>> > some investigation I find it may related with l3 sub interface delete
>> > while my process node add work to ip4-lookup node.
>> >
>> >
>> >   Intuitively I think it may related to a barrier usage but I tried
>> > to fix by add some check in my process node to guard the case that l3
>> sub
>> > interface is deleted. however the crash still exists.
>> >
>> >   Finally I think it should be related to a pattern like this:
>> >
>> >   1, my process node adds a pkt by using put_frame_to_node to ip4-
>> > lookup directly, which set the rx interface to the l3 sub interface
>> > created before.
>> >
>> >   2, my control plane agent (using govpp) delete the l3 sub
>> > interface. (it should be handled in vpp api-process node)
>> >
>> >   3, vpp schedule pending nodes. since the rx interface is deleted,
>> > vpp can't get a valid fib index and there is not check in the following
>> > ip4_fib_forwarding_lookup, so it crash with abort.
>> >
>> >   I think vpp may schedule my process node(timeout driven) and api-
>> > process node one over one, then it will schedule the pending nodes.
>> >
>> >   Should I add some check in ip4-lookup or there are better way of
>> > sending pkt in ctrl process not correct ?
>> >
>> >   Thanks a lot.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22295): https://lists.fd.io/g/vpp-dev/message/22295
Mute This Topic: https://lists.fd.io/mt/95307938/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] possible use deleted sw if index in ip4-lookup and cause crash

2022-12-13 Thread Zhang Dongya
Hi list,

During the test, when l3sub if is deleted, I got a new abort in interface
drop node, seems the packet reference to a deleted interface.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x7face8d17859 in __GI_abort () at abort.c:79
> #2  0x00407397 in os_exit (code=1) at
> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:440
> #3  0x7face922dd57 in unix_signal_handler (signum=6,
> si=0x7faca2891170, uc=0x7faca2891040) at
> /home/fortitude/glx/vpp/src/vlib/unix/main.c:188
> #4  
> #5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #6  0x7face8d17859 in __GI_abort () at abort.c:79
> #7  0x00407333 in os_panic () at
> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:416
> #8  0x7face9067039 in debugger () at
> /home/fortitude/glx/vpp/src/vppinfra/error.c:84
> #9  0x7face9066dfa in _clib_error (how_to_die=2, function_name=0x0,
> line_number=0, fmt=0x7face9f7a208 "%s:%d (%s) assertion `%s' fails") at
> /home/fortitude/glx/vpp/src/vppinfra/error.c:143
> #10 0x7face9b28358 in vnet_get_sw_interface (vnm=0x7facea243f38
> , sw_if_index=14) at
> /home/fortitude/glx/vpp/src/vnet/interface_funcs.h:60
> #11 0x7face9b2a4ba in interface_drop_punt (vm=0x7facac8e5b00,
> node=0x7faca95c8840, frame=0x7facc2004a40,
> disposition=VNET_ERROR_DISPOSITION_DROP)
> at /home/fortitude/glx/vpp/src/vnet/interface_output.c:1061
> #12 0x7face9b29a96 in interface_drop_fn_hsw (vm=0x7facac8e5b00,
> node=0x7faca95c8840, frame=0x7facc2004a40) at
> /home/fortitude/glx/vpp/src/vnet/interface_output.c:1215
> #13 0x7face91cd50d in dispatch_node (vm=0x7facac8e5b00,
> node=0x7faca95c8840, type=VLIB_NODE_TYPE_INTERNAL,
> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7facc2004a40,
> last_time_stamp=404307411779413) at
> /home/fortitude/glx/vpp/src/vlib/main.c:961
> #14 0x7face91cdfb0 in dispatch_pending_node (vm=0x7facac8e5b00,
> pending_frame_index=3, last_time_stamp=404307411779413) at
> /home/fortitude/glx/vpp/src/vlib/main.c:1120
> #15 0x7face91c921f in vlib_main_or_worker_loop (vm=0x7facac8e5b00,
> is_main=0) at /home/fortitude/glx/vpp/src/vlib/main.c:1589
> #16 0x7face91c8947 in vlib_worker_loop (vm=0x7facac8e5b00) at
> /home/fortitude/glx/vpp/src/vlib/main.c:1723
> #17 0x7face92080a4 in vlib_worker_thread_fn (arg=0x7facaa227d00) at
> /home/fortitude/glx/vpp/src/vlib/threads.c:1579
> #18 0x7face9203195 in vlib_worker_thread_bootstrap_fn
> (arg=0x7facaa227d00) at /home/fortitude/glx/vpp/src/vlib/threads.c:418
> #19 0x7face9121609 in start_thread (arg=) at
> pthread_create.c:477
> #20 0x7face8e14133 in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>

>From the first mail, I want to know is the sequence can happen or not ?

1, my process node adds a pkt by using put_frame_to_node to ip4-lookup
directly, which set the rx interface to the l3 sub interface created before.
2, my control plane agent (using govpp) delete the l3 sub interface. (it
should be handled in vpp api-process node)
3, vpp schedule pending nodes. since the rx interface is deleted, vpp can't
get a valid fib index and there is not check in the following
ip4_fib_forwarding_lookup, so it crash with abort.

I don't think a api barrier in step 2 can solve this, since the pkt is
already in the pending frame.

Zhang Dongya via lists.fd.io 
于2022年12月8日周四 00:17写道:

> The crash have not been found anymore.
>
> Does this fix make any sense? it it does, I will submit a patch later.
>
> Zhang Dongya via lists.fd.io  于
> 2022年11月29日周二 22:51写道:
>
>> Hi ben,
>>
>> In the beginning I also think it should be a barrier issue, however it
>> turned out not the case.
>>
>> The pkt which had sw_if_index[VLIB_RX] set as the to-be-deleted interface
>> is actually being put to ip4-lookup node by my process node, the process
>> node add pkt in a timer drive way.
>>
>> Since the pkt is added by my process node, I think it is not affected by
>> the worker barrier.  in my case the sub if is deleted by API, which is
>> processed in linux_epoll_input PRE_INPUT node, let's consider the following
>> sequence:
>>
>>
>>1. my process add a pkt to ip4-node, and the pkt refer to a valid sw
>>if index
>>2. linux_epoll_input process a API request to delete the above sw if
>>index.
>>3. vpp schedule ip4-lookup node, then it will crash because the sw if
>>index is deleted and ip4_lookup node can't use sw_if_index[VLIB_RX] which
>>is now ~0 to get a valid fib index.
>>
>>
>> There are some code that do this way (ikev2_send_ike and others), I think
>> it's not doable to update the 

Re: [vpp-dev] possible use deleted sw if index in ip4-lookup and cause crash

2022-12-13 Thread Zhang Dongya
By adding the following code right after process dispatch in the main loop,
the crash is fixed.

So I think the condition mentioned above is a rare but valid case.

A ctrl process node being scheduled adds a packet (pending frame) to a node
and the packet is referring to an interface which will be deleted soon.

The interface will then be deleted in the unix_epoll_input PRE_INPUT node
which handles API input, then in the following graph scheduling it will
trigger various assert failures.


  {
> /* Ctrl nodes may have added work to the pending vector too.
>Process pending vector until there is nothing left.
>All pending vectors will be processed from input -> output. */
> for (i = 0; i < _vec_len (nm->pending_frames); i++)
>   cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now);
> /* Reset pending vector for next iteration. */
> vec_set_len (nm->pending_frames, 0);
>
> if (is_main)
>   {
> // We also need do a barrier here to ensure worker node which
> have
> // pkt handoffed.
> vlib_worker_thread_barrier_sync (vm);
> vlib_worker_thread_barrier_release (vm);
>   }
>   }
>

Zhang Dongya via lists.fd.io 
于2022年12月14日周三 11:52写道:

> Hi list,
>
> During the test, when l3sub if is deleted, I got a new abort in interface
> drop node, seems the packet reference to a deleted interface.
>
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #1  0x7face8d17859 in __GI_abort () at abort.c:79
>> #2  0x00407397 in os_exit (code=1) at
>> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:440
>> #3  0x7face922dd57 in unix_signal_handler (signum=6,
>> si=0x7faca2891170, uc=0x7faca2891040) at
>> /home/fortitude/glx/vpp/src/vlib/unix/main.c:188
>> #4  
>> #5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #6  0x7face8d17859 in __GI_abort () at abort.c:79
>> #7  0x00407333 in os_panic () at
>> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:416
>> #8  0x7face9067039 in debugger () at
>> /home/fortitude/glx/vpp/src/vppinfra/error.c:84
>> #9  0x7face9066dfa in _clib_error (how_to_die=2, function_name=0x0,
>> line_number=0, fmt=0x7face9f7a208 "%s:%d (%s) assertion `%s' fails") at
>> /home/fortitude/glx/vpp/src/vppinfra/error.c:143
>> #10 0x7face9b28358 in vnet_get_sw_interface (vnm=0x7facea243f38
>> , sw_if_index=14) at
>> /home/fortitude/glx/vpp/src/vnet/interface_funcs.h:60
>> #11 0x7face9b2a4ba in interface_drop_punt (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, frame=0x7facc2004a40,
>> disposition=VNET_ERROR_DISPOSITION_DROP)
>> at /home/fortitude/glx/vpp/src/vnet/interface_output.c:1061
>> #12 0x7face9b29a96 in interface_drop_fn_hsw (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, frame=0x7facc2004a40) at
>> /home/fortitude/glx/vpp/src/vnet/interface_output.c:1215
>> #13 0x7face91cd50d in dispatch_node (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, type=VLIB_NODE_TYPE_INTERNAL,
>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7facc2004a40,
>> last_time_stamp=404307411779413) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:961
>> #14 0x7face91cdfb0 in dispatch_pending_node (vm=0x7facac8e5b00,
>> pending_frame_index=3, last_time_stamp=404307411779413) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:1120
>> #15 0x7face91c921f in vlib_main_or_worker_loop (vm=0x7facac8e5b00,
>> is_main=0) at /home/fortitude/glx/vpp/src/vlib/main.c:1589
>> #16 0x7face91c8947 in vlib_worker_loop (vm=0x7facac8e5b00) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:1723
>> #17 0x7face92080a4 in vlib_worker_thread_fn (arg=0x7facaa227d00) at
>> /home/fortitude/glx/vpp/src/vlib/threads.c:1579
>> #18 0x7face9203195 in vlib_worker_thread_bootstrap_fn
>> (arg=0x7facaa227d00) at /home/fortitude/glx/vpp/src/vlib/threads.c:418
>> #19 0x7face9121609 in start_thread (arg=) at
>> pthread_create.c:477
>> #20 0x7face8e14133 in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>
> From the first mail, I want to know is the sequence can happen or not ?
>
> 1, my process node adds a pkt by using put_frame_to_node to ip4-lookup
> directly, which set the rx interface to the l3 sub interface created before.
> 2, my control plane agent (using govpp) delete the l3 sub interface. (it
> should be handled in vpp api-process node)
> 3, vpp schedule pending nodes. since the rx interface is deleted, vpp
> can't get a valid fib index and there is not check in the following
> ip4_fib_forward

[vpp-dev] possible bug in sparse_vec_index2

2023-02-06 Thread Zhang Dongya
Hi list,

Recently I found a weird bug when there is only 1 UDP local port registered
due to our vpp application have special initialize sequence, the
sparse_vec_index2 will trigger vpp crash.

By comparing sparse_vec_index2 and sparse_vec_index_internal, it seems
sparse_vec_index2 does not properly handle the case when both keep does not
stored in the sparse vec.

in sparse_vec_index_internal,

  w = h->is_member_bitmap[i];
>
>   /* count_trailing_zeros(0) == 0, take care of that case */
>   if (PREDICT_FALSE (maybe_range == 0 && insert == 0 && w == 0))
> return 0;
>

the w == 0 case have been checked and return lookup failed to the caller
(return 0).

while in the dual loop case, w0 == 0 /w1 == 0 case does not have been
checked.

It seems only limited code use sparse_vec_index2, so this bug may be not
raised that frequently.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22556): https://lists.fd.io/g/vpp-dev/message/22556
Mute This Topic: https://lists.fd.io/mt/96802298/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



[vpp-dev] can't establish tcp connection with new introduced transport_endpoint_freelist

2023-03-13 Thread Zhang Dongya
Hi list,

We have update coded from the upstream session&tcp changes to our code base
and find a possible bug which cause tcp connection can't be established
anymore.

Our scenario is that we will connect to a remote tcp server with specified
local port and local ip, however, new vpp code have introduced a
lcl_endpts_freelist which will be either flushed when pending local
endpoint exceeded the limit (32) or when transport_alloc_local_port is
called.

However, since we specify the local port and local ip and the total session
count is limited (< 32), in this case, the transport_cleanup_freelist will
never be called which cause the previous session which use the specified
local port and local ip will not be released after the session aborted.

I think we should also try to free the list in such case as I did in the
following code:

int
> transport_alloc_local_endpoint (u8 proto, transport_endpoint_cfg_t *
> rmt_cfg,
> ip46_address_t * lcl_addr, u16 * lcl_port)
> {
>   // ZDY:
>   transport_main_t *tm = &tp_main;
>   transport_endpoint_t *rmt = (transport_endpoint_t *) rmt_cfg;
>   session_error_t error;
>   int port;
>
>   /*
>* Find the local address
>*/
>   if (ip_is_zero (&rmt_cfg->peer.ip, rmt_cfg->peer.is_ip4))
> {
>   error = transport_find_local_ip_for_remote
> (&rmt_cfg->peer.sw_if_index,
>  rmt, lcl_addr);
>   if (error)
> return error;
> }
>   else
> {
>   /* Assume session layer vetted this address */
>   clib_memcpy_fast (lcl_addr, &rmt_cfg->peer.ip,
> sizeof (rmt_cfg->peer.ip));
> }
>
>   /*
>* Allocate source port
>*/
>   if (rmt_cfg->peer.port == 0)
> {
>   port = transport_alloc_local_port (proto, lcl_addr, rmt_cfg);
>   if (port < 1)
> return SESSION_E_NOPORT;
>   *lcl_port = port;
> }
>   else
> {
>   port = clib_net_to_host_u16 (rmt_cfg->peer.port);
>   *lcl_port = port;
>
>
>
>
>
>
> *  // ZDY: need add this to to cleanup because in specified src port
> // case, we will not run to transport_alloc_local_port, then  //
> freelist will only be freeed when list is full (>32).  /* Cleanup
> freelist if need be */  if (vec_len (tm->lcl_endpts_freelist))
> transport_cleanup_freelist ();*
>
>   return transport_endpoint_mark_used (proto, lcl_addr, port);
> }
>
>   return 0;
> }
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22698): https://lists.fd.io/g/vpp-dev/message/22698
Mute This Topic: https://lists.fd.io/mt/97596886/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] can't establish tcp connection with new introduced transport_endpoint_freelist

2023-03-14 Thread Zhang Dongya
Just use this patch and the connection can be reconnected after closed.

However, I find another possible bug when using local ip + local port for
different target server due to transport_endpoint_mark_used return error
if it find local ip + port being created.

I think it should increase the refcnt instead if it find 6 tuple is unique.

static int
> transport_endpoint_mark_used (u8 proto, ip46_address_t *ip, u16 port)
> {
>   transport_main_t *tm = &tp_main;
>   local_endpoint_t *lep;
>   u32 tei;
>
>   ASSERT (vlib_get_thread_index () <= transport_cl_thread ());
>
  // BUG??? maybe should allow reuse ???
>
  tei =
> transport_endpoint_lookup (&tm->local_endpoints_table, proto, ip,
> port);
>   if (tei != ENDPOINT_INVALID_INDEX)
> return SESSION_E_PORTINUSE;
>
>   /* Pool reallocs with worker barrier */
>   lep = transport_endpoint_alloc ();
>   clib_memcpy_fast (&lep->ep.ip, ip, sizeof (*ip));
>   lep->ep.port = port;
>   lep->proto = proto;
>   lep->refcnt = 1;
>
>   transport_endpoint_table_add (&tm->local_endpoints_table, proto,
> &lep->ep,
> lep - tm->local_endpoints);
>
>   return 0;
> }
>

Florin Coras  于2023年3月14日周二 11:38写道:

> Hi,
>
> Could you try this out [1]? I’ve hit this issue myself today but with udp
> sessions. Unfortunately, as you’ve correctly pointed out, we were forcing a
> cleanup only on the non-fixed local port branch.
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/38473
>
> On Mar 13, 2023, at 7:35 PM, Zhang Dongya 
> wrote:
>
> Hi list,
>
> We have update coded from the upstream session&tcp changes to our code
> base and find a possible bug which cause tcp connection can't be
> established anymore.
>
> Our scenario is that we will connect to a remote tcp server with specified
> local port and local ip, however, new vpp code have introduced a
> lcl_endpts_freelist which will be either flushed when pending local
> endpoint exceeded the limit (32) or when transport_alloc_local_port is
> called.
>
> However, since we specify the local port and local ip and the total
> session count is limited (< 32), in this case, the
> transport_cleanup_freelist will never be called which cause the previous
> session which use the specified local port and local ip will not be
> released after the session aborted.
>
> I think we should also try to free the list in such case as I did in the
> following code:
>
> int
>> transport_alloc_local_endpoint (u8 proto, transport_endpoint_cfg_t *
>> rmt_cfg,
>> ip46_address_t * lcl_addr, u16 * lcl_port)
>> {
>>   // ZDY:
>>   transport_main_t *tm = &tp_main;
>>   transport_endpoint_t *rmt = (transport_endpoint_t *) rmt_cfg;
>>   session_error_t error;
>>   int port;
>>
>>   /*
>>* Find the local address
>>*/
>>   if (ip_is_zero (&rmt_cfg->peer.ip, rmt_cfg->peer.is_ip4))
>> {
>>   error = transport_find_local_ip_for_remote
>> (&rmt_cfg->peer.sw_if_index,
>>  rmt, lcl_addr);
>>   if (error)
>> return error;
>> }
>>   else
>> {
>>   /* Assume session layer vetted this address */
>>   clib_memcpy_fast (lcl_addr, &rmt_cfg->peer.ip,
>> sizeof (rmt_cfg->peer.ip));
>> }
>>
>>   /*
>>* Allocate source port
>>*/
>>   if (rmt_cfg->peer.port == 0)
>> {
>>   port = transport_alloc_local_port (proto, lcl_addr, rmt_cfg);
>>   if (port < 1)
>> return SESSION_E_NOPORT;
>>   *lcl_port = port;
>> }
>>   else
>> {
>>   port = clib_net_to_host_u16 (rmt_cfg->peer.port);
>>   *lcl_port = port;
>>
>>
>>
>>
>>
>>
>> *  // ZDY: need add this to to cleanup because in specified src port
>> // case, we will not run to transport_alloc_local_port, then  //
>> freelist will only be freeed when list is full (>32).  /* Cleanup
>> freelist if need be */  if (vec_len (tm->lcl_endpts_freelist))
>> transport_cleanup_freelist ();*
>>
>>   return transport_endpoint_mark_used (proto, lcl_addr, port);
>> }
>>
>>   return 0;
>> }
>>
>
>
>
>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22700): https://lists.fd.io/g/vpp-dev/message/22700
Mute This Topic: https://lists.fd.io/mt/97596886/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] can't establish tcp connection with new introduced transport_endpoint_freelist

2023-03-16 Thread Zhang Dongya
yes, this is exactly what I want do, this patch works as expected, thanks a
lot.

Florin Coras  于2023年3月15日周三 01:22写道:

> Hi,
>
> Are you looking for behavior similar to the one when random local ports
> are allocated when, if port is used, we check if the 5-tuple is available?
>
> Don’t think we explicitly supported this before but here’s a patch [1].
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/38486
>
>
> On Mar 14, 2023, at 12:56 AM, Zhang Dongya 
> wrote:
>
> Just use this patch and the connection can be reconnected after closed.
>
> However, I find another possible bug when using local ip + local port for
> different target server due to transport_endpoint_mark_used return error
> if it find local ip + port being created.
>
> I think it should increase the refcnt instead if it find 6 tuple is unique.
>
> static int
>> transport_endpoint_mark_used (u8 proto, ip46_address_t *ip, u16 port)
>> {
>>   transport_main_t *tm = &tp_main;
>>   local_endpoint_t *lep;
>>   u32 tei;
>>
>>   ASSERT (vlib_get_thread_index () <= transport_cl_thread ());
>>
>   // BUG??? maybe should allow reuse ???
>>
>   tei =
>> transport_endpoint_lookup (&tm->local_endpoints_table, proto, ip,
>> port);
>>   if (tei != ENDPOINT_INVALID_INDEX)
>> return SESSION_E_PORTINUSE;
>>
>>   /* Pool reallocs with worker barrier */
>>   lep = transport_endpoint_alloc ();
>>   clib_memcpy_fast (&lep->ep.ip, ip, sizeof (*ip));
>>   lep->ep.port = port;
>>   lep->proto = proto;
>>   lep->refcnt = 1;
>>
>>   transport_endpoint_table_add (&tm->local_endpoints_table, proto,
>> &lep->ep,
>> lep - tm->local_endpoints);
>>
>>   return 0;
>> }
>>
>
> Florin Coras  于2023年3月14日周二 11:38写道:
>
>> Hi,
>>
>> Could you try this out [1]? I’ve hit this issue myself today but with udp
>> sessions. Unfortunately, as you’ve correctly pointed out, we were forcing a
>> cleanup only on the non-fixed local port branch.
>>
>> Regards,
>> Florin
>>
>> [1] https://gerrit.fd.io/r/c/vpp/+/38473
>>
>> On Mar 13, 2023, at 7:35 PM, Zhang Dongya 
>> wrote:
>>
>> Hi list,
>>
>> We have update coded from the upstream session&tcp changes to our code
>> base and find a possible bug which cause tcp connection can't be
>> established anymore.
>>
>> Our scenario is that we will connect to a remote tcp server with
>> specified local port and local ip, however, new vpp code have introduced a
>> lcl_endpts_freelist which will be either flushed when pending local
>> endpoint exceeded the limit (32) or when transport_alloc_local_port is
>> called.
>>
>> However, since we specify the local port and local ip and the total
>> session count is limited (< 32), in this case, the
>> transport_cleanup_freelist will never be called which cause the previous
>> session which use the specified local port and local ip will not be
>> released after the session aborted.
>>
>> I think we should also try to free the list in such case as I did in the
>> following code:
>>
>> int
>>> transport_alloc_local_endpoint (u8 proto, transport_endpoint_cfg_t *
>>> rmt_cfg,
>>> ip46_address_t * lcl_addr, u16 * lcl_port)
>>> {
>>>   // ZDY:
>>>   transport_main_t *tm = &tp_main;
>>>   transport_endpoint_t *rmt = (transport_endpoint_t *) rmt_cfg;
>>>   session_error_t error;
>>>   int port;
>>>
>>>   /*
>>>* Find the local address
>>>*/
>>>   if (ip_is_zero (&rmt_cfg->peer.ip, rmt_cfg->peer.is_ip4))
>>> {
>>>   error = transport_find_local_ip_for_remote
>>> (&rmt_cfg->peer.sw_if_index,
>>>  rmt, lcl_addr);
>>>   if (error)
>>> return error;
>>> }
>>>   else
>>> {
>>>   /* Assume session layer vetted this address */
>>>   clib_memcpy_fast (lcl_addr, &rmt_cfg->peer.ip,
>>> sizeof (rmt_cfg->peer.ip));
>>> }
>>>
>>>   /*
>>>* Allocate source port
>>>*/
>>>   if (rmt_cfg->peer.port == 0)
>>> {
>>>   port = transport_alloc_local_port (proto, lcl_addr, rmt_cfg);
>>>   if (port < 1)
>>> return SESSION_E_NOPORT;
>>>   *lcl_port = port;
>>> }
>>>   else
>>> {
>>>   port = clib_net_to_host_u16 (rmt_cfg->peer.port);
>>>   *lcl_port = port;
>>>
>>>
>>>
>>>
>>>
>>>
>>> *  // ZDY: need add this to to cleanup because in specified src
>>> port  // case, we will not run to transport_alloc_local_port, then
>>>   // freelist will only be freeed when list is full (>32).  /* Cleanup
>>> freelist if need be */  if (vec_len (tm->lcl_endpts_freelist))
>>>   transport_cleanup_freelist ();*
>>>
>>>   return transport_endpoint_mark_used (proto, lcl_addr, port);
>>> }
>>>
>>>   return 0;
>>> }
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22717): https://lists.fd.io/g/vpp-dev/message/22717
Mute This Topic: https://lists.fd.io/mt/97596886/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



[vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-19 Thread Zhang Dongya
Hi list,

recently in our application, we constantly triggered such abrt issue which
make our connectivity interrupt for a while:

Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC
0x7fefd3b2000b
Mar 19 16:11:26 ubuntu vnet[2565933]:
/home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 (tcp46_input_inline)
assertion `tcp_lookup_is_valid (tc0, b[0], tcp_buffer_hdr (b[0]))' fails

Our scenario is quite simple, we will make 4 parallel tcp connection (use 4
fixed source ports) to a remote vpp stack (fixed ip and port), and will do
some keepalive in our application layer, since we only use the vpp tcp
stack to make the middle box happy with the connection, we do not use the
data transport of tcp statck actually.

However, since the network condition is complex, we have to  always need to
abrt the connection and reconnect.

I keep to merge upstream session and tcp fix however the issue still not
fixed, what I found now it may be in some case
tcp_half_open_connection_cleanup may not deleted the half open session from
the lookup table (bihash) and the session index is realloced by other
connection.

Hope the list can provide some hint about how to overcome this issue,
thanks a lot.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22721): https://lists.fd.io/g/vpp-dev/message/22721
Mute This Topic: https://lists.fd.io/mt/97707823/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-19 Thread Zhang Dongya
Hi,

It can be aborted both in established state or half open state because I
will do timeout in our app layer.

Regarding your question,

- Yes we add a builtin in app relys on C apis that  mainly use
vnet_connect/disconnect to connect or disconnect session.
- We call these api in a vpp ctrl process which should be running on the
master thread, we never do session setup/teardown on worker thread. (the
environment that found this issue is configured with 1 master + 1 worker
setup.)
- We started to develop the app using 22.06 and I keep to merge upstream
changes to latest vpp by cherry-picking. The reason for line mismatch is
that I added some comment to the session layer code, it should be equal to
the master branch now.

When reading the code I understand that we mainly want to cleanup half open
from bihash in session_stream_connect_notify, however, in syn-sent state if
I choose to close the session, the session might be closed by my app due to
session setup timeout (in second scale), in that case, session will be
marked as half_open_done and half open session will be freed shortly in the
ctrl thread (the 1st worker?).

Should I also registered half open callback or there are some other reason
that lead to this failure?


Florin Coras  于2023年3月20日周一 06:22写道:

> Hi,
>
> When you abort the connection, is it fully established or half-open?
> Half-opens are cleaned up by the owner thread after a timeout, but the
> 5-tuple should be assigned to the fully established session by that point.
> tcp_half_open_connection_cleanup does not cleanup the bihash instead
> session_stream_connect_notify does once tcp connect returns either success
> or failure.
>
> So a few questions:
> - is it accurate to assume you have a builtin vpp app and rely only on C
> apis to interact with host stack?
> - on what thread (main or first worker) do you call vnet_connect?
> - what api do you use to close the session?
> - what version of vpp is this because lines don’t match vpp latest?
>
> Regards,
> Florin
>
> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya 
> wrote:
> >
> > Hi list,
> >
> > recently in our application, we constantly triggered such abrt issue
> which make our connectivity interrupt for a while:
> >
> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC
> 0x7fefd3b2000b
> > Mar 19 16:11:26 ubuntu vnet[2565933]:
> /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 (tcp46_input_inline)
> assertion `tcp_lookup_is_valid (tc0, b[0], tcp_buffer_hdr (b[0]))' fails
> >
> > Our scenario is quite simple, we will make 4 parallel tcp connection
> (use 4 fixed source ports) to a remote vpp stack (fixed ip and port), and
> will do some keepalive in our application layer, since we only use the vpp
> tcp stack to make the middle box happy with the connection, we do not use
> the data transport of tcp statck actually.
> >
> > However, since the network condition is complex, we have to  always need
> to abrt the connection and reconnect.
> >
> > I keep to merge upstream session and tcp fix however the issue still not
> fixed, what I found now it may be in some case
> tcp_half_open_connection_cleanup may not deleted the half open session from
> the lookup table (bihash) and the session index is realloced by other
> connection.
> >
> > Hope the list can provide some hint about how to overcome this issue,
> thanks a lot.
> >
> >
> >
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22727): https://lists.fd.io/g/vpp-dev/message/22727
Mute This Topic: https://lists.fd.io/mt/97707823/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-20 Thread Zhang Dongya
Hi,

It seems the issue occurs when there are disconnect called because our
network can't guarantee a tcp can't be reset even when 3 ways handshake is
completed (firewall issue :( ).

When we find the app layer timeout, we will first disconnect (because we
record the session handle, this session might be a half open session), does
vnet session layer guarantee that if we reconnect from master thread when
the half open session still not be released yet (due to asynchronous logic)
that the reconnect fail? if then we can retry connect later.

I prefer to not registered half open callback because I think it make app
complicated from a TCP programming prospective.

For your patch, I think it should be work because I can't delete the half
open session immediately because there is worker configured, so the half
open will be removed from bihash when syn retrans timeout. I have merged
the patch and will provide feedback later.

Florin Coras  于2023年3月20日周一 13:09写道:

> Hi,
>
> Inline.
>
> On Mar 19, 2023, at 6:47 PM, Zhang Dongya 
> wrote:
>
> Hi,
>
> It can be aborted both in established state or half open state because I
> will do timeout in our app layer.
>
>
> [fc] Okay! Is the issue present irrespective of the state of the session
> or does it happen only after a disconnect in hanf-open state? More lower.
>
>
> Regarding your question,
>
> - Yes we add a builtin in app relys on C apis that  mainly use
> vnet_connect/disconnect to connect or disconnect session.
>
>
> [fc] Understood
>
> - We call these api in a vpp ctrl process which should be running on the
> master thread, we never do session setup/teardown on worker thread. (the
> environment that found this issue is configured with 1 master + 1 worker
> setup.)
>
>
> [fc] With vpp latest it’s possible to connect from first workers. It’s an
> optimization meant to avoid 1) worker barrier on syns and 2) entering poll
> mode on main (consume less cpu)
>
> - We started to develop the app using 22.06 and I keep to merge upstream
> changes to latest vpp by cherry-picking. The reason for line mismatch is
> that I added some comment to the session layer code, it should be equal to
> the master branch now.
>
>
> [fc] Ack
>
>
> When reading the code I understand that we mainly want to cleanup half
> open from bihash in session_stream_connect_notify, however, in syn-sent
> state if I choose to close the session, the session might be closed by my
> app due to session setup timeout (in second scale), in that case, session
> will be marked as half_open_done and half open session will be freed
> shortly in the ctrl thread (the 1st worker?).
>
>
> [fc] Actually, this might be the issue. We did start to provide a
> half-open session handle to apps which if closed does clean up the session
> but apparently it is missing the cleanup of the session lookup table. Could
> you try this patch [1]? It might need additional work.
>
> Having said that, forcing a close/cleanup will not free the port
> synchronously. So, if you’re using fixed ports, you’ll have to wait for the
> half-open cleanup notification.
>
>
> Should I also registered half open callback or there are some other reason
> that lead to this failure?
>
>
> [fc] Yes, see above.
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/38526
>
>
> Florin Coras  于2023年3月20日周一 06:22写道:
>
>> Hi,
>>
>> When you abort the connection, is it fully established or half-open?
>> Half-opens are cleaned up by the owner thread after a timeout, but the
>> 5-tuple should be assigned to the fully established session by that point.
>> tcp_half_open_connection_cleanup does not cleanup the bihash instead
>> session_stream_connect_notify does once tcp connect returns either success
>> or failure.
>>
>> So a few questions:
>> - is it accurate to assume you have a builtin vpp app and rely only on C
>> apis to interact with host stack?
>> - on what thread (main or first worker) do you call vnet_connect?
>> - what api do you use to close the session?
>> - what version of vpp is this because lines don’t match vpp latest?
>>
>> Regards,
>> Florin
>>
>> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya 
>> wrote:
>> >
>> > Hi list,
>> >
>> > recently in our application, we constantly triggered such abrt issue
>> which make our connectivity interrupt for a while:
>> >
>> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC
>> 0x7fefd3b2000b
>> > Mar 19 16:11:26 ubuntu vnet[2565933]:
>> /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 (tcp46_input_inline)
>> assertion `tcp_looku

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-20 Thread Zhang Dongya
Hi,

After merge this patch and update the test environment, the issue still
persists.

Let me clear my client app config:
1. register a reset callback, which will call vnet_disconnect there and
also trigger reconnect by send event to the ctrl process.)
2. register a connected callback, which will handle connect err by trigger
reconnect, on success, it will record session handle and extract tcp
sequence for our app usage.
3. register a disconnect callback, which basically do same as reset
callback.
4. register a cleanup callback and accept callback, which basically make
the session layer happy without actually relevant work to do.

There is a ctrl process in mater, which will handle periodically reconnect
or triggered by event.

BTW, I also see frequently warning 'session %u hash delete rv -3' in
session_delete in my environment, hope this helps to investigate.

Florin Coras  于2023年3月20日周一 23:29写道:

> Hi,
>
> Understood and yes, connect will synchronously fail if port is not
> available, so you should be able to retry it later.
>
> Regards,
> Florin
>
> On Mar 20, 2023, at 1:58 AM, Zhang Dongya 
> wrote:
>
> Hi,
>
> It seems the issue occurs when there are disconnect called because our
> network can't guarantee a tcp can't be reset even when 3 ways handshake is
> completed (firewall issue :( ).
>
> When we find the app layer timeout, we will first disconnect (because we
> record the session handle, this session might be a half open session), does
> vnet session layer guarantee that if we reconnect from master thread when
> the half open session still not be released yet (due to asynchronous logic)
> that the reconnect fail? if then we can retry connect later.
>
> I prefer to not registered half open callback because I think it make app
> complicated from a TCP programming prospective.
>
> For your patch, I think it should be work because I can't delete the half
> open session immediately because there is worker configured, so the half
> open will be removed from bihash when syn retrans timeout. I have merged
> the patch and will provide feedback later.
>
> Florin Coras  于2023年3月20日周一 13:09写道:
>
>> Hi,
>>
>> Inline.
>>
>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya 
>> wrote:
>>
>> Hi,
>>
>> It can be aborted both in established state or half open state because I
>> will do timeout in our app layer.
>>
>>
>> [fc] Okay! Is the issue present irrespective of the state of the session
>> or does it happen only after a disconnect in hanf-open state? More lower.
>>
>>
>> Regarding your question,
>>
>> - Yes we add a builtin in app relys on C apis that  mainly use
>> vnet_connect/disconnect to connect or disconnect session.
>>
>>
>> [fc] Understood
>>
>> - We call these api in a vpp ctrl process which should be running on the
>> master thread, we never do session setup/teardown on worker thread. (the
>> environment that found this issue is configured with 1 master + 1 worker
>> setup.)
>>
>>
>> [fc] With vpp latest it’s possible to connect from first workers. It’s an
>> optimization meant to avoid 1) worker barrier on syns and 2) entering poll
>> mode on main (consume less cpu)
>>
>> - We started to develop the app using 22.06 and I keep to merge upstream
>> changes to latest vpp by cherry-picking. The reason for line mismatch is
>> that I added some comment to the session layer code, it should be equal to
>> the master branch now.
>>
>>
>> [fc] Ack
>>
>>
>> When reading the code I understand that we mainly want to cleanup half
>> open from bihash in session_stream_connect_notify, however, in syn-sent
>> state if I choose to close the session, the session might be closed by my
>> app due to session setup timeout (in second scale), in that case, session
>> will be marked as half_open_done and half open session will be freed
>> shortly in the ctrl thread (the 1st worker?).
>>
>>
>> [fc] Actually, this might be the issue. We did start to provide a
>> half-open session handle to apps which if closed does clean up the session
>> but apparently it is missing the cleanup of the session lookup table. Could
>> you try this patch [1]? It might need additional work.
>>
>> Having said that, forcing a close/cleanup will not free the port
>> synchronously. So, if you’re using fixed ports, you’ll have to wait for the
>> half-open cleanup notification.
>>
>>
>> Should I also registered half open callback or there are some other
>> reason that lead to this failure?
>>
>>
>> [fc] Yes, see above.
>>
>> 

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-20 Thread Zhang Dongya
Hi,

After review my code, I found that I have add a flag to the vnet_disconnect
API which will call session_reset instead of session_close, the reason I do
this is to make intermediate firewall just flush the state and reconstruct
if I later reconnect.

It seems in session_reset logic, for half open session, it also missing to
remove the session from the lookup hash which may cause the issue too.

I change my code and will test with your patch along, will provide feedback
later.

I also noticed the bihash issue discussed in the list recently, I will
merge later.

Florin Coras  于2023年3月21日周二 11:56写道:

> Hi,
>
> That last thing is pretty interesting. It’s either the issue fixed by this
> patch [1] or sessions are somehow cleaned up multiple times. If it’s the
> latter, I’d really like to understand how that happens.
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/38507
>
> On Mar 20, 2023, at 6:52 PM, Zhang Dongya 
> wrote:
>
> Hi,
>
> After merge this patch and update the test environment, the issue still
> persists.
>
> Let me clear my client app config:
> 1. register a reset callback, which will call vnet_disconnect there and
> also trigger reconnect by send event to the ctrl process.)
> 2. register a connected callback, which will handle connect err by trigger
> reconnect, on success, it will record session handle and extract tcp
> sequence for our app usage.
> 3. register a disconnect callback, which basically do same as reset
> callback.
> 4. register a cleanup callback and accept callback, which basically make
> the session layer happy without actually relevant work to do.
>
> There is a ctrl process in mater, which will handle periodically reconnect
> or triggered by event.
>
> BTW, I also see frequently warning 'session %u hash delete rv -3' in
> session_delete in my environment, hope this helps to investigate.
>
> Florin Coras  于2023年3月20日周一 23:29写道:
>
>> Hi,
>>
>> Understood and yes, connect will synchronously fail if port is not
>> available, so you should be able to retry it later.
>>
>> Regards,
>> Florin
>>
>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya 
>> wrote:
>>
>> Hi,
>>
>> It seems the issue occurs when there are disconnect called because our
>> network can't guarantee a tcp can't be reset even when 3 ways handshake is
>> completed (firewall issue :( ).
>>
>> When we find the app layer timeout, we will first disconnect (because we
>> record the session handle, this session might be a half open session), does
>> vnet session layer guarantee that if we reconnect from master thread when
>> the half open session still not be released yet (due to asynchronous logic)
>> that the reconnect fail? if then we can retry connect later.
>>
>> I prefer to not registered half open callback because I think it make app
>> complicated from a TCP programming prospective.
>>
>> For your patch, I think it should be work because I can't delete the half
>> open session immediately because there is worker configured, so the half
>> open will be removed from bihash when syn retrans timeout. I have merged
>> the patch and will provide feedback later.
>>
>> Florin Coras  于2023年3月20日周一 13:09写道:
>>
>>> Hi,
>>>
>>> Inline.
>>>
>>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya 
>>> wrote:
>>>
>>> Hi,
>>>
>>> It can be aborted both in established state or half open state because I
>>> will do timeout in our app layer.
>>>
>>>
>>> [fc] Okay! Is the issue present irrespective of the state of the session
>>> or does it happen only after a disconnect in hanf-open state? More lower.
>>>
>>>
>>> Regarding your question,
>>>
>>> - Yes we add a builtin in app relys on C apis that  mainly use
>>> vnet_connect/disconnect to connect or disconnect session.
>>>
>>>
>>> [fc] Understood
>>>
>>> - We call these api in a vpp ctrl process which should be running on the
>>> master thread, we never do session setup/teardown on worker thread. (the
>>> environment that found this issue is configured with 1 master + 1 worker
>>> setup.)
>>>
>>>
>>> [fc] With vpp latest it’s possible to connect from first workers. It’s
>>> an optimization meant to avoid 1) worker barrier on syns and 2) entering
>>> poll mode on main (consume less cpu)
>>>
>>> - We started to develop the app using 22.06 and I keep to merge upstream
>>> changes to latest vpp by cherry-picking. The reason fo

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-21 Thread Zhang Dongya
Hi Florin,

Thanks a lot, the previous patch and with reset disabled have been running
1 day without issue.

I will enable reset and with your new patch, will provide feedback later.

Florin Coras  于2023年3月22日周三 02:12写道:

> Hi,
>
> Okay, resetting of half-opens definitely not supported. I updated the
> patch to just clean them up on forced reset, without sending a reset to
> make sure session lookup table cleanup still happens.
>
> Regards,
> Florin
>
> On Mar 20, 2023, at 9:13 PM, Zhang Dongya 
> wrote:
>
> Hi,
>
> After review my code, I found that I have add a flag to the
> vnet_disconnect API which will call session_reset instead of session_close,
> the reason I do this is to make intermediate firewall just flush the state
> and reconstruct if I later reconnect.
>
> It seems in session_reset logic, for half open session, it also missing to
> remove the session from the lookup hash which may cause the issue too.
>
> I change my code and will test with your patch along, will provide
> feedback later.
>
> I also noticed the bihash issue discussed in the list recently, I will
> merge later.
>
> Florin Coras  于2023年3月21日周二 11:56写道:
>
>> Hi,
>>
>> That last thing is pretty interesting. It’s either the issue fixed by
>> this patch [1] or sessions are somehow cleaned up multiple times. If it’s
>> the latter, I’d really like to understand how that happens.
>>
>> Regards,
>> Florin
>>
>> [1] https://gerrit.fd.io/r/c/vpp/+/38507
>>
>> On Mar 20, 2023, at 6:52 PM, Zhang Dongya 
>> wrote:
>>
>> Hi,
>>
>> After merge this patch and update the test environment, the issue still
>> persists.
>>
>> Let me clear my client app config:
>> 1. register a reset callback, which will call vnet_disconnect there and
>> also trigger reconnect by send event to the ctrl process.)
>> 2. register a connected callback, which will handle connect err by
>> trigger reconnect, on success, it will record session handle and extract
>> tcp sequence for our app usage.
>> 3. register a disconnect callback, which basically do same as reset
>> callback.
>> 4. register a cleanup callback and accept callback, which basically make
>> the session layer happy without actually relevant work to do.
>>
>> There is a ctrl process in mater, which will handle periodically
>> reconnect or triggered by event.
>>
>> BTW, I also see frequently warning 'session %u hash delete rv -3' in
>> session_delete in my environment, hope this helps to investigate.
>>
>> Florin Coras  于2023年3月20日周一 23:29写道:
>>
>>> Hi,
>>>
>>> Understood and yes, connect will synchronously fail if port is not
>>> available, so you should be able to retry it later.
>>>
>>> Regards,
>>> Florin
>>>
>>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya 
>>> wrote:
>>>
>>> Hi,
>>>
>>> It seems the issue occurs when there are disconnect called because our
>>> network can't guarantee a tcp can't be reset even when 3 ways handshake is
>>> completed (firewall issue :( ).
>>>
>>> When we find the app layer timeout, we will first disconnect (because we
>>> record the session handle, this session might be a half open session), does
>>> vnet session layer guarantee that if we reconnect from master thread when
>>> the half open session still not be released yet (due to asynchronous logic)
>>> that the reconnect fail? if then we can retry connect later.
>>>
>>> I prefer to not registered half open callback because I think it make
>>> app complicated from a TCP programming prospective.
>>>
>>> For your patch, I think it should be work because I can't delete the
>>> half open session immediately because there is worker configured, so the
>>> half open will be removed from bihash when syn retrans timeout. I have
>>> merged the patch and will provide feedback later.
>>>
>>> Florin Coras  于2023年3月20日周一 13:09写道:
>>>
>>>> Hi,
>>>>
>>>> Inline.
>>>>
>>>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> It can be aborted both in established state or half open state because
>>>> I will do timeout in our app layer.
>>>>
>>>>
>>>> [fc] Okay! Is the issue present irrespective of the state of the
>>>> session or does it happen only after a disconnect in hanf-open state? More
>&

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-23 Thread Zhang Dongya
Hi,

The new patch works as expected, no assert triggered abort anymore.

Really appreciate your help and thanks a lot.

Florin Coras  于2023年3月22日周三 11:54写道:

> Hi Zhang,
>
> Awesome! Thanks!
>
> Regards,
> Florin
>
> On Mar 21, 2023, at 7:41 PM, Zhang Dongya 
> wrote:
>
> Hi Florin,
>
> Thanks a lot, the previous patch and with reset disabled have been running
> 1 day without issue.
>
> I will enable reset and with your new patch, will provide feedback later.
>
> Florin Coras  于2023年3月22日周三 02:12写道:
>
>> Hi,
>>
>> Okay, resetting of half-opens definitely not supported. I updated the
>> patch to just clean them up on forced reset, without sending a reset to
>> make sure session lookup table cleanup still happens.
>>
>> Regards,
>> Florin
>>
>> On Mar 20, 2023, at 9:13 PM, Zhang Dongya 
>> wrote:
>>
>> Hi,
>>
>> After review my code, I found that I have add a flag to the
>> vnet_disconnect API which will call session_reset instead of session_close,
>> the reason I do this is to make intermediate firewall just flush the state
>> and reconstruct if I later reconnect.
>>
>> It seems in session_reset logic, for half open session, it also missing
>> to remove the session from the lookup hash which may cause the issue too.
>>
>> I change my code and will test with your patch along, will provide
>> feedback later.
>>
>> I also noticed the bihash issue discussed in the list recently, I will
>> merge later.
>>
>> Florin Coras  于2023年3月21日周二 11:56写道:
>>
>>> Hi,
>>>
>>> That last thing is pretty interesting. It’s either the issue fixed by
>>> this patch [1] or sessions are somehow cleaned up multiple times. If it’s
>>> the latter, I’d really like to understand how that happens.
>>>
>>> Regards,
>>> Florin
>>>
>>> [1] https://gerrit.fd.io/r/c/vpp/+/38507
>>>
>>> On Mar 20, 2023, at 6:52 PM, Zhang Dongya 
>>> wrote:
>>>
>>> Hi,
>>>
>>> After merge this patch and update the test environment, the issue still
>>> persists.
>>>
>>> Let me clear my client app config:
>>> 1. register a reset callback, which will call vnet_disconnect there and
>>> also trigger reconnect by send event to the ctrl process.)
>>> 2. register a connected callback, which will handle connect err by
>>> trigger reconnect, on success, it will record session handle and extract
>>> tcp sequence for our app usage.
>>> 3. register a disconnect callback, which basically do same as reset
>>> callback.
>>> 4. register a cleanup callback and accept callback, which basically make
>>> the session layer happy without actually relevant work to do.
>>>
>>> There is a ctrl process in mater, which will handle periodically
>>> reconnect or triggered by event.
>>>
>>> BTW, I also see frequently warning 'session %u hash delete rv -3' in
>>> session_delete in my environment, hope this helps to investigate.
>>>
>>> Florin Coras  于2023年3月20日周一 23:29写道:
>>>
>>>> Hi,
>>>>
>>>> Understood and yes, connect will synchronously fail if port is not
>>>> available, so you should be able to retry it later.
>>>>
>>>> Regards,
>>>> Florin
>>>>
>>>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> It seems the issue occurs when there are disconnect called because our
>>>> network can't guarantee a tcp can't be reset even when 3 ways handshake is
>>>> completed (firewall issue :( ).
>>>>
>>>> When we find the app layer timeout, we will first disconnect (because
>>>> we record the session handle, this session might be a half open session),
>>>> does vnet session layer guarantee that if we reconnect from master thread
>>>> when the half open session still not be released yet (due to asynchronous
>>>> logic) that the reconnect fail? if then we can retry connect later.
>>>>
>>>> I prefer to not registered half open callback because I think it make
>>>> app complicated from a TCP programming prospective.
>>>>
>>>> For your patch, I think it should be work because I can't delete the
>>>> half open session immediately because there is worker configured, so the
>>>> half open will be removed from bihash when syn retrans timeout.