Re: [vpp-dev] questions in configuring tunnel

Kingwel Xie Thu, 19 Apr 2018 05:37:37 -0700

Hi Xue,

As I said, there are something else you have to do.



  1.  Better to use socket based API. Otherwise, you might suffer the 10ms 
sleep time of linux_epoll_input. Or please hard code the sleep time to 
100~200us in vl_api_clnt_process:430 vlib_process_wait_for_event_or_clock. 
Don’t know if you understand what I mean. Hard to explain in a few words, but 
this is the behavior of vPP process scheduler. I would not say this is a 
perfect design, but it is just like that.
  2.  Comment the lines of fib_table_entry_special_add and fib_entry_child_add 
in vnet_gtpu_add_del_tunnel  in gtpu.c. The former would create the fib entry 
for tunnel interface, thus might create a long linked list for covered fib 
entry(the default route in most cases), when you have many tunnel endpoints, 
while the latter would create a child node linked list for each fib entry 
created by fib_table_entry_special_add, when you have just a few tunnel 
endpoints. We kind of discussed why in previous email. I would guess this fib 
entry is not so important for a gtp tunnel, because the gtp traffic would 
finally hit a valid fib entry then get sent out
  3.  You should really change the mheap as I mentioned. Otherwise the slow 
memory allocation will kill you in the end.
  4.  We made some other improvements, to avoid frequently memory 
resize/allocation. We managed to start with a very large (2M in our case) 
sw_if_index, so that many vectors related to sw_if_index, are validated to a 
appropriate size. However, I wouldn’t recommend you to do so, because you just 
need 100K.

Hope it helps, instead of confusing you.

Regards,
Kingwel


From: 薛欣颖 [mailto:xy...@fiberhome.com]
Sent: Thursday, April 19, 2018 7:43 PM
To: Kingwel Xie <kingwel....@ericsson.com>; nranns <nra...@cisco.com>
Cc: vpp-dev <vpp-dev@lists.fd.io>
Subject: Re: Re: [vpp-dev] questions in configuring tunnel

Hi Kingwel,

Thank you very much for your share of the solution of  'cache line alignment'.
I saw you configured 2M gtpu tunnel in 200s .  When I merge the patch 10216 , 
configure 100K gtpu tunnel cost 7 mins.  How do your configure rate reach so 
fast?


Thanks,
Xyxue
________________________________

From: Kingwel Xie<mailto:kingwel....@ericsson.com>
Date: 2018-04-19 17:11
To: 薛欣颖<mailto:xy...@fiberhome.com>; nranns<mailto:nra...@cisco.com>
CC: vpp-dev<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] questions in configuring tunnel
Hi Xue,

I’m afraid it will take a few days to commit the code.

For now I copied the key changes for your reference. It should work.

Regards,
Kingwel


/* Search free lists for object with given size and alignment. */
static uword
mheap_get_search_free_list (void *v,
                                                    uword * n_user_bytes_arg,
                                                    uword align, uword 
align_offset)
{
  mheap_t *h = mheap_header (v);
  uword bin, n_user_bytes, i, bi;

  n_user_bytes = *n_user_bytes_arg;
  bin = user_data_size_to_bin_index (n_user_bytes);

  if (MHEAP_HAVE_SMALL_OBJECT_CACHE
      && (h->flags & MHEAP_FLAG_SMALL_OBJECT_CACHE)
      && bin < 255
      && align == STRUCT_SIZE_OF (mheap_elt_t, user_data[0])
      && align_offset == 0)
    {
      uword r = mheap_get_small_object (h, bin);
      h->stats.n_small_object_cache_attempts += 1;
      if (r != MHEAP_GROUNDED)
                {
                  h->stats.n_small_object_cache_hits += 1;
                  return r;
                }
    }

  /* kingwel, lookup a free bin which is big enough to hold everything 
align+align_offset+lo_free_size+overhead */
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);
  for (i = bin / BITS (uword); i < ARRAY_LEN (h->non_empty_free_elt_heads);
       i++)
    {
      uword non_empty_bin_mask = h->non_empty_free_elt_heads[i];

      /* No need to search smaller bins. */
      if (i == bin / BITS (uword))
                non_empty_bin_mask &= ~pow2_mask (bin % BITS (uword));

      /* Search each occupied free bin which is large enough. */
      /* *INDENT-OFF* */
      foreach_set_bit (bi, non_empty_bin_mask,
      ({
        uword r =
          mheap_get_search_free_bin (v, bi + i * BITS (uword),
                                     n_user_bytes_arg,
                                     align,
                                     align_offset);
        if (r != MHEAP_GROUNDED) return r;
      }));
      /* *INDENT-ON* */
    }

  return MHEAP_GROUNDED;
}



From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of xyxue
Sent: Thursday, April 19, 2018 4:02 PM
To: Kingwel Xie <kingwel....@ericsson.com<mailto:kingwel....@ericsson.com>>; 
nranns <nra...@cisco.com<mailto:nra...@cisco.com>>
Cc: vpp-dev <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] questions in configuring tunnel

Hi,

Thank you all for your help . I've learned so much in your discussion . There 
is some questions to ask for advice:

About the 3th advice's solution , Can you commit this part, or tell us the 
method to handle it?



The patch 10216 is the solution of 'gtpu geneve vxlan vxlan-gre' .  When we 
create gtpu tunnel ,vpp add 'virtual node' . But create mpls and gre , vpp add 
'true node' to trans.
We can delete the gtpu's 'virtual node'  but not the 'true node' . Is there any 
solution for mpls and gre?

Thanks,
Xyxue
________________________________

From: Kingwel Xie<mailto:kingwel....@ericsson.com>
Date: 2018-04-19 13:44
To: Neale Ranns (nranns)<mailto:nra...@cisco.com>; 
薛欣颖<mailto:xy...@fiberhome.com>
CC: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] questions in configuring tunnel
Thanks for the comments. Please see mine in line.


From: Neale Ranns (nranns) [mailto:nra...@cisco.com]
Sent: Wednesday, April 18, 2018 9:18 PM
To: Kingwel Xie <kingwel....@ericsson.com<mailto:kingwel....@ericsson.com>>; 
xyxue <xy...@fiberhome.com<mailto:xy...@fiberhome.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] questions in configuring tunnel

Hi Kingwei,

Thank you for your analysis. Some comments inline (on subjects I know a bit 
about ☺ )

Regards,
neale

From: Kingwel Xie <kingwel....@ericsson.com<mailto:kingwel....@ericsson.com>>
Date: Wednesday, 18 April 2018 at 13:49
To: "Neale Ranns (nranns)" <nra...@cisco.com<mailto:nra...@cisco.com>>, xyxue 
<xy...@fiberhome.com<mailto:xy...@fiberhome.com>>
Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>
Subject: RE: [vpp-dev] questions in configuring tunnel

Hi,

As we understand, this patch would bypass the node replication, so that adding 
tunnel would not cause main thread to wait for workers  synchronizing the nodes.

However, in addition to that, you have to do more things to be able to add 40k 
or more tunnels in a predictable time period. Here is what we did for adding 2M 
gtp tunnels, for your reference. Mpls tunnel should be pretty much the same.


  1.  Don’t call fib_entry_child_add after adding fib entry to the tunnel 
(fib_table_entry_special_add ). This will create a linked list for all child 
nodes belonged to the fib entry pointed to the tunnel endpoint. As a result, 
adding tunnel would become slower and slower. BTW, it is not a good fix, but it 
works.
                  #if 0
                  t->sibling_index = fib_entry_child_add
                    (t->fib_entry_index, gtm->fib_node_type, t - gtm->tunnels);
                  #endif

[nr] if you skip this then the tunnels are not part of the FIB graph and hence 
any updates in the forwarding to the tunnel’s destination will go unnoticed and 
hence you potentially black hole the tunnel traffic indefinitely (since the 
tunnel is not re-stacked). It is a linked list, but apart from the pool 
allocation of the list element, the list element insertion is O(1), no?
[kingwel] You are right that the update will not be noticed, but we think it is 
acceptable for a p2p tunnel interface. The list element itself is ok when being 
inserted, but the following restack operation will walk through all inserted 
elements. This is the point I’m talking about.


  1.  The bihash for Adj_nbr. Each tunnel interface would create one bihash 
which by default is 32MB, mmap and memset then. Typically you don’t need that 
many adjacencies for a p2p tunnel interface. We change the code to use a common 
heap for all p2p interfaces

[nr] if you would push these changes upstream, I would be grateful.
[kingwel] The fix is quite ugly. Let’s see what we can do to make it better.


  1.  As mentioned in my email, rewrite requires cache line alignment, which 
mheap cannot handle very well. Mheap might be super slow when you add too many 
tunnels.
  2.  In vl_api_clnt_process, make sleep_time always 100us. This is to avoid 
main thread yielding to linux_epoll_input_inline 10ms wait time. This is not a 
perfect fix either. But if don’t do this, probably each API call would probably 
have to wait for 10ms until main thread has chance to polling API events.
  3.  Be careful with the counters. It would eat up your memory very quick. 
Each counter will be expanded to number of thread multiply number of tunnels. 
In other words, 1M tunnels means 1M x 8 x 8B = 64MB, if you have 8 workers. The 
combined counter will take double size because it has 16 bytes. Each interface 
has 9 simple and 2 combined counters. Besides, load_balance_t and adjacency_t 
also have some counters. You will have at least that many objects if you have 
that many interfaces. The solution is simple – to make a dedicated heap for all 
counters.

[nr] this would also be a useful addition to the upstream
[kingwel] will do later.


  1.  We also did some other fixes to speed up memory allocation, f.g., 
pre-allocate a big enough pool for gtpu_tunnel_t

[nr] I understand why you would do this and knobs in the startup.conf to enable 
might be a good approach, but for general consumption, IMHO, it’s too specific 
– others may disagree.
[kingwel] agree☺

To honest, it is not easy. It took us quite some time to figure it out. In the 
end, we manage to add 2M tunnels & 2M routes in 250s.

Hope it helps.

Regard,
Kingwel


From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Neale Ranns
Sent: Wednesday, April 18, 2018 4:33 PM
To: xyxue <xy...@fiberhome.com<mailto:xy...@fiberhome.com>>; Kingwel Xie 
<kingwel....@ericsson.com<mailto:kingwel....@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] questions in configuring tunnel

Hi Xyxue,

Try applying the changes in this patch:
   https://gerrit.fd.io/r/#/c/10216/
to MPLS tunnels. Please contribute any changes back to the community so we can 
all benefit.

Regards,
Neale


From: <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> on behalf of xyxue 
<xy...@fiberhome.com<mailto:xy...@fiberhome.com>>
Date: Wednesday, 18 April 2018 at 09:48
To: Xie <kingwel....@ericsson.com<mailto:kingwel....@ericsson.com>>
Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>
Subject: [vpp-dev] questions in configuring tunnel


Hi,

We are testing mpls tunnel.The problems shown below appear in our configuration:
1.A configuration of one tunnel will increase two node (this would lead to a 
very high consumption of memory )
2.more node number, more time to update vlib_node_runtime_update and node info 
traversal;

When we configured 40 thousand mpls tunnels , the configure time is 10+ minutes 
, and the occurrence of out of memory.
How can you configure 2M gtpu tunnels , Can I know the configuration speed and 
the memory usage?

Thanks,
Xyxue
________________________________

Re: [vpp-dev] questions in configuring tunnel

Reply via email to