On Sun, 2019-02-03 at 21:13 +0100, Damjan Marion wrote:
External Email
________________________________


On 3 Feb 2019, at 20:13, Saxena, Nitin 
<nitin.sax...@cavium.com<mailto:nitin.sax...@cavium.com>> wrote:

Hi Damjan,

See function octeontx_fpa_bufpool_alloc() called by octeontx_fpa_dequeue(). Its 
a single read instruction to get the pointer of data.

Yeah saw that, and today vpp buffer manager can grab up to 16 buffer indices 
with one instructions so no big deal here....

Similarly, octeontx_fpa_bufpool_free() is also a single write instruction.

So, If you are able to prove with numbers that current software solution is 
low-performant and that you are confident that you can do significantly better, 
I will be happy to work with you on implementing support for hardware buffer 
manager.
First of all I welcome your patch as we were also trying to remove latencies 
seen by memcpy_x4() of buffer template. As I said earlier hardware buffer 
coprocessor is being used by other packet engines hence the support has to be 
added in VPP. I am looking for suggestion for its resolution.

You can hardly get any suggestion from my side if you are ignoring my 
questions, which I asked in my previous email to get better understanding of 
what your hardware do.

"It is hardware so it is fast" is not real argument, we need real datapoints 
before investing time into this area....


Adding more details of HW mempool manger attributes:

1) Semantically HW mempool manager is same as SW mempool manger
2) HW mempool mangers has "alloc/dequeue" and "free/enqueue" operation as SW 
mempool manager
3) HW mempool mangers can work with SW per core local cache scheme too
4) user metadata initialization is not done in HW. SW needs to do before free() 
or after alloc()
5) Typically it has an operation to "Dont free" the packet after Tx. Which can 
be used as back end to clone the packet(aka reference count schemes)
6) How does HW pool manger improves the performance:
- MP/MC can work without locks(HW takes care internally)
- HW Frees the buffer on Tx unlike core does in SW mempool case. So it does 
save CPU cycles packet Tx and cost of bringing packet again
in L1 cache.
- On the RX side, HW alloc/dequeue packet from mempool. No SW intervention 
required.

In terms of abstraction. DPDK mempool manger does abstract SW and HW mempool 
though static struct rte_mempool_ops.

Limitations:
1) Some NPU packet processing HW can work only with HW mempool manger.(Aka it 
can not work with SW mempool manager
as on the RX, HW looks for mempool manager to alloc and then form the packet)

Using DPDK abstractions will enable to write agositic software which works NPU 
and CPUs models.

/Jerin




Thanks,
Nitin

On 03-Feb-2019, at 11:39 PM, Damjan Marion via Lists.Fd.Io 
<dmarion=me....@lists.fd.io<mailto:dmarion=me....@lists.fd.io>> wrote:


External Email


On 3 Feb 2019, at 18:38, Nitin Saxena 
<nitin.sax...@cavium.com<mailto:nitin.sax...@cavium.com>> wrote:

Hi Damjan,

Which exact operation do they accelerate?
There are many…basic features are…
- they accelerate fast buffer free and alloc. Single instruction required for 
both operations.

I quickly looked into DPDK octeontx_fpavf_dequeue() and it looks to me much 
more than one instruction.

In case of DPDK, how that works with DPDK mempool cache or are you disabling 
mempool cache completely?

Does single instruction alloc/free include:
 - reference_count check and decrement?
 - user metadata initialization ?

- Free list is maintained by hardware and not software.

Sounds to me that it is slower to program hardware, than to simply add few 
buffer indices to the end of vector but I may be wrong...


Further other co-processors are dependent on buffer being managed by hardware 
instead of software so it is must to add support of hardware mem-pool in VPP. 
Software mempool will not work with other packet engines.

But that can also be handled internally by device driver...

So, If you are able to prove with numbers that current software solution is 
low-performant and that you are confident that you can do significantly better, 
I will be happy to work with you on implementing support for hardware buffer 
manager.


Thanks,
Nitin

On 03-Feb-2019, at 10:34 PM, Damjan Marion via Lists.Fd.Io 
<dmarion=me....@lists.fd.io<mailto:dmarion=me....@lists.fd.io>> wrote:


External Email


On 3 Feb 2019, at 16:58, Nitin Saxena 
<nsax...@marvell.com<mailto:nsax...@marvell.com>> wrote:

Hi Damjan,

I have few queries regarding this patch.

 - DPDK mempools are not used anymore, we register custom mempool ops, and dpdk 
is taking buffers from VPP
Some of the targets uses hardware memory allocator like OCTEONTx family and 
NXP's dpaa. Those hardware allocators are exposed as dpdk mempools.

Which exact operation do they accelerate?

Now with this change I can see rte_mempool_populate_iova() is not anymore 
called.

Yes, but new code does pretty much the same thing, it populates both elt_list 
and mem_list. Also new code puts IOVA into mempool_objhdr.

So what is your suggestion to support such hardware.

Before I can provide any suggestion I need to understand better what those 
hardware buffer managers do
and why they are better than pure software solution we have today.



 - first 64-bytes of metadata are initialised on free, so buffer alloc is very 
fast
Is it fair to say if a mempool is created per worker core per sw_index 
(interface) then buffer template copy can be avoided even during free (It can 
be done only once at init time)

The really expensive part of buffer free operation is bringing cacheline into 
L1, and we need to do that to verify reference count of the packet.
At the moment when data is in L1, simply copying template will not cost much. 
1-2 clocks on x86, not sure about arm but still i expect that it will result in 
4 128-bit stores.
That was the rationale for resetting the metadata during buffer free.

So to answer your question, having buffer per sw-interface will likely improve 
performance a bit, but it will also cause sub-optimal use of buffer memory.
Such solution will also have problem in scaling, for example if you have 
hundreds of virtual interfaces...



Thanks,
Nitin

________________________________
From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> on behalf of Damjan Marion 
via Lists.Fd.Io <dmarion=me....@lists.fd.io<mailto:dmarion=me....@lists.fd.io>>
Sent: Friday, January 25, 2019 10:38 PM
To: vpp-dev
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: [vpp-dev] RFC: buffer manager rework

External Email

I am very close to the finish line with buffer management rework patch, and 
would like to
ask people to take a look before it is merged.

https://gerrit.fd.io/r/16638

It significantly improves performance of buffer alloc free and introduces numa 
awareness.
On my skylake platinum 8180 system, with native AVF driver observed performance 
improvement is:

- single core, 2 threads, ipv4 base forwarding test, CPU running at 2.5GHz (TB 
off):

old code - dpdk buffer manager: 20.4 Mpps
old code - old native buffer manager: 19.4 Mpps
new code: 24.9 Mpps

With DPDK drivers performance stays same as DPDK is maintaining own internal 
buffer cache.
So major perf gain should be observed in native code like: vhost-user, memif, 
AVF, host stack.

user facing changes:
to change number of buffers:
  old startup.conf:
    dpdk { num-mbufs XXXX }
  new startup.conf:
    buffers { buffers-per-numa XXXX}

Internal changes:
 - free lists are deprecated
 - buffer metadata is always initialised.
 - first 64-bytes of metadata are initialised on free, so buffer alloc is very 
fast
 - DPDK mempools are not used anymore, we register custom mempool ops, and dpdk 
is taking buffers from VPP
 - to support such operation plugin can request external header space - in case 
of DPDK it stores rte_mbuf + rte_mempool_objhdr

I'm still running some tests so possible minor changes are possible, but 
nothing major expected.

--
Damjan

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12016): https://lists.fd.io/g/vpp-dev/message/12016
Mute This Topic: https://lists.fd.io/mt/29539221/675748
Group Owner: vpp-dev+ow...@lists.fd.io<mailto:vpp-dev+ow...@lists.fd.io>
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
[nsax...@caviumnetworks.com<mailto:nsax...@caviumnetworks.com>]
-=-=-=-=-=-=-=-=-=-=-=-

--
Damjan

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12144): https://lists.fd.io/g/vpp-dev/message/12144
Mute This Topic: https://lists.fd.io/mt/29539221/675748
Group Owner: vpp-dev+ow...@lists.fd.io<mailto:vpp-dev+ow...@lists.fd.io>
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
[nsax...@caviumnetworks.com<mailto:nsax...@caviumnetworks.com>]
-=-=-=-=-=-=-=-=-=-=-=-

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12146): https://lists.fd.io/g/vpp-dev/message/12146
Mute This Topic: https://lists.fd.io/mt/29539221/675642
Group Owner: vpp-dev+ow...@lists.fd.io<mailto:vpp-dev+ow...@lists.fd.io>
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
[dmar...@me.com<mailto:dmar...@me.com>]
-=-=-=-=-=-=-=-=-=-=-=-

--
Damjan

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12147): https://lists.fd.io/g/vpp-dev/message/12147
Mute This Topic: https://lists.fd.io/mt/29539221/675748
Group Owner: vpp-dev+ow...@lists.fd.io<mailto:vpp-dev+ow...@lists.fd.io>
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
[nsax...@caviumnetworks.com<mailto:nsax...@caviumnetworks.com>]
-=-=-=-=-=-=-=-=-=-=-=-


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12156): https://lists.fd.io/g/vpp-dev/message/12156
Mute This Topic: https://lists.fd.io/mt/29651573/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to