On 2 June 2015 at 11:34, Maciej Czekaj <m...@semihalf.com> wrote:

> Zoltan,
>
> I am currently working on ThunderX port so I can offer an insight into one
> of the implementations.
>
> ThunderX has more server-like network adapter as opposed to Octeon or
> QorIQ, so buffer management is done in software.
> I think the problem with pool starvation affects mostly those kinds of
> platforms and any mismanagement here may have dire costs.
>
> On Thunder the buffers are handled the same way as in DPDK, so transmitted
> buffers have to be "reclaimed" after hardware finished processing them.
> The best and most efficient  way to free the buffers is to do it while
> transmitting others on the same queue, i.e. in odp_pktio_send or in enqueue
> operation. There are several reasons behind this:
>
The problem is that no-one might actually be transmitting on the interface
where there are TX buffers waiting to be retired.

The DPDK solution seems to be to scatter your application code with
rte_eth_tx_burst(num_pkts=0)
calls.


>  1. TX ring is accessed anyway so it minimizes cache misses.
>
But already transmitted packets (ring head) and new packets (ring tail)
might not be located in the same place (cache line).


>
>  2.  TX ring H/W registers are accessed while transmitting packets, so
> information about the ring occupation is already extracted by software.
>      This leads to minimizing the overhead of H/W register access, which
> may be quite significant even on internal PCI-E bus.
>
Yes unnecessary PCIe accesses will likely have a huge performance impact.
But do you have to get get this information from HW? Can't the SW maintain
a pointer (index) to the oldest descriptor in the ring and check some
descriptor flags if the buffer has been transmitted? You really want to
avoid accessing HW registers anyway because of synchronisation problems
(e.g. often HW (device) access requires a DSB or equivalent so that the CPU
and device are synchronised with regards to memory updates). Preferably the
driver should only work with shared coherent memory, not with the device
registers itself (no kicking the device if there are new packets to
transmit, the device will have to poll the descriptor rings, this might add
some latency but will save CPU cycles).


>
>  3. Any other scheme, e.g. doing it in mempool or in RX as suggested
> previously, leads to extra overhead from points 1 and 2 and another
> overhead caused by synchronization of access to the ring:
>
My idea was that you should only do this extra TX done processing if
running out of buffers and the system would likely fail otherwise (not
crash but start dropping packets or otherwise misbehave due to lack of
buffers).


>      - accessing TX ring from mempool must be thread-safe, mempool may be
> invoked from another context than ring transmission
>
Either each thread has its own RX and TX rings and threads should only
perform TX done processing on their own rings.
Or rings are shared between threads (maybe the HW doesn't support
per-thread/CPU rings) and drivers must be thread-safe. Is there such HW in
existence anymore?

All these problems were solved with HW buffer and queue management... Now
the time machine brought us back to the 90'es.



>      - accessing the transmission ring from receive operation leads to
> similar thread safety issue where RX and TX, being independent operations
> from H/W perspective, must be additionally synchronized with respect to
> each other
>
Not if the TX ring is specific to this thread. It doesn't matter that we
(the thread) access it from some RX function. We are not interrupting some
TX function that was in the process of accessing the same TX ring.


>
> Summarizing, any high-performance implementation must live with the fact
> that some buffers will be kept in TX ring for a while and choose the
> mempool size accordingly.
>
And is this a practical problem if there are enough packets? A lot of
packets may be stuck in TX rings waiting to be retired but with enough
packets, reception should still proceed and cause new send calls which will
eventually retire those TX packets.

We only need the work-around if there are too few packets so that reception
stops due to lack of packets. In that case, we need to issue explicit calls
(but not necessarily from the application) to perform TX done processing.
But preferably only to our own TX rings so to avoid thread synchronisation
issues. If packets are stuck on some other thread's TX ring, we have to
wait for that thread to do TX done processing (either due to normal send
call or forced because of lack of packets).


> This is true at least for Thunder and any other similar "server"
> adapters.  On the other hand, the issue may be non-existent in specialized
> network processors, but then there is no need for extra API or extra
> software tricks, anyway.
>
> There memory pressure may come not only from TX ring, but from RX ring as
> well, when flooded with packets. That leads to the same challange, but
> reversed, i.e. the receive function greedily allocates packets to feed the
> H/W with as many free buffers as possible and there is currently no way to
> limit that.
>
> That is why from Thunder perspective a practical solution is:
>  - explicitly stating the "depth" of the engine (both RX and TX) by either
> API or some parameter and letting the implementer choose how to deal with
> the problem
>  - adding the note that transmission functions are responsible for buffer
> cleanup, to let the application choose the best strategy
>
> This is by all means not a sliver bullet but it gives the user the tools
> to deal with the problem and at the same time does not impose unnecessary
> overhead for certain implementations.
>
>
> Cheers
> Maciej
>
> 2015-05-29 18:03 GMT+02:00 Zoltan Kiss <zoltan.k...@linaro.org>:
>
>> Hi,
>>
>> On 29/05/15 16:58, Jerin Jacob wrote:
>>
>>> I agree. Is it possbile to dedicate "core 0"/"any core" in ODP-DPDK
>>> implementation
>>> to do the house keeping job ? If we are planning for ODP-DPDK
>>> implementation as just wrapper over DPDK API then there will not be any
>>> value addition to use the ODP API. At least from my experience, We have
>>> changed our  SDK "a lot" to fit into ODP model. IMO that kind of effort
>>> will
>>> be required for useful ODP-DPDK port.
>>>
>>
>> It would be good to have some input from other implementations as well:
>> when do you release the sent packets in the Cavium implementation?
>>
>> _______________________________________________
>> lng-odp mailing list
>> lng-odp@lists.linaro.org
>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>
>
>
> _______________________________________________
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
>
_______________________________________________
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp

Reply via email to