On Sat, Jun 26, 2021 at 11:59:49AM +0800, fengchengwen wrote:
> Hi, all
> I analyzed the current DPAM DMA driver and drew this summary in conjunction
> with the previous discussion, and this will as a basis for the V2
> implementation.
> Feedback is welcome, thanks
>
>
> dpaa2_qdma:
> [probe]: mainly obtains the number of hardware queues.
> [dev_configure]: has following parameters:
> max_hw_queues_per_core:
> max_vqs: max number of virt-queue
> fle_queue_pool_cnt: the size of FLE pool
> [queue_setup]: setup up one virt-queue, has following parameters:
> lcore_id:
> flags: some control params, e.g. sg-list, longformat desc, exclusive HW
> queue...
> rbp: some misc field which impact the descriptor
> Note: this API return the index of virt-queue which was successful
> setuped.
> [enqueue_bufs]: data-plane API, the key fields:
> vq_id: the index of virt-queue
> job: the pointer of job array
> nb_jobs:
> Note: one job has src/dest/len/flag/cnxt/status/vq_id/use_elem fields,
> the flag field indicate whether src/dst is PHY addr.
> [dequeue_bufs]: get the completed jobs's pointer
>
> [key point]:
> ------------ ------------
> |virt-queue| |virt-queue|
> ------------ ------------
> \ /
> \ /
> \ /
> ------------ ------------
> | HW-queue | | HW-queue |
> ------------ ------------
> \ /
> \ /
> \ /
> core/rawdev
> 1) In the probe stage, driver tell how many HW-queues could use.
> 2) User could specify the maximum number of HW-queues managed by a
> single
> core in the dev_configure stage.
> 3) User could create one virt-queue by queue_setup API, the virt-queue
> has
> two types: a) exclusive HW-queue, b) shared HW-queue(as described
> above), this is achieved by the corresponding bit of flags field.
> 4) In this mode, queue management is simplified. User do not need to
> specify the HW-queue to be applied for and create a virt-queue on the
> HW-queue. All you need to do is say on which core I want to create a
> virt-queue.
> 5) The virt-queue could have different capability, e.g. virt-queue-0
> support scatter-gather format, and virt-queue-1 don't support sg,
> this
> was control by flags and rbp fields in queue_setup stage.
> 6) The data-plane API use the definition similar to rte_mbuf and
> rte_eth_rx/tx_burst().
> PS: I still don't understand how sg-list enqueue/dequeue, and user how
> to
> use RTE_QDMA_VQ_NO_RESPONSE.
>
> Overall, I think it's a flexible design with many scalability.
> Especially
> the queue resource pool architecture, simplifies user invocations,
> although the 'core' introduces a bit abruptly.
>
>
> octeontx2_dma:
> [dev_configure]: has one parameters:
> chunk_pool: it's strange why it's not managed internally by the driver,
> but passed in through the API.
> [enqueue_bufs]: has three important parameters:
> context: this is what Jerin referred to 'channel', it could hold the
> completed ring of the job.
> buffers: hold the pointer array of dpi_dma_buf_ptr_s
> count: how many dpi_dma_buf_ptr_s
> Note: one dpi_dma_buf_ptr_s may has many src and dst pairs (it's
> scatter-
> gather list), and has one completed_ptr (when HW complete it will
> write one value to this ptr), current the completed_ptr pointer
> struct:
> struct dpi_dma_req_compl_s {
> uint64_t cdata; --driver init and HW update result to
> this.
> void (*compl_cb)(void *dev, void *arg);
> void *cb_data;
> };
> [dequeue_bufs]: has two important parameters:
> context: driver will scan it's completed ring to get complete info.
> buffers: hold the pointer array of completed_ptr.
>
> [key point]:
> ----------- -----------
> | channel | | channel |
> ----------- -----------
> \ /
> \ /
> \ /
> ------------
> | HW-queue |
> ------------
> |
> --------
> |rawdev|
> --------
> 1) User could create one channel by init context(dpi_dma_queue_ctx_s),
> this interface is not standardized and needs to be implemented by
> users.
> 2) Different channels can support different transmissions, e.g. one for
> inner m2m, and other for inbound copy.
>
> Overall, I think the 'channel' is similar the 'virt-queue' of
> dpaa2_qdma.
> The difference is that dpaa2_qdma supports multiple hardware queues. The
> 'channel' has following
> 1) A channel is an operable unit at the user level. User can create a
> channel for each transfer type, for example, a local-to-local
> channel,
> and a local-to-host channel. User could also get the completed status
> of one channel.
> 2) Multiple channels can run on the same HW-queue. In terms of API
> design,
> this design reduces the number of data-plane API parameters. The
> channel could has context info which will referred by data-plane APIs
> execute.
>
>
> ioat:
> [probe]: create multiple rawdev if it's DSA device and has multiple
> HW-queues.
> [dev_configure]: has three parameters:
> ring_size: the HW descriptor size.
> hdls_disable: whether ignore user-supplied handle params
> no_prefetch_completions:
> [rte_ioat_enqueue_copy]: has dev_id/src/dst/length/src_hdl/dst_hdl
> parameters.
> [rte_ioat_completed_ops]: has dev_id/max_copies/status/num_unsuccessful/
> src_hdls/dst_hdls parameters.
>
> Overall, one HW-queue one rawdev, and don't have many 'channel' which
> similar
> to octeontx2_dma.
>
>
> Kunpeng_dma:
> 1) The hardmware support multiple modes(e.g.
> local-to-local/local-to-pciehost/
> pciehost-to-local/immediated-to-local copy).
> Note: Currently, we only implement local-to-local copy.
> 2) The hardmware support multiple HW-queues.
>
>
> Summary:
> 1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint
> of
> x86 host (e.g. smart NIC), multiple memory transfer requirements may
> exist,
> e.g. local-to-host/local-to-host..., from the point of view of API
> design,
> I think we should adopt a similar 'channel' or 'virt-queue' concept.
> 2) Whether to create a separate dmadev for each HW-queue? We previously
> discussed this, and due HW-queue could indepent management (like
> Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
> HW-queue before. But I'm not sure if that's the case with dpaa. I think
> that can be left to the specific driver, no restriction is imposed on the
> framework API layer.
> 3) I think we could setup following abstraction at dmadev device:
> ------------ ------------
> |virt-queue| |virt-queue|
> ------------ ------------
> \ /
> \ /
> \ /
> ------------ ------------
> | HW-queue | | HW-queue |
> ------------ ------------
> \ /
> \ /
> \ /
> dmadev
> 4) The driver's ops design (here we only list key points):
> [dev_info_get]: mainly return the number of HW-queues
> [dev_configure]: nothing important
> [queue_setup]: create one virt-queue, has following main parameters:
> HW-queue-index: the HW-queue index used
> nb_desc: the number of HW descriptors
> opaque: driver's specific info
> Note1: this API return virt-queue index which will used in later API.
> If user want create multiple virt-queue one the same HW-queue,
> they could achieved by call queue_setup with the same
> HW-queue-index.
> Note2: I think it's hard to define queue_setup config paramter, and
> also this is control API, so I think it's OK to use opaque
> pointer to implement it.
> [dma_copy/memset/sg]: all has vq_id input parameter.
> Note: I notice dpaa can't support single and sg in one virt-queue,
> and
> I think it's maybe software implement policy other than HW
> restriction because virt-queue could share the same HW-queue.
> Here we use vq_id to tackle different scenario, like local-to-local/
> local-to-host and etc.
> 5) And the dmadev public data-plane API (just prototype):
> dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
> -- flags: used as an extended parameter, it could be uint32_t
> dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
> dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
> -- sg: struct dma_scatterlist array
> uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
> uint16_t nb_cpls, bool *has_error)
> -- nb_cpls: indicate max process operations number
> -- has_error: indicate if there is an error
> -- return value: the number of successful completed operations.
> -- example:
> 1) If there are already 32 completed ops, and 4th is error, and
> nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
> has_error will be true.
> 2) If there are already 32 completed ops, and all successful
> completed, then the ret will be min(32, nb_cpls), and has_error
> will be false.
> 3) If there are already 32 completed ops, and all failed completed,
> then the ret will be 0, and has_error will be true.
> uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
> uint16_t nb_status, uint32_t
> *status)
> -- return value: the number of failed completed operations.
> And here I agree with Morten: we should design API which adapts to DPDK
> service scenarios. So we don't support some sound-cards DMA, and 2D
> memory
> copy which mainly used in video scenarios.
> 6) The dma_cookie_t is signed int type, when <0 it mean error, it's
> monotonically increasing base on HW-queue (other than virt-queue). The
> driver needs to make sure this because the damdev framework don't manage
> the dma_cookie's creation.
> 7) Because data-plane APIs are not thread-safe, and user could determine
> virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
> duty to ensure thread-safe.
> 8) One example:
> vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
> if (vq_id < 0) {
> // create virt-queue failed
> return;
> }
> // submit memcpy task
> cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
> if (cookie < 0) {
> // submit failed
> return;
> }
IMO
rte_dmadev_memcpy should return ops number successfully submitted
that's easier to do re-submit if previous session is not fully
submitted.
> // get complete task
> ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
> if (!has_error && ret == 1) {
> // the memcpy successful complete
> }
> 9) As octeontx2_dma support sg-list which has many valid buffers in
> dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
> 10) As ioat, it could delcare support one HW-queue at dev_configure stage,
> and
> only support create one virt-queue.
> 11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
> for dpaa2_qdma guys feedback.
> 12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
> two candidates which are iova and void *, how about introduce dma_addr_t
> type which could be va or iova ?
>