Hi, Pavan > -----邮件原件----- > 发件人: Pavan Nikhilesh Bhagavatula <pbhagavat...@marvell.com> > 发送时间: 2021年1月8日 17:13 > 收件人: Feifei Wang <feifei.wa...@arm.com>; jer...@marvell.com; Harry > van Haaren <harry.van.haa...@intel.com> > 抄送: dev@dpdk.org; nd <n...@arm.com>; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com>; sta...@dpdk.org; Ruifeng Wang > <ruifeng.w...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com> > 主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers for pipeline > test > > Hi Feifei, > > >Hi, Pavan > > > > > > > >> -----邮件原件----- > > > >> 发件人: Pavan Nikhilesh Bhagavatula <mailto:pbhagavat...@marvell.com> > > > >> 发送时间: 2021年1月5日 17:29 > > > >> 收件人: Feifei Wang <mailto:feifei.wa...@arm.com>; mailto:jer...@marvell.com; > >Harry > > > >> van Haaren <mailto:harry.van.haa...@intel.com> > > > >> 抄送: mailto:dev@dpdk.org; nd <mailto:n...@arm.com>; Honnappa Nagarahalli > > > >> <mailto:honnappa.nagaraha...@arm.com>; mailto:sta...@dpdk.org; Ruifeng Wang > > > >> <mailto:ruifeng.w...@arm.com>; nd <mailto:n...@arm.com> > > > >> 主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers for > >pipeline > > > >> test > > > >> > > > >> Hi Feifei, > > > >> > > > >> >Hi, Pavan > > > >> > > > > >> >Sorry for my late reply and thanks very much for your review. > > > >> > > > > >> >> -----Original Message----- > > > >> >> From: Pavan Nikhilesh Bhagavatula > ><mailto:pbhagavat...@marvell.com<mailto:pbhagavat...@marvell.com>> > > > >> >> Sent: 2020年12月22日 18:33 > > > >> >> To: Feifei Wang > ><mailto:feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>>; > >mailto:jer...@marvell.com<mailto:jer...@marvell.com>; > > > >> >Harry van > > > >> >> Haaren > ><mailto:harry.van.haa...@intel.com<mailto:harry.van.haa...@intel.com>>; > >Pavan Nikhilesh > > > >> >> > ><pbhagavat...@caviumnetworks.com<mailto:pbhagavatula@caviumn > >etworks.com>> > > > >> >> Cc: mailto:dev@dpdk.org<mailto:dev@dpdk.org>; nd > ><mailto:n...@arm.com<mailto:n...@arm.com>>; Honnappa Nagarahalli > > > >> >> > ><honnappa.nagaraha...@arm.com<mailto:Honnappa.Nagarahalli@arm > >.com>>; mailto:sta...@dpdk.org<mailto:sta...@dpdk.org>; Phil Yang > > > >> >> <mailto:phil.y...@arm.com<mailto:phil.y...@arm.com>> > > > >> >> Subject: RE: [RFC PATCH v1 4/6] app/eventdev: add release > >barriers > > > >> >for > > > >> >> pipeline test > > > >> >> > > > >> >> > > > >> >> >Add release barriers before updating the processed packets for > > > >> >worker > > > >> >> >lcores to ensure the worker lcore has really finished data > > > >> >> >processing and then it can update the processed packets number. > > > >> >> > > > > >> >> > > > >> >> I believe we can live with minor inaccuracies in stats being > > > >> >> presented > > > >> >as > > > >> >> atomics are pretty heavy when scheduler is limited to burst size > >> >> as > >1. > > > >> >> > > > >> >> One option is to move it before a pipeline operation > > > >> >(pipeline_event_tx, > > > >> >> pipeline_fwd_event etc.) as they imply implicit release barrier > >> >> (as > > > >> >> all > > > >> >the > > > >> >> changes done to the event should be visible to the next core). > > > >> > > > > >> >If I understand correctly, your meaning is that move release > >> >barriers > > > >> >before pipeline_event_tx or pipeline_fwd_event. This can ensure the > > > >> >event has been processed before the next core begins to tx/fwd. For > > > >> >example: > > > >> > > > >> What I meant was event APIs such as `rte_event_enqueue_burst`, > > > >> `rte_event_eth_tx_adapter_enqueue` > > > >> act as an implicit release barrier and the API > >`rte_event_dequeue_burst` act > > > >> as an implicit acquire barrier. > > > > > > > >> > > > >> Since, pipeline_* test starts with a dequeue() and ends with an > >enqueue() I > > > >> don’t believe we need barriers in Between. > > > > > > > >Sorry for my misunderstanding this. And I agree with you that no > >barriers are > > > >needed between dequeue and enqueue. > > > > > > > >Now, let's go back to the beginning. Actually with this patch, our > >barrier is mainly > > > >for the synchronous variable " w->processed_pkts ". As we all know, the > >event is firstly > > > >dequeued and then enqueued, after this, the event can be treated as the > >processed event > > > >and included in the statistics("w->processed_pkts++"). > > > > > > > >Thus, we add a release barrier before " w->processed_pkts++" is to > >prevent this operation > > > >being executed ahead of time. For example: > > > >dequeue -> w->processed_pkts++ -> enqueue > > > >This cause that the worker doesn't actually finish this event > >processing, but the event is treated > > > >as the processed one and included in the statistics. > > > > But the current sequence is dequeue-> enqueue-> w->processed_pkts++ > and enqueue already acts as an explicit release barrier right? >
Sorry maybe I cannot understand how “enqueue” as an explicit release barrier. I think of two possibilities: 1. As you say before, all the changes done to the event should be visible to the next core and enqueue is a operation for event, so the next core should wait for the event to be enqueued. I think this is due to data dependence for the same variable. However, ‘w->processed_pkts’ and ‘ev’ are different variables, so this cannot prevent ‘w->processed_pkts++’ before enqueue. And the main core may load updated ‘w->processed_pkts’ but actually the event is still being processed. For example: Time Slot Worker 1 Main core 1 dequeue 2 w->processed_pkts++ 3 load w->processed_pkts 4 enqueue 2. Some release barriers have been included in enqueue. There is a release barrier in rte_ring_enqueue : move head -> copy elements to the ring -> release barrier -> update tail -> w->processed_pkts++ However, this barrier cannot prevent ‘w->processed_pkts++’ before update tail, and when update_tail has been finished, the enqueue process can be seen completed. > > > > > >_________________________________________________________ > _ > >____________________ > > > > > > > >By the way, I have two other questions about pipeline process test in > >"test_pipeline_queue". > > > >1. when do we start counting processed events (w->processed_pkts)? > > > >For the fwd mode (internal_port = false), when we choose single stage, > >application increments > > > >the number events processed after "pipeline_event enqueue". > >However, when we choose multiple > > > >stage, application increments the number events processed before > >"pipleline_event_enqueue". > > We count an event as process when all the stages have completed and its > Trasnmitted. > > >So, > > > >maybe we can unify this. For example of multiple stage: > > > > > > > > if (cq_id == last_queue) { > > > > ev.queue_id = > > tx_queue[ev.mbuf->port]; > > > > > >rte_event_eth_tx_adapter_txq_set(ev.mbuf, > >0); > > > > pipeline_fwd_event(&ev, > >RTE_SCHED_TYPE_ATOMIC); > > > > + pipeline_event_enqueue(dev, > > port, &ev); > > > > w->processed_pkts++; > > > > } else { > > > > ev.queue_id++; > > > > pipeline_fwd_event(&ev, > >sched_type_list[cq_id]); > > > > + pipeline_event_enqueue(dev, > > port, &ev); > > > > } > > > > > > > > - pipeline_event_enqueue(dev, port, &ev); > > > > > > The above change makes sense. > Thanks for your review, and I’ll update this change into the next version. > > > >2. Whether "pipeline_event_enqueue" is needed after > >"pipeline_event_tx" for tx mode? > > > >For single_stage_burst_tx mode, after "pipeline_event_tx", the worker > >has to enqueue again > > > >due to "pipeline_event_enqueue_burst", so maybe we should jump out of > >the loop after > > > >“pipeline_event_tx”, > > We call enqueue burst to release the events i.e. enqueue events with > RTE_EVENT_OP_RELEASE. > However, In case of single event, for ' pipeline_queue_worker_single_stage_tx' and ' pipeline_queue_worker_multi_stage_tx', after tx, there is no release operation. > > > for example: > > > > > > > > if (ev[i].sched_type == > >RTE_SCHED_TYPE_ATOMIC) { > > > > > > pipeline_event_tx(dev, port, &ev[i]); > > > > > > ev[i].op = RTE_EVENT_OP_RELEASE; > > > > > > w->processed_pkts++; > > > > + continue; > > > > } else { > > > > > > ev[i].queue_id++; > > > > > > pipeline_fwd_event(&ev[i], > > > > > >RTE_SCHED_TYPE_ATOMIC); > > > > } > > > > } > > > > > > > > pipeline_event_enqueue_burst(dev, port, > > ev, nb_rx); > > > > > > > > > > > >> > > > >> > > > > >> >if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) { > > > >> > + > >__atomic_thread_fence(__ATOMIC_RELEASE); > > > >> > pipeline_event_tx(dev, > >> > port, &ev); > > > >> > w->processed_pkts++; > > > >> > } else { > > > >> > ev.queue_id++; > > > >> > + > >__atomic_thread_fence(__ATOMIC_RELEASE); > > > >> > pipeline_fwd_event(&ev, > > > >> >RTE_SCHED_TYPE_ATOMIC); > > > >> > > >> > pipeline_event_enqueue(dev, port, &ev); > > > >> > > > > >> >However, there are two reasons to prevent this: > > > >> > > > > >> >First, compare with other tests in app/eventdev, for example, the > > > >> >eventdev perf test, the wmb is after event operation to ensure > > > >> >operation has been finished and then w->processed_pkts++. > > > >> > > > >> In case of perf_* tests start with a dequeue() and finally ends with > >> a > > > >> mempool_put() should also act as implicit acquire release pairs > >making stats > > > >> consistent? > > > > > > > >For perf tests, this consistency refers to that there is a wmb after > >mempool_put(). > > > >Please refer to this link: > > > >https://urldefense.proofpoint.com/v2/url?u=http- > >3A__patches.dpdk.org_patch_85634_&d=DwIGaQ&c=nKjWec2b6R0mO > >yPaz7xtfQ&r=1cjuAHrGh745jHNmj2fD85sUMIJ2IPIDsIJzo6FN6Z0&m=zg > >QHeSDiXWfI1PIIUxXBqMS6E-2_3G46nhrzGXoBpHI&s=0FwTxPXjWflh- > >sdmnkY133IPlJB780x0yxe7Am3JCBw&e= > > > > > > > >> > > > >> >So, if we move release barriers before tx/fwd, it may cause that the > > > >> >tests of app/eventdev become inconsistent.This may reduce the > > > >> >maintainability of the code and make it difficult to understand. > > > >> > > > > >> >Second, it is a test case, though heavy thread may cause > >performance > > > >> >degradation, it can ensure that the operation process and the test > > > >> >result are correct. And maybe for a test case, correctness is more > > > >> >important than performance. > > > >> > > > > >> > > > >> Most of our internal perf test run on 24/48 core combinations and > >since > > > >> Octeontx2 event device driver supports a burst size of 1, it will > >> show > >up as > > > >> Huge performance degradation. > > > > > > > >For the impact on performance, I do the test using software driver, > >following are some test results: > > > >----------------------------------------------------------------------- > >---------------- > >--------------------------------------------- > > > >Architecture: aarch64 > > > >Nics: ixgbe-82599 > > > >CPU: Cortex-A72 > > > >BURST_SIZE: 1 > > > >Order: ./dpdk-test-eventdev -l 0-15 -s 0x2 --vdev=event_sw0 -- -- > >test=pipeline_queue --wlcore=4-14 --prod_type_ethdev --stlist=a,a > > > >Flow: one flow, 64bits package, TX rate: 1.4Mpps > > > > > > > >Without this patch: > > > >0.954 mpps avg 0.953 mpps > > > > > > > >With this patch: > > > >0.932 mpps avg 0.930 mpps > > > >----------------------------------------------------------------------- > >---------------- > >--------------------------------------------- > > > > > > > >Based on the result above, there is no significant performance > >degradation with this patch. > > > >This is because the release barrier is only for “w->processed_pkts++”. > >It just ensures that the worker core > > > >increments the number events processed after enqueue, and it doesn’t > >affect dequeue/enqueue: > > > > > > > >dequeue -> enqueue -> release barrier -> w->processed_pkts++ > > > > Here enqueue already acts as an explicit release barrier. > Please refer above reasons. > > > > > >On the other hand, I infer the reason for the slight decrease in > >measurement performance is that the release barrier > > > >prevent “w->processed_pkts++” before that the event has been processed > >(enqueue). But I think this test result is closer > > > >to the real performance. > > > >And sorry for that we have no octentx2 device, so there is no test > >result on Octeontx2 event device driver. Would you please > > > >help us test this patch on octentx2 when you are convenient. Thanks > >very much. > > > > I will report the performance numbers on Monday. > That’s great, Thanks very much for your help. Best Regards Feifei > > > > > >Best Regards > > > >Feifei > > Regards, > Pavan. > > > > > > > > >> > > > >> >So, due to two reasons above, I'm ambivalent about how we should > >do in > > > >> >the next step. > > > >> > > > > >> >Best Regards > > > >> >Feifei > > > >> > > > >> Regards, > > > >> Pavan. > > > >> > > > >> > > > > >> >> >Fixes: 314bcf58ca8f ("app/eventdev: add pipeline queue worker > > > >> >> >functions") > > > >> >> >Cc: > >mailto:pbhagavat...@marvell.com<mailto:pbhagavat...@marvell.com> > > > >> >> >Cc: mailto:sta...@dpdk.org<mailto:sta...@dpdk.org> > > > >> >> > > > > >> >> >Signed-off-by: Phil Yang > ><mailto:phil.y...@arm.com<mailto:phil.y...@arm.com>> > > > >> >> >Signed-off-by: Feifei Wang > ><mailto:feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>> > > > >> >> >Reviewed-by: Ruifeng Wang > ><mailto:ruifeng.w...@arm.com<mailto:ruifeng.w...@arm.com>> > > > >> >> >--- > > > >> >> > app/test-eventdev/test_pipeline_queue.c | 64 > > > >> >> >+++++++++++++++++++++---- > > > >> >> > 1 file changed, 56 insertions(+), 8 deletions(-) > > > >> >> > > > > >> >> >diff --git a/app/test-eventdev/test_pipeline_queue.c b/app/test- > > > >> >> >eventdev/test_pipeline_queue.c index 7bebac34f..0c0ec0ceb > > > >> >100644 > > > >> >> >--- a/app/test-eventdev/test_pipeline_queue.c > > > >> >> >+++ b/app/test-eventdev/test_pipeline_queue.c > > > >> >> >@@ -30,7 +30,13 @@ > >pipeline_queue_worker_single_stage_tx(void > > > >> >> >*arg) > > > >> >> > > > > >> >> > if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) { > > > >> >> > pipeline_event_tx(dev, port, > >> >> > &ev); > > > >> >> >- w->processed_pkts++; > > > >> >> >+ > > > >> >> >+ /* release barrier here ensures > >> >> >+ stored > >operation > > > >> >> >+ * of the event completes before > >> >> >+ the number > >of > > > >> >> >+ * processed pkts is visible to > >> >> >+ the main core > > > >> >> >+ */ > > > >> >> >+ > >> >> >+ __atomic_fetch_add(&(w->processed_pkts), > >1, > > > >> >> >+ > >> >> >+ __ATOMIC_RELEASE); > > > >> >> > } else { > > > >> >> > ev.queue_id++; > > > >> >> > pipeline_fwd_event(&ev, > > > >> >> >RTE_SCHED_TYPE_ATOMIC); > > > >> >> >@@ -59,7 +65,13 @@ > > > >> >pipeline_queue_worker_single_stage_fwd(void > > > >> >> >*arg) > > > >> >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0); > > > >> >> > pipeline_fwd_event(&ev, > >> >> > RTE_SCHED_TYPE_ATOMIC); > > > >> >> > pipeline_event_enqueue(dev, port, &ev); > > > >> >> >- w->processed_pkts++; > > > >> >> >+ > > > >> >> >+ /* release barrier here ensures stored > >> >> >+ operation > > > >> >> >+ * of the event completes before the number of > > > >> >> >+ * processed pkts is visible to the main core > > > >> >> >+ */ > > > >> >> >+ __atomic_fetch_add(&(w->processed_pkts), 1, > > > >> >> >+ > >> >> >+ __ATOMIC_RELEASE); > > > >> >> > } > > > >> >> > > > > >> >> > return 0; > > > >> >> >@@ -84,7 +96,13 @@ > > > >> >> >pipeline_queue_worker_single_stage_burst_tx(void *arg) > > > >> >> > if (ev[i].sched_type == > > > >> >> >RTE_SCHED_TYPE_ATOMIC) { > > > >> >> > > >> >> > pipeline_event_tx(dev, port, &ev[i]); > > > >> >> > ev[i].op = > >> >> > RTE_EVENT_OP_RELEASE; > > > >> >> >- w->processed_pkts++; > > > >> >> >+ > > > >> >> >+ /* release > >> >> >+ barrier here ensures stored > > > >> >> >operation > > > >> >> >+ * of the event > >> >> >+ completes before the > > > >> >> >number of > > > >> >> >+ * processed > >> >> >+ pkts is visible to the main > > > >> >> >core > > > >> >> >+ */ > > > >> >> >+ > >> >> >+ __atomic_fetch_add(&(w- > > > >> >> >>processed_pkts), 1, > > > >> >> >+ > >__ATOMIC_RELEASE); > > > >> >> > } else { > > > >> >> > > >> >> > ev[i].queue_id++; > > > >> >> > > >> >> > pipeline_fwd_event(&ev[i], > > > >> >> >@@ -121,7 +139,13 @@ > > > >> >> >pipeline_queue_worker_single_stage_burst_fwd(void *arg) > > > >> >> > } > > > >> >> > > > > >> >> > pipeline_event_enqueue_burst(dev, port, ev, > >> >> > nb_rx); > > > >> >> >- w->processed_pkts += nb_rx; > > > >> >> >+ > > > >> >> >+ /* release barrier here ensures stored > >> >> >+ operation > > > >> >> >+ * of the event completes before the number of > > > >> >> >+ * processed pkts is visible to the main core > > > >> >> >+ */ > > > >> >> >+ __atomic_fetch_add(&(w->processed_pkts), nb_rx, > > > >> >> >+ > >> >> >+ __ATOMIC_RELEASE); > > > >> >> > } > > > >> >> > > > > >> >> > return 0; > > > >> >> >@@ -146,7 +170,13 @@ > > > >> >pipeline_queue_worker_multi_stage_tx(void > > > >> >> >*arg) > > > >> >> > > > > >> >> > if (ev.queue_id == tx_queue[ev.mbuf->port]) { > > > >> >> > pipeline_event_tx(dev, port, > >> >> > &ev); > > > >> >> >- w->processed_pkts++; > > > >> >> >+ > > > >> >> >+ /* release barrier here ensures > >> >> >+ stored > >operation > > > >> >> >+ * of the event completes before > >> >> >+ the number > >of > > > >> >> >+ * processed pkts is visible to > >> >> >+ the main core > > > >> >> >+ */ > > > >> >> >+ > >> >> >+ __atomic_fetch_add(&(w->processed_pkts), > >1, > > > >> >> >+ > >> >> >+ __ATOMIC_RELEASE); > > > >> >> > continue; > > > >> >> > } > > > >> >> > > > > >> >> >@@ -180,7 +210,13 @@ > > > >> >> >pipeline_queue_worker_multi_stage_fwd(void *arg) > > > >> >> > ev.queue_id = > >> >> > tx_queue[ev.mbuf->port]; > > > >> >> > > >> >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf, > >0); > > > >> >> > pipeline_fwd_event(&ev, > > > >> >> >RTE_SCHED_TYPE_ATOMIC); > > > >> >> >- w->processed_pkts++; > > > >> >> >+ > > > >> >> >+ /* release barrier here ensures > >> >> >+ stored > >operation > > > >> >> >+ * of the event completes before > >> >> >+ the number > >of > > > >> >> >+ * processed pkts is visible to > >> >> >+ the main core > > > >> >> >+ */ > > > >> >> >+ > >> >> >+ __atomic_fetch_add(&(w->processed_pkts), > >1, > > > >> >> >+ > >> >> >+ __ATOMIC_RELEASE); > > > >> >> > } else { > > > >> >> > ev.queue_id++; > > > >> >> > pipeline_fwd_event(&ev, > > > >> >> >sched_type_list[cq_id]); > > > >> >> >@@ -214,7 +250,13 @@ > > > >> >> >pipeline_queue_worker_multi_stage_burst_tx(void *arg) > > > >> >> > if (ev[i].queue_id == > >> >> > tx_queue[ev[i].mbuf- > > > >> >> >>port]) { > > > >> >> > > >> >> > pipeline_event_tx(dev, port, &ev[i]); > > > >> >> > ev[i].op = > >> >> > RTE_EVENT_OP_RELEASE; > > > >> >> >- w->processed_pkts++; > > > >> >> >+ > > > >> >> >+ /* release > >> >> >+ barrier here ensures stored > > > >> >> >operation > > > >> >> >+ * of the event > >> >> >+ completes before the > > > >> >> >number of > > > >> >> >+ * processed > >> >> >+ pkts is visible to the main > > > >> >> >core > > > >> >> >+ */ > > > >> >> >+ > >> >> >+ __atomic_fetch_add(&(w- > > > >> >> >>processed_pkts), 1, > > > >> >> >+ > >__ATOMIC_RELEASE); > > > >> >> > continue; > > > >> >> > } > > > >> >> > > > > >> >> >@@ -254,7 +296,13 @@ > > > >> >> >pipeline_queue_worker_multi_stage_burst_fwd(void *arg) > > > >> >> > > > > >> >> > rte_event_eth_tx_adapter_txq_set(ev[i].mbuf, 0); > > > >> >> > > >> >> > pipeline_fwd_event(&ev[i], > > > >> >> > > > > >> >> > RTE_SCHED_TYPE_ATOMIC); > > > >> >> >- w->processed_pkts++; > > > >> >> >+ > > > >> >> >+ /* release > >> >> >+ barrier here ensures stored > > > >> >> >operation > > > >> >> >+ * of the event > >> >> >+ completes before the > > > >> >> >number of > > > >> >> >+ * processed > >> >> >+ pkts is visible to the main > > > >> >> >core > > > >> >> >+ */ > > > >> >> >+ > >> >> >+ __atomic_fetch_add(&(w- > > > >> >> >>processed_pkts), 1, > > > >> >> >+ > >__ATOMIC_RELEASE); > > > >> >> > } else { > > > >> >> > > >> >> > ev[i].queue_id++; > > > >> >> > > >> >> > pipeline_fwd_event(&ev[i], > > > >> >> >-- > > > >> >> >2.17.1 > >