Re: [PATCH v11 2/5] ethdev: support proactive error handling mode

fengchengwen Tue, 11 Oct 2022 07:48:47 -0700

Hi Andrew,

On 2022/10/10 16:47, Andrew Rybchenko wrote:

On 10/9/22 12:10, Chengwen Feng wrote:

From: Kalesh AP <kalesh-anakkur.pura...@broadcom.com>


Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
to recover from the errors. In this process, the PMD sets the data path
pointers to dummy functions (which will prevent the crash), and also
make sure the control path operations failed with retcode -EBUSY.


Could you explain why passive mode is not good. Why is
proactive better? What are the benefits? IMHO, it would
be simpler to have just one error recovery mode.

I think the two modes are not good or bad. To a large extent, they aredetermined

by the hardware and software design of the network card chip. Here takethe hns3


driver as an examples:

During the error recovery, multiple handshakes are required between thedriver and


the firmware, in addition, the handshake timeout are required.

If chose passive mode, the application may not register the callback(and also we

found that only ovs-dpdk register the reset event in many DPDK-basedopensource

software), so the recovery will failed. Furthermore, even if registeredthe callback,

the recovery process involves multiple handshakes which may take toomuch time

to complete, imagine having multiple ports to report the reset time atthe same time.

(This possibility exists. Consider that the PF is reset due to multipleVFs under the PF.)

In this case, many VFs report event, but the event callback is executedsequentially

(because there is only one interrupt thread). As a result, later VFscannot be processed


in time, and the reset may fails.


In conclusion, the proactive mode is an available troubleshooting method in

engineering practice.


The above error handling mode is known as
RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).

In some service scenarios, application needs to be aware of the event
to determine whether to migrate services. So three events were
introduced:

1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
detected an error and the recovery is being started. Upon receiving the
event, the application should not invoke any control path APIs until
receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
RTE_ETH_EVENT_RECOVERY_FAILED event.

2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
it recovers successful from the error, the PMD already re-configures the
port, and the effect is the same as that of the restart operation.

3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
recovers failed from the error, the port should not usable anymore. The
application should close the port.

Signed-off-by: Kalesh AP <kalesh-anakkur.pura...@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.ko...@broadcom.com>
Signed-off-by: Chengwen Feng <fengcheng...@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khapa...@broadcom.com>


The code itself LGTM. I just want to understand why we need it.
It should be proved in the description.

Re: [PATCH v11 2/5] ethdev: support proactive error handling mode

Reply via email to