> -----Original Message----- > From: Paul Menzel <pmen...@molgen.mpg.de> > Sent: Monday, June 10, 2024 2:26 PM > To: Loktionov, Aleksandr <aleksandr.loktio...@intel.com>; Kang, Kelvin > <kelvin.k...@intel.com> > Cc: Sokolowski, Jan <jan.sokolow...@intel.com>; net...@vger.kernel.org; > Kubalewski, Arkadiusz <arkadiusz.kubalew...@intel.com>; Nguyen, Anthony L > <anthony.l.ngu...@intel.com>; intel-wired-...@lists.osuosl.org; Leon > Romanovsky <leo...@nvidia.com> > Subject: Re: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot issue NVM > content is corrupted after nvmupdate > > Dear Aleksandr, > > > Thank you for your response. > > > Am 10.06.24 um 12:20 schrieb Loktionov, Aleksandr: > > > > > >> -----Original Message----- > >> From: Intel-wired-lan <intel-wired-lan-boun...@osuosl.org> On Behalf > >> Of Loktionov, Aleksandr > >> Sent: Monday, June 10, 2024 12:16 PM > > >>> -----Original Message----- > >>> From: Loktionov, Aleksandr > >>> Sent: Monday, June 10, 2024 12:14 PM > > >>>> -----Original Message----- > >>>> From: Paul Menzel <pmen...@molgen.mpg.de> > >>>> Sent: Monday, June 10, 2024 11:45 AM > > >>>> Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov: > >>>>> After 230f3d53a547 patch all I/O errors are being converted into > >>>>> EAGAIN which leads to retries until timeout so nvmupdate sometimes > >>>>> fails after more than 20 minutes! > >>>>> > >>>>> Remove misleading EIO to EGAIN conversion and pass all errors as > >>>> is. > >>>>> > >>>>> Fixes: 230f3d53a547 ("i40e: remove i40e_status") > >>>> > >>>> This commit is present since v6.6-rc1, released September last year > >>>> (2023). So until now, nobody noticed this? > >>>> > >>> Really, really. The regression affects users only when they update > >>> F/W, and not all F/W are affected, only that generate I/O errors > >>> while update. > >> Not all the cards are affected, but the consequences are serous as in > >> subj. > >> > >>>>> Co-developed-by: Kelvin Kang <kelvin.k...@intel.com> > >>>>> Signed-off-by: Kelvin Kang <kelvin.k...@intel.com> > >>>>> Reviewed-by: Arkadiusz Kubalewski > >>>> <arkadiusz.kubalew...@intel.com> > >>>>> Signed-off-by: Aleksandr Loktionov > >>>> <aleksandr.loktio...@intel.com> > >>>> > >>>> Please give more details about your test setup. For me it’s also > >>>> not clear, how the NVM content gets corrupted as stated in the > >>>> summary/title. Could you please elaborate that in the commit message. > > > For example X710DA2 with 0x8000ECB7 is affected, but there are > > probably more... > > Please amend the commit message with this information, and for ease also the > commands you executed. > > > The corruption is already described - because of timeout nvmupdate > > timeouts failing to update NVM. > > Only because something times out, does not mean it causes corruption. > > Please amend the commit message. It looks like you also missed more of my > questions below. > > >>>>> --- > >>>>> drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ---- > >>>>> 1 file changed, 4 deletions(-) > >>>>> > >>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h > >>>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h > >>>>> index ee86d2c..55b5bb8 100644 > >>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h > >>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h > >>>>> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, > int aq_rc) > >>>>> -EFBIG, /* I40E_AQ_RC_EFBIG */ > >>>>> }; > >>>>> > >>>>> - /* aq_rc is invalid if AQ timed out */ > >>>>> - if (aq_ret == -EIO) > >>>>> - return -EAGAIN; > >>>>> - > >>>>> if (!((u32)aq_rc < (sizeof(aq_to_posix) / > sizeof((aq_to_posix)[0])))) > >>>>> return -ERANGE; > >>>> > >>>> The referenced commit 230f3d53a547 does: > >>>> > >>>> ``` > >>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h > >>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h > >>>> index ee394aacef4d..267f2e0a21ce 100644 > >>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h > >>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h > >>>> @@ -5,7 +5,6 @@ > >>>> #define _I40E_ADMINQ_H_ > >>>> > >>>> #include "i40e_osdep.h" > >>>> - #include "i40e_status.h" > >>>> #include "i40e_adminq_cmd.h" > >>>> > >>>> #define I40E_ADMINQ_DESC(R, i) \ > >>>> @@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, > int aq_rc) > >>>> }; > >>>> > >>>> /* aq_rc is invalid if AQ timed out */ > >>>> - if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT) > >>>> + if (aq_ret == -EIO) > >>>> return -EAGAIN; > >>>> > >>>> if (!((u32)aq_rc < (sizeof(aq_to_posix) / > >>>> sizeof((aq_to_posix)[0])))) ``` > >>>> > >>>> So I do not see yet, why removing the whole hunk is the solution. The solution is to pass errors as mentioned in the commit as is from f/w to nvmupdate.
> Kind regards, > > Paul > > > PS: Would it be possible, that you use another email program. The cited > parts are wrapped awkwardly, and it takes some time to correct it in my > response.