> -----Original Message-----
> From: Intel-wired-lan <[email protected]> On Behalf Of Jacob 
> Keller
> Sent: 17 July 2025 22:27
> To: Nguyen, Anthony L <[email protected]>; Intel Wired LAN 
> <[email protected]>; Loktionov, Aleksandr 
> <[email protected]>
> Cc: Keller, Jacob E <[email protected]>; Grinberg, Vitaly 
> <[email protected]>; [email protected]
> Subject: [Intel-wired-lan] [PATCH iwl-net 2/2] ice: don't leave device 
> non-functional if Tx scheduler config fails
>
> The ice_cfg_tx_topo function attempts to apply Tx scheduler topology 
> configuration based on NVM parameters, selecting either a 5 or 9 layer 
> topology.
>
> As part of this flow, the driver acquires the "Global Configuration Lock", 
> which is a hardware resource associated with programming the DDP package to 
> the device. This "lock" is implemented by firmware as a way to guarantee that 
> only one PF can program the DDP for a device. Unlike a traditional lock, once 
> a PF has acquired this lock, no other PF will be able to acquire it again 
> (including that PF) until a CORER of the device.
> Future requests to acquire the lock report that global configuration has 
> already completed.
>
> The following flow is used to program the Tx topology:
>
> * Read the DDP package for scheduler configuration data
> * Acquire the global configuration lock
> * Program Tx scheduler topology according to DDP package data
> * Trigger a CORER which clears the global configuration lock
>
> This is followed by the flow for programming the DDP package:
>
> * Acquire the global configuration lock (again)
> * Download the DDP package to the device
> * Release the global configuration lock.
>
> However, if configuration of the Tx topology fails, (i.e.
> ice_get_set_tx_topo returns an error code), the driver exits
> ice_cfg_tx_topo() immediately, and fails to trigger CORER.
>
> While the global configuration lock is held, the firmware rejects most AdminQ 
> commands, as it is waiting for the DDP package download (or Tx scheduler 
> topology programming) to occur.
>
> The current driver flows assume that the global configuration lock has been 
> reset by CORER after programming the Tx topology. Thus, the same PF attempts 
> to acquire the global lock again, and fails. This results in the driver 
> reporting "an unknown error occurred when loading the DDP package".
> It then attempts to enter safe mode, but ultimately fails to finish
> ice_probe() since nearly all AdminQ command report error codes, and the 
> driver stops loading the device at some point during its initialization.
>
> The only currently known way that ice_get_set_tx_topo() can fail is with 
> certain older DDP packages which contain invalid topology configuration, on 
> firmware versions which strictly validate this data. The most recent releases 
> of the DDP have resolved the invalid data. However, it is still poor practice 
> to essentially brick the device, and prevent access to the device even 
> through safe mode or recovery mode. It is also plausible that this command 
> could fail for some other reason in the future.
>
> We cannot simply release the global lock after a failed call to 
> ice_get_set_tx_topo(). Releasing the lock indicates to firmware that global 
> configuration (downloading of the DDP) has completed. Future attempts by this 
> or other PFs to load the DDP will fail with a report that the DDP package has 
> already been downloaded. Then, PFs will enter safe mode as they realize that 
> the package on the device does not meet the minimum version requirement to 
> load. The reported error messages are confusing, as they indicate the version 
> of the default "safe mode" package in the NVM, rather than the version of the 
> file loaded from /lib/firmware.
>
> Instead, we need to trigger CORER to clear global configuration. This is the 
> lowest level of hardware reset which clears the global configuration lock and 
> related state. It also clears any already downloaded DDP.
> Crucially, it does *not* clear the Tx scheduler topology configuration.
>
> Refactor ice_cfg_tx_topo() to always trigger a CORER after acquiring the 
> global lock, regardless of success or failure of the topology configuration.
>
> We need to re-initialize the HW structure when we trigger the CORER. Thus, it 
> makes sense for this to be the responsibility of ice_cfg_tx_topo() rather 
> than its caller, ice_init_tx_topology(). This avoids needless 
> re-initialization in cases where we don't attempt to update the Tx scheduler 
> topology, such as if it has already been programmed.
>
> There is one catch: failure to re-initialize the HW struct should stop 
> ice_probe(). If this function fails, we won't have a valid HW structure and 
> cannot ensure the device is functioning properly. To handle this, ensure
Ice_cfg_tx_topo() returns a limited set of error codes. Set aside one 
specifically, -ENODEV, to indicate that the ice_init_tx_topology() should fail 
and stop probe.
>
>
> Other error codes indicate failure to apply the Tx scheduler topology. This 
> is treated as a non-fatal error, with an informational message informing the 
> system administrator that the updated Tx topology did not apply. > This 
> allows the device to load and function with the default Tx scheduler 
> topology, rather than failing to load entirely.
>
> Note that this use of CORER will not result in loops with future PFs 
> attempting to also load the invalid Tx topology configuration. The first PF 
> will acquire the global configuration lock as part of programming the DDP.
> Each PF after this will attempt to acquire the global lock as part of 
> programming the Tx topology, and will fail with the indication from firmware 
> that global configuration is already complete. Tx scheduler topology 
> configuration is only performed during driver init (probe or devlink reload) 
> and not during cleanup for a CORER that happens after probe completes.
>
> Fixes: 91427e6d9030 ("ice: Support 5 layer topology")
> Signed-off-by: Jacob Keller <[email protected]>
> ---
> drivers/net/ethernet/intel/ice/ice_ddp.c  | 44 
> ++++++++++++++++++++++---------  drivers/net/ethernet/intel/ice/ice_main.c | 
> 14 ++++++----
> 2 files changed, 41 insertions(+), 17 deletions(-)
>

Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel)

Reply via email to