Presently to flash a cxl adapter with a new FPGA image a warm pcie reset is requested on the adapter, once the bitstream is loaded to card flash memory. This issues a pci-fundamental reset to the card slot signaling the card controller to reconfigure the fpga with the new bitstream. However pci-fundamental reset of the slot also results in a fenced PHB that raises an eeh event triggering the core eeh flow.
The core eeh also maintains a counter named freeze_count for each PE inside struct eeh_pe. The counter is incremented every time an eeh error is reported on the PE domain and if the counter reaches the threshold limit, the device is permanently disabled. The threshold limit is enforced by the variable eeh_max_freeze variable that can be manipulated via debugfs. This creates problem for cxl adapters as: * This puts a limit on number of times a fpga image can be re-flashed which is by default 5-time/Hour. * Since after each reset the adapter can potentially acquire a new personality, the freeze_count of older fpga image shouldn't be carried over to newer image. To fix these problems the proposed patch-set introduces a new function named eeh_pe_reset_freeze_counter that resets freeze counter for the eeh_pe struct. This function can then be called by the cxl module the cxl module before issuing pci-fundamental reset to the card slot for loading the new fpga image. Test Runs ========== * Without the patchset: # for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done bash: /sys/class/cxl/card0/reset: No such file or directory # dmesg ... EEH: Fenced PHB#22 detected, location: N/A EEH: PHB#22-PE#0 has failed 1 times in the last hour ... EEH: PHB#22-PE#0 has failed 2 times in the last hour ... EEH: PHB#22-PE#0 has failed 3 times in the last hour ... EEH: PHB#22-PE#0 has failed 4 times in the last hour ... EEH: PHB#22-PE#0 has failed 5 times in the last hour ... EEH: PHB#22-PE#0 has failed 6 times in the last hour and has been permanently disabled. * With the patchset: # for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done # dmesg ... cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB EEH: Fenced PHB#22 detected, location: N/A EEH: PHB#22-PE#0 has failed 1 times in the last hour ... cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB EEH: Fenced PHB#22 detected, location: N/A EEH: PHB#22-PE#0 has failed 1 times in the last hour ... cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB EEH: Fenced PHB#22 detected, location: N/A EEH: PHB#22-PE#0 has failed 1 times in the last hour ... cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB EEH: Fenced PHB#22 detected, location: N/A EEH: PHB#22-PE#0 has failed 1 times in the last hour ... cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB EEH: Fenced PHB#22 detected, location: N/A EEH: PHB#22-PE#0 has failed 1 times in the last hour ... cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB EEH: Fenced PHB#22 detected, location: N/A EEH: PHB#22-PE#0 has failed 1 times in the last hour --- Vaibhav Jain (3): powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter cxl: Reset freeze counters before adapter PERST for flashing new image arch/powerpc/include/asm/eeh.h | 11 ++++++++- arch/powerpc/kernel/eeh_driver.c | 20 +++------------- arch/powerpc/kernel/eeh_pe.c | 50 ++++++++++++++++++++++++++-------------- drivers/misc/cxl/pci.c | 15 ++++++++++++ 4 files changed, 61 insertions(+), 35 deletions(-) -- 2.9.3