Interrupt remapping IO_PAGE_FAULT has been observed under system w/ large number of VMs w/ pass-through devices. This can be reproduced with 64 VMs + 64 pass-through VFs of Mellanox MT28800 Family [ConnectX-5 Ex], where each VM runs small-packet netperf test via the pass-through device to the netserver running on the host. All VMs are running in reboot loop, to trigger IRTE updates.
In addition, to accelerate the failure, irqbalance is triggered periodically (e.g. 1-5 sec), which should generate large amount of updates to IRTE. This setup generally triggers IO_PAGE_FAULT within 3-4 hours. Investigation has shown that the issue is in the code to update IRTE while remapping is enabled. Please see patch 2/2 for detail discussion. This serires has been tested running in the setup mentioned above upto 96 hours w/o seeing issues. Thanks, Suravee Suravee Suthikulpanit (2): iommu: amd: Restore IRTE.RemapEn bit after programming IRTE iommu: amd: Use cmpxchg_double() when updating 128-bit IRTE drivers/iommu/amd/Kconfig | 2 +- drivers/iommu/amd/init.c | 21 +++++++++++++++++++-- drivers/iommu/amd/iommu.c | 19 +++++++++++++++---- 3 files changed, 35 insertions(+), 7 deletions(-) -- 2.17.1 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu