>>> On 13.04.15 at 13:47, <m...@redhat.com> wrote: > On Mon, Apr 13, 2015 at 12:34:34PM +0100, Jan Beulich wrote: >> >>> On 13.04.15 at 13:19, <m...@redhat.com> wrote: >> > Yes Linux can't fix firmware 1st mode, but >> > PCI express spec says what firmware should do in this case: >> > >> > IMPLEMENTATION NOTE Software UR Reporting Compatibility with 1.0a Devices >> > >> > With 1.0a device Functions, 96 if the Unsupported Request >> > Reporting > >> > Enable bit is set, the Function >> > when operating as a Completer will send an uncorrectable error >> > Message (if enabled) when a UR >> > error is detected. On platforms where an uncorrectable error > Message >> > is handled as a System Error, >> > this will break PC-compatible Configuration Space probing, so >> > software/firmware on such >> > platforms may need to avoid setting the Unsupported Request >> > Reporting Enable bit. >> > With device Functions implementing Role-Based Error Reporting, >> > setting the Unsupported Request >> > Reporting Enable bit will not interfere with PC-compatible >> > Configuration Space probing, assuming >> > that the severity for UR is left at its default of non-fatal. >> > However, setting the Unsupported Request >> > Reporting Enable bit will enable the Function to report UR errors >> > detected with posted Requests, >> > helping avoid this case for potential silent data corruption. >> > On platforms where robust error handling and PC-compatible >> > Configuration Space probing is >> > required, it is suggested that software or firmware have the >> > Unsupported Request Reporting Enable >> > bit Set for Role-Based Error Reporting Functions, but clear for >> > 1.0a > >> > Functions. Software or >> > firmware can distinguish the two classes of Functions by examining >> > the Role-Based Error Reporting >> > bit in the Device Capabilities register. >> > >> > >> > What I think you have is a very old 1.0a system, and you set Unsupported >> > Request Reporting Enable. >> > >> > Can you confirm? >> >> No. In at least one of the two cases we got reports of the original >> problem, triggering the finding of this issue, this is a brand new one, >> only soon to become available publicly. Furthermore I'm being >> confused by the mention of PC-compatible config space probing >> above: The URs we talk about here don't result from config space >> accessed at all. > > OK. Can you please explain why does UR cause a system error then? > It looks like a hardware bug: PCIE 1.1 seems to say it shouldn't.
Quite possible. Looking at the ITP log we were provided, the UR severity bit is clear (non-fatal), yet the error got surfaced to the OS as a fatal one (I would guess because it validly gets flagged as uncorrectable at the same time). Jan