[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-12-03 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #16 from Christian König (christian.koe...@amd.com) ---
(In reply to Dave Airlie from comment #14)
> Should we at least push this patch to improve resiliance a little?

We could, but I don't see much value in that. E.g. we would need to code the
software in a way which also works if the hardware is damaged.

That is possible, but I grepped a bit over the source and in this particular
case we would need to manually audit 2201 registers accesses so that they also
work when the hardware suddenly goes up in flames.

That is totally unrealistic and just fixing this one case doesn't gives us
much.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-12-03 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |OBSOLETE

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-12-03 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #15 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
For information. I enventually tracked the hardware fault to bad solder flow in
the area of the dvi-d socket. I still stick by my original comments about
usability. To me an outcome of a recovery process that will leave 99.9% of end
users clueless of how to safely restart their system is not a good outcome from
an end user perspective. This is my last word on this topic.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-12-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

Dave Airlie (airl...@linux.ie) changed:

   What|Removed |Added

 CC||airl...@linux.ie

--- Comment #14 from Dave Airlie (airl...@linux.ie) ---
Should we at least push this patch to improve resiliance a little?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-07 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #13 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
You can ignore comment 11. I I thought the email reply had not worked. So I
posted a revised version directly. Comment 10 is the correct one.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-07 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #12 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
On 7 February 2018 08:23:06 bugzilla-dae...@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=198669
>
> --- Comment #10 from Christian König (christian.koe...@amd.com) ---
> (In reply to ro...@beardandsandals.co.uk from comment #9)
>> The most likely cause of this kind of mechanical issue is the signal path
>> between the video interface hardware and the outside world, either a dry
>> joint or a mechanical fault in the cable or cable connectors.
>
> That is what I absolutely agree about.
>
>> The driver has sufficient
>> information to determine that a hard failure has occured, and that failure
>> is probably not in the gpu itself. I would like to see the driver doing a
>> hard reset of the card with rigorous error checking. If it cannot reset the
>> GPU in graphical mode it should try to set the display hardware into a basic
>> console mode.
>
> And that is the part you don't seem to understand. The driver is trying
> exactly
> what you are describing.
>
> We detect a problem because of a timeout, e.g. the hardware doesn't respond
> in
> a given time frame on commands we send to it.
>
> What we do then is to query the hardware how far we proceeded in the
> execution
> and the hardware answered with a nonsense value. In other words bits are set
> in
> the response which should never be set.
>
> This is a clear indicator that the PCIe transaction for the register read
> aborted because the device doesn't response any more.
>
> The most likely cause of that is that the bus interface in the ASIC locked up
> because of an electrical problem (I think the ESD protection kicked in) and
> the
> only way to get out of that is a hard reset of the system.
>
> What we can try to do is trying to prevent further failures like the crash
> you
> described by checking the values read from the hardware. This way you can at
> least access the box over the network or blindly shut it down with keyboard
> short cuts.


Yes, I take your point. I was speculating on insufficient information. My 
apologies. The solution you propose sounds great.

Thank you for your patience.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-07 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #11 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
Yes, I take your point. I was speculating on insufficient information. My
apologies.

The solution you propose is essentially what I have already been doing. The
logging in over a network already works with the unpatched driver. I have not
had any luck with keyboard shortcuts. It looks like xwayland/xserver does not
know that a problem has occurred and has still got hold of the keyboard and
mouse. This is an obscure problem and probably not worth spending much time on.
Especially as I no longer seem to be able to reproduce it!


Thank you for your patience.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-07 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #10 from Christian König (christian.koe...@amd.com) ---
(In reply to ro...@beardandsandals.co.uk from comment #9)
> The most likely cause of this kind of mechanical issue is the signal path
> between the video interface hardware and the outside world, either a dry
> joint or a mechanical fault in the cable or cable connectors.

That is what I absolutely agree about.

> The driver has sufficient
> information to determine that a hard failure has occured, and that failure
> is probably not in the gpu itself. I would like to see the driver doing a
> hard reset of the card with rigorous error checking. If it cannot reset the
> GPU in graphical mode it should try to set the display hardware into a basic
> console mode.

And that is the part you don't seem to understand. The driver is trying exactly
what you are describing.

We detect a problem because of a timeout, e.g. the hardware doesn't respond in
a given time frame on commands we send to it.

What we do then is to query the hardware how far we proceeded in the execution
and the hardware answered with a nonsense value. In other words bits are set in
the response which should never be set.

This is a clear indicator that the PCIe transaction for the register read
aborted because the device doesn't response any more.

The most likely cause of that is that the bus interface in the ASIC locked up
because of an electrical problem (I think the ESD protection kicked in) and the
only way to get out of that is a hard reset of the system.

What we can try to do is trying to prevent further failures like the crash you
described by checking the values read from the hardware. This way you can at
least access the box over the network or blindly shut it down with keyboard
short cuts.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-06 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #9 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
I think we have to agree to differ on this one. You seem to be focussing on the
software interface between the GPU and the driver.

What follows is my personal opinion.

The most likely cause of this kind of mechanical issue is the signal path
between the video interface hardware and the outside world, either a dry joint
or a mechanical fault in the cable or cable connectors. I can only reiterate
what I said in my previous post. The driver has sufficient information to
determine that a hard failure has occured, and that failure is probably not in
the gpu itself. I would like to see the driver doing a hard reset of the card
with rigorous error checking. If it cannot reset the GPU in graphical mode it
should try to set the display hardware into a basic console mode.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-06 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #8 from Christian König (christian.koe...@amd.com) ---
(In reply to ro...@beardandsandals.co.uk from comment #7)
> The original point I made in the bug report was that this bug is not about
> the mechanical hardware glitch. It as about the driver being in what is
> obviously a failure mode and attempting a recovery that fails and leaves the
> system in unusable state.

You are missing the point. The driver fails to recover because the hardware is
buggy and not because there is any problem with the recovery routine.

In other words we read back an impossible value from the hardware and that is
why the system is failing.

I mean I can handle this impossible value at this code location, but as you
actually figured out by yourself it then fails at the next best location.

There are simply hundreds or even thousands of locations where the assumption
is that the hardware works correctly and we don't handle the case to get
nonsense values.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-06 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #7 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
The original point I made in the bug report was that this bug is not about the
mechanical hardware glitch. It as about the driver being in what is obviously a
failure mode and attempting a recovery that fails and leaves the system in
unusable state. The error recovery paths of any driver should be its most
resilient components. Especially when the driver is controlling a part of the
primary user interface to it.

To pose another question. Why, when the driver has the information to tell it
that the GPU is irrevocably stalled, does it attempt a soft restart and leave
the system in an unusable state.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-06 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #6 from Christian König (christian.koe...@amd.com) ---
Well the issue is triggered by the driver reading nonsense values from the
hardware.

E.g. we ask the hardware what the last good position on a 16k ring buffer is
and get 0x as result (or something like this) which obviously can't be
correct.

My patch mitigated that by clamping the value to a valid range, but if you read
nonsense values from the hardware because the hardware has a loose connection
and acts strange on vibrations then I basically can't guarantee for anything.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-06 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #5 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
My best guess is the error came from 

r600.c:2848:DRM_ERROR("radeon: ring %d test failed
(scratch(0x%04X)=0x%08X)\n",


I cannot reproduce the mechanical hardware failure. I don't want to clobber the
system any harder and risk damaging a disk.

I assume this is being called from the GPU reset path.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-05 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #4 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
Well it moved the problem. It crashed somewhere else in the driver with some
message about scratch. Sorry I cannot tell you what is was because I screwed up
the save of the the kernel message buffer, and now I cannot get the thing to
glitch again. My normal method of stamping on the floor next to the system is
not working. I think I might have overdone it and now the thing has bedded in.
Going to leave it powered off overnight and try again in the morning.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-05 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #3 from Christian König (christian.koe...@amd.com) ---
Created attachment 274001
  --> https://bugzilla.kernel.org/attachment.cgi?id=274001=edit
Possible fix

The attached patch is a shoot into the dark, but please give it a try.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-04 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #2 from ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) 
---
Looking at the debug files.

radeon_ring_backup resolves to 0x33430 so +0xd4 is 0x33503.

The line info gives this

radeon_ring.c323 0x334f4
radeon_ring.c324 0x33508

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-04 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

Christian König (christian.koe...@amd.com) changed:

   What|Removed |Added

 CC||christian.koe...@amd.com

--- Comment #1 from Christian König (christian.koe...@amd.com) ---
What does radeon_ring_backup+0xd3 resolve to on your system?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]

2018-02-04 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=198669

ro...@beardandsandals.co.uk (ro...@beardandsandals.co.uk) changed:

   What|Removed |Added

URL||https://bugs.launchpad.net/
   ||ubuntu/+source/linux/+bug/1
   ||746232

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel