Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-04-09 Thread Clemens Eisserer
Hi Someguy,

I've been running with accelmethod=none and llvmpipe for opengl now
for over a week (more or less using only the display engine of my
rx570) and haven't experienced a single MCE during that period.
However, statistically, it will take 1-2 additional weeks to be sure
this is not a coincidence.

This seems to confirm your observations (as well as the reports in the
reddit thread) that the GPU seems to play at least some role when it
comes to the MCE caused reboots.

@John: Could it be the 3rd gen ryzen has some issues which are covered
by windows driver workarounds?
Are CPU errata data sheets publicly available?

Best regards, Clemens

PS: While it does seem to solve the reboot issue, it causes degraded
desktop performance as well as tearing when playing videos.
So I still hope I don't have to live with this forever.

PS2: AMD support respond after some delay stating the MCE actually is
caused by the watchdog timer.

Am So., 5. Apr. 2020 um 04:41 Uhr schrieb someguy108 :
>
> How has it been going since you limited hardware acceleration? Any 
> improvement?
>
> On Thu, Apr 2, 2020, 11:13 PM Clemens Eisserer  wrote:
>>
>> Hi,
>>
>> I also had the impression this issue triggered by power-state changes.
>> First I suspected CPU power state transitions, but now more and more
>> reports pop up mentinioning exchanging the GPU solved the issue (or in
>> your case it started when switching to an AMD gpu).
>> However I've already tried to limit power-states using the
>> /sys/-Interface as well as custom feature masks provided suggested on
>> the kernel bug report.
>>
>> Your theory regarding firefox and compositing might be right, so this
>> is what I did:
>> - Switch from Glamor to "none" acceleration for Xorg (glamor
>> translates 2d drawing commands to OpenGL)
>> - Switch Firefox from WebRender to Software-Rendering
>> - Disable KDE composition manager
>>
>> Only time will tell...
>>
>> Best regards, Clemens
>>
>>
>> Am Fr., 3. Apr. 2020 um 03:39 Uhr schrieb someguy108 :
>> >
>> > Hi Cemens, I responded to that bugs report about my findings! I do
>> > wonder since yours is happening at desktop, do you have compositing
>> > enabled with your window manager? If my theory, as noted just based
>> > off my limited understanding and observations, is correct about it
>> > revolving around power states, having Firefox playing a video with
>> > hardware acceleration enabled and desktop compositing could be causing
>> > back and forth swings with power states. Which could be the culprit
>> > with the hangs.
>> >
>> > I also formed some of my theory from the long debacle of AMD's Windows
>> > driver quality. As these hangs sound awfully similar to what Windows
>> > users have been enduring for almost all of 2019 and some parts of
>> > 2020. As noted with me, I had TDR's in Windows while alt-tabbing with
>> > games until I disabled hardware acceleration in Google Chrome. And for
>> > good measure I disabled most animations and compositing effects with
>> > Windows UI. As I like to play most of my games in a Window. Like no
>> > more shadows under windows and such. Though I do know some are Navi
>> > specific.
>> >
>> > On Thu, Apr 2, 2020 at 7:12 AM Clemens Eisserer  
>> > wrote:
>> > >
>> > > Hi Someguy,
>> > >
>> > > Your findings sound very familiar, my machine is also rock-solid
>> > > running Windows-10 - most of the MCEs happened for me with low-load
>> > > situations but firefox playing youtube in background.
>> > > First I didn't care that much - but now having experienced corrupted
>> > > firefox profiles and lost spreadsheet work it starts to get annoying.
>> > >
>> > > I've filed a kernel bug regarding this issue:
>> > > https://bugzilla.kernel.org/show_bug.cgi?id=206903
>> > > I would appreciate if you could report your findings there to give the
>> > > issue more data / weight.
>> > >
>> > > Thanks, Clemens
>> > >
>> > > Am Do., 2. Apr. 2020 um 15:11 Uhr schrieb someguy108 
>> > > :
>> > > >
>> > > > Hello! I saw Clemens Eisserer email regarding MCE errors with his RX 
>> > > > 570 and 3700x, and I like to add to that list of MCE spontaneous 
>> > > > reboots as well.
>> > > > This is my configuration:
>> > > > -Ryzen 3900x + Noctua D15
>> > > > -MSI X570 Unify (latest agesa as of w

Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-04-02 Thread Clemens Eisserer
Hi Someguy,

Your findings sound very familiar, my machine is also rock-solid
running Windows-10 - most of the MCEs happened for me with low-load
situations but firefox playing youtube in background.
First I didn't care that much - but now having experienced corrupted
firefox profiles and lost spreadsheet work it starts to get annoying.

I've filed a kernel bug regarding this issue:
https://bugzilla.kernel.org/show_bug.cgi?id=206903
I would appreciate if you could report your findings there to give the
issue more data / weight.

Thanks, Clemens

Am Do., 2. Apr. 2020 um 15:11 Uhr schrieb someguy108 :
>
> Hello! I saw Clemens Eisserer email regarding MCE errors with his RX 570 and 
> 3700x, and I like to add to that list of MCE spontaneous reboots as well.
> This is my configuration:
> -Ryzen 3900x + Noctua D15
> -MSI X570 Unify (latest agesa as of writing)
> -DDR4 3200mhz 32GB kit
> -Sapphire Pulse 5700 XT
> -Corsair RMX 850 Watt
> -Arch Linux with kernel 5.5.13
> -Mesa 20.0.3
> -Early KMS enabled
>
> I've had this system up and running since November 2019 but initially with a 
> Nvidia 1060 and Windows 10. Everything was running smoothly. About a month 
> ago I switched back over to Linux after purchasing my 5700 XT as my initial 
> plan was to go back to Linux. Since returning I've experienced multiple 
> spontaneous MCE reboots. All happened while I was playing one particular 
> game, Warcraft 3 Reforged. The MCE event is the following:
>
> kernel: mce: [Hardware Error]: Machine check events logged
> kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: 
> bea00108
> kernel: mce: [Hardware Error]: TSC 0 ADDR 1ad66d6fe MISC d0120001 
> SYND 4d00 IPID 500b0
> kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 
> APIC 2 microcode 8701013
> kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
> kernel: mce: [Hardware Error]: Machine check events logged
> kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: 
> bea00108
> kernel: mce: [Hardware Error]: TSC 0 ADDR 1c1196eb6 MISC d0120001 
> SYND 4d00 IPID 500b0
> kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 
> APIC 9 microcode 8701013
> kernel: #16 #17 #18 #19 #20 #21 #22 #23
>
> Initially I figured it could be ram so I performed the usual test with no 
> problems. Also tested with standard JEDEC as well and eventually received a 
> MCE during Warcraft 3 reforged. After consulting with a few friends I decided 
> to try a different power supply to no avail. I then bit the bullet and bought 
> a brand new 3900x. I also cleared CMOS before getting my new 3900x and after. 
> All cpu values are on auto with no PBO or manual overclocking. The only fancy 
> is the ram. Yesterday, after owning the new 3900x for three days, I had a MCE 
> while I was playing Warcraft 3 Reforged. I have tested other games but none 
> of them caused a MCE or any crashes / freezes for that matter. World of 
> Warcraft, The Outer Worlds, Stellaris, and Counter-Strike: Global Offensive.
>
> As the same with Clemens, using the same decoder he used, MCE-Ryzen-Decoder, 
> from github, it reports the MCE to be the following:
> Bank: Execution Unit (EX)
> Error: Watchdog Timeout error (WDT 0x0)
>
> One thing to note is I haven't received it during desktop usage. Only in 
> Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and 
> always have. Both of which used, tested, and received the MCE's during those 
> sessions. I have noticed a pattern with the MCE crashes with Warcraft 3. They 
> always happen during a GPU load drop off or increase transition. By that I 
> mean when exiting a match to return to the lobby, or loading a map and when 
> it switches from the loading screen to the match itself is when these MCE's 
> happen. The entire screen quickly turns black, everything is hard locked, and 
> then after about a minute or so the machine reboots on its own. It hasn't 
> happened yet while in a middle of a match session, sitting in the lobby or at 
> the main menu screen. Its consistently been during a transition. My theory is 
> that this could possibly be a GPU hang from switching from one power state to 
> another power state. With the GPU hanging, causes the CPU to stall, and thus 
> a 
 MCE. The GPU hanging could explain the quick solid black screen as well as all 
output is stopped. But I'm really just assuming here form my own observations 
from my limited understanding. Possible reason why this triggers in Warcraft is 
because the other games have few moments of switching power states heavily. The 
Outer Worlds, World of Warcraft, Stellaris, and Counter-Strike Global Offensive 
all

Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-03-21 Thread Clemens Eisserer
Hi John,

> >I know RX570 (polaris) should stay at PCI3 as far as I know.
>
> Yep... thought I remembered you mentioning having a 5700XT though... is that 
> in a different system ?

I am using a RX570, the guy from reddit changed from R600 to an 5700XT
and it seems it did solve his reboot problems.

As the system is rock solid with windows-10 and others seem to
experience similar behaviour I've decided to file a bug at the
kernel's bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=206903

Thanks for your suggestions & best regards, Clemens


>
> ________
> From: Clemens Eisserer 
> Sent: March 9, 2020 2:30 AM
> To: Bridgman, John ; amd-gfx@lists.freedesktop.org 
> 
> Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) 
> with Ryzen 3700x?
>
> Hi John,
>
> Thanks a lot for taking the time to look at this, even if it doesn't
> seem to be GPU related at first.
>
> > OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as 
> > options for decoding.
> Sorry for omitting that information - indeed I was using
> MCE-Ryzen-Decoder, thanks for pointing to mcelog.
> The mce log output definitivly makes more sense, I'll try to
> experiment a bit with RAM.
>
> Thanks also for the link to the forum, seems of all the affected users,
> no one reported success in that thread.
>
> > For something as simple as the GPU bus interface not responding to an access
> > by the CPU I think you would get a different error (bus error) but not 100% 
> > sure about that.
> >
> > My first thought would be to see if your mobo BIOS has an option to force 
> > PCIE
> > gen3 instead of 4 and see if that makes a difference. There are some amdgpu 
> > module parms
> > related to PCIE as well but I'm not sure which ones to recommend.
>
> I'll give it a try and have a look at the pcie options - but as far as
> I know RX570 (polaris) should stay at PCI3 as far as I know.
> Disabling IOMMU didn't help as far as I recall.
>
> Thanks & best regards, Clemens
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-03-08 Thread Clemens Eisserer
Hi John,

Thanks a lot for taking the time to look at this, even if it doesn't
seem to be GPU related at first.

> OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options 
> for decoding.
Sorry for omitting that information - indeed I was using
MCE-Ryzen-Decoder, thanks for pointing to mcelog.
The mce log output definitivly makes more sense, I'll try to
experiment a bit with RAM.

Thanks also for the link to the forum, seems of all the affected users,
no one reported success in that thread.

> For something as simple as the GPU bus interface not responding to an access
> by the CPU I think you would get a different error (bus error) but not 100% 
> sure about that.
>
> My first thought would be to see if your mobo BIOS has an option to force PCIE
> gen3 instead of 4 and see if that makes a difference. There are some amdgpu 
> module parms
> related to PCIE as well but I'm not sure which ones to recommend.

I'll give it a try and have a look at the pcie options - but as far as
I know RX570 (polaris) should stay at PCI3 as far as I know.
Disabling IOMMU didn't help as far as I recall.

Thanks & best regards, Clemens
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-03-08 Thread Clemens Eisserer
Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[0.105003]  node  #0, CPUs:#1  #2
[0.107022] mce: [Hardware Error]: Machine check events logged
[0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea00108
[0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d0120001 SYND 4d00 IPID 500b0
[0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://www.reddit.com/r/archlinux/comments/e33nyg/hard_reboots_with_ryzen_3600x/

The users there claim to experience exactly the same problem (even
with the same MCE-Code logged) but where using R600 based graphics
cards - he is even using the same mainboard. When he swapped his
R600-card with a new RX5700 the problems vanished.

I don't have the luxury to simply try another GPU (my RX5700 is the
only one properly driving my 4k@60Hz panel), however the whole
observation makes me wonder. How can a GPU be responsible for
low-level errors such as the machine check exception in the execution
units like the one mentioned above.
Could DMA transfers gone bad be the cluprit?
Are there any "safe mode" options available I could try regarding
amdgpu (I tried disabling low-power states but this didn't help and
only made my GPU fans spin up)?

Any help is highly appreciated.

Thanks, Clemens
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: atombios stuck executing D850 when trying to switch to 4k@60Hz on Polaris10

2019-08-22 Thread Clemens Eisserer
Hi Alex,

> You need to use dc to support 4k@60.  Remove amdgpu.dc=0 from the kernel 
> command line in grub.

Thanks a lot!
Indeed, it worked immediatly after removing those left-overs from my
kaveri-based system.

Br, Clemens
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

atombios stuck executing D850 when trying to switch to 4k@60Hz on Polaris10

2019-08-22 Thread Clemens Eisserer
Hi there,

I am trying to connect a LG 32UD59 UHD monitor to a MSI Armor RX570 4G
card via a HDMI2 (cheap) certified cable.
Unfourtunatly the setup only runs at 30Hz, whereas when booting
Windows it automatically selects 3840x2160@59Hz.

I played a bit with adding the modelines manually, however when
enabling those new modes the screen goes black and in syslog I find
the following entries:

[  571.174813] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in
loop for more than 5secs aborting
[  571.174862] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR*
atombios stuck executing D850 (len 824, WS 0, PS 0) @ 0xD992
[  571.174908] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR*
atombios stuck executing D70A (len 326, WS 0, PS 0) @ 0xD7A6

Xorg.0.log: https://pastebin.com/LmZ0bvyL
kernel log: https://pastebin.com/rXGVMTnV

Help would be really appreciated - I am rather latency sensitive and
those 30Hz are driving me nuts ;)

Best regards, Clemens
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx