Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Tue 2020-09-01 13:57:55, Harald Arnesen wrote: > Still (rc3) doesn't work without the three reverts. > > I'm not sure how to proceed, I cannot capture any oops, and see nothing > obvious in any logs. I believe this is the place when you ask Linus for reverts... Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Still (rc3) doesn't work without the three reverts. I'm not sure how to proceed, I cannot capture any oops, and see nothing obvious in any logs. -- Hilsen Harald
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Hi! > >> It's a Thinkpad T520. > > > > Oh, so this is a 64-bit machine? Yeah, that patch to flush vmalloc > > ranges won't make any difference on x86-64. > > > > Or are you for some reason running a 32-bit kernel on that thing? Have > > you tried building a 64-bit one (user-space can be 32-bit, it should > > all just work. Knock wood). > > No, I run a 64-bit kernel with 64-bit userspace (Void Linux). > Config is attached, in case anything is obvious from that. For the record, I'm running 5.9.0-rc2-next-20200825 w/o further patches, and it behaves okay on that 32-bit thinkpad x60. BTW... could we get the test farms to occassionaly boot in 32-bit mode? Those modern CPUs can still do that :-). Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Wed, Aug 26, 2020 at 1:53 PM Harald Arnesen wrote: > > It's a Thinkpad T520. Oh, so this is a 64-bit machine? Yeah, that patch to flush vmalloc ranges won't make any difference on x86-64. Or are you for some reason running a 32-bit kernel on that thing? Have you tried building a 64-bit one (user-space can be 32-bit, it should all just work. Knock wood). Linus
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Dave Airlie [26.08.2020 22:47]: > On Thu, 27 Aug 2020 at 06:44, Harald Arnesen wrote: >> >> Linus Torvalds [26.08.2020 20:04]: >> >> > On Wed, Aug 26, 2020 at 2:30 AM Harald Arnesen wrote: >> >> Somehow related to lightdm or xfce4? However, it is a regression, since >> >> kernel 5.8 works. >> > Yeah, apparently there's something else wrong with the relocation changes >> > too. >> > >> > That said, does that patch at >> > >> > https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/ >> > >> > change things at all? If there are two independent bugs, maybe >> > applying that patch might at least give you an oops that gets saved in >> > the logs? >> > >> > (it might be worth waiting a bit after the machine locks up in case >> > the machine is alive enough so sync logs after a bit.. If ssh works, >> > that's obviously better yet) >> >> No, doesn't help. And I was wrong, ssh does not work at all when the >> display locks up. > > Did you say what hw you had? is it the same hw as Pavel or different? > > Dave. > It's a Thinkpad T520. Output from 'lspci' attached. -- Hilsen Harald 00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09) 00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) 00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 (rev 04) 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 04) 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 04) 00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 04) 00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b4) 00:1c.1 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 2 (rev b4) 00:1c.3 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 4 (rev b4) 00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 (rev b4) 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 04) 00:1f.0 ISA bridge: Intel Corporation QM67 Express Chipset LPC Controller (rev 04) 00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port Mobile SATA AHCI Controller (rev 04) 00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 04) 03:00.0 Network controller: Intel Corporation Centrino Ultimate-N 6300 (rev 35) 0d:00.0 System peripheral: Ricoh Co Ltd PCIe SDXC/MMC Host Controller (rev 08) 0d:00.3 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 PCIe IEEE 1394 Controller (rev 04)
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Thu, 27 Aug 2020 at 06:44, Harald Arnesen wrote: > > Linus Torvalds [26.08.2020 20:04]: > > > On Wed, Aug 26, 2020 at 2:30 AM Harald Arnesen wrote: > >> Somehow related to lightdm or xfce4? However, it is a regression, since > >> kernel 5.8 works. > > Yeah, apparently there's something else wrong with the relocation changes > > too. > > > > That said, does that patch at > > > > https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/ > > > > change things at all? If there are two independent bugs, maybe > > applying that patch might at least give you an oops that gets saved in > > the logs? > > > > (it might be worth waiting a bit after the machine locks up in case > > the machine is alive enough so sync logs after a bit.. If ssh works, > > that's obviously better yet) > > No, doesn't help. And I was wrong, ssh does not work at all when the > display locks up. Did you say what hw you had? is it the same hw as Pavel or different? Dave.
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Linus Torvalds [26.08.2020 20:04]: > On Wed, Aug 26, 2020 at 2:30 AM Harald Arnesen wrote: >> Somehow related to lightdm or xfce4? However, it is a regression, since >> kernel 5.8 works. > Yeah, apparently there's something else wrong with the relocation changes too. > > That said, does that patch at > > https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/ > > change things at all? If there are two independent bugs, maybe > applying that patch might at least give you an oops that gets saved in > the logs? > > (it might be worth waiting a bit after the machine locks up in case > the machine is alive enough so sync logs after a bit.. If ssh works, > that's obviously better yet) No, doesn't help. And I was wrong, ssh does not work at all when the display locks up. -- Hilsen Harald
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Wed, Aug 26, 2020 at 2:30 AM Harald Arnesen wrote: > > Somehow related to lightdm or xfce4? However, it is a regression, since > kernel 5.8 works. Yeah, apparently there's something else wrong with the relocation changes too. That said, does that patch at https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/ change things at all? If there are two independent bugs, maybe applying that patch might at least give you an oops that gets saved in the logs? (it might be worth waiting a bit after the machine locks up in case the machine is alive enough so sync logs after a bit.. If ssh works, that's obviously better yet) Linus
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Harald Arnesen [26.08.2020 10:36]: > I was wrong about ssh working. The whole machine locks up when X starts. > > A strange thing, sometimes I can log in from lightdm before it locks up, > sometimes I cannot even use the login screen. Timing related? > > If I don't start X, console login seems to work fine, and I see nothing > obvious in the logs or kernel messages. > > I will try to start just a window manager with startx instead of going > through lightdm. Disabled lightdm, started DE or WM from .xinitrc: xfce4-session: Machine locks up enlightenment: Machine works Somehow related to lightdm or xfce4? However, it is a regression, since kernel 5.8 works. -- Hilsen Harald
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Linus Torvalds [25.08.2020 20:19]: > On Tue, Aug 25, 2020 at 9:32 AM Harald Arnesen wrote: >> >> > For posterity, I'm told the fix is [1]. >> > >> > [1] >> > https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/ >> >> Doesn't fix it for me. As soon as I start XFCE, the mouse and keyboard >> freeezes. I can still ssh into the machine >> >> The three reverts (763fedd6a216, 7ac2d2536dfa and 9e0f9464e2ab) fixes >> the bug for me. > > Do you get any oops or other indication of what ends up going wrong? > Since ssh works that should be fairly easy to see. I was wrong about ssh working. The whole machine locks up when X starts. A strange thing, sometimes I can log in from lightdm before it locks up, sometimes I cannot even use the login screen. Timing related? If I don't start X, console login seems to work fine, and I see nothing obvious in the logs or kernel messages. I will try to start just a window manager with startx instead of going through lightdm. -- Hilsen Harald
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Linus Torvalds [25.08.2020 20:19]: >> Doesn't fix it for me. As soon as I start XFCE, the mouse and keyboard >> freeezes. I can still ssh into the machine >> >> The three reverts (763fedd6a216, 7ac2d2536dfa and 9e0f9464e2ab) fixes >> the bug for me. > Do you get any oops or other indication of what ends up going wrong? > Since ssh works that should be fairly easy to see. Away from the machine now, will check tomorrow morning (CET). -- Hilsen Harald
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Tue, Aug 25, 2020 at 9:32 AM Harald Arnesen wrote: > > > For posterity, I'm told the fix is [1]. > > > > [1] > > https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/ > > Doesn't fix it for me. As soon as I start XFCE, the mouse and keyboard > freeezes. I can still ssh into the machine > > The three reverts (763fedd6a216, 7ac2d2536dfa and 9e0f9464e2ab) fixes > the bug for me. Do you get any oops or other indication of what ends up going wrong? Since ssh works that should be fairly easy to see. Linus
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Jani Nikula [25.08.2020 11:55]: > On Fri, 21 Aug 2020, Pavel Machek wrote: >> On Thu 2020-08-20 09:16:18, Linus Torvalds wrote: >>> On Thu, Aug 20, 2020 at 2:23 AM Pavel Machek wrote: >>> > >>> > Yes, it seems they make things work. (Chris asked for new patch to be >>> > tested, so I am switching to his kernel, but it survived longer than >>> > it usually does.) >>> >>> Ok, so at worst we know how to solve it, at best the reverts won't be >>> needed because Chris' patch will fix the issue properly. >>> >>> So I'll archive this thread, but remind me if this hasn't gotten >>> sorted out in the later rc's. >> >> Yes, thank you, it seems we have a solution w/o the revert. > > For posterity, I'm told the fix is [1]. > > BR, > Jani. > > > [1] https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/ Doesn't fix it for me. As soon as I start XFCE, the mouse and keyboard freeezes. I can still ssh into the machine The three reverts (763fedd6a216, 7ac2d2536dfa and 9e0f9464e2ab) fixes the bug for me. -- Hilsen Harald
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Fri, 21 Aug 2020, Pavel Machek wrote: > On Thu 2020-08-20 09:16:18, Linus Torvalds wrote: >> On Thu, Aug 20, 2020 at 2:23 AM Pavel Machek wrote: >> > >> > Yes, it seems they make things work. (Chris asked for new patch to be >> > tested, so I am switching to his kernel, but it survived longer than >> > it usually does.) >> >> Ok, so at worst we know how to solve it, at best the reverts won't be >> needed because Chris' patch will fix the issue properly. >> >> So I'll archive this thread, but remind me if this hasn't gotten >> sorted out in the later rc's. > > Yes, thank you, it seems we have a solution w/o the revert. For posterity, I'm told the fix is [1]. BR, Jani. [1] https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/ -- Jani Nikula, Intel Open Source Graphics Center
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Thu 2020-08-20 09:16:18, Linus Torvalds wrote: > On Thu, Aug 20, 2020 at 2:23 AM Pavel Machek wrote: > > > > Yes, it seems they make things work. (Chris asked for new patch to be > > tested, so I am switching to his kernel, but it survived longer than > > it usually does.) > > Ok, so at worst we know how to solve it, at best the reverts won't be > needed because Chris' patch will fix the issue properly. > > So I'll archive this thread, but remind me if this hasn't gotten > sorted out in the later rc's. Yes, thank you, it seems we have a solution w/o the revert. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: PGP signature
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Thu, Aug 20, 2020 at 2:23 AM Pavel Machek wrote: > > Yes, it seems they make things work. (Chris asked for new patch to be > tested, so I am switching to his kernel, but it survived longer than > it usually does.) Ok, so at worst we know how to solve it, at best the reverts won't be needed because Chris' patch will fix the issue properly. So I'll archive this thread, but remind me if this hasn't gotten sorted out in the later rc's. Linus
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Hi! > > I think there's been some discussion about reverting that change for > > other reasons, but it's quite likely the culprit. > > Hmm. It reverts cleanly, but the end result doesn't work, because of > other changes. > > Reverting all of > >763fedd6a216 ("drm/i915: Remove i915_gem_object_get_dirty_page()") >7ac2d2536dfa ("drm/i915/gem: Delete unused code") >9e0f9464e2ab ("drm/i915/gem: Async GPU relocations only") > > seems to at least build. > > Pavel, does doing those three reverts make things work for you? Yes, it seems they make things work. (Chris asked for new patch to be tested, so I am switching to his kernel, but it survived longer than it usually does.) Thanks and best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Tue 2020-08-18 18:59:27, Linus Torvalds wrote: > On Tue, Aug 18, 2020 at 6:13 PM Dave Airlie wrote: > > > > I think there's been some discussion about reverting that change for > > other reasons, but it's quite likely the culprit. > > Hmm. It reverts cleanly, but the end result doesn't work, because of > other changes. > > Reverting all of > >763fedd6a216 ("drm/i915: Remove i915_gem_object_get_dirty_page()") >7ac2d2536dfa ("drm/i915/gem: Delete unused code") >9e0f9464e2ab ("drm/i915/gem: Async GPU relocations only") > > seems to at least build. > > Pavel, does doing those three reverts make things work for you? Ok, so Chris' patches resulted in (less severe?) crash, let me try this. pavel@amd:/data/l/linux-next-32$ git reset --hard 8eb858df0a5f6bcd371b5d5637255c987278b8c9 HEAD is now at 8eb858df0a5f Add linux-next specific files for 20200819 pavel@amd:/data/l/linux-next-32$ git revert 763fedd6a216 Performing inexact rename detection: 100% (1212316/1212316), done. hint: Waiting for your editor to close the file... Editing file: /data/fast/l/linux-next-32/.git/COMMIT_EDITMSG /home/pavel/bin/emacsf: line 3: ed: command not found [detached HEAD 261cbba627b7] Revert "drm/i915: Remove i915_gem_object_get_dirty_page()" 2 files changed, 18 insertions(+) pavel@amd:/data/l/linux-next-32$ git revert 7ac2d2536dfa warning: inexact rename detection was skipped due to too many files. warning: you may want to set your merge.renamelimit variable to at least 3877 and retry the command. hint: Waiting for your editor to close the file... Editing file: /data/fast/l/linux-next-32/.git/COMMIT_EDITMSG /home/pavel/bin/emacsf: line 3: ed: command not found [detached HEAD 526af90ea811] Revert "drm/i915/gem: Delete unused code" 1 file changed, 19 insertions(+) pavel@amd:/data/l/linux-next-32$ git revert 9e0f9464e2ab warning: inexact rename detection was skipped due to too many files. warning: you may want to set your merge.renamelimit variable to at least 3877 and retry the command. hint: Waiting for your editor to close the file... Editing file: /data/fast/l/linux-next-32/.git/COMMIT_EDITMSG /home/pavel/bin/emacsf: line 3: ed: command not found [detached HEAD 173e46213949] Revert "drm/i915/gem: Async GPU relocations only" 2 files changed, 289 insertions(+), 27 deletions(-) pavel@amd:/data/l/linux-next-32$ It is now running, it seems unison is the thing that usually triggers this (due to memory pressure?). This time it survived unison (but without chromium). I'll really know if it works in day or two. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
Hi! > > I think there's been some discussion about reverting that change for > > other reasons, but it's quite likely the culprit. > > Hmm. It reverts cleanly, but the end result doesn't work, because of > other changes. > > Reverting all of > >763fedd6a216 ("drm/i915: Remove i915_gem_object_get_dirty_page()") >7ac2d2536dfa ("drm/i915/gem: Delete unused code") >9e0f9464e2ab ("drm/i915/gem: Async GPU relocations only") > > seems to at least build. > > Pavel, does doing those three reverts make things work for you? Thanks. I got "[PATCH 1/2] drm/i915/gem: Replace reloc chain with terminator on..." in my inbox; I believe that's related. Let me try those, first. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Tue, Aug 18, 2020 at 6:13 PM Dave Airlie wrote: > > I think there's been some discussion about reverting that change for > other reasons, but it's quite likely the culprit. Hmm. It reverts cleanly, but the end result doesn't work, because of other changes. Reverting all of 763fedd6a216 ("drm/i915: Remove i915_gem_object_get_dirty_page()") 7ac2d2536dfa ("drm/i915/gem: Delete unused code") 9e0f9464e2ab ("drm/i915/gem: Async GPU relocations only") seems to at least build. Pavel, does doing those three reverts make things work for you? Linus
Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline
On Wed, 19 Aug 2020 at 10:38, Linus Torvalds wrote: > > Ping on this? > > The code disassembles to > > 24: 8b 85 d0 fd ff ffmov-0x230(%ebp),%eax > 2a:* c7 03 01 00 40 10movl $0x1041,(%ebx) <-- trapping instruction > 30: 89 43 04 mov%eax,0x4(%ebx) > 33: 8b 85 b4 fd ff ffmov-0x24c(%ebp),%eax > 39: 89 43 08 mov%eax,0x8(%ebx) > 3c: e9jmp ... > > which looks like is one of the cases in __reloc_entry_gpu(). I *think* > it's this one: > > } else if (gen >= 3 && >!(IS_I915G(eb->i915) || IS_I915GM(eb->i915))) { > *batch++ = MI_STORE_DWORD_IMM | MI_MEM_VIRTUAL; > *batch++ = addr; > *batch++ = target_addr; > > where that "batch" pointer is 0xf8601000, so it looks like it just > overflowed into the next page that isn't there. > > The cleaned-up call trace is > > drm_ioctl+0x1f4/0x38b -> > drm_ioctl_kernel+0x87/0xd0 -> > i915_gem_execbuffer2_ioctl+0xdd/0x360 -> > i915_gem_do_execbuffer+0xaab/0x2780 -> > eb_relocate_vma > > but there's a lot of inling going on, so.. > > The obvious suspect is commit 9e0f9464e2ab ("drm/i915/gem: Async GPU > relocations only") but that's going purely by "that seems to be the > main relocation change this mmrge window". I think there's been some discussion about reverting that change for other reasons, but it's quite likely the culprit. Maybe we can push for a revert sooner, (cc'ing more of i915 team). Dave.