Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-09-09 Thread Pavel Machek
On Tue 2020-09-01 13:57:55, Harald Arnesen wrote:
> Still (rc3) doesn't work without the three reverts.
> 
> I'm not sure how to proceed, I cannot capture any oops, and see nothing
> obvious in any logs.

I believe this is the place when you ask Linus for reverts...

Best regards,

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-09-01 Thread Harald Arnesen
Still (rc3) doesn't work without the three reverts.

I'm not sure how to proceed, I cannot capture any oops, and see nothing
obvious in any logs.
-- 
Hilsen Harald


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-27 Thread Pavel Machek
Hi!

> >> It's a Thinkpad T520.
> > 
> > Oh, so this is a 64-bit machine? Yeah, that patch to flush vmalloc
> > ranges won't make any difference on x86-64.
> > 
> > Or are you for some reason running a 32-bit kernel on that thing? Have
> > you tried building a 64-bit one (user-space can be 32-bit, it should
> > all just work. Knock wood).
> 
> No, I run a 64-bit kernel with 64-bit userspace (Void Linux).
> Config is attached, in case anything is obvious from that.

For the record, I'm running 5.9.0-rc2-next-20200825 w/o further
patches, and it behaves okay on that 32-bit thinkpad x60.

BTW... could we get the test farms to occassionaly boot in 32-bit
mode? Those modern CPUs can still do that :-).

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-26 Thread Linus Torvalds
On Wed, Aug 26, 2020 at 1:53 PM Harald Arnesen  wrote:
>
> It's a Thinkpad T520.

Oh, so this is a 64-bit machine? Yeah, that patch to flush vmalloc
ranges won't make any difference on x86-64.

Or are you for some reason running a 32-bit kernel on that thing? Have
you tried building a 64-bit one (user-space can be 32-bit, it should
all just work. Knock wood).

   Linus


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-26 Thread Harald Arnesen
Dave Airlie [26.08.2020 22:47]:

> On Thu, 27 Aug 2020 at 06:44, Harald Arnesen  wrote:
>>
>> Linus Torvalds [26.08.2020 20:04]:
>>
>> > On Wed, Aug 26, 2020 at 2:30 AM Harald Arnesen  wrote:
>> >> Somehow related to lightdm or xfce4? However, it is a regression, since
>> >> kernel 5.8 works.
>> > Yeah, apparently there's something else wrong with the relocation changes 
>> > too.
>> >
>> > That said, does that patch at
>> >
>> >   https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/
>> >
>> > change things at all? If there are two independent bugs, maybe
>> > applying that patch might at least give you an oops that gets saved in
>> > the logs?
>> >
>> > (it might be worth waiting a bit after the machine locks up in case
>> > the machine is alive enough so sync logs after a bit.. If ssh works,
>> > that's obviously better yet)
>>
>> No, doesn't help. And I was wrong, ssh does not work at all when the
>> display locks up.
> 
> Did you say what hw you had? is it the same hw as Pavel or different?
> 
> Dave.
> 

It's a Thinkpad T520.
Output from 'lspci' attached.

-- 
Hilsen Harald
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family 
DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core 
Processor Family Integrated Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series 
Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network 
Connection (Lewisville) (rev 04)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family 
USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family 
High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI 
Express Root Port 1 (rev b4)
00:1c.1 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI 
Express Root Port 2 (rev b4)
00:1c.3 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI 
Express Root Port 4 (rev b4)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI 
Express Root Port 5 (rev b4)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family 
USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation QM67 Express Chipset LPC Controller (rev 
04)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 
6 port Mobile SATA AHCI Controller (rev 04)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus 
Controller (rev 04)
03:00.0 Network controller: Intel Corporation Centrino Ultimate-N 6300 (rev 35)
0d:00.0 System peripheral: Ricoh Co Ltd PCIe SDXC/MMC Host Controller (rev 08)
0d:00.3 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 PCIe IEEE 1394 Controller 
(rev 04)


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-26 Thread Dave Airlie
On Thu, 27 Aug 2020 at 06:44, Harald Arnesen  wrote:
>
> Linus Torvalds [26.08.2020 20:04]:
>
> > On Wed, Aug 26, 2020 at 2:30 AM Harald Arnesen  wrote:
> >> Somehow related to lightdm or xfce4? However, it is a regression, since
> >> kernel 5.8 works.
> > Yeah, apparently there's something else wrong with the relocation changes 
> > too.
> >
> > That said, does that patch at
> >
> >   https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/
> >
> > change things at all? If there are two independent bugs, maybe
> > applying that patch might at least give you an oops that gets saved in
> > the logs?
> >
> > (it might be worth waiting a bit after the machine locks up in case
> > the machine is alive enough so sync logs after a bit.. If ssh works,
> > that's obviously better yet)
>
> No, doesn't help. And I was wrong, ssh does not work at all when the
> display locks up.

Did you say what hw you had? is it the same hw as Pavel or different?

Dave.


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-26 Thread Harald Arnesen
Linus Torvalds [26.08.2020 20:04]:

> On Wed, Aug 26, 2020 at 2:30 AM Harald Arnesen  wrote:
>> Somehow related to lightdm or xfce4? However, it is a regression, since
>> kernel 5.8 works.
> Yeah, apparently there's something else wrong with the relocation changes too.
> 
> That said, does that patch at
> 
>   https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/
> 
> change things at all? If there are two independent bugs, maybe
> applying that patch might at least give you an oops that gets saved in
> the logs?
> 
> (it might be worth waiting a bit after the machine locks up in case
> the machine is alive enough so sync logs after a bit.. If ssh works,
> that's obviously better yet)

No, doesn't help. And I was wrong, ssh does not work at all when the
display locks up.
-- 
Hilsen Harald


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-26 Thread Linus Torvalds
On Wed, Aug 26, 2020 at 2:30 AM Harald Arnesen  wrote:
>
> Somehow related to lightdm or xfce4? However, it is a regression, since
> kernel 5.8 works.

Yeah, apparently there's something else wrong with the relocation changes too.

That said, does that patch at

  https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/

change things at all? If there are two independent bugs, maybe
applying that patch might at least give you an oops that gets saved in
the logs?

(it might be worth waiting a bit after the machine locks up in case
the machine is alive enough so sync logs after a bit.. If ssh works,
that's obviously better yet)

  Linus


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-26 Thread Harald Arnesen
Harald Arnesen [26.08.2020 10:36]:

> I was wrong about ssh working. The whole machine locks up when X starts.
> 
> A strange thing, sometimes I can log in from lightdm before it locks up,
> sometimes I cannot even use the login screen. Timing related?
> 
> If I don't start X, console login seems to work fine, and I see nothing
> obvious in the logs or kernel messages.
> 
> I will try to start just a window manager with startx instead of going
> through lightdm.

Disabled lightdm, started DE or WM from .xinitrc:

xfce4-session: Machine locks up
enlightenment: Machine works

Somehow related to lightdm or xfce4? However, it is a regression, since
kernel 5.8 works.
-- 
Hilsen Harald


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-26 Thread Harald Arnesen
Linus Torvalds [25.08.2020 20:19]:

> On Tue, Aug 25, 2020 at 9:32 AM Harald Arnesen  wrote:
>>
>> > For posterity, I'm told the fix is [1].
>> >
>> > [1] 
>> > https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/
>>
>> Doesn't fix it for me. As soon as I start XFCE, the mouse and keyboard
>> freeezes. I can still ssh into the machine
>>
>> The three reverts (763fedd6a216, 7ac2d2536dfa and 9e0f9464e2ab) fixes
>> the bug for me.
> 
> Do you get any oops or other indication of what ends up going wrong?
> Since ssh works that should be fairly easy to see.
I was wrong about ssh working. The whole machine locks up when X starts.

A strange thing, sometimes I can log in from lightdm before it locks up,
sometimes I cannot even use the login screen. Timing related?

If I don't start X, console login seems to work fine, and I see nothing
obvious in the logs or kernel messages.

I will try to start just a window manager with startx instead of going
through lightdm.
-- 
Hilsen Harald


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-25 Thread Harald Arnesen
Linus Torvalds [25.08.2020 20:19]:

>> Doesn't fix it for me. As soon as I start XFCE, the mouse and keyboard
>> freeezes. I can still ssh into the machine
>>
>> The three reverts (763fedd6a216, 7ac2d2536dfa and 9e0f9464e2ab) fixes
>> the bug for me.
> Do you get any oops or other indication of what ends up going wrong?
> Since ssh works that should be fairly easy to see.

Away from the machine now, will check tomorrow morning (CET).
-- 
Hilsen Harald


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-25 Thread Linus Torvalds
On Tue, Aug 25, 2020 at 9:32 AM Harald Arnesen  wrote:
>
> > For posterity, I'm told the fix is [1].
> >
> > [1] 
> > https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/
>
> Doesn't fix it for me. As soon as I start XFCE, the mouse and keyboard
> freeezes. I can still ssh into the machine
>
> The three reverts (763fedd6a216, 7ac2d2536dfa and 9e0f9464e2ab) fixes
> the bug for me.

Do you get any oops or other indication of what ends up going wrong?
Since ssh works that should be fairly easy to see.

Linus


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-25 Thread Harald Arnesen
Jani Nikula [25.08.2020 11:55]:

> On Fri, 21 Aug 2020, Pavel Machek  wrote:
>> On Thu 2020-08-20 09:16:18, Linus Torvalds wrote:
>>> On Thu, Aug 20, 2020 at 2:23 AM Pavel Machek  wrote:
>>> >
>>> > Yes, it seems they make things work. (Chris asked for new patch to be
>>> > tested, so I am switching to his kernel, but it survived longer than
>>> > it usually does.)
>>> 
>>> Ok, so at worst we know how to solve it, at best the reverts won't be
>>> needed because Chris' patch will fix the issue properly.
>>> 
>>> So I'll archive this thread, but remind me if this hasn't gotten
>>> sorted out in the later rc's.
>>
>> Yes, thank you, it seems we have a solution w/o the revert.
> 
> For posterity, I'm told the fix is [1].
> 
> BR,
> Jani.
> 
> 
> [1] https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/

Doesn't fix it for me. As soon as I start XFCE, the mouse and keyboard
freeezes. I can still ssh into the machine

The three reverts (763fedd6a216, 7ac2d2536dfa and 9e0f9464e2ab) fixes
the bug for me.
-- 
Hilsen Harald


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-25 Thread Jani Nikula
On Fri, 21 Aug 2020, Pavel Machek  wrote:
> On Thu 2020-08-20 09:16:18, Linus Torvalds wrote:
>> On Thu, Aug 20, 2020 at 2:23 AM Pavel Machek  wrote:
>> >
>> > Yes, it seems they make things work. (Chris asked for new patch to be
>> > tested, so I am switching to his kernel, but it survived longer than
>> > it usually does.)
>> 
>> Ok, so at worst we know how to solve it, at best the reverts won't be
>> needed because Chris' patch will fix the issue properly.
>> 
>> So I'll archive this thread, but remind me if this hasn't gotten
>> sorted out in the later rc's.
>
> Yes, thank you, it seems we have a solution w/o the revert.

For posterity, I'm told the fix is [1].

BR,
Jani.


[1] https://lore.kernel.org/intel-gfx/20200821123746.16904-1-j...@8bytes.org/


-- 
Jani Nikula, Intel Open Source Graphics Center


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-21 Thread Pavel Machek
On Thu 2020-08-20 09:16:18, Linus Torvalds wrote:
> On Thu, Aug 20, 2020 at 2:23 AM Pavel Machek  wrote:
> >
> > Yes, it seems they make things work. (Chris asked for new patch to be
> > tested, so I am switching to his kernel, but it survived longer than
> > it usually does.)
> 
> Ok, so at worst we know how to solve it, at best the reverts won't be
> needed because Chris' patch will fix the issue properly.
> 
> So I'll archive this thread, but remind me if this hasn't gotten
> sorted out in the later rc's.

Yes, thank you, it seems we have a solution w/o the revert.

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: PGP signature


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-20 Thread Linus Torvalds
On Thu, Aug 20, 2020 at 2:23 AM Pavel Machek  wrote:
>
> Yes, it seems they make things work. (Chris asked for new patch to be
> tested, so I am switching to his kernel, but it survived longer than
> it usually does.)

Ok, so at worst we know how to solve it, at best the reverts won't be
needed because Chris' patch will fix the issue properly.

So I'll archive this thread, but remind me if this hasn't gotten
sorted out in the later rc's.

 Linus


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-20 Thread Pavel Machek
Hi!

> > I think there's been some discussion about reverting that change for
> > other reasons, but it's quite likely the culprit.
> 
> Hmm. It reverts cleanly, but the end result doesn't work, because of
> other changes.
> 
> Reverting all of
> 
>763fedd6a216 ("drm/i915: Remove i915_gem_object_get_dirty_page()")
>7ac2d2536dfa ("drm/i915/gem: Delete unused code")
>9e0f9464e2ab ("drm/i915/gem: Async GPU relocations only")
> 
> seems to at least build.
> 
> Pavel, does doing those three reverts make things work for you?

Yes, it seems they make things work. (Chris asked for new patch to be
tested, so I am switching to his kernel, but it survived longer than
it usually does.)

Thanks and best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-19 Thread Pavel Machek
On Tue 2020-08-18 18:59:27, Linus Torvalds wrote:
> On Tue, Aug 18, 2020 at 6:13 PM Dave Airlie  wrote:
> >
> > I think there's been some discussion about reverting that change for
> > other reasons, but it's quite likely the culprit.
> 
> Hmm. It reverts cleanly, but the end result doesn't work, because of
> other changes.
> 
> Reverting all of
> 
>763fedd6a216 ("drm/i915: Remove i915_gem_object_get_dirty_page()")
>7ac2d2536dfa ("drm/i915/gem: Delete unused code")
>9e0f9464e2ab ("drm/i915/gem: Async GPU relocations only")
> 
> seems to at least build.
> 
> Pavel, does doing those three reverts make things work for you?

Ok, so Chris' patches resulted in (less severe?) crash, let me try this.

pavel@amd:/data/l/linux-next-32$ git reset --hard 
8eb858df0a5f6bcd371b5d5637255c987278b8c9
HEAD is now at 8eb858df0a5f Add linux-next specific files for 20200819
pavel@amd:/data/l/linux-next-32$ git revert 763fedd6a216
Performing inexact rename detection: 100% (1212316/1212316), done.
hint: Waiting for your editor to close the file... Editing file: 
/data/fast/l/linux-next-32/.git/COMMIT_EDITMSG
/home/pavel/bin/emacsf: line 3: ed: command not found
[detached HEAD 261cbba627b7] Revert "drm/i915: Remove 
i915_gem_object_get_dirty_page()"
 2 files changed, 18 insertions(+)
pavel@amd:/data/l/linux-next-32$ git revert 7ac2d2536dfa
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your merge.renamelimit variable to at least 3877 
and retry the command.
hint: Waiting for your editor to close the file... Editing file: 
/data/fast/l/linux-next-32/.git/COMMIT_EDITMSG
/home/pavel/bin/emacsf: line 3: ed: command not found
[detached HEAD 526af90ea811] Revert "drm/i915/gem: Delete unused code"
 1 file changed, 19 insertions(+)
pavel@amd:/data/l/linux-next-32$ git revert 9e0f9464e2ab
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your merge.renamelimit variable to at least 3877 
and retry the command.
hint: Waiting for your editor to close the file... Editing file: 
/data/fast/l/linux-next-32/.git/COMMIT_EDITMSG
/home/pavel/bin/emacsf: line 3: ed: command not found
[detached HEAD 173e46213949] Revert "drm/i915/gem: Async GPU relocations only"
 2 files changed, 289 insertions(+), 27 deletions(-)
pavel@amd:/data/l/linux-next-32$ 

It is now running, it seems unison is the thing that usually triggers
this (due to memory pressure?). This time it survived unison (but
without chromium). I'll really know if it works in day or two.

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-19 Thread Pavel Machek
Hi!

> > I think there's been some discussion about reverting that change for
> > other reasons, but it's quite likely the culprit.
> 
> Hmm. It reverts cleanly, but the end result doesn't work, because of
> other changes.
> 
> Reverting all of
> 
>763fedd6a216 ("drm/i915: Remove i915_gem_object_get_dirty_page()")
>7ac2d2536dfa ("drm/i915/gem: Delete unused code")
>9e0f9464e2ab ("drm/i915/gem: Async GPU relocations only")
> 
> seems to at least build.
> 
> Pavel, does doing those three reverts make things work for you?

Thanks.

I got "[PATCH 1/2] drm/i915/gem: Replace reloc chain with terminator
on..." in my inbox; I believe that's related. Let me try those, first.

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-18 Thread Linus Torvalds
On Tue, Aug 18, 2020 at 6:13 PM Dave Airlie  wrote:
>
> I think there's been some discussion about reverting that change for
> other reasons, but it's quite likely the culprit.

Hmm. It reverts cleanly, but the end result doesn't work, because of
other changes.

Reverting all of

   763fedd6a216 ("drm/i915: Remove i915_gem_object_get_dirty_page()")
   7ac2d2536dfa ("drm/i915/gem: Delete unused code")
   9e0f9464e2ab ("drm/i915/gem: Async GPU relocations only")

seems to at least build.

Pavel, does doing those three reverts make things work for you?

   Linus


Re: [Intel-gfx] 5.9-rc1: graphics regression moved from -next to mainline

2020-08-18 Thread Dave Airlie
On Wed, 19 Aug 2020 at 10:38, Linus Torvalds
 wrote:
>
> Ping on this?
>
> The code disassembles to
>
>   24: 8b 85 d0 fd ff ffmov-0x230(%ebp),%eax
>   2a:* c7 03 01 00 40 10movl   $0x1041,(%ebx) <-- trapping instruction
>   30: 89 43 04  mov%eax,0x4(%ebx)
>   33: 8b 85 b4 fd ff ffmov-0x24c(%ebp),%eax
>   39: 89 43 08  mov%eax,0x8(%ebx)
>   3c: e9jmp ...
>
> which looks like is one of the cases in __reloc_entry_gpu(). I *think*
> it's this one:
>
> } else if (gen >= 3 &&
>!(IS_I915G(eb->i915) || IS_I915GM(eb->i915))) {
> *batch++ = MI_STORE_DWORD_IMM | MI_MEM_VIRTUAL;
> *batch++ = addr;
> *batch++ = target_addr;
>
> where that "batch" pointer is 0xf8601000, so it looks like it just
> overflowed into the next page that isn't there.
>
> The cleaned-up call trace is
>
>   drm_ioctl+0x1f4/0x38b ->
> drm_ioctl_kernel+0x87/0xd0 ->
>   i915_gem_execbuffer2_ioctl+0xdd/0x360 ->
> i915_gem_do_execbuffer+0xaab/0x2780 ->
>   eb_relocate_vma
>
> but there's a lot of inling going on, so..
>
> The obvious suspect is commit 9e0f9464e2ab ("drm/i915/gem: Async GPU
> relocations only") but that's going purely by "that seems to be the
> main relocation change this mmrge window".

I think there's been some discussion about reverting that change for
other reasons, but it's quite likely the culprit.

Maybe we can push for a revert sooner, (cc'ing more of i915 team).

Dave.