Re: [git pull] drm for 5.14-rc1

2021-09-22 Thread Maxime Ripard
On Mon, Sep 20, 2021 at 10:47:43AM -0700, Linus Torvalds wrote:
> On Mon, Sep 20, 2021 at 10:33 AM Maxime Ripard  wrote:
> >
> > What I was interested in was more about the context itself, and I'd
> > still like an answer on whether it's ok to wait for a review for 5
> > months though, or if it's an expectation from now on that we are
> > supposed to fix bugs over the week-end.
> 
> Oh, it's definitely not "over a weekend". These reverts happened on a
> Sunday just because that's when I do rc releases, and this was one of
> those pending issues that had been around long enough that I went "ok,
> I'm reverting now since it's been bisected and verified".
> 
> So it happened on a weekend, but that's pretty incidental.

Ok.

> You should not wait for 5 months to send bug-fixes. That's not the
> point of review, and review shouldn't hold up reported regressions of
> existing code. That's just basic _testing_ - either the fix should be
> applied, or - if the fix is too invasive or too ugly - the problematic
> source of the regression should be reverted.
> 
> Review should be about new code, it shouldn't be holding up "there's a
> bug report, here's the obvious fix".
> 
> And for something like a NULL pointer dereference, there really should
> generally be an "obvious fix".
> 
> Of course, a corollary to that "fixes are different from new
> development", though, is that bug fixes need to be kept separate from
> new code - just so that they _can_ be handled separately and so that
> you could have sent Sudip (and Michael, although that was apparently a
> very different bug, and the report came in later) a "can you test this
> fix" kind of thing.

I still don't have a way to reproduce Sudip's bug, so I can't even
provide that.

> I don't know what the review issue on the vc4 drm side is, but I
> suspect that the vc4 people are just perhaps not as integrated with a
> lot of the other core drm people. Or maybe review of new features are
> held off because there are bug reports on the old code.

It's not really about drm here, it's a dependency on the clock framework.

Maxime


signature.asc
Description: PGP signature


Re: [git pull] drm for 5.14-rc1

2021-09-20 Thread Linus Torvalds
On Mon, Sep 20, 2021 at 10:33 AM Maxime Ripard  wrote:
>
> What I was interested in was more about the context itself, and I'd
> still like an answer on whether it's ok to wait for a review for 5
> months though, or if it's an expectation from now on that we are
> supposed to fix bugs over the week-end.

Oh, it's definitely not "over a weekend". These reverts happened on a
Sunday just because that's when I do rc releases, and this was one of
those pending issues that had been around long enough that I went "ok,
I'm reverting now since it's been bisected and verified".

So it happened on a weekend, but that's pretty incidental.

You should not wait for 5 months to send bug-fixes. That's not the
point of review, and review shouldn't hold up reported regressions of
existing code. That's just basic _testing_ - either the fix should be
applied, or - if the fix is too invasive or too ugly - the problematic
source of the regression should be reverted.

Review should be about new code, it shouldn't be holding up "there's a
bug report, here's the obvious fix".

And for something like a NULL pointer dereference, there really should
generally be an "obvious fix".

Of course, a corollary to that "fixes are different from new
development", though, is that bug fixes need to be kept separate from
new code - just so that they _can_ be handled separately and so that
you could have sent Sudip (and Michael, although that was apparently a
very different bug, and the report came in later) a "can you test this
fix" kind of thing.

I don't know what the review issue on the vc4 drm side is, but I
suspect that the vc4 people are just perhaps not as integrated with a
lot of the other core drm people. Or maybe review of new features are
held off because there are bug reports on the old code.

   Linus


Re: [git pull] drm for 5.14-rc1

2021-09-20 Thread Maxime Ripard
On Mon, Sep 20, 2021 at 10:10:57AM -0700, Linus Torvalds wrote:
> On Mon, Sep 20, 2021 at 5:17 AM Maxime Ripard  wrote:
> >
> > I'm sorry, but I find that situation to be a bit ridiculous.
> 
> Honestly, the original report about the pulseaudio problem came in
> over two weeks ago, and all you seemed to do was to ignore everything
> that Sudip said and reported.
> 
> THAT is the ridiculous part.
> 
> The rules for the kernel are very clear: regressions get fixed or they
> get reverted. And I saw absolutely _zero_ indication that you took
> that pulseaudio oops at all seriously.

I wasn't questioning the revert itself though, I can see how you took
that decision for the audio patch.

What I was interested in was more about the context itself, and I'd
still like an answer on whether it's ok to wait for a review for 5
months though, or if it's an expectation from now on that we are
supposed to fix bugs over the week-end.

Maxime


signature.asc
Description: PGP signature


Re: [git pull] drm for 5.14-rc1

2021-09-20 Thread Linus Torvalds
On Mon, Sep 20, 2021 at 5:17 AM Maxime Ripard  wrote:
>
> I'm sorry, but I find that situation to be a bit ridiculous.

Honestly, the original report about the pulseaudio problem came in
over two weeks ago, and all you seemed to do was to ignore everything
that Sudip said and reported.

THAT is the ridiculous part.

The rules for the kernel are very clear: regressions get fixed or they
get reverted. And I saw absolutely _zero_ indication that you took
that pulseaudio oops at all seriously.

Linus


Re: [git pull] drm for 5.14-rc1

2021-09-20 Thread Sudip Mukherjee
On Mon, Sep 20, 2021 at 1:17 PM Maxime Ripard  wrote:
>
> On Sun, Sep 19, 2021 at 10:19:35AM -0700, Linus Torvalds wrote:
> > On Sun, Sep 19, 2021 at 4:05 AM Sudip Mukherjee
> >  wrote:
> > >
> > > And indeed, reverting 27da370e0fb3 ("drm/vc4: hdmi: Remove
> > > drm_encoder->crtc usage") on top of d4d016caa4b8 ("alpha: move
> > > __udiv_qrnnd library function to arch/alpha/lib/")
> > > has fixed the error.
> >
> > Ok, since your pulseaudio problem was reported already over two weeks
> > ago with no apparent movement, I've done that revert in my tree.
> >
> > I reverted the two runtime PM changes that cause problems for Michael too.
>
> I'm sorry, but I find that situation to be a bit ridiculous. In order to
> be properly addressed, Michael's issue needs some patches that have been
> unreviewed for 5 monthes now.
>
> However, if one reports an issue to you, without cc'ing the author, on a
> week-end, the revert is done in a single day.
>
> Even that audio bug only got a proper report yesterday, after you asked
> for it.

Apologies if I have missed any of your mail, but iirc you have only
asked me what tests I am running, but Linus asked if I can do a
bisect.


--
Regards
Sudip


Re: [git pull] drm for 5.14-rc1

2021-09-20 Thread Maxime Ripard
On Mon, Sep 20, 2021 at 10:55:31AM +0200, Maxime Ripard wrote:
> Hi,
> 
> On Sat, Sep 18, 2021 at 11:18:33AM +0200, Michael Stapelberg wrote:
> > torvalds at linux-foundation.org (Linus Torvalds) writes:
> > > Did I fix it up correctly? Who knows. The code makes more sense to me
> > > now and seems valid. But I really *really* want to stress how locking
> > > is important.
> > 
> > As far as I can tell, this merge conflict resolution made my Raspberry
> > Pi 3 hang on boot. You can find the full serial console output at:
> > 
> > https://t.zekjur.net/linux-5.14-raspberry-pi-3-hang-vc4.txt
> > 
> > The last few messages are from vc4, then the boot hangs.
> > 
> > Using git-bisect, I tracked this down to
> > https://github.com/torvalds/linux/commit/e058a84bfddc42ba356a2316f2cf1141974625c9,
> > which is the merge you’re talking about here, AFAICT.
> > 
> > I also tried the git://anongit.freedesktop.org/drm/drm, and that tree
> > boots as expected, suggesting that the problem really is with the
> > additional changes.
> > 
> > The code seems to work on my Raspberry Pi 4, just not on the Raspberry
> > Pi 3. Any ideas why that might be, and how to fix it?
> 
> I assume you run your Pi without anything connected on HDMI, and without
> hdmi_force_hotplug in your config.txt?
> 
> If so, can you test that branch, and let me know if it works for you
> https://github.com/mripard/linux/tree/rpi/bug-fixes

This breaks every one else, unfortunately. I'll try to come up with something.

Maxime


signature.asc
Description: PGP signature


Re: [git pull] drm for 5.14-rc1

2021-09-20 Thread Maxime Ripard
On Sun, Sep 19, 2021 at 10:19:35AM -0700, Linus Torvalds wrote:
> On Sun, Sep 19, 2021 at 4:05 AM Sudip Mukherjee
>  wrote:
> >
> > And indeed, reverting 27da370e0fb3 ("drm/vc4: hdmi: Remove
> > drm_encoder->crtc usage") on top of d4d016caa4b8 ("alpha: move
> > __udiv_qrnnd library function to arch/alpha/lib/")
> > has fixed the error.
> 
> Ok, since your pulseaudio problem was reported already over two weeks
> ago with no apparent movement, I've done that revert in my tree.
> 
> I reverted the two runtime PM changes that cause problems for Michael too.

I'm sorry, but I find that situation to be a bit ridiculous. In order to
be properly addressed, Michael's issue needs some patches that have been
unreviewed for 5 monthes now.

However, if one reports an issue to you, without cc'ing the author, on a
week-end, the revert is done in a single day.

Even that audio bug only got a proper report yesterday, after you asked
for it.

Do you really expect us to work during the week end too?

Maxime


signature.asc
Description: PGP signature


Re: [git pull] drm for 5.14-rc1

2021-09-20 Thread Maxime Ripard
Hi,

On Sat, Sep 18, 2021 at 11:18:33AM +0200, Michael Stapelberg wrote:
> torvalds at linux-foundation.org (Linus Torvalds) writes:
> > Did I fix it up correctly? Who knows. The code makes more sense to me
> > now and seems valid. But I really *really* want to stress how locking
> > is important.
> 
> As far as I can tell, this merge conflict resolution made my Raspberry
> Pi 3 hang on boot. You can find the full serial console output at:
> 
> https://t.zekjur.net/linux-5.14-raspberry-pi-3-hang-vc4.txt
> 
> The last few messages are from vc4, then the boot hangs.
> 
> Using git-bisect, I tracked this down to
> https://github.com/torvalds/linux/commit/e058a84bfddc42ba356a2316f2cf1141974625c9,
> which is the merge you’re talking about here, AFAICT.
> 
> I also tried the git://anongit.freedesktop.org/drm/drm, and that tree
> boots as expected, suggesting that the problem really is with the
> additional changes.
> 
> The code seems to work on my Raspberry Pi 4, just not on the Raspberry
> Pi 3. Any ideas why that might be, and how to fix it?

I assume you run your Pi without anything connected on HDMI, and without
hdmi_force_hotplug in your config.txt?

If so, can you test that branch, and let me know if it works for you
https://github.com/mripard/linux/tree/rpi/bug-fixes

Maxime


signature.asc
Description: PGP signature


Re: [git pull] drm for 5.14-rc1

2021-09-19 Thread Linus Torvalds
On Sun, Sep 19, 2021 at 4:05 AM Sudip Mukherjee
 wrote:
>
> And indeed, reverting 27da370e0fb3 ("drm/vc4: hdmi: Remove
> drm_encoder->crtc usage") on top of d4d016caa4b8 ("alpha: move
> __udiv_qrnnd library function to arch/alpha/lib/")
> has fixed the error.

Ok, since your pulseaudio problem was reported already over two weeks
ago with no apparent movement, I've done that revert in my tree.

I reverted the two runtime PM changes that cause problems for Michael too.

  Linus


Re: [git pull] drm for 5.14-rc1

2021-09-19 Thread Sudip Mukherjee
Hi Linus,

On Sun, Sep 19, 2021 at 12:06 AM Linus Torvalds
 wrote:
>
> On Sat, Sep 18, 2021 at 3:48 PM Sudip Mukherjee
>  wrote:
> >
> > Also, I have now tested by reverting those two commits and I still get
> > the same trace on rpi4.
>
> Ok. I'm afraid we really need to have the VC4 people figure it out - I
> count do the two reverts that are reported to fix the RPi3 issue, but
> it looks like the RPi4 pulseaudio issue needs something else.
>
> Any chance you could bisect that?

Done, here is the bisect log:

# bad: [7c636d4d20f8c5acfbfbc60f326fddb0e1cf5daa] Merge tag 'dt-5.15'
of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
# good: [9e9fb7655ed585da8f468e29221f0ba194a5f613] Merge tag
'net-next-5.15' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect start '7c636d4d20f8' '9e9fb7655ed5' '--' 'drivers/gpu/drm/vc4/'
# good: [776efe800feda95a29cefecce1ce36cc27d70b29] drm/vc4: hdmi: Drop
devm interrupt handler for hotplug interrupts
git bisect good 776efe800feda95a29cefecce1ce36cc27d70b29
# bad: [588b3eee528873d73bf777f329d35b2e65e24777] Merge tag
'drm-misc-next-2021-07-16' of
git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect bad 588b3eee528873d73bf777f329d35b2e65e24777
# bad: [27da370e0fb343a0baf308f503bb3e5dcdfe3362] drm/vc4: hdmi:
Remove drm_encoder->crtc usage
git bisect bad 27da370e0fb343a0baf308f503bb3e5dcdfe3362
# good: [44fe9f90eb9d2533d049b4ba09540eed6cad9f49] drm/vc4: hdmi: Only
call into DRM framework if registered
git bisect good 44fe9f90eb9d2533d049b4ba09540eed6cad9f49
# first bad commit: [27da370e0fb343a0baf308f503bb3e5dcdfe3362]
drm/vc4: hdmi: Remove drm_encoder->crtc usage

And indeed, reverting 27da370e0fb3 ("drm/vc4: hdmi: Remove
drm_encoder->crtc usage") on top of d4d016caa4b8 ("alpha: move
__udiv_qrnnd library function to arch/alpha/lib/")
has fixed the error.


-- 
Regards
Sudip


Re: [git pull] drm for 5.14-rc1

2021-09-19 Thread Michael Stapelberg
Sure, no problem. Here are the patch files that make it work for me:

https://github.com/gokrazy/kernel/blob/d04d64114aae51aa27752adca6080ed4c9a0c70c/0001-Revert-drm-vc4-hdmi-Make-sure-the-controller-is-powe.patch
https://github.com/gokrazy/kernel/blob/d04d64114aae51aa27752adca6080ed4c9a0c70c/0002-Revert-drm-vc4-hdmi-Move-the-HSM-clock-enable-to-run.patch

On Sun, 19 Sept 2021 at 00:13, Linus Torvalds
 wrote:
>
> On Sat, Sep 18, 2021 at 1:13 PM Michael Stapelberg
>  wrote:
> >
> > > Michael - do things work if you revert those two (sadly, they don't
> > > revert cleanly exactly _because_ of the other changes in the same
> > > area)?
> >
> > Reverting only 9984d6664ce9 is not sufficient, but reverting both
> > 9984d6664ce9 and 411efa18e4b0 does indeed make my Raspberry Pi 3 boot
> > again!
>
> Since you did those reverts and fixed up the conflicts, would you mind
> sending out the resulting patch so that maybe Sudip can test it too?
>
> Maybe the RPi4 hdmi audio issues are related to the RPi4 hdmi problems
> despite the symptoms apparently being rather different..
>
>   Linus



-- 
Best regards,
Michael


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Linus Torvalds
On Sat, Sep 18, 2021 at 3:48 PM Sudip Mukherjee
 wrote:
>
> Also, I have now tested by reverting those two commits and I still get
> the same trace on rpi4.

Ok. I'm afraid we really need to have the VC4 people figure it out - I
count do the two reverts that are reported to fix the RPi3 issue, but
it looks like the RPi4 pulseaudio issue needs something else.

Any chance you could bisect that?

Actually, looking at that first report of yours, the RPi4 sound
problem apparently happened in the range

9e9fb7655ed5..7c636d4d20f8

and while that range does have a drm merge from Dave Airlie, it _also_
has a sound merge from Takashi and various ARM SoC updates from Arnd.

But most (all?) of the changes in that range to the vc4 source code
came from the drm tree. And Maxime in particular. I think.

Linus


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Sudip Mukherjee
On Sat, Sep 18, 2021 at 11:15 PM Linus Torvalds
 wrote:
>
> On Sat, Sep 18, 2021 at 3:00 PM Sudip Mukherjee
>  wrote:
> >
> > Its still there. I am seeing it every night. This was from last night
> > - https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
>
> Note that that web server is not available at least to me.  Looks like
> some internal name or limited dns, I just get
>
> lava.qa.codethink.co.uk: Name or service not known
>
> so your reports aren't getting a lot of traction.

Looks like a DNS issue with the subdomains, someone else also reported
the same last week.
I have reported this to our ops and they are looking into it.

Also, I have now tested by reverting those two commits and I still get
the same trace on rpi4.
Copying here again for your reference.

[   37.929903] Unable to handle kernel access to user memory outside
uaccess routines at virtual address 0348
[   37.940882] Mem abort info:
[   37.943715]   ESR = 0x9604
[   37.946875]   EC = 0x25: DABT (current EL), IL = 32 bits
[   37.952547]   SET = 0, FnV = 0
[   37.956086]   EA = 0, S1PTW = 0
[   37.959570]   FSC = 0x04: level 0 translation fault
[   37.964561] Data abort info:
[   37.967518]   ISV = 0, ISS = 0x0004
[   37.971437]   CM = 0, WnR = 0
[   37.974666] user pgtable: 4k pages, 48-bit VAs, pgdp=50be
[   37.981239] [0348] pgd=, p4d=
[   37.988234] Internal error: Oops: 9604 [#1] PREEMPT SMP
[   37.993897] Modules linked in: overlay algif_hash algif_skcipher
af_alg bnep sch_fq_codel ppdev lp parport ip_tables x_tables autofs4
btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq
libcrc32c raid1 raid0 multipath linear hid_generic uas usbhid
usb_storage snd_soc_hdmi_codec btsdio hci_uart bcm2835_v4l2(C)
bcm2835_mmal_vchiq(C) btqca brcmfmac btrtl btbcm videobuf2_vmalloc
videobuf2_memops brcmutil btintel videobuf2_v4l2 cfg80211
raspberrypi_hwmon i2c_brcmstb videobuf2_common bluetooth videodev dwc2
udc_core roles vc4 crct10dif_ce ecdh_generic drm_kms_helper
pwm_bcm2835 cec ecc snd_bcm2835(C) mc drm snd_soc_core xhci_pci
xhci_pci_renesas ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd
fb_sys_fops syscopyarea sysfillrect sysimgblt phy_generic
uio_pdrv_genirq uio aes_neon_bs aes_neon_blk crypto_simd cryptd
[   38.073649] CPU: 1 PID: 1572 Comm: pulseaudio Tainted: G C
  5.15.0-rc1-d4d016caa4b8 #1
[   38.082913] Hardware name: Raspberry Pi 4 Model B (DT)
[   38.088119] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   38.095178] pc : vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4]
[   38.100683] lr : vc4_hdmi_audio_prepare+0x308/0xba4 [vc4]
[   38.106173] sp : 800013073a60
[   38.109527] x29: 800013073a60 x28: 4086b800 x27: 
[   38.116767] x26:  x25: ac44 x24: 21002003
[   38.124004] x23: 800013073b50 x22: 0003 x21: 561b2080
[   38.131242] x20: 561b24c8 x19: 000583380600 x18: 
[   38.138479] x17:  x16:  x15: 
[   38.145716] x14:  x13:  x12: 0991
[   38.152952] x11: 0001 x10: 0001d4c0 x9 : 88f87708
[   38.160188] x8 : 0031 x7 : 0001d4c0 x6 : 0030
[   38.167424] x5 : 800013073aa8 x4 : 88f9baa8 x3 : 10624dd3
[   38.174661] x2 : 03e8 x1 :  x0 : 00562200
[   38.181899] Call trace:
[   38.184371]  vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4]
[   38.189511]  hdmi_codec_prepare+0xe8/0x1b0 [snd_soc_hdmi_codec]
[   38.195515]  snd_soc_pcm_dai_prepare+0x5c/0x10c [snd_soc_core]
[   38.201476]  soc_pcm_prepare+0x5c/0x130 [snd_soc_core]
[   38.206729]  snd_pcm_prepare+0x150/0x1f0 [snd_pcm]
[   38.211608]  snd_pcm_common_ioctl+0x1644/0x1d14 [snd_pcm]
[   38.217099]  snd_pcm_ioctl+0x3c/0x5c [snd_pcm]
[   38.221620]  __arm64_sys_ioctl+0xb4/0x100
[   38.225687]  invoke_syscall+0x50/0x120
[   38.229486]  el0_svc_common.constprop.0+0x180/0x1a0
[   38.234431]  do_el0_svc+0x34/0xa0
[   38.237788]  el0_svc+0x2c/0xc0
[   38.240883]  el0t_64_sync_handler+0xa4/0x12c
[   38.245208]  el0t_64_sync+0x1a4/0x1a8
[   38.248921] Code: 52807d02 72a20c43 f9400421 9ba37c13 (f941a423)
[   38.255098] ---[ end trace 925d8184702abf4d ]---


-- 
Regards
Sudip


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Linus Torvalds
On Sat, Sep 18, 2021 at 3:00 PM Sudip Mukherjee
 wrote:
>
> Its still there. I am seeing it every night. This was from last night
> - https://lava.qa.codethink.co.uk/scheduler/job/164#L1356

Note that that web server is not available at least to me.  Looks like
some internal name or limited dns, I just get

lava.qa.codethink.co.uk: Name or service not known

so your reports aren't getting a lot of traction.

 Linus


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Linus Torvalds
On Sat, Sep 18, 2021 at 1:13 PM Michael Stapelberg
 wrote:
>
> > Michael - do things work if you revert those two (sadly, they don't
> > revert cleanly exactly _because_ of the other changes in the same
> > area)?
>
> Reverting only 9984d6664ce9 is not sufficient, but reverting both
> 9984d6664ce9 and 411efa18e4b0 does indeed make my Raspberry Pi 3 boot
> again!

Since you did those reverts and fixed up the conflicts, would you mind
sending out the resulting patch so that maybe Sudip can test it too?

Maybe the RPi4 hdmi audio issues are related to the RPi4 hdmi problems
despite the symptoms apparently being rather different..

  Linus


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Sudip Mukherjee
Hi Linus,

On Sat, Sep 18, 2021 at 8:24 PM Linus Torvalds
 wrote:
>
> On Sat, Sep 18, 2021 at 2:18 AM Michael Stapelberg
>  wrote:
> >
> > torvalds at linux-foundation.org (Linus Torvalds) writes:
> > > Did I fix it up correctly? Who knows. The code makes more sense to me
> > > now and seems valid. But I really *really* want to stress how locking
> > > is important.
> >
> > As far as I can tell, this merge conflict resolution made my Raspberry
> > Pi 3 hang on boot.
>
> Ok, that's a different merge issue than the locking one (which is
> about the amd ttm code).
>
> But the VC4 driver did have changes close to each other in the hdmi
> detection and clock setting code.
>
> And it doesn't seem to be just RPi3, there was a report back a couple
> of weeks ago about RPi4 also having regressed (with an Ubuntu
> install). That one was an oops in vc4_hdmi_audio_prepare(). I don't
> know if that got resolved, I heard nothing about it after the report.

Its still there. I am seeing it every night. This was from last night
- https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
Last night's test was on top of 4357f03d6611 ("Merge tag 'pm-5.15-rc2'
of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")


-- 
Regards
Sudip


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Michael Stapelberg
On Sat, 18 Sept 2021 at 21:24, Linus Torvalds
 wrote:
>
> On Sat, Sep 18, 2021 at 2:18 AM Michael Stapelberg
>  wrote:
> >
> > torvalds at linux-foundation.org (Linus Torvalds) writes:
> > > Did I fix it up correctly? Who knows. The code makes more sense to me
> > > now and seems valid. But I really *really* want to stress how locking
> > > is important.
> >
> > As far as I can tell, this merge conflict resolution made my Raspberry
> > Pi 3 hang on boot.
>
> Ok, that's a different merge issue than the locking one (which is
> about the amd ttm code).

Ah, my apologies for getting these mixed up!

>
> But the VC4 driver did have changes close to each other in the hdmi
> detection and clock setting code.
>
> And it doesn't seem to be just RPi3, there was a report back a couple
> of weeks ago about RPi4 also having regressed (with an Ubuntu
> install). That one was an oops in vc4_hdmi_audio_prepare(). I don't
> know if that got resolved, I heard nothing about it after the report.
>
> So there's something seriously wrong in VC4 space.
>
> The main issue seems to be the runtime power management changes. As
> far as I can tell, the commits that didn't come in through that drm
> pull were these two
>
>   9984d6664ce9 ("drm/vc4: hdmi: Make sure the controller is powered in 
> detect")
>   411efa18e4b0 ("drm/vc4: hdmi: Move the HSM clock enable to runtime_pm")
>
> Michael - do things work if you revert those two (sadly, they don't
> revert cleanly exactly _because_ of the other changes in the same
> area)?

Reverting only 9984d6664ce9 is not sufficient, but reverting both
9984d6664ce9 and 411efa18e4b0 does indeed make my Raspberry Pi 3 boot
again!

Thanks for identifying the faulty commits. Please let me know if any
of y’all want me to test anything else in the process to get this
fixed for the next kernel release (or perhaps even a minor release)?

>
> Maxime?
>
>  Linus



-- 
Best regards,
Michael


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Linus Torvalds
On Sat, Sep 18, 2021 at 2:18 AM Michael Stapelberg
 wrote:
>
> torvalds at linux-foundation.org (Linus Torvalds) writes:
> > Did I fix it up correctly? Who knows. The code makes more sense to me
> > now and seems valid. But I really *really* want to stress how locking
> > is important.
>
> As far as I can tell, this merge conflict resolution made my Raspberry
> Pi 3 hang on boot.

Ok, that's a different merge issue than the locking one (which is
about the amd ttm code).

But the VC4 driver did have changes close to each other in the hdmi
detection and clock setting code.

And it doesn't seem to be just RPi3, there was a report back a couple
of weeks ago about RPi4 also having regressed (with an Ubuntu
install). That one was an oops in vc4_hdmi_audio_prepare(). I don't
know if that got resolved, I heard nothing about it after the report.

So there's something seriously wrong in VC4 space.

The main issue seems to be the runtime power management changes. As
far as I can tell, the commits that didn't come in through that drm
pull were these two

  9984d6664ce9 ("drm/vc4: hdmi: Make sure the controller is powered in detect")
  411efa18e4b0 ("drm/vc4: hdmi: Move the HSM clock enable to runtime_pm")

Michael - do things work if you revert those two (sadly, they don't
revert cleanly exactly _because_ of the other changes in the same
area)?

Maxime?

 Linus


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Simon Ser
CC Emma and Maxime

On Saturday, September 18th, 2021 at 11:18, Michael Stapelberg 
 wrote:

> Hi,
>
> torvalds at linux-foundation.org (Linus Torvalds) writes:
> > Did I fix it up correctly? Who knows. The code makes more sense to me
> > now and seems valid. But I really *really* want to stress how locking
> > is important.
>
> As far as I can tell, this merge conflict resolution made my Raspberry
> Pi 3 hang on boot. You can find the full serial console output at:
>
> https://t.zekjur.net/linux-5.14-raspberry-pi-3-hang-vc4.txt
>
> The last few messages are from vc4, then the boot hangs.
>
> Using git-bisect, I tracked this down to
> https://github.com/torvalds/linux/commit/e058a84bfddc42ba356a2316f2cf1141974625c9,
> which is the merge you’re talking about here, AFAICT.
>
> I also tried the git://anongit.freedesktop.org/drm/drm, and that tree
> boots as expected, suggesting that the problem really is with the
> additional changes.
>
> The code seems to work on my Raspberry Pi 4, just not on the Raspberry
> Pi 3. Any ideas why that might be, and how to fix it?
>
> Thanks!
>
> --
> Best regards,
> Michael


Re: [git pull] drm for 5.14-rc1

2021-09-18 Thread Michael Stapelberg
Hi,

torvalds at linux-foundation.org (Linus Torvalds) writes:
> Did I fix it up correctly? Who knows. The code makes more sense to me
> now and seems valid. But I really *really* want to stress how locking
> is important.

As far as I can tell, this merge conflict resolution made my Raspberry
Pi 3 hang on boot. You can find the full serial console output at:

https://t.zekjur.net/linux-5.14-raspberry-pi-3-hang-vc4.txt

The last few messages are from vc4, then the boot hangs.

Using git-bisect, I tracked this down to
https://github.com/torvalds/linux/commit/e058a84bfddc42ba356a2316f2cf1141974625c9,
which is the merge you’re talking about here, AFAICT.

I also tried the git://anongit.freedesktop.org/drm/drm, and that tree
boots as expected, suggesting that the problem really is with the
additional changes.

The code seems to work on my Raspberry Pi 4, just not on the Raspberry
Pi 3. Any ideas why that might be, and how to fix it?

Thanks!

-- 
Best regards,
Michael


Re: [git pull] drm for 5.14-rc1

2021-07-01 Thread Felix Kuehling
Am 2021-07-01 um 4:15 p.m. schrieb Linus Torvalds:
> On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie  wrote:
>> Hi Linus,
>>
>> This is the main drm pull request for 5.14-rc1.
>>
>> I've done a test pull into your current tree, and hit two conflicts
>> (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one
>> is recent and sfr sent out a resolution for it today.
> Well, the resolutions may be trivial, but the conflict made me look at
> the code, and it's buggy.
>
> Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function")
> is broken. It made the code do
>
> mmap_read_lock(mm);
> vma = find_vma(mm, start);
> mmap_read_unlock(mm);
>
> and then it *uses* that "vma" after it has dropped the lock.
>
> That's a big no-no - once you've dropped the lock, the vma contents
> simply aren't reliable any more. That mapping could now be unmapped
> and removed at any time.
>
> Now, the conflict actually made one of the uses go away (switching to
> vma_lookup() means that the subsequent code no longer needs to look at
> "vm_start" to verify we're actually _inside_ the vma), but it still
> checks for vma->vm_file afterwards.
>
> So those locking changes in commit 04d8d73dbcbe are completely bogus.
>
> I tried to fix up that bug while handling the conflict, but who knows
> what else similar is going on elsewhere.
>
> So I would ask people to
>
>  (a) verify that I didn't make things worse as I fixed things up (note
> how I had to change the last argument to amdgpu_hmm_range_get_pages()
> from false to true etc).
>
>  (b) go and look at their vma lookup code: you can't just look up a
> vma under the lock, and then drop the lock, and then think things stay
> stable.
>
> In particular for that (b) case: it is *NOT* enough to look up
> vma->vm_file inside the lock and cache that. No - if the test is about
> "no backing file before looking up pages", then you have to *keep*
> holding the lock until after you've actually looked up the pages!
>
> Because otherwise any test for "vma->vm_file" is entirely pointless,
> for the same reason it's buggy to even look at it after dropping the
> lock: because once you've dropped the lock, the thing you just tested
> for might not be true any more.
>
> So no, it's not valid to do
>
> bool has_file = vma && vma->vm_file;
>
> and then drop the lock, because you don't use 'vma' any more as a
> pointer, and then use 'has_file' outside the lock. Because after
> you've dropped the lock, 'has_file' is now meaningless.
>
> So it's not just about "you can't look at vma->vm_file after dropping
> the lock". It's more fundamental than that. Any *decision* you make
> based on the vma is entirely pointless and moot after the lock is
> dropped!
>
> Did I fix it up correctly? Who knows. The code makes more sense to me
> now and seems valid. But I really *really* want to stress how locking
> is important.

Thank you for the fix and the explanation. Your fix looks correct. I
also double-checked all other uses of find_vma in the amdgpu driver.
They all hold the mmap lock correctly.

Two comments:

With this fix, we could remove the bool mmap_locked parameter from
amdgpu_hmm_range_get_pages because it always gets called with the lock
held now.

You're now holding the mmap lock from the vma_lookup until
hmm_range_fault is done. This ensures that the result of the
vma->vm_file check remains valid. This was broken even before our commit
04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function").


>
> You also can't just unlock in the middle of an operation - even if you
> then take the lock *again* later (as amdgpu_hmm_range_get_pages() then
> did), the fact that you unlocked in the middle means that all the
> earlier tests you did are simply no longer valid when you re-take the
> lock.

I agree completely. I catch a lot of locking bugs in code review. I
probably missed this one because I wasn't paying enough attention to
what was being protected by the mmap_read_lock in this case.

Regards,
  Felix


>
>  Linus


Re: [git pull] drm for 5.14-rc1

2021-07-01 Thread pr-tracker-bot
The pull request you sent on Thu, 1 Jul 2021 14:34:15 +1000:

> git://anongit.freedesktop.org/drm/drm tags/drm-next-2021-07-01

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e058a84bfddc42ba356a2316f2cf1141974625c9

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


Re: [git pull] drm for 5.14-rc1

2021-07-01 Thread Linus Torvalds
On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie  wrote:
>
> Hi Linus,
>
> This is the main drm pull request for 5.14-rc1.
>
> I've done a test pull into your current tree, and hit two conflicts
> (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one
> is recent and sfr sent out a resolution for it today.

Well, the resolutions may be trivial, but the conflict made me look at
the code, and it's buggy.

Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function")
is broken. It made the code do

mmap_read_lock(mm);
vma = find_vma(mm, start);
mmap_read_unlock(mm);

and then it *uses* that "vma" after it has dropped the lock.

That's a big no-no - once you've dropped the lock, the vma contents
simply aren't reliable any more. That mapping could now be unmapped
and removed at any time.

Now, the conflict actually made one of the uses go away (switching to
vma_lookup() means that the subsequent code no longer needs to look at
"vm_start" to verify we're actually _inside_ the vma), but it still
checks for vma->vm_file afterwards.

So those locking changes in commit 04d8d73dbcbe are completely bogus.

I tried to fix up that bug while handling the conflict, but who knows
what else similar is going on elsewhere.

So I would ask people to

 (a) verify that I didn't make things worse as I fixed things up (note
how I had to change the last argument to amdgpu_hmm_range_get_pages()
from false to true etc).

 (b) go and look at their vma lookup code: you can't just look up a
vma under the lock, and then drop the lock, and then think things stay
stable.

In particular for that (b) case: it is *NOT* enough to look up
vma->vm_file inside the lock and cache that. No - if the test is about
"no backing file before looking up pages", then you have to *keep*
holding the lock until after you've actually looked up the pages!

Because otherwise any test for "vma->vm_file" is entirely pointless,
for the same reason it's buggy to even look at it after dropping the
lock: because once you've dropped the lock, the thing you just tested
for might not be true any more.

So no, it's not valid to do

bool has_file = vma && vma->vm_file;

and then drop the lock, because you don't use 'vma' any more as a
pointer, and then use 'has_file' outside the lock. Because after
you've dropped the lock, 'has_file' is now meaningless.

So it's not just about "you can't look at vma->vm_file after dropping
the lock". It's more fundamental than that. Any *decision* you make
based on the vma is entirely pointless and moot after the lock is
dropped!

Did I fix it up correctly? Who knows. The code makes more sense to me
now and seems valid. But I really *really* want to stress how locking
is important.

You also can't just unlock in the middle of an operation - even if you
then take the lock *again* later (as amdgpu_hmm_range_get_pages() then
did), the fact that you unlocked in the middle means that all the
earlier tests you did are simply no longer valid when you re-take the
lock.

 Linus