Re: imxdrm issue on SABRE Lite

2017-02-20 Thread Dan MacDonald
Hi Phillipp

On Mon, Feb 20, 2017 at 3:42 PM, Philipp Zabel  wrote:
> Hi Dan,
>
> On Sat, 2017-02-11 at 21:09 +, Dan MacDonald wrote:
>> Hi Phillipp
>>
>> I'm having trouble getting xf86-video-armada working properly on a
>> Element 14 / Embest SABRE Lite board running Arch Linux with kernel
>> 4.9.8. I have been in touch with RMK and he's confident the crash
>> below is caused by a kernel issue relating to imxdrm and a null
>> pointer dereference.
>
> I can't reproduce this on v4.9.8 with a Nitrogen6X with 1280x800 LDVS
> panel and 1920x1080 HDMI monitor using modetest from libdrm. Does this
> only happen with X?

I have read that there are supposedly differences between the Element
14 SL and the Boundary Devices board of the same name but I have yet
to find out exactly what they are as the specs are the same on paper.
Saying that, all of the uboot and OS images for the BD SL that I've
tried have worked without issue, including the GLES demos on the BD
Ubuntu 16.04 image.

I've never used wayland. The console works fine with my 4K display and
I have been able to get X running at 720x480 under Arch with armada
but no other resolution.

> Do you have framebuffer console emulation enabled?

Not sure - will need to check if thats enable by default under ALARM?
I've not manually enabled it.

> Does this still happen if you disable the LVDS panel in the device tree?

Russell sent me a different dtb file which seemed to fix some of my
kernel probs. Russell will be able to explain that better than me.

>
>> The first indiction of X/armada not being happy it that it only starts
>> in 720x480 when connected to a 4K HDMI display, and when I try
>> running:
>>
>> xrandr --output HDMI1 --mode 1920x1080
>>
>> X crashes.
>>
>> Here's the serial console output for X loading and then crashing after
>> I try running that xrandr command:
>>
>>  Arch Linux 4.9.8-1-ARCH (ttymxc1)
>>
>> alarm login: [   19.197477] fec 2188000.ethernet eth0: Link is Up -
>> 1Gbps/Full - flow control rx/tx
>> [   19.205165] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
>> [   33.08] vgaarb: this pci device is not a vga device
>> [   33.203520] vgaarb: this pci device is not a vga device
>> [   33.447051] imx-ipuv3 240.ipu: DC stop timeout after 50 ms
>
> Does the "gpu: ipu-v3: remove IRQ dance on DC channel disable" patch
> (https://patchwork.kernel.org/patch/9417683) do anything to improve the
> situation, besides removing the above warning?

After trying Russells dtb, he asked me to try booting with:

setenv bootargs cma=256M

Then X failed to start with the error:

[   57.705889] imx-drm display-subsystem: failed to allocate buffer
with size 33177600

Do you think that patch could fix such an issue?

Thanks
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: imxdrm issue on SABRE Lite

2017-02-20 Thread Philipp Zabel
Hi Dan,

On Sat, 2017-02-11 at 21:09 +, Dan MacDonald wrote:
> Hi Phillipp
>
> I'm having trouble getting xf86-video-armada working properly on a
> Element 14 / Embest SABRE Lite board running Arch Linux with kernel
> 4.9.8. I have been in touch with RMK and he's confident the crash
> below is caused by a kernel issue relating to imxdrm and a null
> pointer dereference.

I can't reproduce this on v4.9.8 with a Nitrogen6X with 1280x800 LDVS
panel and 1920x1080 HDMI monitor using modetest from libdrm. Does this
only happen with X? Do you have framebuffer console emulation enabled?
Does this still happen if you disable the LVDS panel in the device tree?

> The first indiction of X/armada not being happy it that it only starts
> in 720x480 when connected to a 4K HDMI display, and when I try
> running:
> 
> xrandr --output HDMI1 --mode 1920x1080
> 
> X crashes.
> 
> Here's the serial console output for X loading and then crashing after
> I try running that xrandr command:
> 
>  Arch Linux 4.9.8-1-ARCH (ttymxc1)
> 
> alarm login: [   19.197477] fec 2188000.ethernet eth0: Link is Up -
> 1Gbps/Full - flow control rx/tx
> [   19.205165] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [   33.08] vgaarb: this pci device is not a vga device
> [   33.203520] vgaarb: this pci device is not a vga device
> [   33.447051] imx-ipuv3 240.ipu: DC stop timeout after 50 ms

Does the "gpu: ipu-v3: remove IRQ dance on DC channel disable" patch
(https://patchwork.kernel.org/patch/9417683) do anything to improve the
situation, besides removing the above warning?

> [   33.707042] [ cut here ]
> [   33.711682] WARNING: CPU: 1 PID: 316 at
> drivers/gpu/drm/drm_atomic_helper.c:1140
> drm_atomic_helper_wait_for_vblanks+0x274/0x278
> [   33.723187] [CRTC:24] vblank wait timed out
> [   33.727395] Modules linked in: snd_soc_sgtl5000 snd_soc_fsl_ssi
> snd_soc_imx_sgtl5000 imx_pcm_fiq imx_pcm_dma snd_soc_fsl_asrc
> snd_soc_fsl_asoc_card snd_soc_core dw_hdmi_ahb_audio snd_pcm_dmaengine
> caam_jr imx_ipuv3_crtc snd_ac97_codec coda v4l2_mem2mem
> videobuf2_dma_contig ac97_bus imx_ipu_v3 snd_soc_imx_audmux snd_pcm
> videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core
> dw_hdmi_imx caam imx2_wdt ofpart spi_imx evdev dw_hdmi etnaviv imx_ldb
> pwm_imx snd_timer parallel_display uio_pdrv_genirq uio imxdrm
> sch_fq_codel ip_tables x_tables
> [   33.776237] CPU: 1 PID: 316 Comm: Xorg Not tainted 4.9.8-1-ARCH #1
> [   33.782419] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> [   33.788965] [] (unwind_backtrace) from []
> (show_stack+0x10/0x14)
> [   33.796719] [] (show_stack) from []
> (dump_stack+0x88/0x9c)
> [   33.803951] [] (dump_stack) from [] (__warn+0xe8/0x100)
> [   33.810919] [] (__warn) from []
> (warn_slowpath_fmt+0x48/0x6c)
> [   33.818410] [] (warn_slowpath_fmt) from []
> (drm_atomic_helper_wait_for_vblanks+0x274/0x278)
> [   33.828516] [] (drm_atomic_helper_wait_for_vblanks) from
> [] (imx_drm_atomic_commit_tail+0x48/0x58 [imxdrm])
> [   33.840004] [] (imx_drm_atomic_commit_tail [imxdrm]) from
> [] (commit_tail+0x40/0x5c)
> [   33.849489] [] (commit_tail) from []
> (drm_atomic_helper_commit+0x94/0xd8)
> [   33.858019] [] (drm_atomic_helper_commit) from
> [] (drm_atomic_helper_set_config+0x78/0x9c)
> [   33.868028] [] (drm_atomic_helper_set_config) from
> [] (drm_mode_set_config_internal+0x58/0xdc)
> [   33.878382] [] (drm_mode_set_config_internal) from
> [] (drm_mode_setcrtc+0x4a4/0x550)
> [   33.887867] [] (drm_mode_setcrtc) from []
> (drm_ioctl+0x214/0x44c)
> [   33.895704] [] (drm_ioctl) from []
> (do_vfs_ioctl+0xac/0x980)
> [   33.903104] [] (do_vfs_ioctl) from []
> (SyS_ioctl+0x6c/0x7c)
> [   33.910422] [] (SyS_ioctl) from []
> (ret_fast_syscall+0x0/0x3c)
> [   33.918023] ---[ end trace e8a77aa320be7e55 ]---
> [   43.997066] [drm:drm_atomic_helper_commit_cleanup_done] *ERROR*
> [CRTC:24:crtc-0] flip_done timed out
> [   55.517063] [drm:drm_atomic_helper_commit_cleanup_done] *ERROR*
> [CRTC:24:crtc-0] flip_done timed out
> [   56.958264] alloc_contig_range: [4d0b7, 4d0b8) PFNs busy
> [   90.896615] usb 1-1.2.3: new low-speed USB device number 6 using ci_hdrc
> [   91.050796] usb 1-1.2.3: New USB device found, idVendor=0458, 
> idProduct=003a
> [   91.057888] usb 1-1.2.3: New USB device strings: Mfr=1, Product=2,
> SerialNumber=0
> [   91.065382] usb 1-1.2.3: Product: Optical Mouse
> [   91.069944] usb 1-1.2.3: Manufacturer: Genius
> [   91.082071] input: Genius Optical Mouse as
> /devices/soc0/soc/210.aips-bus/2184200.usb/ci_hdrc.1/usb1/1-1/1-1.2/1-1.2.3/1-1.2.3:1.0/0003:0458:003A.0003/input/input3
> [   91.098116] hid-generic 0003:0458:003A.0003: input,hidraw2: USB HID
> v1.11 Mouse [Genius Optical Mouse] on usb-ci_hdrc.1-1.2.3/input0
> [   91.162048] mousedev: PS/2 mouse device common for all mice
> [  214.765689] imx-ipuv3 240.ipu: DC stop timeout after 50 ms
> [  214.825688] Unable to handle kernel NULL pointer dereference at
> virtual address 

Re: imxdrm issue on SABRE Lite

2017-02-19 Thread Dan MacDonald
Hi all

Has there been any progress toward getting imxdrm working with the
SABRE Lite and similar?

I'm presuming that non of you own such a board and that this won't be
fixed in time for 4.10, right?

Thanks

On Mon, Feb 13, 2017 at 9:38 AM, Russell King - ARM Linux
 wrote:
> On Mon, Feb 13, 2017 at 08:55:53AM +, Chris Wilson wrote:
>> On Mon, Feb 13, 2017 at 09:05:33AM +0100, Thierry Reding wrote:
>> > On Sun, Feb 12, 2017 at 12:15:46AM +, Russell King - ARM Linux wrote:
>> > > On Sat, Feb 11, 2017 at 09:09:34PM +, Dan MacDonald wrote:
>> > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
>> > > b/drivers/gpu/drm/drm_atomic_helper.c
>> > > index 21f992605541..46668d071d6a 100644
>> > > --- a/drivers/gpu/drm/drm_atomic_helper.c
>> > > +++ b/drivers/gpu/drm/drm_atomic_helper.c
>> > > @@ -1201,9 +1201,8 @@ static void commit_tail(struct drm_atomic_state 
>> > > *state)
>> > >   else
>> > >   drm_atomic_helper_commit_tail(state);
>> > >
>> > > - drm_atomic_helper_commit_cleanup_done(state);
>> > > -
>> > > - drm_atomic_state_free(state);
>> > > + if (drm_atomic_helper_commit_cleanup_done(state) == 0)
>> > > + drm_atomic_state_free(state);
>> >
>> > Chris (Cc'ed) added reference counting to atomic state for v4.10, maybe
>> > that already fixes the issue?
>>
>> I'm not confident it will, as there is not an independent ref on the
>> state for the phases, and so a forced timeout still leaves a dangling
>> pointer. The above chunk goes the opposite way and leaks the state to
>> avoid the invalid deref, what we need is a ref around its existence on
>> the dependency queue if that is outside the lifetime of the commit.
>
> I said as much in my email - unfortunately, Thierry cut all that context.
>
> Right now, we oops the kernel, which causes:
>
> (a) the death of the calling process
> (b) leaking of all memory associated with the modeset
>
> What I'm proposing for the -stable kernels is to _improve_ the situation
> by eliminating part of the problem, so it's possible to get a better
> idea of which bit went wrong and which outputs have failed.
>
> Fixing it properly is likely to be very invasive, since you'll need to
> add reference counting to the drm_crtc_commit structure, a pointer
> to that in the drm_pending_event structure, and ensure that the
> reference count gets incremented at the appropriate time.  Incrementing
> the reference count in drm_atomic_helper_setup_commit() certainly isn't
> the right place, that would be at the sites which queue the event, but
> they are scattered amongst all the atomic modeset drivers.
>
> For reference, here's my complete patch I posted yesterday:
>
>  drivers/gpu/drm/drm_atomic_helper.c | 15 +--
>  include/drm/drm_atomic_helper.h |  2 +-
>  2 files changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
> b/drivers/gpu/drm/drm_atomic_helper.c
> index 21f992605541..46668d071d6a 100644
> --- a/drivers/gpu/drm/drm_atomic_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_helper.c
> @@ -1201,9 +1201,8 @@ static void commit_tail(struct drm_atomic_state *state)
> else
> drm_atomic_helper_commit_tail(state);
>
> -   drm_atomic_helper_commit_cleanup_done(state);
> -
> -   drm_atomic_state_free(state);
> +   if (drm_atomic_helper_commit_cleanup_done(state) == 0)
> +   drm_atomic_state_free(state);
>  }
>
>  static void commit_work(struct work_struct *work)
> @@ -1591,12 +1590,12 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_hw_done);
>   * This is part of the atomic helper support for nonblocking commits, see
>   * drm_atomic_helper_setup_commit() for an overview.
>   */
> -void drm_atomic_helper_commit_cleanup_done(struct drm_atomic_state *state)
> +int drm_atomic_helper_commit_cleanup_done(struct drm_atomic_state *state)
>  {
> struct drm_crtc *crtc;
> struct drm_crtc_state *crtc_state;
> struct drm_crtc_commit *commit;
> -   int i;
> +   int i, failed = 0;
> long ret;
>
> for_each_crtc_in_state(state, crtc, crtc_state, i) {
> @@ -1621,15 +1620,19 @@ void drm_atomic_helper_commit_cleanup_done(struct 
> drm_atomic_state *state)
>  * not hold a reference of its own. */
> ret = wait_for_completion_timeout(>flip_done,
>   10*HZ);
> -   if (ret == 0)
> -   DRM_ERROR("[CRTC:%d:%s] flip_done timed out\n",
> +   if (ret == 0) {
> +   DRM_ERROR("[CRTC:%d:%s] flip_done timed out, memory 
> leaked\n",
>   crtc->base.id, crtc->name);
> +   failed = -ETIMEDOUT;
> +   }
>
> spin_lock(>commit_lock);
>  del_commit:
> list_del(>commit_entry);
> spin_unlock(>commit_lock);
> }
> +
> +   return failed;
>  }
>  

Re: imxdrm issue on SABRE Lite

2017-02-13 Thread Russell King - ARM Linux
On Mon, Feb 13, 2017 at 09:05:33AM +0100, Thierry Reding wrote:
> On Sun, Feb 12, 2017 at 12:15:46AM +, Russell King - ARM Linux wrote:
> > diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
> > b/drivers/gpu/drm/drm_atomic_helper.c
> > index 21f992605541..46668d071d6a 100644
> > --- a/drivers/gpu/drm/drm_atomic_helper.c
> > +++ b/drivers/gpu/drm/drm_atomic_helper.c
> > @@ -1201,9 +1201,8 @@ static void commit_tail(struct drm_atomic_state 
> > *state)
> > else
> > drm_atomic_helper_commit_tail(state);
> >  
> > -   drm_atomic_helper_commit_cleanup_done(state);
> > -
> > -   drm_atomic_state_free(state);
> > +   if (drm_atomic_helper_commit_cleanup_done(state) == 0)
> > +   drm_atomic_state_free(state);
> 
> Chris (Cc'ed) added reference counting to atomic state for v4.10, maybe
> that already fixes the issue?

No.  It's not the atomic state that's referenced, it's only a completion
within the drm_crtc_commit structure, which is completely separate from
the atomic state.  Moreover, the event code has no knowledge of commits,
so it can't "put" a reference count on it.  See:

void drm_send_event_locked(struct drm_device *dev, struct drm_pending_event *e)
{
assert_spin_locked(>event_lock);

if (e->completion) {
/* ->completion might disappear as soon as it signalled. */
complete_all(e->completion);
e->completion = NULL;
}

vs the setup of the event done in drm_atomic_helper_setup_commit():

if (!crtc_state->event) {
commit->event = kzalloc(sizeof(*commit->event),
GFP_KERNEL);
if (!commit->event)
return -ENOMEM;

crtc_state->event = commit->event;
}

crtc_state->event->base.completion = >flip_done;

"commit" gets freed before drm_send_event_locked() is called (hence
the timeout message) and when drm_send_event_locked() is eventually
called via drm_vblank_off(), this causes a use-after-free bug.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: imxdrm issue on SABRE Lite

2017-02-13 Thread Russell King - ARM Linux
On Mon, Feb 13, 2017 at 08:55:53AM +, Chris Wilson wrote:
> On Mon, Feb 13, 2017 at 09:05:33AM +0100, Thierry Reding wrote:
> > On Sun, Feb 12, 2017 at 12:15:46AM +, Russell King - ARM Linux wrote:
> > > On Sat, Feb 11, 2017 at 09:09:34PM +, Dan MacDonald wrote:
> > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
> > > b/drivers/gpu/drm/drm_atomic_helper.c
> > > index 21f992605541..46668d071d6a 100644
> > > --- a/drivers/gpu/drm/drm_atomic_helper.c
> > > +++ b/drivers/gpu/drm/drm_atomic_helper.c
> > > @@ -1201,9 +1201,8 @@ static void commit_tail(struct drm_atomic_state 
> > > *state)
> > >   else
> > >   drm_atomic_helper_commit_tail(state);
> > >  
> > > - drm_atomic_helper_commit_cleanup_done(state);
> > > -
> > > - drm_atomic_state_free(state);
> > > + if (drm_atomic_helper_commit_cleanup_done(state) == 0)
> > > + drm_atomic_state_free(state);
> > 
> > Chris (Cc'ed) added reference counting to atomic state for v4.10, maybe
> > that already fixes the issue?
> 
> I'm not confident it will, as there is not an independent ref on the
> state for the phases, and so a forced timeout still leaves a dangling
> pointer. The above chunk goes the opposite way and leaks the state to
> avoid the invalid deref, what we need is a ref around its existence on
> the dependency queue if that is outside the lifetime of the commit.

I said as much in my email - unfortunately, Thierry cut all that context.

Right now, we oops the kernel, which causes:

(a) the death of the calling process
(b) leaking of all memory associated with the modeset

What I'm proposing for the -stable kernels is to _improve_ the situation
by eliminating part of the problem, so it's possible to get a better
idea of which bit went wrong and which outputs have failed.

Fixing it properly is likely to be very invasive, since you'll need to
add reference counting to the drm_crtc_commit structure, a pointer
to that in the drm_pending_event structure, and ensure that the
reference count gets incremented at the appropriate time.  Incrementing
the reference count in drm_atomic_helper_setup_commit() certainly isn't
the right place, that would be at the sites which queue the event, but
they are scattered amongst all the atomic modeset drivers.

For reference, here's my complete patch I posted yesterday:

 drivers/gpu/drm/drm_atomic_helper.c | 15 +--
 include/drm/drm_atomic_helper.h |  2 +-
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
b/drivers/gpu/drm/drm_atomic_helper.c
index 21f992605541..46668d071d6a 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1201,9 +1201,8 @@ static void commit_tail(struct drm_atomic_state *state)
else
drm_atomic_helper_commit_tail(state);
 
-   drm_atomic_helper_commit_cleanup_done(state);
-
-   drm_atomic_state_free(state);
+   if (drm_atomic_helper_commit_cleanup_done(state) == 0)
+   drm_atomic_state_free(state);
 }
 
 static void commit_work(struct work_struct *work)
@@ -1591,12 +1590,12 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_hw_done);
  * This is part of the atomic helper support for nonblocking commits, see
  * drm_atomic_helper_setup_commit() for an overview.
  */
-void drm_atomic_helper_commit_cleanup_done(struct drm_atomic_state *state)
+int drm_atomic_helper_commit_cleanup_done(struct drm_atomic_state *state)
 {
struct drm_crtc *crtc;
struct drm_crtc_state *crtc_state;
struct drm_crtc_commit *commit;
-   int i;
+   int i, failed = 0;
long ret;
 
for_each_crtc_in_state(state, crtc, crtc_state, i) {
@@ -1621,15 +1620,19 @@ void drm_atomic_helper_commit_cleanup_done(struct 
drm_atomic_state *state)
 * not hold a reference of its own. */
ret = wait_for_completion_timeout(>flip_done,
  10*HZ);
-   if (ret == 0)
-   DRM_ERROR("[CRTC:%d:%s] flip_done timed out\n",
+   if (ret == 0) {
+   DRM_ERROR("[CRTC:%d:%s] flip_done timed out, memory 
leaked\n",
  crtc->base.id, crtc->name);
+   failed = -ETIMEDOUT;
+   }
 
spin_lock(>commit_lock);
 del_commit:
list_del(>commit_entry);
spin_unlock(>commit_lock);
}
+
+   return failed;
 }
 EXPORT_SYMBOL(drm_atomic_helper_commit_cleanup_done);
 
diff --git a/include/drm/drm_atomic_helper.h b/include/drm/drm_atomic_helper.h
index 7ff92b09fd9c..ee3d642c1feb 100644
--- a/include/drm/drm_atomic_helper.h
+++ b/include/drm/drm_atomic_helper.h
@@ -88,7 +88,7 @@ int drm_atomic_helper_setup_commit(struct drm_atomic_state 
*state,
   bool nonblock);
 void drm_atomic_helper_wait_for_dependencies(struct drm_atomic_state *state);
 void 

Re: imxdrm issue on SABRE Lite

2017-02-13 Thread Chris Wilson
On Mon, Feb 13, 2017 at 09:05:33AM +0100, Thierry Reding wrote:
> On Sun, Feb 12, 2017 at 12:15:46AM +, Russell King - ARM Linux wrote:
> > On Sat, Feb 11, 2017 at 09:09:34PM +, Dan MacDonald wrote:
> > diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
> > b/drivers/gpu/drm/drm_atomic_helper.c
> > index 21f992605541..46668d071d6a 100644
> > --- a/drivers/gpu/drm/drm_atomic_helper.c
> > +++ b/drivers/gpu/drm/drm_atomic_helper.c
> > @@ -1201,9 +1201,8 @@ static void commit_tail(struct drm_atomic_state 
> > *state)
> > else
> > drm_atomic_helper_commit_tail(state);
> >  
> > -   drm_atomic_helper_commit_cleanup_done(state);
> > -
> > -   drm_atomic_state_free(state);
> > +   if (drm_atomic_helper_commit_cleanup_done(state) == 0)
> > +   drm_atomic_state_free(state);
> 
> Chris (Cc'ed) added reference counting to atomic state for v4.10, maybe
> that already fixes the issue?

I'm not confident it will, as there is not an independent ref on the
state for the phases, and so a forced timeout still leaves a dangling
pointer. The above chunk goes the opposite way and leaks the state to
avoid the invalid deref, what we need is a ref around its existence on
the dependency queue if that is outside the lifetime of the commit.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: imxdrm issue on SABRE Lite

2017-02-13 Thread Thierry Reding
On Sun, Feb 12, 2017 at 12:15:46AM +, Russell King - ARM Linux wrote:
> On Sat, Feb 11, 2017 at 09:09:34PM +, Dan MacDonald wrote:
> > [   43.997066] [drm:drm_atomic_helper_commit_cleanup_done] *ERROR*
> > [CRTC:24:crtc-0] flip_done timed out
> > [   55.517063] [drm:drm_atomic_helper_commit_cleanup_done] *ERROR*
> > [CRTC:24:crtc-0] flip_done timed out
> 
> This seems to lay the foundation for the kernel to Oops itself later.
> The problem seems to be this:
> 
> drm_atomic_helper_commit(state->dev, state, false)
> - drm_atomic_helper_setup_commit(state, false)
>   - foreach crtc in state
> - commit->event = kzalloc()
> - crtc_state->event = commit->event
> - crtc_state->event->base.completion = >flip_done
> ...
> - commit_tail(state)
>   - funcs->atomic_commit_tail(state)
> ...
> - drm_atomic_helper_commit_planes(dev, state,
> DRM_PLANE_COMMIT_ACTIVE_ONLY |
> 
> DRM_PLANE_COMMIT_NO_DISABLE_AFTER_MODES$  - foreach active crtc in state
> - funcs->atomic_begin(crtc, old_crtc_state)
>   - ipu_crtc_atomic_begin()
> - drm_crtc_vblank_on()
> - if crtc->state->event
>   - drm_crtc_arm_vblank_event(crtc, crtc->state->event)
>   - crtc->state->event = NULL
> 
> At this point, the "commit->flip_done" completion is queued with the
> event onto the vblank list.
> ...
>   - drm_atomic_helper_commit_cleanup_done(state)
> - foreach crtc in state
>   - try_wait_for_completion(>hw_done)
>   - wait_for_completion_timeout(>flip_done, 10sec)
> 
> This is where we get the timeout message.
> 
>   - drm_atomic_state_free(state)
> 
> This "clears" the commit state (calling drm_crtc_commit_put() on it)
> which has the effect of kfree()'ing the structure containing the
> flip_done, but which is still on the vblank list.
> 
> The next time we try to set a mode, the result is that a call to
> drm_crtc_vblank_off() causes all queued events to be sent, including
> the now kfree()'d flip_done completion, resulting in the reported
> kernel oops.
> 
> It seems others are also suffering similar issues when the flip_done
> completion times out with other drivers:
> 
> https://lkml.org/lkml/2016/12/1/171
> https://bugs.freedesktop.org/show_bug.cgi?id=96781
> https://lists.opensuse.org/opensuse-bugs/2016-10/msg03011.html
> https://patchwork.kernel.org/patch/9280223/ (which is me...)
> 
> This is likely the same, although the timeout line was not captured:
> https://bugzilla.redhat.com/show_bug.cgi?id=1415180
> https://bodhi.fedoraproject.org/updates/kernel-4.8.7-200.fc24
> 
> So, can we please avoid killing the kernel when the hardware doesn't
> quite behave as we want it to?
> 
> Right now, when we oops the kernel, we're leaking all the memory
> associated with the atomic modeset, so if we stop oopsing the kernel
> but still leak the memory, surely that would be an improvement?
> Maybe something like the untested patch at the bottom of this mail?
> 
> It would give the opportunity to poke about on a failed system to
> work out what happened and maybe why the hardware misbehaved.
> 
> The real answer is for the hardware to behave, but we can't always
> have our cake.
> 
> Note - I can't reproduce Dan's problem here on 4.10-rc7 as I suspect
> it needs multiple CRTCs/outputs running in the IPU to trigger it,
> which Sabre lite has.  I'll try enabling the (disconnected) LVDS
> output tomorrow (I have Fabio's LVDS patch knocking about), but I
> suspect those with a deeper knowledge of the IPU need to investigate
> what's going on.
> 
> > [  214.765689] imx-ipuv3 240.ipu: DC stop timeout after 50 ms
> > [  214.825688] Unable to handle kernel NULL pointer dereference at
> > virtual address 
> > [  214.833783] pgd = ed1b8000
> > [  214.836491] [] *pgd=4c974831
> > [  214.840084] Internal error: Oops: 17 [#1] SMP ARM
> > [  214.844789] Modules linked in: mousedev snd_soc_sgtl5000
> > snd_soc_fsl_ssi snd_soc_imx_sgtl5000 imx_pcm_fiq imx_pcm_dma
> > snd_soc_fsl_asrc snd_soc_fsl_asoc_card snd_soc_core dw_hdmi_ahb_audio
> > snd_pcm_dmaengine caam_jr imx_ipuv3_crtc snd_ac97_codec coda
> > v4l2_mem2mem videobuf2_dma_contig ac97_bus imx_ipu_v3
> > snd_soc_imx_audmux snd_pcm videobuf2_vmalloc videobuf2_memops
> > videobuf2_v4l2 videobuf2_core dw_hdmi_imx caam imx2_wdt ofpart spi_imx
> > evdev dw_hdmi etnaviv imx_ldb pwm_imx snd_timer parallel_display
> > uio_pdrv_genirq uio imxdrm sch_fq_codel ip_tables x_tables
> > [  214.894338] CPU: 2 PID: 316 Comm: Xorg Tainted: GW
> > 4.9.8-1-ARCH #1
> > [  214.901735] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> > [  214.908264] task: ed2c4d00 task.stack: ed2a6000
> > [  214.912803] PC is at __wake_up_common+0x1c/0x80
> > [  214.917337] LR is at __wake_up_locked+0x14/0x1c
> > [  214.921871] pc : []lr : []psr: a0070093
> > [  214.921871] sp : ed2a7c68  ip :   fp : c0fa2a70
> > [  

imxdrm issue on SABRE Lite

2017-02-12 Thread Dan MacDonald
Hi Phillipp

I'm having trouble getting xf86-video-armada working properly on a
Element 14 / Embest SABRE Lite board running Arch Linux with kernel
4.9.8. I have been in touch with RMK and he's confident the crash
below is caused by a kernel issue relating to imxdrm and a null
pointer dereference.

The first indiction of X/armada not being happy it that it only starts
in 720x480 when connected to a 4K HDMI display, and when I try
running:

xrandr --output HDMI1 --mode 1920x1080

X crashes.

Here's the serial console output for X loading and then crashing after
I try running that xrandr command:

 Arch Linux 4.9.8-1-ARCH (ttymxc1)

alarm login: [   19.197477] fec 2188000.ethernet eth0: Link is Up -
1Gbps/Full - flow control rx/tx
[   19.205165] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   33.08] vgaarb: this pci device is not a vga device
[   33.203520] vgaarb: this pci device is not a vga device
[   33.447051] imx-ipuv3 240.ipu: DC stop timeout after 50 ms
[   33.707042] [ cut here ]
[   33.711682] WARNING: CPU: 1 PID: 316 at
drivers/gpu/drm/drm_atomic_helper.c:1140
drm_atomic_helper_wait_for_vblanks+0x274/0x278
[   33.723187] [CRTC:24] vblank wait timed out
[   33.727395] Modules linked in: snd_soc_sgtl5000 snd_soc_fsl_ssi
snd_soc_imx_sgtl5000 imx_pcm_fiq imx_pcm_dma snd_soc_fsl_asrc
snd_soc_fsl_asoc_card snd_soc_core dw_hdmi_ahb_audio snd_pcm_dmaengine
caam_jr imx_ipuv3_crtc snd_ac97_codec coda v4l2_mem2mem
videobuf2_dma_contig ac97_bus imx_ipu_v3 snd_soc_imx_audmux snd_pcm
videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core
dw_hdmi_imx caam imx2_wdt ofpart spi_imx evdev dw_hdmi etnaviv imx_ldb
pwm_imx snd_timer parallel_display uio_pdrv_genirq uio imxdrm
sch_fq_codel ip_tables x_tables
[   33.776237] CPU: 1 PID: 316 Comm: Xorg Not tainted 4.9.8-1-ARCH #1
[   33.782419] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[   33.788965] [] (unwind_backtrace) from []
(show_stack+0x10/0x14)
[   33.796719] [] (show_stack) from []
(dump_stack+0x88/0x9c)
[   33.803951] [] (dump_stack) from [] (__warn+0xe8/0x100)
[   33.810919] [] (__warn) from []
(warn_slowpath_fmt+0x48/0x6c)
[   33.818410] [] (warn_slowpath_fmt) from []
(drm_atomic_helper_wait_for_vblanks+0x274/0x278)
[   33.828516] [] (drm_atomic_helper_wait_for_vblanks) from
[] (imx_drm_atomic_commit_tail+0x48/0x58 [imxdrm])
[   33.840004] [] (imx_drm_atomic_commit_tail [imxdrm]) from
[] (commit_tail+0x40/0x5c)
[   33.849489] [] (commit_tail) from []
(drm_atomic_helper_commit+0x94/0xd8)
[   33.858019] [] (drm_atomic_helper_commit) from
[] (drm_atomic_helper_set_config+0x78/0x9c)
[   33.868028] [] (drm_atomic_helper_set_config) from
[] (drm_mode_set_config_internal+0x58/0xdc)
[   33.878382] [] (drm_mode_set_config_internal) from
[] (drm_mode_setcrtc+0x4a4/0x550)
[   33.887867] [] (drm_mode_setcrtc) from []
(drm_ioctl+0x214/0x44c)
[   33.895704] [] (drm_ioctl) from []
(do_vfs_ioctl+0xac/0x980)
[   33.903104] [] (do_vfs_ioctl) from []
(SyS_ioctl+0x6c/0x7c)
[   33.910422] [] (SyS_ioctl) from []
(ret_fast_syscall+0x0/0x3c)
[   33.918023] ---[ end trace e8a77aa320be7e55 ]---
[   43.997066] [drm:drm_atomic_helper_commit_cleanup_done] *ERROR*
[CRTC:24:crtc-0] flip_done timed out
[   55.517063] [drm:drm_atomic_helper_commit_cleanup_done] *ERROR*
[CRTC:24:crtc-0] flip_done timed out
[   56.958264] alloc_contig_range: [4d0b7, 4d0b8) PFNs busy
[   90.896615] usb 1-1.2.3: new low-speed USB device number 6 using ci_hdrc
[   91.050796] usb 1-1.2.3: New USB device found, idVendor=0458, idProduct=003a
[   91.057888] usb 1-1.2.3: New USB device strings: Mfr=1, Product=2,
SerialNumber=0
[   91.065382] usb 1-1.2.3: Product: Optical Mouse
[   91.069944] usb 1-1.2.3: Manufacturer: Genius
[   91.082071] input: Genius Optical Mouse as
/devices/soc0/soc/210.aips-bus/2184200.usb/ci_hdrc.1/usb1/1-1/1-1.2/1-1.2.3/1-1.2.3:1.0/0003:0458:003A.0003/input/input3
[   91.098116] hid-generic 0003:0458:003A.0003: input,hidraw2: USB HID
v1.11 Mouse [Genius Optical Mouse] on usb-ci_hdrc.1-1.2.3/input0
[   91.162048] mousedev: PS/2 mouse device common for all mice
[  214.765689] imx-ipuv3 240.ipu: DC stop timeout after 50 ms
[  214.825688] Unable to handle kernel NULL pointer dereference at
virtual address 
[  214.833783] pgd = ed1b8000
[  214.836491] [] *pgd=4c974831
[  214.840084] Internal error: Oops: 17 [#1] SMP ARM
[  214.844789] Modules linked in: mousedev snd_soc_sgtl5000
snd_soc_fsl_ssi snd_soc_imx_sgtl5000 imx_pcm_fiq imx_pcm_dma
snd_soc_fsl_asrc snd_soc_fsl_asoc_card snd_soc_core dw_hdmi_ahb_audio
snd_pcm_dmaengine caam_jr imx_ipuv3_crtc snd_ac97_codec coda
v4l2_mem2mem videobuf2_dma_contig ac97_bus imx_ipu_v3
snd_soc_imx_audmux snd_pcm videobuf2_vmalloc videobuf2_memops
videobuf2_v4l2 videobuf2_core dw_hdmi_imx caam imx2_wdt ofpart spi_imx
evdev dw_hdmi etnaviv imx_ldb pwm_imx snd_timer parallel_display
uio_pdrv_genirq uio imxdrm sch_fq_codel ip_tables x_tables
[  214.894338] CPU: 2 PID: 316 

Re: imxdrm issue on SABRE Lite

2017-02-12 Thread Russell King - ARM Linux
On Sat, Feb 11, 2017 at 09:09:34PM +, Dan MacDonald wrote:
> [   43.997066] [drm:drm_atomic_helper_commit_cleanup_done] *ERROR*
> [CRTC:24:crtc-0] flip_done timed out
> [   55.517063] [drm:drm_atomic_helper_commit_cleanup_done] *ERROR*
> [CRTC:24:crtc-0] flip_done timed out

This seems to lay the foundation for the kernel to Oops itself later.
The problem seems to be this:

drm_atomic_helper_commit(state->dev, state, false)
- drm_atomic_helper_setup_commit(state, false)
  - foreach crtc in state
- commit->event = kzalloc()
- crtc_state->event = commit->event
- crtc_state->event->base.completion = >flip_done
...
- commit_tail(state)
  - funcs->atomic_commit_tail(state)
...
- drm_atomic_helper_commit_planes(dev, state,
DRM_PLANE_COMMIT_ACTIVE_ONLY |

DRM_PLANE_COMMIT_NO_DISABLE_AFTER_MODES$  - foreach active crtc in state
- funcs->atomic_begin(crtc, old_crtc_state)
  - ipu_crtc_atomic_begin()
- drm_crtc_vblank_on()
- if crtc->state->event
  - drm_crtc_arm_vblank_event(crtc, crtc->state->event)
  - crtc->state->event = NULL

At this point, the "commit->flip_done" completion is queued with the
event onto the vblank list.
...
  - drm_atomic_helper_commit_cleanup_done(state)
- foreach crtc in state
  - try_wait_for_completion(>hw_done)
  - wait_for_completion_timeout(>flip_done, 10sec)

This is where we get the timeout message.

  - drm_atomic_state_free(state)

This "clears" the commit state (calling drm_crtc_commit_put() on it)
which has the effect of kfree()'ing the structure containing the
flip_done, but which is still on the vblank list.

The next time we try to set a mode, the result is that a call to
drm_crtc_vblank_off() causes all queued events to be sent, including
the now kfree()'d flip_done completion, resulting in the reported
kernel oops.

It seems others are also suffering similar issues when the flip_done
completion times out with other drivers:

https://lkml.org/lkml/2016/12/1/171
https://bugs.freedesktop.org/show_bug.cgi?id=96781
https://lists.opensuse.org/opensuse-bugs/2016-10/msg03011.html
https://patchwork.kernel.org/patch/9280223/ (which is me...)

This is likely the same, although the timeout line was not captured:
https://bugzilla.redhat.com/show_bug.cgi?id=1415180
https://bodhi.fedoraproject.org/updates/kernel-4.8.7-200.fc24

So, can we please avoid killing the kernel when the hardware doesn't
quite behave as we want it to?

Right now, when we oops the kernel, we're leaking all the memory
associated with the atomic modeset, so if we stop oopsing the kernel
but still leak the memory, surely that would be an improvement?
Maybe something like the untested patch at the bottom of this mail?

It would give the opportunity to poke about on a failed system to
work out what happened and maybe why the hardware misbehaved.

The real answer is for the hardware to behave, but we can't always
have our cake.

Note - I can't reproduce Dan's problem here on 4.10-rc7 as I suspect
it needs multiple CRTCs/outputs running in the IPU to trigger it,
which Sabre lite has.  I'll try enabling the (disconnected) LVDS
output tomorrow (I have Fabio's LVDS patch knocking about), but I
suspect those with a deeper knowledge of the IPU need to investigate
what's going on.

> [  214.765689] imx-ipuv3 240.ipu: DC stop timeout after 50 ms
> [  214.825688] Unable to handle kernel NULL pointer dereference at
> virtual address 
> [  214.833783] pgd = ed1b8000
> [  214.836491] [] *pgd=4c974831
> [  214.840084] Internal error: Oops: 17 [#1] SMP ARM
> [  214.844789] Modules linked in: mousedev snd_soc_sgtl5000
> snd_soc_fsl_ssi snd_soc_imx_sgtl5000 imx_pcm_fiq imx_pcm_dma
> snd_soc_fsl_asrc snd_soc_fsl_asoc_card snd_soc_core dw_hdmi_ahb_audio
> snd_pcm_dmaengine caam_jr imx_ipuv3_crtc snd_ac97_codec coda
> v4l2_mem2mem videobuf2_dma_contig ac97_bus imx_ipu_v3
> snd_soc_imx_audmux snd_pcm videobuf2_vmalloc videobuf2_memops
> videobuf2_v4l2 videobuf2_core dw_hdmi_imx caam imx2_wdt ofpart spi_imx
> evdev dw_hdmi etnaviv imx_ldb pwm_imx snd_timer parallel_display
> uio_pdrv_genirq uio imxdrm sch_fq_codel ip_tables x_tables
> [  214.894338] CPU: 2 PID: 316 Comm: Xorg Tainted: GW
> 4.9.8-1-ARCH #1
> [  214.901735] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> [  214.908264] task: ed2c4d00 task.stack: ed2a6000
> [  214.912803] PC is at __wake_up_common+0x1c/0x80
> [  214.917337] LR is at __wake_up_locked+0x14/0x1c
> [  214.921871] pc : []lr : []psr: a0070093
> [  214.921871] sp : ed2a7c68  ip :   fp : c0fa2a70
> [  214.933348] r10: c0f37384  r9 : 0003  r8 : 
> [  214.938574] r7 :   r6 : edbf3410  r5 : edbf3408  r4 : edbf340c
> [  214.945101] r3 :   r2 :   r1 :   r0 : edbf340c
> [  214.951630] Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
> Segment