Re: [Intel-gfx] GPU hang with kernel 4.10rc3

2017-05-12 Thread Juergen Gross
On 11/05/17 23:08, Pavel Machek wrote:
> On Mon 2017-01-23 10:39:27, Juergen Gross wrote:
>> On 13/01/17 15:41, Juergen Gross wrote:
>>> On 12/01/17 10:21, Chris Wilson wrote:
 On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote:
> On 11/01/17 18:08, Chris Wilson wrote:
>> On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
>>> With kernel 4.10rc3 running as Xen dm0 I get at each boot:
>>>
>>> [   49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
>>> [1431], reason: Hang on render ring, action: reset
>>> [   49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
>>> gfx stack, including userspace.
>>> [   49.213700] [drm] Please file a _new_ bug report on
>>> bugs.freedesktop.org against DRI -> DRM/Intel
>>> [   49.213700] [drm] drm/i915 developers can then reassign to the right
>>> component if it's not a kernel issue.
>>> [   49.213700] [drm] The gpu crash dump is required to analyze gpu
>>> hangs, so please always attach it.
>>> [   49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
>>> [   49.213755] drm/i915: Resetting chip after gpu hang
>>> [   60.213769] drm/i915: Resetting chip after gpu hang
>>> [   71.189737] drm/i915: Resetting chip after gpu hang
>>> [   82.165747] drm/i915: Resetting chip after gpu hang
>>> [   93.205727] drm/i915: Resetting chip after gpu hang
>>>
>>> The dump is attached.
>>
>> That's a nasty one. The first couple of pages of the batchbuffer appear
>> to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
>> may be a concurrent write by either the GPU or CPU, or we may have
>> incorrected mapped a set of pages. That it doesn't recovered suggests
>> that the corruption occurs frequently, probably on every request/batch.
>
> I hoped someone would have an idea already.

 Sorry, first report of something like this in a long time (that I can
 remember at least). And the problem is that it can be anything from a
 coherency to a concurrency issue, so no one patch springs to mind.
 Thankfully it appears to be kernel related.
 -Chris

>>>
>>> Bisecting took longer than I thought, but I had to cherry pick some
>>> patches and rebase one of them multiple times...
>>>
>>> Finally I found the commit to blame: 920cf4194954ec ("drm/i915:
>>> Introduce an internal allocator for disposable private objects")
>>>
>>> In case you need me to produce some more data or test a patch
>>> feel free to reach out.
>>
>> Anything new for this severe regression?
>>
>> Without a fix 4.10 will be unusable with Xen on a machine with i915
>> graphics!
> 
> Did this get solved?

Yes. Commit 7152187159193056f30ad5726741bb25028672bf.


Juergen

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Re: [Intel-gfx] GPU hang with kernel 4.10rc3

2017-05-11 Thread Pavel Machek
On Mon 2017-01-23 10:39:27, Juergen Gross wrote:
> On 13/01/17 15:41, Juergen Gross wrote:
> > On 12/01/17 10:21, Chris Wilson wrote:
> >> On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote:
> >>> On 11/01/17 18:08, Chris Wilson wrote:
>  On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
> > With kernel 4.10rc3 running as Xen dm0 I get at each boot:
> >
> > [   49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
> > [1431], reason: Hang on render ring, action: reset
> > [   49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
> > gfx stack, including userspace.
> > [   49.213700] [drm] Please file a _new_ bug report on
> > bugs.freedesktop.org against DRI -> DRM/Intel
> > [   49.213700] [drm] drm/i915 developers can then reassign to the right
> > component if it's not a kernel issue.
> > [   49.213700] [drm] The gpu crash dump is required to analyze gpu
> > hangs, so please always attach it.
> > [   49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> > [   49.213755] drm/i915: Resetting chip after gpu hang
> > [   60.213769] drm/i915: Resetting chip after gpu hang
> > [   71.189737] drm/i915: Resetting chip after gpu hang
> > [   82.165747] drm/i915: Resetting chip after gpu hang
> > [   93.205727] drm/i915: Resetting chip after gpu hang
> >
> > The dump is attached.
> 
>  That's a nasty one. The first couple of pages of the batchbuffer appear
>  to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
>  may be a concurrent write by either the GPU or CPU, or we may have
>  incorrected mapped a set of pages. That it doesn't recovered suggests
>  that the corruption occurs frequently, probably on every request/batch.
> >>>
> >>> I hoped someone would have an idea already.
> >>
> >> Sorry, first report of something like this in a long time (that I can
> >> remember at least). And the problem is that it can be anything from a
> >> coherency to a concurrency issue, so no one patch springs to mind.
> >> Thankfully it appears to be kernel related.
> >> -Chris
> >>
> > 
> > Bisecting took longer than I thought, but I had to cherry pick some
> > patches and rebase one of them multiple times...
> > 
> > Finally I found the commit to blame: 920cf4194954ec ("drm/i915:
> > Introduce an internal allocator for disposable private objects")
> > 
> > In case you need me to produce some more data or test a patch
> > feel free to reach out.
> 
> Anything new for this severe regression?
> 
> Without a fix 4.10 will be unusable with Xen on a machine with i915
> graphics!

Did this get solved?

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Re: [Intel-gfx] GPU hang with kernel 4.10rc3

2017-01-23 Thread Juergen Gross
On 13/01/17 15:41, Juergen Gross wrote:
> On 12/01/17 10:21, Chris Wilson wrote:
>> On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote:
>>> On 11/01/17 18:08, Chris Wilson wrote:
 On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
> With kernel 4.10rc3 running as Xen dm0 I get at each boot:
>
> [   49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
> [1431], reason: Hang on render ring, action: reset
> [   49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
> gfx stack, including userspace.
> [   49.213700] [drm] Please file a _new_ bug report on
> bugs.freedesktop.org against DRI -> DRM/Intel
> [   49.213700] [drm] drm/i915 developers can then reassign to the right
> component if it's not a kernel issue.
> [   49.213700] [drm] The gpu crash dump is required to analyze gpu
> hangs, so please always attach it.
> [   49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [   49.213755] drm/i915: Resetting chip after gpu hang
> [   60.213769] drm/i915: Resetting chip after gpu hang
> [   71.189737] drm/i915: Resetting chip after gpu hang
> [   82.165747] drm/i915: Resetting chip after gpu hang
> [   93.205727] drm/i915: Resetting chip after gpu hang
>
> The dump is attached.

 That's a nasty one. The first couple of pages of the batchbuffer appear
 to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
 may be a concurrent write by either the GPU or CPU, or we may have
 incorrected mapped a set of pages. That it doesn't recovered suggests
 that the corruption occurs frequently, probably on every request/batch.
>>>
>>> I hoped someone would have an idea already.
>>
>> Sorry, first report of something like this in a long time (that I can
>> remember at least). And the problem is that it can be anything from a
>> coherency to a concurrency issue, so no one patch springs to mind.
>> Thankfully it appears to be kernel related.
>> -Chris
>>
> 
> Bisecting took longer than I thought, but I had to cherry pick some
> patches and rebase one of them multiple times...
> 
> Finally I found the commit to blame: 920cf4194954ec ("drm/i915:
> Introduce an internal allocator for disposable private objects")
> 
> In case you need me to produce some more data or test a patch
> feel free to reach out.

Anything new for this severe regression?

Without a fix 4.10 will be unusable with Xen on a machine with i915
graphics!


Juergen
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] GPU hang with kernel 4.10rc3

2017-01-15 Thread Juergen Gross
On 12/01/17 10:21, Chris Wilson wrote:
> On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote:
>> On 11/01/17 18:08, Chris Wilson wrote:
>>> On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
 With kernel 4.10rc3 running as Xen dm0 I get at each boot:

 [   49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
 [1431], reason: Hang on render ring, action: reset
 [   49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
 gfx stack, including userspace.
 [   49.213700] [drm] Please file a _new_ bug report on
 bugs.freedesktop.org against DRI -> DRM/Intel
 [   49.213700] [drm] drm/i915 developers can then reassign to the right
 component if it's not a kernel issue.
 [   49.213700] [drm] The gpu crash dump is required to analyze gpu
 hangs, so please always attach it.
 [   49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
 [   49.213755] drm/i915: Resetting chip after gpu hang
 [   60.213769] drm/i915: Resetting chip after gpu hang
 [   71.189737] drm/i915: Resetting chip after gpu hang
 [   82.165747] drm/i915: Resetting chip after gpu hang
 [   93.205727] drm/i915: Resetting chip after gpu hang

 The dump is attached.
>>>
>>> That's a nasty one. The first couple of pages of the batchbuffer appear
>>> to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
>>> may be a concurrent write by either the GPU or CPU, or we may have
>>> incorrected mapped a set of pages. That it doesn't recovered suggests
>>> that the corruption occurs frequently, probably on every request/batch.
>>
>> I hoped someone would have an idea already.
> 
> Sorry, first report of something like this in a long time (that I can
> remember at least). And the problem is that it can be anything from a
> coherency to a concurrency issue, so no one patch springs to mind.
> Thankfully it appears to be kernel related.
> -Chris
> 

Bisecting took longer than I thought, but I had to cherry pick some
patches and rebase one of them multiple times...

Finally I found the commit to blame: 920cf4194954ec ("drm/i915:
Introduce an internal allocator for disposable private objects")

In case you need me to produce some more data or test a patch
feel free to reach out.


Juergen
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] GPU hang with kernel 4.10rc3

2017-01-12 Thread Juergen Gross
On 11/01/17 18:08, Chris Wilson wrote:
> On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
>> With kernel 4.10rc3 running as Xen dm0 I get at each boot:
>>
>> [   49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
>> [1431], reason: Hang on render ring, action: reset
>> [   49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
>> gfx stack, including userspace.
>> [   49.213700] [drm] Please file a _new_ bug report on
>> bugs.freedesktop.org against DRI -> DRM/Intel
>> [   49.213700] [drm] drm/i915 developers can then reassign to the right
>> component if it's not a kernel issue.
>> [   49.213700] [drm] The gpu crash dump is required to analyze gpu
>> hangs, so please always attach it.
>> [   49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
>> [   49.213755] drm/i915: Resetting chip after gpu hang
>> [   60.213769] drm/i915: Resetting chip after gpu hang
>> [   71.189737] drm/i915: Resetting chip after gpu hang
>> [   82.165747] drm/i915: Resetting chip after gpu hang
>> [   93.205727] drm/i915: Resetting chip after gpu hang
>>
>> The dump is attached.
> 
> That's a nasty one. The first couple of pages of the batchbuffer appear
> to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
> may be a concurrent write by either the GPU or CPU, or we may have
> incorrected mapped a set of pages. That it doesn't recovered suggests
> that the corruption occurs frequently, probably on every request/batch.

I hoped someone would have an idea already.

> Is this a new bug? Bisection would be the fastest way to triage it.

Commit 7453c549f was still okay. Starting bisect now (2882 commits, 12
steps) ...


Juergen
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] GPU hang with kernel 4.10rc3

2017-01-12 Thread Chris Wilson
On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote:
> On 11/01/17 18:08, Chris Wilson wrote:
> > On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
> >> With kernel 4.10rc3 running as Xen dm0 I get at each boot:
> >>
> >> [   49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
> >> [1431], reason: Hang on render ring, action: reset
> >> [   49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
> >> gfx stack, including userspace.
> >> [   49.213700] [drm] Please file a _new_ bug report on
> >> bugs.freedesktop.org against DRI -> DRM/Intel
> >> [   49.213700] [drm] drm/i915 developers can then reassign to the right
> >> component if it's not a kernel issue.
> >> [   49.213700] [drm] The gpu crash dump is required to analyze gpu
> >> hangs, so please always attach it.
> >> [   49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> >> [   49.213755] drm/i915: Resetting chip after gpu hang
> >> [   60.213769] drm/i915: Resetting chip after gpu hang
> >> [   71.189737] drm/i915: Resetting chip after gpu hang
> >> [   82.165747] drm/i915: Resetting chip after gpu hang
> >> [   93.205727] drm/i915: Resetting chip after gpu hang
> >>
> >> The dump is attached.
> > 
> > That's a nasty one. The first couple of pages of the batchbuffer appear
> > to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
> > may be a concurrent write by either the GPU or CPU, or we may have
> > incorrected mapped a set of pages. That it doesn't recovered suggests
> > that the corruption occurs frequently, probably on every request/batch.
> 
> I hoped someone would have an idea already.

Sorry, first report of something like this in a long time (that I can
remember at least). And the problem is that it can be anything from a
coherency to a concurrency issue, so no one patch springs to mind.
Thankfully it appears to be kernel related.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] GPU hang with kernel 4.10rc3

2017-01-11 Thread Chris Wilson
On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
> With kernel 4.10rc3 running as Xen dm0 I get at each boot:
> 
> [   49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
> [1431], reason: Hang on render ring, action: reset
> [   49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
> gfx stack, including userspace.
> [   49.213700] [drm] Please file a _new_ bug report on
> bugs.freedesktop.org against DRI -> DRM/Intel
> [   49.213700] [drm] drm/i915 developers can then reassign to the right
> component if it's not a kernel issue.
> [   49.213700] [drm] The gpu crash dump is required to analyze gpu
> hangs, so please always attach it.
> [   49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [   49.213755] drm/i915: Resetting chip after gpu hang
> [   60.213769] drm/i915: Resetting chip after gpu hang
> [   71.189737] drm/i915: Resetting chip after gpu hang
> [   82.165747] drm/i915: Resetting chip after gpu hang
> [   93.205727] drm/i915: Resetting chip after gpu hang
> 
> The dump is attached.

That's a nasty one. The first couple of pages of the batchbuffer appear
to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
may be a concurrent write by either the GPU or CPU, or we may have
incorrected mapped a set of pages. That it doesn't recovered suggests
that the corruption occurs frequently, probably on every request/batch.

Is this a new bug? Bisection would be the fastest way to triage it.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel