Re: [Intel-gfx] GPU hang with kernel 4.10rc3
On 11/05/17 23:08, Pavel Machek wrote: > On Mon 2017-01-23 10:39:27, Juergen Gross wrote: >> On 13/01/17 15:41, Juergen Gross wrote: >>> On 12/01/17 10:21, Chris Wilson wrote: On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote: > On 11/01/17 18:08, Chris Wilson wrote: >> On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote: >>> With kernel 4.10rc3 running as Xen dm0 I get at each boot: >>> >>> [ 49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell >>> [1431], reason: Hang on render ring, action: reset >>> [ 49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire >>> gfx stack, including userspace. >>> [ 49.213700] [drm] Please file a _new_ bug report on >>> bugs.freedesktop.org against DRI -> DRM/Intel >>> [ 49.213700] [drm] drm/i915 developers can then reassign to the right >>> component if it's not a kernel issue. >>> [ 49.213700] [drm] The gpu crash dump is required to analyze gpu >>> hangs, so please always attach it. >>> [ 49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error >>> [ 49.213755] drm/i915: Resetting chip after gpu hang >>> [ 60.213769] drm/i915: Resetting chip after gpu hang >>> [ 71.189737] drm/i915: Resetting chip after gpu hang >>> [ 82.165747] drm/i915: Resetting chip after gpu hang >>> [ 93.205727] drm/i915: Resetting chip after gpu hang >>> >>> The dump is attached. >> >> That's a nasty one. The first couple of pages of the batchbuffer appear >> to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That >> may be a concurrent write by either the GPU or CPU, or we may have >> incorrected mapped a set of pages. That it doesn't recovered suggests >> that the corruption occurs frequently, probably on every request/batch. > > I hoped someone would have an idea already. Sorry, first report of something like this in a long time (that I can remember at least). And the problem is that it can be anything from a coherency to a concurrency issue, so no one patch springs to mind. Thankfully it appears to be kernel related. -Chris >>> >>> Bisecting took longer than I thought, but I had to cherry pick some >>> patches and rebase one of them multiple times... >>> >>> Finally I found the commit to blame: 920cf4194954ec ("drm/i915: >>> Introduce an internal allocator for disposable private objects") >>> >>> In case you need me to produce some more data or test a patch >>> feel free to reach out. >> >> Anything new for this severe regression? >> >> Without a fix 4.10 will be unusable with Xen on a machine with i915 >> graphics! > > Did this get solved? Yes. Commit 7152187159193056f30ad5726741bb25028672bf. Juergen ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Re: [Intel-gfx] GPU hang with kernel 4.10rc3
On Mon 2017-01-23 10:39:27, Juergen Gross wrote: > On 13/01/17 15:41, Juergen Gross wrote: > > On 12/01/17 10:21, Chris Wilson wrote: > >> On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote: > >>> On 11/01/17 18:08, Chris Wilson wrote: > On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote: > > With kernel 4.10rc3 running as Xen dm0 I get at each boot: > > > > [ 49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell > > [1431], reason: Hang on render ring, action: reset > > [ 49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire > > gfx stack, including userspace. > > [ 49.213700] [drm] Please file a _new_ bug report on > > bugs.freedesktop.org against DRI -> DRM/Intel > > [ 49.213700] [drm] drm/i915 developers can then reassign to the right > > component if it's not a kernel issue. > > [ 49.213700] [drm] The gpu crash dump is required to analyze gpu > > hangs, so please always attach it. > > [ 49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error > > [ 49.213755] drm/i915: Resetting chip after gpu hang > > [ 60.213769] drm/i915: Resetting chip after gpu hang > > [ 71.189737] drm/i915: Resetting chip after gpu hang > > [ 82.165747] drm/i915: Resetting chip after gpu hang > > [ 93.205727] drm/i915: Resetting chip after gpu hang > > > > The dump is attached. > > That's a nasty one. The first couple of pages of the batchbuffer appear > to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That > may be a concurrent write by either the GPU or CPU, or we may have > incorrected mapped a set of pages. That it doesn't recovered suggests > that the corruption occurs frequently, probably on every request/batch. > >>> > >>> I hoped someone would have an idea already. > >> > >> Sorry, first report of something like this in a long time (that I can > >> remember at least). And the problem is that it can be anything from a > >> coherency to a concurrency issue, so no one patch springs to mind. > >> Thankfully it appears to be kernel related. > >> -Chris > >> > > > > Bisecting took longer than I thought, but I had to cherry pick some > > patches and rebase one of them multiple times... > > > > Finally I found the commit to blame: 920cf4194954ec ("drm/i915: > > Introduce an internal allocator for disposable private objects") > > > > In case you need me to produce some more data or test a patch > > feel free to reach out. > > Anything new for this severe regression? > > Without a fix 4.10 will be unusable with Xen on a machine with i915 > graphics! Did this get solved? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Re: [Intel-gfx] GPU hang with kernel 4.10rc3
On 13/01/17 15:41, Juergen Gross wrote: > On 12/01/17 10:21, Chris Wilson wrote: >> On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote: >>> On 11/01/17 18:08, Chris Wilson wrote: On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote: > With kernel 4.10rc3 running as Xen dm0 I get at each boot: > > [ 49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell > [1431], reason: Hang on render ring, action: reset > [ 49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire > gfx stack, including userspace. > [ 49.213700] [drm] Please file a _new_ bug report on > bugs.freedesktop.org against DRI -> DRM/Intel > [ 49.213700] [drm] drm/i915 developers can then reassign to the right > component if it's not a kernel issue. > [ 49.213700] [drm] The gpu crash dump is required to analyze gpu > hangs, so please always attach it. > [ 49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error > [ 49.213755] drm/i915: Resetting chip after gpu hang > [ 60.213769] drm/i915: Resetting chip after gpu hang > [ 71.189737] drm/i915: Resetting chip after gpu hang > [ 82.165747] drm/i915: Resetting chip after gpu hang > [ 93.205727] drm/i915: Resetting chip after gpu hang > > The dump is attached. That's a nasty one. The first couple of pages of the batchbuffer appear to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That may be a concurrent write by either the GPU or CPU, or we may have incorrected mapped a set of pages. That it doesn't recovered suggests that the corruption occurs frequently, probably on every request/batch. >>> >>> I hoped someone would have an idea already. >> >> Sorry, first report of something like this in a long time (that I can >> remember at least). And the problem is that it can be anything from a >> coherency to a concurrency issue, so no one patch springs to mind. >> Thankfully it appears to be kernel related. >> -Chris >> > > Bisecting took longer than I thought, but I had to cherry pick some > patches and rebase one of them multiple times... > > Finally I found the commit to blame: 920cf4194954ec ("drm/i915: > Introduce an internal allocator for disposable private objects") > > In case you need me to produce some more data or test a patch > feel free to reach out. Anything new for this severe regression? Without a fix 4.10 will be unusable with Xen on a machine with i915 graphics! Juergen ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] GPU hang with kernel 4.10rc3
On 12/01/17 10:21, Chris Wilson wrote: > On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote: >> On 11/01/17 18:08, Chris Wilson wrote: >>> On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote: With kernel 4.10rc3 running as Xen dm0 I get at each boot: [ 49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell [1431], reason: Hang on render ring, action: reset [ 49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 49.213700] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 49.213700] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 49.213700] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 49.213755] drm/i915: Resetting chip after gpu hang [ 60.213769] drm/i915: Resetting chip after gpu hang [ 71.189737] drm/i915: Resetting chip after gpu hang [ 82.165747] drm/i915: Resetting chip after gpu hang [ 93.205727] drm/i915: Resetting chip after gpu hang The dump is attached. >>> >>> That's a nasty one. The first couple of pages of the batchbuffer appear >>> to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That >>> may be a concurrent write by either the GPU or CPU, or we may have >>> incorrected mapped a set of pages. That it doesn't recovered suggests >>> that the corruption occurs frequently, probably on every request/batch. >> >> I hoped someone would have an idea already. > > Sorry, first report of something like this in a long time (that I can > remember at least). And the problem is that it can be anything from a > coherency to a concurrency issue, so no one patch springs to mind. > Thankfully it appears to be kernel related. > -Chris > Bisecting took longer than I thought, but I had to cherry pick some patches and rebase one of them multiple times... Finally I found the commit to blame: 920cf4194954ec ("drm/i915: Introduce an internal allocator for disposable private objects") In case you need me to produce some more data or test a patch feel free to reach out. Juergen ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] GPU hang with kernel 4.10rc3
On 11/01/17 18:08, Chris Wilson wrote: > On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote: >> With kernel 4.10rc3 running as Xen dm0 I get at each boot: >> >> [ 49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell >> [1431], reason: Hang on render ring, action: reset >> [ 49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire >> gfx stack, including userspace. >> [ 49.213700] [drm] Please file a _new_ bug report on >> bugs.freedesktop.org against DRI -> DRM/Intel >> [ 49.213700] [drm] drm/i915 developers can then reassign to the right >> component if it's not a kernel issue. >> [ 49.213700] [drm] The gpu crash dump is required to analyze gpu >> hangs, so please always attach it. >> [ 49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error >> [ 49.213755] drm/i915: Resetting chip after gpu hang >> [ 60.213769] drm/i915: Resetting chip after gpu hang >> [ 71.189737] drm/i915: Resetting chip after gpu hang >> [ 82.165747] drm/i915: Resetting chip after gpu hang >> [ 93.205727] drm/i915: Resetting chip after gpu hang >> >> The dump is attached. > > That's a nasty one. The first couple of pages of the batchbuffer appear > to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That > may be a concurrent write by either the GPU or CPU, or we may have > incorrected mapped a set of pages. That it doesn't recovered suggests > that the corruption occurs frequently, probably on every request/batch. I hoped someone would have an idea already. > Is this a new bug? Bisection would be the fastest way to triage it. Commit 7453c549f was still okay. Starting bisect now (2882 commits, 12 steps) ... Juergen ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] GPU hang with kernel 4.10rc3
On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote: > On 11/01/17 18:08, Chris Wilson wrote: > > On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote: > >> With kernel 4.10rc3 running as Xen dm0 I get at each boot: > >> > >> [ 49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell > >> [1431], reason: Hang on render ring, action: reset > >> [ 49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire > >> gfx stack, including userspace. > >> [ 49.213700] [drm] Please file a _new_ bug report on > >> bugs.freedesktop.org against DRI -> DRM/Intel > >> [ 49.213700] [drm] drm/i915 developers can then reassign to the right > >> component if it's not a kernel issue. > >> [ 49.213700] [drm] The gpu crash dump is required to analyze gpu > >> hangs, so please always attach it. > >> [ 49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error > >> [ 49.213755] drm/i915: Resetting chip after gpu hang > >> [ 60.213769] drm/i915: Resetting chip after gpu hang > >> [ 71.189737] drm/i915: Resetting chip after gpu hang > >> [ 82.165747] drm/i915: Resetting chip after gpu hang > >> [ 93.205727] drm/i915: Resetting chip after gpu hang > >> > >> The dump is attached. > > > > That's a nasty one. The first couple of pages of the batchbuffer appear > > to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That > > may be a concurrent write by either the GPU or CPU, or we may have > > incorrected mapped a set of pages. That it doesn't recovered suggests > > that the corruption occurs frequently, probably on every request/batch. > > I hoped someone would have an idea already. Sorry, first report of something like this in a long time (that I can remember at least). And the problem is that it can be anything from a coherency to a concurrency issue, so no one patch springs to mind. Thankfully it appears to be kernel related. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] GPU hang with kernel 4.10rc3
On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote: > With kernel 4.10rc3 running as Xen dm0 I get at each boot: > > [ 49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell > [1431], reason: Hang on render ring, action: reset > [ 49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire > gfx stack, including userspace. > [ 49.213700] [drm] Please file a _new_ bug report on > bugs.freedesktop.org against DRI -> DRM/Intel > [ 49.213700] [drm] drm/i915 developers can then reassign to the right > component if it's not a kernel issue. > [ 49.213700] [drm] The gpu crash dump is required to analyze gpu > hangs, so please always attach it. > [ 49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error > [ 49.213755] drm/i915: Resetting chip after gpu hang > [ 60.213769] drm/i915: Resetting chip after gpu hang > [ 71.189737] drm/i915: Resetting chip after gpu hang > [ 82.165747] drm/i915: Resetting chip after gpu hang > [ 93.205727] drm/i915: Resetting chip after gpu hang > > The dump is attached. That's a nasty one. The first couple of pages of the batchbuffer appear to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That may be a concurrent write by either the GPU or CPU, or we may have incorrected mapped a set of pages. That it doesn't recovered suggests that the corruption occurs frequently, probably on every request/batch. Is this a new bug? Bisection would be the fastest way to triage it. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel