hibernate random memory corruption, workaround i915.modeset=0

2012-03-21 Thread Stanislaw Gruszka
On Mon, Mar 19, 2012 at 10:21:28AM -0700, Keith Packard wrote:
> <#part sign=pgpmime>
> On Mon, 19 Mar 2012 15:53:54 +0100, Stanislaw Gruszka  redhat.com> wrote:
> 
> > Keith, is there a chance that this bug can be fixed by i915 team?
> 
> Yes, I'm working on figuring out how to actually reproduce this and then
> work on a few work-arounds.

This must be hardware dependent then, yesterday and today I tried to
reproduce on Lenovo T60 with:

00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 
943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03)

and there is no sign of corruption, I double checked if slab poisoning
is enabled.

> > If not, can we disable hibernate on i915 with modeset=1 and add
> > module option, which enable it for those who want to risk?
> 
> I'd love to know if disabling modeset on just the booting kernel helps;
> leaving the resuming kernel with modeset=1. I haven't been able to
> reproduce this locally yet to test this theory though.

On Lenovo T500 with:

00:02.1 Display controller [0380]: Intel Corporation Mobile 4 Series Chipset 
Integrated Graphics Controller [8086:2a43] (rev 07)

that is the case. Script attached on previous email trigger corruption
with modeset=1 (on various older and newer kernels) in less then 20
iterations. There is no corruption after 100 iterations if
i915.modeset=0 parameter is used. Some users on RH bugzilla confirmed
that workaround as well.

If you want list of more i915 adapters where the problem happens, I can
check in our bugzilla and provide it.

Thanks
Stanislaw 


hibernate random memory corruption, workaround i915.modeset=0

2012-03-21 Thread Dave Airlie
On Wed, Mar 21, 2012 at 3:14 PM, Stanislaw Gruszka  
wrote:
> On Mon, Mar 19, 2012 at 10:21:28AM -0700, Keith Packard wrote:
>> <#part sign=pgpmime>
>> On Mon, 19 Mar 2012 15:53:54 +0100, Stanislaw Gruszka > redhat.com> wrote:
>>
>> > Keith, is there a chance that this bug can be fixed by i915 team?
>>
>> Yes, I'm working on figuring out how to actually reproduce this and then
>> work on a few work-arounds.
>
> This must be hardware dependent then, yesterday and today I tried to
> reproduce on Lenovo T60 with:
>
> 00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 
> 943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03)
>
> and there is no sign of corruption, I double checked if slab poisoning
> is enabled.
>
>> > If not, can we disable hibernate on i915 with modeset=1 and add
>> > module option, which enable it for those who want to risk?
>>
>> I'd love to know if disabling modeset on just the booting kernel helps;
>> leaving the resuming kernel with modeset=1. I haven't been able to
>> reproduce this locally yet to test this theory though.
>
> On Lenovo T500 with:
>
> 00:02.1 Display controller [0380]: Intel Corporation Mobile 4 Series Chipset 
> Integrated Graphics Controller [8086:2a43] (rev 07)

I also did a bunch of tests on my 965GM and it didn't die,

so it might be its a GM45 and Ironlake at least.

Dave.


Re: hibernate random memory corruption, workaround i915.modeset=0

2012-03-21 Thread Dave Airlie
On Wed, Mar 21, 2012 at 3:14 PM, Stanislaw Gruszka sgrus...@redhat.com wrote:
 On Mon, Mar 19, 2012 at 10:21:28AM -0700, Keith Packard wrote:
 #part sign=pgpmime
 On Mon, 19 Mar 2012 15:53:54 +0100, Stanislaw Gruszka sgrus...@redhat.com 
 wrote:

  Keith, is there a chance that this bug can be fixed by i915 team?

 Yes, I'm working on figuring out how to actually reproduce this and then
 work on a few work-arounds.

 This must be hardware dependent then, yesterday and today I tried to
 reproduce on Lenovo T60 with:

 00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 
 943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03)

 and there is no sign of corruption, I double checked if slab poisoning
 is enabled.

  If not, can we disable hibernate on i915 with modeset=1 and add
  module option, which enable it for those who want to risk?

 I'd love to know if disabling modeset on just the booting kernel helps;
 leaving the resuming kernel with modeset=1. I haven't been able to
 reproduce this locally yet to test this theory though.

 On Lenovo T500 with:

 00:02.1 Display controller [0380]: Intel Corporation Mobile 4 Series Chipset 
 Integrated Graphics Controller [8086:2a43] (rev 07)

I also did a bunch of tests on my 965GM and it didn't die,

so it might be its a GM45 and Ironlake at least.

Dave.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: hibernate random memory corruption, workaround i915.modeset=0

2012-03-21 Thread Stanislaw Gruszka
On Mon, Mar 19, 2012 at 10:21:28AM -0700, Keith Packard wrote:
 #part sign=pgpmime
 On Mon, 19 Mar 2012 15:53:54 +0100, Stanislaw Gruszka sgrus...@redhat.com 
 wrote:
 
  Keith, is there a chance that this bug can be fixed by i915 team?
 
 Yes, I'm working on figuring out how to actually reproduce this and then
 work on a few work-arounds.

This must be hardware dependent then, yesterday and today I tried to
reproduce on Lenovo T60 with:

00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 
943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03)

and there is no sign of corruption, I double checked if slab poisoning
is enabled.

  If not, can we disable hibernate on i915 with modeset=1 and add
  module option, which enable it for those who want to risk?
 
 I'd love to know if disabling modeset on just the booting kernel helps;
 leaving the resuming kernel with modeset=1. I haven't been able to
 reproduce this locally yet to test this theory though.

On Lenovo T500 with:

00:02.1 Display controller [0380]: Intel Corporation Mobile 4 Series Chipset 
Integrated Graphics Controller [8086:2a43] (rev 07)

that is the case. Script attached on previous email trigger corruption
with modeset=1 (on various older and newer kernels) in less then 20
iterations. There is no corruption after 100 iterations if
i915.modeset=0 parameter is used. Some users on RH bugzilla confirmed
that workaround as well.

If you want list of more i915 adapters where the problem happens, I can
check in our bugzilla and provide it.

Thanks
Stanislaw 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


hibernate random memory corruption, workaround i915.modeset=0

2012-03-19 Thread Stanislaw Gruszka
On Mon, Feb 27, 2012 at 01:42:43PM +0100, Stanislaw Gruszka wrote:
> I'm able to reproduce random memory corruption after hibernate.
> Corruption is not reproducible when I disable mode setting, what
> seems to blame i915 driver or generic DRM kernel code.
> 
> I'm able to reproduce bug on Fedora 11 with 2.6.30 kernel (first
> fedora with KMS support) and on the latest 3.3-rc kernels. So this
> issue is there from very beginning, hence it is not bisectable.
> 
> I'm attaching script to reproduce (with accompanying memory checker
> program). Script is basically sequence of hibernate - reset - check
> memory. Kernel should be compiled with CONFIG_DEBUG_SLAB to detect
> poison/redzone overwrites.
>  
> I already tried to debug this using CONFIG_DEBUG_PAGEALLOC and new
> kernel option debug_guardpage_minorder, but without any success.
> Seems corruption happen behind CPU MMU, i.e. is DMA unit programming
> bug. I'm not able to turn on IOMMU on that hardware.
> 
> This happen on T500 laptop with, lspci output attached.
> 
> I'm attaching also dmesg's with poison/redzone overwrites from
> 3.3-rc4 and 2.6.30 kernels.
> 
> Some more information can be found on:
> https://bugzilla.redhat.com/show_bug.cgi?id=746169
> https://bugzilla.redhat.com/show_bug.cgi?id=701857
> 
> i.e there is invalid DMA address warning that could be a good hint:
> https://bugzilla.redhat.com/show_bug.cgi?id=746169#c7
> 
> I would appreciate any help with solving this issue. I think many
> people are hitting this, but since corruption happens at random,
> not many people notice it, or when notice, did not find out that
> this could be i915/DRM issue.

So, after googling a bit I find out that we are writing pixels into
memory and issue is known since 2010 at least:
http://codemonkey.org.uk/2012/03/12/i915-hibernate-memory-corruption/
https://bugzilla.novell.com/show_bug.cgi?id=697699
https://bugzilla.kernel.org/show_bug.cgi?id=13811
https://bugzilla.kernel.org/show_bug.cgi?id=37142

Keith, is there a chance that this bug can be fixed by i915 team?
If not, can we disable hibernate on i915 with modeset=1 and add
module option, which enable it for those who want to risk?

Thanks
Stanislaw



hibernate random memory corruption, workaround i915.modeset=0

2012-03-19 Thread Keith Packard
<#part sign=pgpmime>
On Mon, 19 Mar 2012 15:53:54 +0100, Stanislaw Gruszka  
wrote:

> Keith, is there a chance that this bug can be fixed by i915 team?

Yes, I'm working on figuring out how to actually reproduce this and then
work on a few work-arounds.

> If not, can we disable hibernate on i915 with modeset=1 and add
> module option, which enable it for those who want to risk?

I'd love to know if disabling modeset on just the booting kernel helps;
leaving the resuming kernel with modeset=1. I haven't been able to
reproduce this locally yet to test this theory though.

-- 
keith.packard at intel.com


Re: hibernate random memory corruption, workaround i915.modeset=0

2012-03-19 Thread Stanislaw Gruszka
On Mon, Feb 27, 2012 at 01:42:43PM +0100, Stanislaw Gruszka wrote:
 I'm able to reproduce random memory corruption after hibernate.
 Corruption is not reproducible when I disable mode setting, what
 seems to blame i915 driver or generic DRM kernel code.
 
 I'm able to reproduce bug on Fedora 11 with 2.6.30 kernel (first
 fedora with KMS support) and on the latest 3.3-rc kernels. So this
 issue is there from very beginning, hence it is not bisectable.
 
 I'm attaching script to reproduce (with accompanying memory checker
 program). Script is basically sequence of hibernate - reset - check
 memory. Kernel should be compiled with CONFIG_DEBUG_SLAB to detect
 poison/redzone overwrites.
  
 I already tried to debug this using CONFIG_DEBUG_PAGEALLOC and new
 kernel option debug_guardpage_minorder, but without any success.
 Seems corruption happen behind CPU MMU, i.e. is DMA unit programming
 bug. I'm not able to turn on IOMMU on that hardware.
 
 This happen on T500 laptop with, lspci output attached.
 
 I'm attaching also dmesg's with poison/redzone overwrites from
 3.3-rc4 and 2.6.30 kernels.
 
 Some more information can be found on:
 https://bugzilla.redhat.com/show_bug.cgi?id=746169
 https://bugzilla.redhat.com/show_bug.cgi?id=701857
 
 i.e there is invalid DMA address warning that could be a good hint:
 https://bugzilla.redhat.com/show_bug.cgi?id=746169#c7
 
 I would appreciate any help with solving this issue. I think many
 people are hitting this, but since corruption happens at random,
 not many people notice it, or when notice, did not find out that
 this could be i915/DRM issue.

So, after googling a bit I find out that we are writing pixels into
memory and issue is known since 2010 at least:
http://codemonkey.org.uk/2012/03/12/i915-hibernate-memory-corruption/
https://bugzilla.novell.com/show_bug.cgi?id=697699
https://bugzilla.kernel.org/show_bug.cgi?id=13811
https://bugzilla.kernel.org/show_bug.cgi?id=37142

Keith, is there a chance that this bug can be fixed by i915 team?
If not, can we disable hibernate on i915 with modeset=1 and add
module option, which enable it for those who want to risk?

Thanks
Stanislaw

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: hibernate random memory corruption, workaround i915.modeset=0

2012-03-19 Thread Keith Packard
#part sign=pgpmime
On Mon, 19 Mar 2012 15:53:54 +0100, Stanislaw Gruszka sgrus...@redhat.com 
wrote:

 Keith, is there a chance that this bug can be fixed by i915 team?

Yes, I'm working on figuring out how to actually reproduce this and then
work on a few work-arounds.

 If not, can we disable hibernate on i915 with modeset=1 and add
 module option, which enable it for those who want to risk?

I'd love to know if disabling modeset on just the booting kernel helps;
leaving the resuming kernel with modeset=1. I haven't been able to
reproduce this locally yet to test this theory though.

-- 
keith.pack...@intel.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


hibernate random memory corruption, workaround i915.modeset=0

2012-02-27 Thread Stanislaw Gruszka
Hi.

I'm able to reproduce random memory corruption after hibernate.
Corruption is not reproducible when I disable mode setting, what
seems to blame i915 driver or generic DRM kernel code.

I'm able to reproduce bug on Fedora 11 with 2.6.30 kernel (first
fedora with KMS support) and on the latest 3.3-rc kernels. So this
issue is there from very beginning, hence it is not bisectable.

I'm attaching script to reproduce (with accompanying memory checker
program). Script is basically sequence of hibernate - reset - check
memory. Kernel should be compiled with CONFIG_DEBUG_SLAB to detect
poison/redzone overwrites.

I already tried to debug this using CONFIG_DEBUG_PAGEALLOC and new
kernel option debug_guardpage_minorder, but without any success.
Seems corruption happen behind CPU MMU, i.e. is DMA unit programming
bug. I'm not able to turn on IOMMU on that hardware.

This happen on T500 laptop with, lspci output attached.

I'm attaching also dmesg's with poison/redzone overwrites from
3.3-rc4 and 2.6.30 kernels.

Some more information can be found on:
https://bugzilla.redhat.com/show_bug.cgi?id=746169
https://bugzilla.redhat.com/show_bug.cgi?id=701857

i.e there is invalid DMA address warning that could be a good hint:
https://bugzilla.redhat.com/show_bug.cgi?id=746169#c7

I would appreciate any help with solving this issue. I think many
people are hitting this, but since corruption happens at random,
not many people notice it, or when notice, did not find out that
this could be i915/DRM issue.

Thanks
Stanislaw
-- next part --
A non-text attachment was scrubbed...
Name: hib_corruption_reproducer.tar.bz2
Type: application/x-bzip2
Size: 1468 bytes
Desc: not available
URL: 

-- next part --
00:00.0 Host bridge [0600]: Intel Corporation Mobile 4 Series Chipset Memory 
Controller Hub [8086:2a40] (rev 07)
Subsystem: Lenovo Device [17aa:20e0]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- 
Kernel driver in use: agpgart-intel

00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 4 Series 
Chipset Integrated Graphics Controller [8086:2a42] (rev 07) (prog-if 00 [VGA 
controller])
Subsystem: Lenovo Device [17aa:20e4]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR-  [disabled]
Capabilities: [90] MSI: Mask- 64bit- Count=1/1 Enable+
Address: fee0200c  Data: 41b1
Capabilities: [d0] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: i915
Kernel modules: i915

00:02.1 Display controller [0380]: Intel Corporation Mobile 4 Series Chipset 
Integrated Graphics Controller [8086:2a43] (rev 07)
Subsystem: Lenovo Device [17aa:20e4]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- 
Kernel driver in use: e1000e
Kernel modules: e1000e

00:1a.0 USB Controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI 
Controller #4 [8086:2937] (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Device [17aa:20f0]
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- 
Kernel driver in use: uhci_hcd

00:1a.1 USB Controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI 
Controller #5 [8086:2938] (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Device [17aa:20f0]
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- 
Kernel driver in use: uhci_hcd

00:1a.2 USB Controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI 
Controller #6 [8086:2939] (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Device [17aa:20f0]
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- 
Kernel driver in use: uhci_hcd

00:1a.7 USB Controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB2 EHCI 
Controller #2 [8086:293c] (rev 03) (prog-if 20 [EHCI])
Subsystem: Lenovo Device [17aa:20f1]
Control: I/O- Mem+ BusMaster+ SpecCycle-