Re: [ubuntu-x] Automatic GPU lockup bug reports

2010-03-24 Thread Bryce Harrington
On Wed, Mar 24, 2010 at 01:11:27PM +0100, Geir Ove Myhr wrote:
> On Wed, Mar 17, 2010 at 11:16 AM, Geir Ove Myhr  wrote:
> > The first happens whether the GPU is wedged or not (as defined by
> > dev_priv->mm.wedged). There is no uevent that is triggered for all
> > chipsets, but only if the GPU is wedged, which may be what we want.
> > [...]
> > Open question:
> > - Is wedged the same as hung, or is there a subtle difference?
> 
> I just realized that there is /sys/kernel/debug/dri/0/i915_wedged on
> Lucid now the .33 drm is included [1]. Attaching this file
> automatically may aid in deciphering what's going on sometimes. I see
> the apport hook was just disabled, but possibly for next time...
> 
> [1]: 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f3cd474bb235f2331c1a6f579bdbf892386e5c7c

We discussed it in yesterday's meeting.  Rick, Raof, Sarvatt, and myself
reached consensus that we have a sufficient number of bug reports now on
freeze issues to work on, and at this point the script is mostly
gathering dupes (or invalid reports as your thorough analysis has
proven) so it's giving diminishing returns at the moment.
We can turn it on again later once we get these bugs resolved, but by
turning it off we can expend less manpower into triaging and hopefully
more into pulling in fixes.

Meanwhile, you've done some excellent investigation into the apport
hook, and I definitely agree it would be good to get those implemented
at least in Lucid+1.  It's a bit of a bummer that the key seems to be
carrying that temporary debug kernel patch, because I'm doubtful the
kernel team would be open to that at this late stage in the release.
(Also, I'm skeptical that even if we did this, that upstream would just
come back with something else needed.)

What I've done is scheduled a session at UDS to discuss freeze hooks in
general, and I've captured some of your advice into a wiki page[1] which
we can use for reference at the UDS session.  That session will cover
getting freeze hooks set up for -ati and -nouveau as well, that use the
infrastructure and lessons-learned we've gained doing this for -intel.
If you have further thoughts or want to do copyediting on the wiki page,
please do!  That would make life easier for whomever drafts this
blueprint post-UDS.  :-)

1: https://wiki.ubuntu.com/X/Blueprints/ApportFreezeHooks

Bryce

-- 
Ubuntu-x mailing list
Ubuntu-x@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-x


Re: [ubuntu-x] Automatic GPU lockup bug reports

2010-03-24 Thread Geir Ove Myhr
On Wed, Mar 17, 2010 at 11:16 AM, Geir Ove Myhr  wrote:
> The first happens whether the GPU is wedged or not (as defined by
> dev_priv->mm.wedged). There is no uevent that is triggered for all
> chipsets, but only if the GPU is wedged, which may be what we want.
> [...]
> Open question:
> - Is wedged the same as hung, or is there a subtle difference?

I just realized that there is /sys/kernel/debug/dri/0/i915_wedged on
Lucid now the .33 drm is included [1]. Attaching this file
automatically may aid in deciphering what's going on sometimes. I see
the apport hook was just disabled, but possibly for next time...

[1]: 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f3cd474bb235f2331c1a6f579bdbf892386e5c7c

-- 
Ubuntu-x mailing list
Ubuntu-x@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-x


Re: [ubuntu-x] Automatic GPU lockup bug reports

2010-03-17 Thread Geir Ove Myhr
A little incremental update on the apport GPU lockup reports...

On how GPU reset works:

I have looked a little on the code, and the first thing that pops out
is that only chipsets above i965 and GM45 are being reset. i945, G33,
and below are not reset. This resonates well with what I see in the
bug reports. On the chipsets where the GPU is not reset, the attached
IntelGpuDump.txt is compatible with the (limited) information in
i915_error_state. For the bug reports where I have got a manual dump
of i915_error_state with drm-intel-next kernel which dumps all
relevant information there, the information is compatible with
IntelGpuDump.txt, although more complete (i.e. includes all the
relevant buffers, IntelGpuDump.txt often lacks some important ones).
On chipsets where the GPU is reset, IntelGpuDump.txt is a dump of a
freshly initialized GPU. The best sign is that the HEAD is right in
the beginning of the ringbuffer, i.e. it just got started. The other
sign is that ACTHD and IPEHR are different from the ones recorded in
i915_error_state. With drm.debug=0x02 as kernel parameter, we can also
see that the GPU is being reset in dmesg output (see [1] for an
example from LP # 516909). The code that triggers the reset is
i915_error_work_func in drivers/gpu/drm/i915/i915_irq.c [2]. The
actual reset happens in 965_reset in i915_drv.c [3].

[1]: https://bugs.freedesktop.org/attachment.cgi?id=34126&action=edit
[2]: 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/i915/i915_irq.c;h=5388354da0d176df4ff2a3b7c33de069abff12da;hb=HEAD
[3]: 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/i915/i915_drv.c;h=1b2e95455c05d0cce04d17483c7bd4ff9f218fe0;hb=HEAD

On how the udev events are triggered:

The udev events are sent from i915_error_work_func mentioned above.
When a GPU reset happens, there are three events being sent. Once is
at the beginning of the function, when we know that an error has been
detected, one right before the reset and one after. The two last ones
only happen on i965 and above, so we don't want to listen for them.
The first happens whether the GPU is wedged or not (as defined by
dev_priv->mm.wedged). There is no uevent that is triggered for all
chipsets, but only if the GPU is wedged, which may be what we want.

The i915_error_work_func is called from the end of i915_handle_error
(also in i915_irq.c), which takes care of recording the error state to
i915_error_state in debugfs first, so it's fine to grab this file on
the first udev event also in the cases where the GPU will be reset (I
was worried about this in previous emails). i915_handle_error is
called from two places. One is when a bit in the error register EIR
gets set, which triggers an interrupt. The other is when the hangcheck
timer ellapses, i.e. EIR is not set, but the GPU makes no progress. In
the latter case "Hangcheck timer elapsed... GPU hung\n" is logged. In
both cases i915_handle_error prints "render error detected, EIR:
0x%08x\n" (i.e the EIR register is printed), but this will probably
change in drm-intel-next soon, so that this only is printed when a bit
in EIR is set [4]

[4]: http://lists.freedesktop.org/archives/intel-gfx/2010-March/006150.html

On what upstream wants:

Chris Wilson says that they would prefer dumps from kernels with the
i915_error_state dumping patch [5]. IntelGpuDump.txt usually lacks
some important information.

[5]: 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9df30794f609d9412f14cfd0eb7b45dd64d0b14e

On what we can do:

1. Differentiate between "GPU hung" and other GPU errors. I think I
got this part right in my previous email:
- If there is "Hangcheck timer elapsed... GPU hung" in dmesg, give
title "GPU hung ++",
- If there is "page table error" in dmesg, give title "GPU page table error ++"
- If none of the above, simply let the title be "GPU error ++" for now.
2. Include error registers in the right priority in the title
- If PGTBL_ER is non-zero, use that .
- Otherwise, if EIR is non-zero, use that.
- Ignore ESR, it's useless.
3. If possible, carry the record-batch-buffer-following-GPU-error
patch [5] (above) in the kernel. Possibly drop it before release. This
will make the dumps for pre-i965 become better, and will make the
post-i965 dumps become useful.
4. Possibly add some message in the apport-script that says that while
we are recording the logs of the incident, they don't tell us how the
reporter experienced the problem. We get a lot of descriptions that
only says things like "problem happened" and we don't know if the
computer hung and needed a reboot or if the computer recovered all by
itself and the only thing the user notices is that apport asks it to
report a problem he/she was unaware of.
5. Fix whatever caused
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/539533
. This seemed to happen for a lot of people since yesterday. It seems
to be related to tryi

Re: [ubuntu-x] Automatic GPU lockup bug reports

2010-03-12 Thread Geir Ove Myhr
On Fri, Mar 12, 2010 at 1:48 AM, Bryce Harrington  wrote:
> On Wed, Mar 10, 2010 at 09:12:33AM +0100, Geir Ove Myhr wrote:
>> I have noticed in the GPU-lockup bug report that we have been
>> receiving 
>> (https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated)
>> that the IntelGpuDump.txt that is attached usually is incomplete, but
>> can be useful for gathering statistics since dumps on the same chipset
>> often has similar characteristics. This may be due to the race
>> condition that Chris mentions.
>
> Incomplete in what sense?

Like below where the currently executing batchbuffer is completely
missing. I have also seen dumps without any ringbuffer or batchbuffer
at all. Then there's the issue of how much to trust the dump if the
kernel is racing to reset the GPU while userspace is trying to dump
it.

> Btw, you've noticed the random number strings that are included in
> titles.  That is basically a checksum hex of the dump report, which I'm
> calling the 'dump sign'.  If two bug reports have exactly the same gpu
> dump (character-for-character) then they'll have identical dump signs
> and thus are almost assuredly dupes.  Looking through our existing bug
> reports I found half a dozen with the same hex, and sure enough they
> were all against 915gm and so I marked them all dupes.

Yes, I knew that. I just thought I'd replace it with what I thought
was the actual problem. I think we will only get matches in degenerate
cases, like the 915gm one where the ringbuffer is only a big 0. It is
kind of taking a 'ps aux' dump. If one symptom is that no processes
are started, we will get matching MD5s, but for any "normal" output
the MD5 will not match even if there are other characteristics that
are the same. I think it's okay to have the first few hex-digits in
the title along with any other useful information that we can add,
like you did for the last ones.

> However, I recognize these hex strings are nigh-unreadible for triagers,
> and notice you've been replacing them with the PGTBL_ER or ESR values in
> some cases.  To save you some typing I've updated the report to append
> these to the title, if the values are non-zero.  I did not include
> looking at the EIR but notice this is discussed in your other email -
> let me know if that would be worth including and if it should be
> used preferentially to ESR and/or PGTBL_ER.

Good idea. I added the ESR before I found out that it is essentially
useless. So PGTBL_ER should be used before EIR, and ESR should never
be used. EIR=0x10 is the general sign of a page table error and in
that case more detailed information about it can be found in PGTBL_ER,
so we always have EIR=0x10 (and possibly other errors) if PGTBL_ER is
non-zero.

>> One thing that I see a lot is that only the ringbuffer is captured,
>> while the GPU is executing a batchbuffer (see
>> https://wiki.ubuntu.com/X/InterpretingIntelGpuDump for a high level
>> description of ringbuffers and batchbuffers) . One example is
>> IntelGpuDump.txt from
>> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/535477
>> . The first line captures the memory address of the active head, i.e.
>> where the GPU is currently executing (ACTHD: 0x0e366d50). From the
>> ringbuffer dump we see
>> 0x00012500:      0x18800080: MI_BATCH_BUFFER_START
>> 0x00012504:      0x0e363001:    dword 1
>> 0x00012508: HEAD 0x0204: MI_FLUSH
>> which means that the last executed command in the ringbuffer was start
>> a batch buffer at memory address 0x0e363001. This is a little bit
>> ahead of ACTHD, so we can assume that the GPU is executing in that
>> batchbuffer, but the batchbuffer is not part of the dump, which makes
>> it hard to say what the GPU is up to. The only thing we can see is
>> that the last executed instruction is 0x1500 (from the IPEHR
>> register which is loaded with every instruction that is processed).
> Can you propose a mechanism for how we can solve this?  I only half grok
> the freeze dumping stuff, and unfortunately some other X projects are
> demanding my time.  But if you can propose some specific changes I can
> at least supply some time to update the apport hook and/or get the bits
> into the archive.  I would love patches or even just bash snippets that
> can be put into the apport hook, udev hook, or whatever.

One option would be to carry the record-GPU-error-state kernel patch
http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git;a=commit;h=9df30794f609d9412f14cfd0eb7b45dd64d0b14e
until some time before release and capture the i915_error_state a
little later. This would need some testing though, so our best option
may be to simply leave it at the status quo and ask the reporters of
the most promising automatic reports to test a drm-intel-next kernel
and get a manual dump.

>> I'm also wondering if there are many false positives, since I don't
>> always see signs of a GPU errror in the dmesg output. Even when there
>> are GPU hung messages, t

Re: [ubuntu-x] Automatic GPU lockup bug reports

2010-03-11 Thread Bryce Harrington
On Thu, Mar 11, 2010 at 04:48:07PM -0800, Bryce Harrington wrote:
> However, I recognize these hex strings are nigh-unreadible for triagers,
> and notice you've been replacing them with the PGTBL_ER or ESR values in
> some cases.  To save you some typing I've updated the report to append
> these to the title, if the values are non-zero.  I did not include
> looking at the EIR but notice this is discussed in your other email -
> let me know if that would be worth including and if it should be
> used preferentially to ESR and/or PGTBL_ER.

I've gone ahead and added EIR.  It will be used preferentially to the
other two if it is non-zero.  Let me know if that's incorrect.

Bryce

-- 
Ubuntu-x mailing list
Ubuntu-x@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-x


Re: [ubuntu-x] Automatic GPU lockup bug reports

2010-03-11 Thread Bryce Harrington
On Wed, Mar 10, 2010 at 09:12:33AM +0100, Geir Ove Myhr wrote:
> I have noticed in the GPU-lockup bug report that we have been
> receiving 
> (https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated)
> that the IntelGpuDump.txt that is attached usually is incomplete, but
> can be useful for gathering statistics since dumps on the same chipset
> often has similar characteristics. This may be due to the race
> condition that Chris mentions.

Incomplete in what sense?

Btw, you've noticed the random number strings that are included in
titles.  That is basically a checksum hex of the dump report, which I'm
calling the 'dump sign'.  If two bug reports have exactly the same gpu
dump (character-for-character) then they'll have identical dump signs
and thus are almost assuredly dupes.  Looking through our existing bug
reports I found half a dozen with the same hex, and sure enough they
were all against 915gm and so I marked them all dupes.

Ideally, when apport tries filing a bug report with the same dump sign
as one already filed, it should automatically set it as a dupe.  I don't
know that this is working yet, the dupe detection stuff is still magical
to me.

However, I recognize these hex strings are nigh-unreadible for triagers,
and notice you've been replacing them with the PGTBL_ER or ESR values in
some cases.  To save you some typing I've updated the report to append
these to the title, if the values are non-zero.  I did not include
looking at the EIR but notice this is discussed in your other email -
let me know if that would be worth including and if it should be
used preferentially to ESR and/or PGTBL_ER.
 
> One thing that I see a lot is that only the ringbuffer is captured,
> while the GPU is executing a batchbuffer (see
> https://wiki.ubuntu.com/X/InterpretingIntelGpuDump for a high level
> description of ringbuffers and batchbuffers) . One example is
> IntelGpuDump.txt from
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/535477
> . The first line captures the memory address of the active head, i.e.
> where the GPU is currently executing (ACTHD: 0x0e366d50). From the
> ringbuffer dump we see
> 0x00012500:  0x18800080: MI_BATCH_BUFFER_START
> 0x00012504:  0x0e363001:dword 1
> 0x00012508: HEAD 0x0204: MI_FLUSH
> which means that the last executed command in the ringbuffer was start
> a batch buffer at memory address 0x0e363001. This is a little bit
> ahead of ACTHD, so we can assume that the GPU is executing in that
> batchbuffer, but the batchbuffer is not part of the dump, which makes
> it hard to say what the GPU is up to. The only thing we can see is
> that the last executed instruction is 0x1500 (from the IPEHR
> register which is loaded with every instruction that is processed).

Can you propose a mechanism for how we can solve this?  I only half grok
the freeze dumping stuff, and unfortunately some other X projects are
demanding my time.  But if you can propose some specific changes I can
at least supply some time to update the apport hook and/or get the bits
into the archive.  I would love patches or even just bash snippets that
can be put into the apport hook, udev hook, or whatever.


> I'm also wondering if there are many false positives, since I don't
> always see signs of a GPU errror in the dmesg output. Even when there
> are GPU hung messages, there may be messages in dmesg for a long time
> after that, which means that it couldn't have been that GPU hang that
> triggered the udev rule. I'm not sure how to interpret this.

Can you propose a string to look for in the dmesg output?  It would be
straightforward to have the apport hook scan for that string and refuse
to file a bug report unless it sees it.

> Since the number of bug reports is quite overwhelming, I think a
> suitable thing to do would be to lump similar automatic report
> together by duplicating them to a master bug report. Most likely, the
> i8xx reports are mostly this issue:
> http://bugs.freedesktop.org/show_bug.cgi?id=26345 .

It could be.  Are we sufficiently confident that we could just dupe all
the bug reports in launchpad?  Or if we're not sure, we could go ahead
and start forwarding the bug reports and let upstream dupe them there.
The former is probably less total work, and like you mention we can
always undupe them ourselves as we learn more.

With 8xx, another option we could pursue would be to blacklist KMS in
the kernel and force them to use UMS instead.  Do you know if there has
been testing to verify that the freezes experienced by 8xx are specific
to KMS?  I'd hate to blacklist 845 for example, only to find it still
doesn't work.

I've removed the --kms-only flag on -intel, so it should now be possible
for 8xx users to switch off KMS via modeset=0 I think.  If we can get
some verifications that this helps eliminate the freezes, let me know
and we can proceed with blacklisting 8xx chips.

> The bugs on i945
> also seem similar to one an

Re: [ubuntu-x] Automatic GPU lockup bug reports

2010-03-10 Thread Geir Ove Myhr
> Yes, the userspace notification is asynchronous and the kernel does not
> wait before starting the reset procedure (if supported). Hence there is a
> race to capture the accurate data.
>
> The current i915_error_state gets around this by performing the capture in
> the error handler and aims to collect all the data that is strictly relevant
> to the crash. I would strongly recommend that this is used, and I want to
> deprecate the ringbuffer_info and batchbuffers debug files in the future -
> hence killing intel_gpu_dump.

I have noticed in the GPU-lockup bug report that we have been
receiving 
(https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated)
that the IntelGpuDump.txt that is attached usually is incomplete, but
can be useful for gathering statistics since dumps on the same chipset
often has similar characteristics. This may be due to the race
condition that Chris mentions.

One thing that I see a lot is that only the ringbuffer is captured,
while the GPU is executing a batchbuffer (see
https://wiki.ubuntu.com/X/InterpretingIntelGpuDump for a high level
description of ringbuffers and batchbuffers) . One example is
IntelGpuDump.txt from
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/535477
. The first line captures the memory address of the active head, i.e.
where the GPU is currently executing (ACTHD: 0x0e366d50). From the
ringbuffer dump we see
0x00012500:  0x18800080: MI_BATCH_BUFFER_START
0x00012504:  0x0e363001:dword 1
0x00012508: HEAD 0x0204: MI_FLUSH
which means that the last executed command in the ringbuffer was start
a batch buffer at memory address 0x0e363001. This is a little bit
ahead of ACTHD, so we can assume that the GPU is executing in that
batchbuffer, but the batchbuffer is not part of the dump, which makes
it hard to say what the GPU is up to. The only thing we can see is
that the last executed instruction is 0x1500 (from the IPEHR
register which is loaded with every instruction that is processed).

I'm also wondering if there are many false positives, since I don't
always see signs of a GPU errror in the dmesg output. Even when there
are GPU hung messages, there may be messages in dmesg for a long time
after that, which means that it couldn't have been that GPU hang that
triggered the udev rule. I'm not sure how to interpret this.

Since the number of bug reports is quite overwhelming, I think a
suitable thing to do would be to lump similar automatic report
together by duplicating them to a master bug report. Most likely, the
i8xx reports are mostly this issue:
http://bugs.freedesktop.org/show_bug.cgi?id=26345 . The bugs on i945
also seem similar to one another. Then we can coordinate some testing
from the master bug report, but ask people to comment on their
findings on their own reports. That way the master bug report will not
be overcommented and we can easily detach bug reports later.

Geir Ove

-- 
Ubuntu-x mailing list
Ubuntu-x@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-x