i915: add reset_state for hw_contexts

Ian Romanick Wed, 27 Feb 2013 17:15:29 -0800

On 02/27/2013 01:13 AM, Chris Wilson wrote:

On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote:

On 02/26/2013 03:05 AM, Mika Kuoppala wrote:

For arb-robustness, every context needs to have it's own
reset state tracking. Default context will be handled in a identical
way as the no-context case in further down in the patch set.
For no-context case, the reset state will be stored in
the file_priv part.


v2: handle default context inside get_reset_state


This isn't the interface we want.  I already sent you the patches
for Mesa, and you seem to have completely ignored them.  Moreover,
this interface provides no mechanism to query for its existence
(other than relying on the kernel version), and no method to
deprecate it.


It's existence, and the existence of any successor, is the only test you
need to check for the interface. Testing the flag field up front also
lets you know if individual flags are known prior to use.

Based on e-mail discussions, I think danvet agrees with me here.

Putting guilty / innocent counting in kernel puts policy decisions
in the kernel that belong with the user space API that implements
them. Putting these choices in the kernel significantly decreases
how "future proof" the interface is.  Since any kernel/user
interface has to be kept forever, this is just asking for
maintenance trouble down the road.


I think you have the policy standpoint reversed.

Can you elaborate? I've been out of the kernel loop for a long time,but I always thought policy and mechanism were supposed to be separated.In the case of drivers with a user-mode component, the mechanism wasin the kernel (where else could it be?), and policy decisions were inuser-mode. This is often at odds with keeping the interfaces betweenkernel and user thin, so that may be where my misunderstanding is.

Also, it's difficult (for me) to tell from the rest of the series
whether or not a context that was not affected by a reset (had no
batches in flight at all) can still observe that a reset occurred.


Yes, the context can observe that the global reset increased, its did
not.


See below for a couple reasons why I think this may be a mistake.

 From the GL point of view, once a context has observed a reset, it's
garbage.  Knowing that it saw 1 reset or 43,000,000 resets is the
same.  Since it's a count, GL will have to query the values before
any rendering happens or it won't know whether the value 6 means
there was a reset or not.

The right interface is a set of flags indicating the state the
context was in when it observed a reset.  As this patch series
stands, Mesa will not use this interface.


This interface conforms to your specifications and interface that you
last posted to the list and were revised on list. There are two
substantive differences; there is *less* policy in the kernel than
before, and the breadcrumb of last reset state was overlooked.

In order to make the reset status work in a multi-threaded environment
the kernel exists in, we have to assume that there will be multiple
actors on the reset status, and they will be comparing the state that
they are interested in. The set of flags that GL understands is
deducible from this interface.

We had some discussion on the list about a proposed interface last year.We had a really good discussion, and both you and Daniel provided somereally valuable feedback. The biggest piece of feedback that both ofyou provided was to simplify. I took that advice to heart. Theoriginal interface was clunky and over complicated.

If memory serves, that discussion was on the order of 8 to 10 monthsago. Quite a lot has happened since then. The biggest thing thathappened is I started implementing the interface we discussed in thekernel, libdrm, and Mesa. In that process I found a number of problemswith even the simplified interface. Back in September I used this newdata to simplify the interface even further. This allowed both the

kernel and the user-mode code to be much less complex.

Right about that same time the whole Mesa team got super busy. First wehad OpenGL 3.1, then we had OpenGL ES 3.0. During that time all of myARB_robustness work languished. I didn't want to send another round ofnot-fully-baked ideas (or patches) to the list, so they just sat. Andsat. And sat. :(

At FOSDEM earlier this month, I discussed this with Jesse. I told himthat there was approximately epsilon probability that I would get backto this work. As a result, I was going to have to throw it over thewall to the kernel team. At no point did he mention that anyone wasalready working on this.

About a week ago Daniel pulled me into a discussion with Mika about theARB_robustness work. I dug out my old code, paged back in the memory ofthe work from off-line storage, and provided a bunch of feedback to Mikaand Daniel. After I sent around my explanation of why I abandoned thecounting interface and circulated my work, there was no response fromMika or anyone on that team. I had a brief discussion with Daniel onIRC, and he seemed to agree with my sentiments and like the direction ofmy code.


Then this patch series appeared on the list, and here we are.

There are two requirements in the ARB_robustness extension (andadditional layered extensions) that I believe cause problems with all ofthe proposed counting interfaces.

1. GL contexts that are not affected by a reset (i.e., didn't lose anyobjects, didn't lose any rendering, etc.) should not observe it. Thisis the big one that all the browser vendors want. If you have 5 tabsopen, a reset caused by a malicious WebGL app in one tab shouldn't, ifpossible, force them to rebuild all of their state for all the other tabs.

2. After a GL context observes a reset, it is garbage. It must bedestroyed and a new one created. This means we only need to know that areset of any variety affected the context.

In addition, I'm concerned about having one GL context be able toobserve, in any manner, resets that did not affect it. Given thatpeople have figured out how to get encryption keys by observing cachetimings, it's not impossible for a malicious program to use informationabout reset counts to do something. Leaving a potential gap like thisin an extension that's used to improved security seems... like we'rejust asking for a Phoronix article. We don't have any requirement toexpose this information, so I don't think we should.

We also don't have any requirement to expose this functionality onanything pre-Sandy Bridge. We may want, at some point, to expose it onIronlake. Based on that, it seems reasonable to tie the availability ofreset notifications to hardware contexts. Since we won't test it, Mesacertainly isn't going to expose it on 8xx or 915.

Based on this, a simplified interface became obvious: for each hardwarecontext, track a mask of kinds of resets that have affected thatcontext. I defined three bits in the mask:


1. The hw context had a batch executing when the reset occurred.

2. The hw context had a batch pending when the reset occurred.Hopefully in the future we could make most occurrences of this go awayby re-queuing the batch. It may also be possible to do this inuser-mode, but that may be more difficult.


3. "Other."

There's also the potential to add other bits to communicate informationabout lost buffers, textures, etc. This could be used to create alayered extension that lets applications transfer data from the deadcontext to the new context. Browsers may not be too interested in this,but I could imagine compositors or other kinds of applicationsbenefiting. We may never implement that, but it's easier to communicatethis through a set of flags than through a set of counts.

The guilty / innocent counts in this patch series lose information. Wecan probably reconstruct the current flags from the counts, but we wouldhave to do some refactoring (or additional counting) to get other flagsin the future. All of this would make the kernel code more complex.Why reconstruct the data you want when you could just track that data inthe first place?

Since the functionality in not available on all hardware, I also added aquery to determine the availability. This has the added benefit that,should this interface prove to be insufficient in the future, we coulddeprecate it by having the query always return false. All softwareinvolved has to be able the handle the case where the interface is notavailable, so I don't think this should run awry of the rules aboutkernel/user interface breakage. Please correct me if I'm wrong.


The Mesa patches are in my robustness2 branch:

http://cgit.freedesktop.org/~idr/mesa/log/?h=robustness2

As Mika pointed out, the patch that makes use of the new kernelinterface is:


http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=ef97f4092c703afd49b3516b15dfbfcd27d4bc97

In addition,

http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=6f73cbeecb90b0a08b0f016e19a16f6f50a1a99c

changes to brw_context.c to use intel_get_boolean to determine theavailability of the kernel query. I'd really like to only expose__DRI2_ROBUSTNESS when the kernel query exists, but I don't see a way todo that with the DRI extension mechanism.


The libdrm patches are in master of:

http://cgit.freedesktop.org/~idr/drm

I had attacked the problem from the opposite end from Mika. My initialimplementation always reported "other" because a lot of other workneeded to happen to differentiate the other cases. It's good to seethat work progressing. My really, really, really incomplete kernelpatch is at:


http://people.freedesktop.org/~idr/robust.patch

This patch was against some kernel from early September 2012. It maynot apply or build at this point. I believe I was able to observe areset by having an app run a shader with an infinite loop and pollglGetGraphicsResetStatusARB. I couldn't find that code, so I may bemisremembering. With the exception of the I915_PARAM_HAS_RESET_QUERYhandling, nearly all of this patch is probably useless at this point.


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 09/13] drm/i915: add reset_state for hw_contexts

Reply via email to