Re: [Mesa-dev] GBM and the Device Memory Allocator Proposals

James Jones Tue, 05 Dec 2017 23:08:21 -0800

On 12/01/2017 01:52 PM, Miguel Angel Vico wrote:



On Fri, 1 Dec 2017 13:38:41 -0500
Rob Clark <robdcl...@gmail.com> wrote:

On Fri, Dec 1, 2017 at 12:09 PM, Nicolai Hähnle <nhaeh...@gmail.com> wrote:

On 01.12.2017 16:06, Rob Clark wrote:


On Thu, Nov 30, 2017 at 5:43 PM, Nicolai Hähnle <nhaeh...@gmail.com>
wrote:


Hi,

I've had a chance to look a bit more closely at the allocator prototype
repository now. There's a whole bunch of low-level API design feedback,
but
for now let's focus on the high-level stuff first.


Thanks for taking a look.

Going by the 4.5 major object types (as also seen on slide 5 of your
presentation [0]), assertions and usages make sense to me.

Capabilities and capability sets should be cleaned up in my opinion, as
the
status quo is overly obfuscating things. What capability sets really
represent, as far as I understand them, is *memory layouts*, and so
that's
what they should be called.

This conceptually simplifies `derive_capabilities` significantly without
any
loss of expressiveness as far as I can see. Given two lists of memory
layouts, we simply look for which memory layouts appear in both lists,
and
then merge their constraints and capabilities.

Merging constraints looks good to me.

Capabilities need some more thought. The prototype removes capabilities
when
merging layouts, but I'd argue that that is often undesirable. (In fact,
I
cannot think of capabilities which we'd always want to remove.)

A typical example for this is compression (i.e. DCC in our case). For
rendering usage, we'd return something like:

Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)

For display usage, we might return (depending on hardware):

Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)

Merging these in the prototype would remove the DCC capability, even
though
it might well make sense to keep it there for rendering. Dealing withthe
fact that display usage does not have this capability is precisely one of
the two things that transitions are about! The other thing that
transitions
are about is caches.

I think this is kind of what Rob was saying in one of his mails.



Perhaps "layout" is a better name than "caps".. either way I think of
both AMD/tiled and AMD/DCC as the same type of "thing".. the
difference between AMD/tiled and AMD/DCC is that a transition can be
provided for AMD/DCC.  Other than that they are both things describing
the layout.



The reason that a transition can be provided is that they aren't quite the
same thing, though. In a very real sense, AMD/DCC is a "child" propertyof
AMD/tiled: DCC is implemented as a meta surface whose memory layout depends
on the layout of the main surface.


I suppose this is six-of-one, half-dozen of the other..

what you are calling a layout is what I'm calling a cap that just
happens not to have an associated transition

Although, if there are GPUs that can do an in-place "transition" between
different tiling layouts, then the distinction is perhaps really not as
clear-cut. I guess that would only apply to tiled renderers.


I suppose the advantage of just calling both layout and caps the same
thing, and just saying that a "cap" (or "layout" if you prefer that
name) can optionally have one or more associated transitions, is that
you can deal with cases where sometimes a tiled format might actually
have an in-place transition ;-)

So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
(FOO/cached).  But the GPU supported the following transitions:

    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null

Then the sets for each device (in order of preference):

GPU:
    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
    3: caps(FOO/tiled); constraints(alignment=32k)

Display:
    1: caps(FOO/tiled); constraints(alignment=64k)

Merged Result:
    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
       transition(GPU->display: trans_a, trans_b; display->GPU: none)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
       transition(GPU->display: trans_a; display->GPU: none)
    3: caps(FOO/tiled); constraints(alignment=64k);
       transition(GPU->display: none; display->GPU: none)



We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of performance. Let's
build fewer footguns :)


sure, this was just a hypothetical example.  But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
with:

    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null

GPU:
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)

Display:
   1: caps(FOO/tiled); constraints(alignment=64k)

Merged Result:
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_a, trans_b; display->GPU: none)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_b; display->GPU: none)

So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display

Hmm, I guess this does require the concept of a required cap..


Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.

Note I also posed the question of whether things like cached (andsimilarly compression, since I view compression as roughly an equivalentmechanism to a cache) in one of the open issues on my XDC 2017 slidesbecause of this very problem of over-pruning it causes. It's on slide15, as "No device-local capabilities". You'll have to listen to mycoverage of it in the recorded presentation for that slide to make anysense, but it's the same thing Nicolai has laid out here.

As I continued working through our prototype driver support, I found Ididn't actually need to include cached or compressed as capabilities:The GPU just applies them as needed and the usage transitions make ittransparent to the non-GPU engines. That does mean the GPU drivercurrently needs to be the one to realize the allocation from thecapability set to get optimal behavior. We could fix that by reworkingour driver though. At this point, not including device-local propertieslike on-device caching in capabilities seems like the right solution tome. I'm curious whether this applies universally though, or if otherhardware doesn't fit the "compression and stuff all behaves like acache" idiom.

So at least for radeonsi, we wouldn't want to have an AMD/cached bit, but
we'd still want to have a transition between the GPU and display precisely
to flush caches.

Two interesting questions:

1. If we query for multiple usages on the same device, can we get a
capability which can only be used for a subset of those usages?



I think the original idea was, "no"..  perhaps that could restriction
could be lifted if transitions where part of the result.  Or maybe you
just query independently the same device for multiple different
usages, and then merge that cap-set.

(Do we need to care about intra-device transitions?  Or can we just
let the driver care about that, same as it always has?)

2. What happens when we merge memory layouts with sets of capabilities
where
neither is a subset of the other?



I think this is a case where no zero-copy sharing is possible, right?



Not necessarily. Let's say we have some industry-standard tiling layoutfoo,
and vendors support their own proprietary framebuffer compression on top of
that.

In that case, we may get:

Device 1, rendering: caps(BASE/foo, VND1/compressed)
Device 2, sampling/scanout: caps(BASE/foo, VND2/compressed)

It should be possible to allocate a surface as

caps(BASE/foo, VND1/compressed)

and just transition the cap(VND1/compressed) away after rendering before
accessing it with device 2.


In this case presumably VND1 defines transitions VND1/cc <-> null, and
VND2 does the same for VND2/cc ?

The interesting question is whether it would be possible or ever usefulto
have a surface allocated as caps(BASE/foo, VND1/compressed,
VND2/compressed).


not sure if it is useful or not, but I think the idea of defining cap
transitions and returning the associated transitions when merging a
caps set could express this..


Yeah. I guess if we define per-cap transitions, the allocator should be
able to find a transition chain to express layouts like the above when
merging.

My guess is: there will be cases where it's possible, but there won't be
cases where it's useful (because you tend to render on device 1 and just
sample or scanout on device 2).

So it makes sense to say that derive_capabilities should just provide both
layouts in this case.

As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic
"off-device
usage" bit.



I've started thinking of cached as a capability with a transition.. I
think that helps.  Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)



As I wrote above, I'd prefer not to think of "cached" as a capability at
least for radeonsi.

 From the desktop perspective, I would say let's ignore caches, the drivers
know which caches they need to flush to make data visible to other devices
on the system.

On the other hand, there are probably SoC cases where non-coherent caches
are shared between some but not all devices, and in that case perhaps we do
need to communicate this.

So perhaps we should have two kinds of "capabilities".

The first, like framebuffer compression, is a capability of the allocated
memory layout (because the compression requires a meta surface), and devices
that expose it may opportunistically use it.

The second, like caches, is a capability that the device/driver will use and
you don't get a say in it, but other devices/drivers also don't need tobe
aware of them.


yeah, a required cap..  we had tried to avoid this, since unlike
constraints which are well defined, the core constraint/capability
merging wouldn't know what to do about merging parameterized caps.
But I guess if transitions are provided then it doesn't have to.


We are going to need required caps either way, right? The core
capability merging logic would try to find a compatible layout to be
used across engines by either using available transitions or dropping
caps. We'd need a way to indicate that a particular engine won't be
able to handle the resulting layout if a certain capability was dropped.

So then you could theoretically have a system that gives you:

GPU:     FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video:   FOO/tiled(dev-caps=FOO/vid-cache)
Camera:  FOO/tiled(dev-caps=FOO/vid-cache)

... from which a FOO/tiled(FOO/cc) surface would be allocated.

The idea here is that whether a transition is required is fully visiblefrom
the capabilities:

1. Moving an image from the camera to the video engine for immediate
compression requires no transition.

2. Moving an image from the camera or video engine to the display requires a
transition by the video/camera device/API, which may flush the video cache.

3. Moving an image from the camera or video engine to the GPU additionally
requires a transition by the GPU, which may invalidate the GPU cache.

4. Moving an image from the GPU anywhere else requires a transition by the
GPU; in all cases, GPU caches may be flushed. When moving to the video
engine or camera, the image additionally needs to be decompressed. When
moving to the video engine (or camera? :)), a transition by the video engine
is also required, which may invalidate the video cache.

5. Moving an image from the display to the video engine requires a
decompression -- oops! :)


I guess it should be possible for devices to provide transitions in
both directions, which would deal with this..

Ignoring that last point for now, I don't think you actually need a
"query_transition" function in libdevicealloc with this approach, for the
most part.


with the idea of being able to provide optional transitions, I'm
leaning towards just having one of outputs of merging the caps sets
being the sets of transitions required to pass a buffer between
different devices..


I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.

I think James's proposal for usage transitions was intended to work
with flows like:

   1. App gets GPU caps for RENDER usage
   2. App allocates GPU memory using a layout from (1)
   3. App now decides it wants use the buffer for SCANOUT
   4. App queries usage transition metadata from RENDER to SCANOUT,
      given the current memory layout.
   5. Do the transition and hand the buffer off to display

No, all usages the app intends to transition to must be specified upfront when initially querying caps in the model I assumed. The app thenspecifies some subset (up to the full set) of the specified usages as asrc and dst when querying transition metadata.

The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.

I hadn't thought hard about it, but my initial thoughts were that itwould be required that the driver support transitioning to any singleusage given the capabilities returned. However, transitioning tomultiple usages (E.g., to simultaneously rendering and scanning out)could fail to produce a valid transition, in which case the app wouldhave to fall back to a copy in that case, or avoid that simultaneoususage combination in some other way.

Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.

I'll look into extending the current merging logic to also take into
account transitions.

Yes, it'll be good to see whether this can be made to work. I agreeRob's example outcomes above are ideal, but it's not clear to me how tocode up such an algorithm. This also all seems unnecessary if "devicelocal" capabilities aren't needed, as posited above.

although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..


We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?

Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.

The allocator only knows the requested union of all usages currently.The number of possible transitions grows combinatorially for every usagerequested I believe. I expect there will be cases where ~10 usages arespecified, so generating all possible transitions all the time may beexcessive, when the app will probably generally only care about 2 or 3states, and in practice, there will probably only actually be 2 or 3different underlying possible combinations of operations.


/me shrugs

Instead, each API needs to provide import and export transition/barrier
functions which receive the previous/next layout-and capability-set.

Basically, to import a frame from the camera to OpenGL/Vulkan in the above
system, you'd first do the camera transition:

   struct layout_capability cc_cap = { FOO, FOO_CC };
   struct device_capability gpu_cache = { FOO, FOO_GPU_CACHE };

   cameraExportTransition(image, 1, &layoutCaps, 1, &gpu_cache, &fence);

and then e.g. an OpenGL import transition:

   struct device_capability vid_cache = { FOO, FOO_VID_CACHE };

   glImportTransitionEXT(texture, 0, NULL, 1, &vid_cache, fence);

By looking at the capabilities for the other device, each API's driver can
derive the required transition steps.

There are probably more gaps, but these are the two I can think of right
now, and both related to the initialization status of meta surfaces, i.e.
FOO/cc:

1. Point 5 above about moving away from the display engine in the example.
This is an ugly asymmetry in the rule that each engine performs its required
import and export transitions.

2. When the GPU imports a FOO/tiled(FOO/cc) surface, the compression meta
surface can be in one of two states:
- reflecting a fully decompressed surface (if the surface was previously
exported from the GPU), or
- garbage (if the surface was allocated by the GPU driver, but then handed
off to the camera before being re-imported for processing)
The GPU's import transition needs to distinguish the two, but it can't with
the scheme above.


hmm, so I suppose this is also true in the cache case.. you want to
know if the buffer was written by someone else since you saw it last..

Something to think about :)

Also, not really a gap, but something to keep in mind: for multi-GPU
systems, the cache-capability needs to carry the device number or PCI bus id
or something, at least as long as those caches are not coherent between
GPUs.


yeah, maybe shouldn't be FOO/gpucache but FOO/gpucache($id)..


That just seems an implementation detail of the representation the
particular vendor chooses for the CACHE capability, right?


Agreed.

One final note: When I initially wrote up the capability merging logic,I treated "layout" as a sort of "special" capability, basically likeNicolai originally outlined above. Miguel suggested I add the"required" bit instead to generalize things, and it ended up working outmuch cleaner. Besides the layout, there is at least one other obviouscandidate for a "required" capability that became obvious as soon as Istarted coding up the prototype driver: memory location. It might seemlike memory location is a simple device-agnostic constraint rather thana capability, but it's actually too complicated (we need more memorylocations than "device" and "host"). It has to be vendor specific, andhence fits in better as a capability.

I think if possible, we should try to keep the design generalized to asfew types of objects and special cases as possible. The more we cangeneralize the solutions to our existing problem set, the better themechanism should hold up as we apply it to new and unknown problems asthey arise.


Thanks,
-James

Thanks,
Miguel.



BR,
-R

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] GBM and the Device Memory Allocator Proposals

Reply via email to