Francisco Jerez <curroje...@riseup.net> writes: > This is the first part of a series meant to improve our usage of the L3 cache. > Currently it's far from ideal since the following objects aren't taking any > advantage of it: > - Pull constants (i.e. UBOs and demoted uniforms) > - Buffer textures > - Shader scratch space (i.e. register spills and fills) > - Atomic counters > - (Soon) Images > > This first series addresses the first two issues. Fixing the last three is > going to be a bit more difficult because we need to modify the partitioning of > the L3 cache in order to increase the number of ways assigned to the DC, which > happens to be zero on boot until Gen8. That's likely to require kernel > changes because we don't have any extremely satisfactory API to change that > from userspace right now. > > The first patch in the series sets the MOCS L3 cacheability bit in the surface > state structure for buffers so the mentioned memory objects (except the shader > scratch space that gets its MOCS from elsewhere) have a chance of getting > cached in L3. > > The fourth patch in the series switches to using the constant cache (which, > unlike the data cache that was used years ago before we started using the > sampler, is cached on L3 with the default partitioning on all gens) for > uniform pull constants loads. The overall performance numbers I've collected > are included in the commit message of the same patch for future reference. > Most of it points at the constant cache being faster than the sampler in a > number of cases (assuming the L3 caching settings are correct), it's also > likely to alleviate some cache thrashing caused by the competition with > textures for the L1/L2 sampler caches, and it allows fetching up to eight > consecutive owords (128B) with just one message. > > The sixth patch enables 4 oword loads because they're basically for free and > they avoid some of the shortcomings of the 1 and 2 oword messages (see the > commit message for more details). I'll have a look into enabling 8 oword > loads but it's going to require an analysis pass to avoid wasting bandwidth > and increasing the register pressure unnecessarily when the shader doesn't > actually need as many constants. > > We could do something similar for non-uniform offset pull constant loads and > for both kinds of pull constant loads on the vec4 back-end, but I don't have > enough performance data to support that yet. > > [PATCH 1/7] i965: Enable L3 caching of buffer surfaces. > [PATCH 2/7] i965: Remove the create_raw_surface vtbl hook. > [PATCH 3/7] i965: Let the caller of brw_set_dp_write/read_message control the > target cache. > [PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants. > [PATCH 5/7] i965/fs: Less broken handling of force_writemask_all in > lower_load_payload(). > [PATCH 6/7] i965/fs: Fetch one cacheline of pull constants at a time. > [PATCH 7/7] i965/fs: Remove the FS_OPCODE_SET_SIMD4X2_OFFSET virtual opcode. > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Any volunteer to review the rest of this performance-improving series before the merge window closes?
pgpJ4SJ4KaoAf.pgp
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev