Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
On Thu, 2011-06-30 at 17:53 +0200, Roland Scheidegger wrote: > Am 30.06.2011 16:14, schrieb Adam Jackson: > > On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote: > >> Ok in fact there's a gcc bug about memcmp: > >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 > >> In short gcc's memcmp builtin is totally lame and loses to glibc's > >> memcmp (including call overhead, no knowledge about alignment etc.) even > >> when comparing only very few bytes (and loses BIG time for lots of bytes > >> to compare). Oops. Well at least if the strings are the same (I'd guess > >> if the first byte is different it's hard to beat the gcc builtin...). > >> So this is really a gcc bug. The bug is quite old though with no fix in > >> sight apparently so might need to think about some workaround (but just > >> not doing the comparison doesn't look like the right idea, since > >> apparently it would be faster with the comparison if gcc's memcmp got > >> fixed). > > > > How do things fare if you build with -fno-builtin-memcmp? > > This is even faster: > original ipers: 12.1 fps > ajax patch: 15.5 fps > optimized struct compare: 16.8 fps > -fno-builtin-memcmp: 18.1 fps > > Looks like we have a winner :-) I guess glibc optimizes the hell out of > it (in contrast to the other results, this affected all memcmp though I > don't know if any others benefited from that on average). > As noted by Keith though the struct we compare is really large (over 4k) > so trimming the size might be a good idea anyway (of course the 4k size > also meant any call overhead and non-optimal code due to glibc not > knowing alignment beforehand and usage of return value is completely > insignificant). > A 50% improvement from disabling a compiler optimization, lol. We probably what this everywhere throughout Mesa & Gallium... Keith ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Am 30.06.2011 16:14, schrieb Adam Jackson: > On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote: >> Ok in fact there's a gcc bug about memcmp: >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 >> In short gcc's memcmp builtin is totally lame and loses to glibc's >> memcmp (including call overhead, no knowledge about alignment etc.) even >> when comparing only very few bytes (and loses BIG time for lots of bytes >> to compare). Oops. Well at least if the strings are the same (I'd guess >> if the first byte is different it's hard to beat the gcc builtin...). >> So this is really a gcc bug. The bug is quite old though with no fix in >> sight apparently so might need to think about some workaround (but just >> not doing the comparison doesn't look like the right idea, since >> apparently it would be faster with the comparison if gcc's memcmp got >> fixed). > > How do things fare if you build with -fno-builtin-memcmp? This is even faster: original ipers: 12.1 fps ajax patch: 15.5 fps optimized struct compare: 16.8 fps -fno-builtin-memcmp: 18.1 fps Looks like we have a winner :-) I guess glibc optimizes the hell out of it (in contrast to the other results, this affected all memcmp though I don't know if any others benefited from that on average). As noted by Keith though the struct we compare is really large (over 4k) so trimming the size might be a good idea anyway (of course the 4k size also meant any call overhead and non-optimal code due to glibc not knowing alignment beforehand and usage of return value is completely insignificant). A 50% improvement from disabling a compiler optimization, lol. Roland ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Am 30.06.2011 12:14, schrieb Jose Fonseca: > > > - Original Message - >> Hmm. >> Forgive my ignorance, but isn't memcmp() on structs pretty prone to >> give >> incorrect != results, given that there may be padding between members >> in >> structs and that IIRC gcc struct assignment is member-wise. > > There's no alternative to bitwise comparison on C: > > $ cat cmp.c > struct foo { > int a; > int b; > }; > > > int cmp(const struct foo *a, const struct foo *b) { > return *a == *b; > } > $ gcc -c cmp.c > cmp.c: In function ‘cmp’: > cmp.c:7:12: error: invalid operands to binary == (have ‘const struct foo’ and > ‘const struct foo’) > >> What happens if there's padding between the jit_context and variant >> members of struct lp_rast_state? >> >> I seem to recall hitting similar issues a number of times in the >> past. > > I recall that as well, but my memory is the other way around: struct > assignment is considerer harmful because it breaks memcmp. Instead all > structures should be initialized with memset(0) first, and always copied with > memcpy. This should ensure that padding doesn't get clobbered. > > But now that you mention it, I'm not still 100% that unsed bits on bitfields > are preserved like that or not, when being assigned. We probably should > ensure that the all bits in bitfields are used, using "reserved" members, > so that the zeros there stay zero. We've definitely hit issues like that in the past - I think if you use struct assignment you'll need to initialize the dst struct to 0 initially (but only once - even though the padding is probably undefined after such an assignment, any implementation should either copy everything including the padding or leave padding alone). I don't think anything else will touch the unused parts, though I guess it might be possible for instance if a 32bit int is assigned to a 16 bit bitfield which has padding after it. But generally using memcmp/memcpy should work ok, and it gives the compiler all the information it needs to do it fast. Well if it uses it or not is another question... I think the problem with gcc is that when it inserts the comparisons with repz cmpsb it knows alignment and size to copy but doesn't know the result is only used as a strict comparison - that makes it impossible to generate really optimized code there (as you need some byte comparison to get correct memcmp semantics unless you use bswap or do dword comparison then byte comparison on a non-match, both of which are probably slower if you expect the comparison to fail on first byte which is the case for instance in substring searches) and later it can't optimize that into something more sensible. So this might not be trivial to fix in gcc. Too bad no builtin is available which only returns true/false... Roland ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote: > Ok in fact there's a gcc bug about memcmp: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 > In short gcc's memcmp builtin is totally lame and loses to glibc's > memcmp (including call overhead, no knowledge about alignment etc.) even > when comparing only very few bytes (and loses BIG time for lots of bytes > to compare). Oops. Well at least if the strings are the same (I'd guess > if the first byte is different it's hard to beat the gcc builtin...). > So this is really a gcc bug. The bug is quite old though with no fix in > sight apparently so might need to think about some workaround (but just > not doing the comparison doesn't look like the right idea, since > apparently it would be faster with the comparison if gcc's memcmp got > fixed). How do things fare if you build with -fno-builtin-memcmp? - ajax signature.asc Description: This is a digitally signed message part ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
On Thu, 2011-06-30 at 03:27 -0700, Jose Fonseca wrote: > > - Original Message - > > On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote: > > > Ok in fact there's a gcc bug about memcmp: > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 > > > In short gcc's memcmp builtin is totally lame and loses to glibc's > > > memcmp (including call overhead, no knowledge about alignment etc.) > > > even > > > when comparing only very few bytes (and loses BIG time for lots of > > > bytes > > > to compare). Oops. Well at least if the strings are the same (I'd > > > guess > > > if the first byte is different it's hard to beat the gcc > > > builtin...). > > > So this is really a gcc bug. The bug is quite old though with no > > > fix in > > > sight apparently so might need to think about some workaround (but > > > just > > > not doing the comparison doesn't look like the right idea, since > > > apparently it would be faster with the comparison if gcc's memcmp > > > got > > > fixed). > > > > Looking at the struct again (it's been a while), it seems like it > > could > > be rearranged to be variable-sized and on average significantly > > smaller: > > > > struct lp_rast_state { > >struct lp_jit_context jit_context; > >struct lp_fragment_shader_variant *variant; > > }; > > > > struct lp_jit_context { > >const float *constants; > >float alpha_ref_value; > >uint32_t stencil_ref_front, stencil_ref_back; > >uint8_t *blend_color; > >struct lp_jit_texture textures[PIPE_MAX_SAMPLERS]; > > }; > > > > If we moved the jit_context part behind "variant", and then hopefully > > note that most of those lp_jit_texture structs are not in use. That > > would save time on the memcmp *and* space in the binned data. > > Yeah, sounds a good idea. > > But there's some subtletly to computing the number of textures: it > can't be just the NULL textures, because they may be reffered by the > JIT code, which has no NULL checks and relies on the state setup to > provide storage for all textures, or dummy memory if one is not bound. So it's a property of the variant, right? We should just store that information when we generate the llvm variant. > I think a better idea would be: > - split the texture/sampler state > - to make the lp_jit_context::textures an array of pointers, and put the > struct lp_jit_texture in the pipe_texture object themselves > - to make the lp_jit_context::samplers an array of pointers, and put the > struct lp_jit_sampler in the pipe_sampler_state CSO I like this too - it's somewhat more involved of course. In fact the two are orthogonal -- the struct below can still be shrunk significantly by knowing how many samplers & textures the variant refers to. Interleaving them or packing them would reduce the bytes to be compared. Alternatively there could be just a pointer in jit_context to textures/samplers binned elsewhere. > struct lp_jit_context { > struct lp_jit_texture *textures[PIPE_MAX_SAMPLERS]; > struct lp_jit_sampler *samplers[PIPE_MAX_SAMPLERS]; > }; The jit context above seems to have lost some of its fields... The next step might be to split the context into four parts: textures, samplers, constants, "other", and have jit_context just be a set of pointers into the binned data: struct lp_jit_context { struct lp_jit_texture **textures; struct lp_jit_sampler **samplers; const float *constants; const struct lp_jit_other *other; }; struct lp_jit_other { float alpha_ref_value; uint32_t stencil_ref_front; uint32_t stencil_ref_back; uint8_t *blend_color; }; > struct lp_jit_texture > { >uint32_t width; >uint32_t height; >uint32_t depth; >uint32_t first_level; >uint32_t last_level; >uint32_t row_stride[LP_MAX_TEXTURE_LEVELS]; >uint32_t img_stride[LP_MAX_TEXTURE_LEVELS]; >const void *data[LP_MAX_TEXTURE_LEVELS]; >/* sampler state, actually */ >float min_lod; >float max_lod; >float lod_bias; >float border_color[4]; > }; > > struct lp_jit_sampler > { >float min_lod; >float max_lod; >float lod_bias; >float border_color[4]; > }; > > > Jose ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
- Original Message - > On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote: > > Ok in fact there's a gcc bug about memcmp: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 > > In short gcc's memcmp builtin is totally lame and loses to glibc's > > memcmp (including call overhead, no knowledge about alignment etc.) > > even > > when comparing only very few bytes (and loses BIG time for lots of > > bytes > > to compare). Oops. Well at least if the strings are the same (I'd > > guess > > if the first byte is different it's hard to beat the gcc > > builtin...). > > So this is really a gcc bug. The bug is quite old though with no > > fix in > > sight apparently so might need to think about some workaround (but > > just > > not doing the comparison doesn't look like the right idea, since > > apparently it would be faster with the comparison if gcc's memcmp > > got > > fixed). > > Looking at the struct again (it's been a while), it seems like it > could > be rearranged to be variable-sized and on average significantly > smaller: > > struct lp_rast_state { >struct lp_jit_context jit_context; >struct lp_fragment_shader_variant *variant; > }; > > struct lp_jit_context { >const float *constants; >float alpha_ref_value; >uint32_t stencil_ref_front, stencil_ref_back; >uint8_t *blend_color; >struct lp_jit_texture textures[PIPE_MAX_SAMPLERS]; > }; > > If we moved the jit_context part behind "variant", and then hopefully > note that most of those lp_jit_texture structs are not in use. That > would save time on the memcmp *and* space in the binned data. Yeah, sounds a good idea. But there's some subtletly to computing the number of textures: it can't be just the NULL textures, because they may be reffered by the JIT code, which has no NULL checks and relies on the state setup to provide storage for all textures, or dummy memory if one is not bound. I think a better idea would be: - split the texture/sampler state - to make the lp_jit_context::textures an array of pointers, and put the struct lp_jit_texture in the pipe_texture object themselves - to make the lp_jit_context::samplers an array of pointers, and put the struct lp_jit_sampler in the pipe_sampler_state CSO struct lp_jit_context { struct lp_jit_texture *textures[PIPE_MAX_SAMPLERS]; struct lp_jit_sampler *samplers[PIPE_MAX_SAMPLERS]; }; struct lp_jit_texture { uint32_t width; uint32_t height; uint32_t depth; uint32_t first_level; uint32_t last_level; uint32_t row_stride[LP_MAX_TEXTURE_LEVELS]; uint32_t img_stride[LP_MAX_TEXTURE_LEVELS]; const void *data[LP_MAX_TEXTURE_LEVELS]; /* sampler state, actually */ float min_lod; float max_lod; float lod_bias; float border_color[4]; }; struct lp_jit_sampler { float min_lod; float max_lod; float lod_bias; float border_color[4]; }; Jose ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
- Original Message - > Hmm. > Forgive my ignorance, but isn't memcmp() on structs pretty prone to > give > incorrect != results, given that there may be padding between members > in > structs and that IIRC gcc struct assignment is member-wise. There's no alternative to bitwise comparison on C: $ cat cmp.c struct foo { int a; int b; }; int cmp(const struct foo *a, const struct foo *b) { return *a == *b; } $ gcc -c cmp.c cmp.c: In function ‘cmp’: cmp.c:7:12: error: invalid operands to binary == (have ‘const struct foo’ and ‘const struct foo’) > What happens if there's padding between the jit_context and variant > members of struct lp_rast_state? > > I seem to recall hitting similar issues a number of times in the > past. I recall that as well, but my memory is the other way around: struct assignment is considerer harmful because it breaks memcmp. Instead all structures should be initialized with memset(0) first, and always copied with memcpy. This should ensure that padding doesn't get clobbered. But now that you mention it, I'm not still 100% that unsed bits on bitfields are preserved like that or not, when being assigned. We probably should ensure that the all bits in bitfields are used, using "reserved" members, so that the zeros there stay zero. Jose > /Thomas > > On 06/30/2011 03:36 AM, Roland Scheidegger wrote: > > Ok in fact there's a gcc bug about memcmp: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 > > In short gcc's memcmp builtin is totally lame and loses to glibc's > > memcmp (including call overhead, no knowledge about alignment etc.) > > even > > when comparing only very few bytes (and loses BIG time for lots of > > bytes > > to compare). Oops. Well at least if the strings are the same (I'd > > guess > > if the first byte is different it's hard to beat the gcc > > builtin...). > > So this is really a gcc bug. The bug is quite old though with no > > fix in > > sight apparently so might need to think about some workaround (but > > just > > not doing the comparison doesn't look like the right idea, since > > apparently it would be faster with the comparison if gcc's memcmp > > got > > fixed). > > > > > > Roland > > > > > > > > Am 30.06.2011 01:47, schrieb Roland Scheidegger: > > > >> I didn't even look at that was just curious why the memcmp (which > >> is > >> used a lot in other places) is slow. However, none of the other > >> memcmp > >> seem to show up prominently (cso functions are quite low in > >> profiles, > >> _mesa_search_program_cache uses memcmp too but it's not that high > >> neither). So I guess those functions either aren't called that > >> often or > >> the sizes they compare are small. > >> So should maybe just file a gcc bug for memcmp and look at that > >> particular llvmpipe issue again :-). > >> > >> Roland > >> > >> Am 30.06.2011 01:16, schrieb Corbin Simpson: > >> > >>> Okay, so maybe I'm failing to recognize the exact situation here, > >>> but > >>> wouldn't it be possible to mark the FS state with a serial number > >>> and > >>> just compare those? Or are these FS states not CSO-cached? > >>> > >>> ~ C. > >>> > >>> On Wed, Jun 29, 2011 at 3:44 PM, Roland > >>> Scheidegger wrote: > >>> > Actually I ran some numbers here and tried out a optimized > struct compare: > original ipers: 12.1 fps > ajax patch: 15.5 fps > optimized struct compare: 16.8 fps > > > This is the function I used for that (just enabled in that > lp_setup > function): > > static INLINE int util_cmp_struct(const void *src1, const void > *src2, > unsigned count) > { > /* hmm pointer casting is evil */ > const uint32_t *src1_ptr = (uint32_t *)src1; > const uint32_t *src2_ptr = (uint32_t *)src2; > unsigned i; > assert(count % 4 == 0); > for (i = 0; i< count/4; i++) { > if (*src1_ptr != *src2_ptr) { > return 1; > } > src1_ptr++; > src2_ptr++; > } > return 0; > } > > (And no this doesn't use repz cmpsd here.) > > So, unless I made some mistake, memcmp is just dead slow (*), > most of > the slowness probably coming from the bytewise comparison (and > apparently I was wrong in assuming the comparison there might > never be > the same for ipers). > Of course, the optimized struct compare relies on structs really > being > dword aligned (I think this is always the case), and > additionally it > relies on the struct size being a whole multiple of dwords - > likely > struct needs padding to ensure that (at least I don't think this > is > always guaranteed for all structs). > But since memcmp is used extensively (cso for one) maybe some > optimization along these lines might be worth it (though of > course for > small structs the win isn't going
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Great work Roland! And thanks Ajax to finding this hot spot. We use memcmp a lot -- all CSO caching, so we should use this everywhere. We should also code a sse2 version with intrinsics for x86-64, which is guaranteed to always have SSE2. Jose - Original Message - > Actually I ran some numbers here and tried out a optimized struct > compare: > original ipers: 12.1 fps > ajax patch: 15.5 fps > optimized struct compare: 16.8 fps > > > This is the function I used for that (just enabled in that lp_setup > function): > > static INLINE int util_cmp_struct(const void *src1, const void *src2, > unsigned count) > { > /* hmm pointer casting is evil */ > const uint32_t *src1_ptr = (uint32_t *)src1; > const uint32_t *src2_ptr = (uint32_t *)src2; > unsigned i; > assert(count % 4 == 0); > for (i = 0; i < count/4; i++) { > if (*src1_ptr != *src2_ptr) { > return 1; > } > src1_ptr++; > src2_ptr++; > } > return 0; > } > > (And no this doesn't use repz cmpsd here.) > > So, unless I made some mistake, memcmp is just dead slow (*), most of > the slowness probably coming from the bytewise comparison (and > apparently I was wrong in assuming the comparison there might never > be > the same for ipers). > Of course, the optimized struct compare relies on structs really > being > dword aligned (I think this is always the case), and additionally it > relies on the struct size being a whole multiple of dwords - likely > struct needs padding to ensure that (at least I don't think this is > always guaranteed for all structs). > But since memcmp is used extensively (cso for one) maybe some > optimization along these lines might be worth it (though of course > for > small structs the win isn't going to be as big - and can't beat the > repz > cmps in code size...). > > Roland > > (*) I actually found some references some implementations might be > better they don't just use repz cmpsb but they split this up in parts > which do dword (or qword even - well for really large structs could > use > sse2) comparisons for the parts where it's possible and only byte > comparisons for the remaining bytes (and if the compiler does that it > probably would know the size at compile time here hence could leave > out > much of the code). Of course memcmp requires that the return value > isn't > just a true or false value, hence there's more code needed once an > unequal dword is found, though the compiler could optimize that away > too > in case it's not needed. Much the same as memcpy is optimized usually > really, so blame gcc :-). > > > > Am 29.06.2011 20:33, schrieb Roland Scheidegger: > > Ohh that's interesting, you'd think the comparison shouldn't be > > that > > expensive (though I guess in ipers case the comparison is never > > true). > > memcmp is quite extensively used everywhere. Maybe we could replace > > that > > with something faster (since we only ever care if the blocks are > > the > > same but not care about the lexographic ordering and always compare > > whole structs, should compare dwords instead of bytes for a 4 time > > speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? > > Also I guess it would help if the values which are more likely to > > be > > unequal are first in the struct (if we can tell that). > > Of course though if it's unlikely to be the same as the compared > > value > > anyway not comparing at all still might be a win (here). > > > > Roland > > > > Am 29.06.2011 19:19, schrieb Adam Jackson: > >> Perversely, do this by eliminating the comparison between stored > >> and > >> current fs state. On ipers, a perf trace showed > >> try_update_scene_state > >> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the > >> memcmp. > >> Taking that out takes try_update_scene_state down to 6.5% of the > >> profile; more importantly, ipers goes from 10 to 14fps and gears > >> goes > >> from 790 to 830fps. > >> > >> Signed-off-by: Adam Jackson > >> --- > >> src/gallium/drivers/llvmpipe/lp_setup.c | 61 > >> ++- > >> 1 files changed, 27 insertions(+), 34 deletions(-) > >> > >> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c > >> b/src/gallium/drivers/llvmpipe/lp_setup.c > >> index cbe06e5..9118db5 100644 > >> --- a/src/gallium/drivers/llvmpipe/lp_setup.c > >> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c > >> @@ -839,42 +839,35 @@ try_update_scene_state( struct > >> lp_setup_context *setup ) > >>setup->dirty |= LP_SETUP_NEW_FS; > >> } > >> > >> - > >> if (setup->dirty & LP_SETUP_NEW_FS) { > >> - if (!setup->fs.stored || > >> - memcmp(setup->fs.stored, > >> - &setup->fs.current, > >> - sizeof setup->fs.current) != 0) > >> - { > >> - struct lp_rast_state *stored; > >> - uint i; > >> - > >> - /* The fs state that's been stored in the scene is > >> different from > >> - * the new, current state. So allocate a n
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote: > Ok in fact there's a gcc bug about memcmp: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 > In short gcc's memcmp builtin is totally lame and loses to glibc's > memcmp (including call overhead, no knowledge about alignment etc.) even > when comparing only very few bytes (and loses BIG time for lots of bytes > to compare). Oops. Well at least if the strings are the same (I'd guess > if the first byte is different it's hard to beat the gcc builtin...). > So this is really a gcc bug. The bug is quite old though with no fix in > sight apparently so might need to think about some workaround (but just > not doing the comparison doesn't look like the right idea, since > apparently it would be faster with the comparison if gcc's memcmp got > fixed). Looking at the struct again (it's been a while), it seems like it could be rearranged to be variable-sized and on average significantly smaller: struct lp_rast_state { struct lp_jit_context jit_context; struct lp_fragment_shader_variant *variant; }; struct lp_jit_context { const float *constants; float alpha_ref_value; uint32_t stencil_ref_front, stencil_ref_back; uint8_t *blend_color; struct lp_jit_texture textures[PIPE_MAX_SAMPLERS]; }; If we moved the jit_context part behind "variant", and then hopefully note that most of those lp_jit_texture structs are not in use. That would save time on the memcmp *and* space in the binned data. It's weird this wasn't showing up in past profiling. Kieth ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
On Wed, 2011-06-29 at 16:16 -0700, Corbin Simpson wrote: > Okay, so maybe I'm failing to recognize the exact situation here, but > wouldn't it be possible to mark the FS state with a serial number and > just compare those? Or are these FS states not CSO-cached? No, the struct being compared is poorly named & collides with a CSO entity. It's really all the state which the compiled fragment shader will reference when it is later invoked. It's all packed into a single struct because it's easier to pass a single parameter to llvm-compiled shaders and add/change that parameter, but it is somewhat non-orthogonal and we end up generating too many of them. Keith ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Hmm. Forgive my ignorance, but isn't memcmp() on structs pretty prone to give incorrect != results, given that there may be padding between members in structs and that IIRC gcc struct assignment is member-wise. What happens if there's padding between the jit_context and variant members of struct lp_rast_state? I seem to recall hitting similar issues a number of times in the past. /Thomas On 06/30/2011 03:36 AM, Roland Scheidegger wrote: Ok in fact there's a gcc bug about memcmp: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 In short gcc's memcmp builtin is totally lame and loses to glibc's memcmp (including call overhead, no knowledge about alignment etc.) even when comparing only very few bytes (and loses BIG time for lots of bytes to compare). Oops. Well at least if the strings are the same (I'd guess if the first byte is different it's hard to beat the gcc builtin...). So this is really a gcc bug. The bug is quite old though with no fix in sight apparently so might need to think about some workaround (but just not doing the comparison doesn't look like the right idea, since apparently it would be faster with the comparison if gcc's memcmp got fixed). Roland Am 30.06.2011 01:47, schrieb Roland Scheidegger: I didn't even look at that was just curious why the memcmp (which is used a lot in other places) is slow. However, none of the other memcmp seem to show up prominently (cso functions are quite low in profiles, _mesa_search_program_cache uses memcmp too but it's not that high neither). So I guess those functions either aren't called that often or the sizes they compare are small. So should maybe just file a gcc bug for memcmp and look at that particular llvmpipe issue again :-). Roland Am 30.06.2011 01:16, schrieb Corbin Simpson: Okay, so maybe I'm failing to recognize the exact situation here, but wouldn't it be possible to mark the FS state with a serial number and just compare those? Or are these FS states not CSO-cached? ~ C. On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger wrote: Actually I ran some numbers here and tried out a optimized struct compare: original ipers: 12.1 fps ajax patch: 15.5 fps optimized struct compare: 16.8 fps This is the function I used for that (just enabled in that lp_setup function): static INLINE int util_cmp_struct(const void *src1, const void *src2, unsigned count) { /* hmm pointer casting is evil */ const uint32_t *src1_ptr = (uint32_t *)src1; const uint32_t *src2_ptr = (uint32_t *)src2; unsigned i; assert(count % 4 == 0); for (i = 0; i< count/4; i++) { if (*src1_ptr != *src2_ptr) { return 1; } src1_ptr++; src2_ptr++; } return 0; } (And no this doesn't use repz cmpsd here.) So, unless I made some mistake, memcmp is just dead slow (*), most of the slowness probably coming from the bytewise comparison (and apparently I was wrong in assuming the comparison there might never be the same for ipers). Of course, the optimized struct compare relies on structs really being dword aligned (I think this is always the case), and additionally it relies on the struct size being a whole multiple of dwords - likely struct needs padding to ensure that (at least I don't think this is always guaranteed for all structs). But since memcmp is used extensively (cso for one) maybe some optimization along these lines might be worth it (though of course for small structs the win isn't going to be as big - and can't beat the repz cmps in code size...). Roland (*) I actually found some references some implementations might be better they don't just use repz cmpsb but they split this up in parts which do dword (or qword even - well for really large structs could use sse2) comparisons for the parts where it's possible and only byte comparisons for the remaining bytes (and if the compiler does that it probably would know the size at compile time here hence could leave out much of the code). Of course memcmp requires that the return value isn't just a true or false value, hence there's more code needed once an unequal dword is found, though the compiler could optimize that away too in case it's not needed. Much the same as memcpy is optimized usually really, so blame gcc :-). Am 29.06.2011 20:33, schrieb Roland Scheidegger: Ohh that's interesting, you'd think the comparison shouldn't be that expensive (though I guess in ipers case the comparison is never true). memcmp is quite extensively used everywhere. Maybe we could replace that with something faster (since we only ever care if the blocks are the same but not care about the lexographic ordering and always compare whole structs, should compare dwords instead of bytes for a 4 time speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? Also I guess it would help if the values which are more likely to be unequal are first in the struct (if we can tell that). Of course though if it's unlikely to be the same as the compared value any
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Ok in fact there's a gcc bug about memcmp: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 In short gcc's memcmp builtin is totally lame and loses to glibc's memcmp (including call overhead, no knowledge about alignment etc.) even when comparing only very few bytes (and loses BIG time for lots of bytes to compare). Oops. Well at least if the strings are the same (I'd guess if the first byte is different it's hard to beat the gcc builtin...). So this is really a gcc bug. The bug is quite old though with no fix in sight apparently so might need to think about some workaround (but just not doing the comparison doesn't look like the right idea, since apparently it would be faster with the comparison if gcc's memcmp got fixed). Roland Am 30.06.2011 01:47, schrieb Roland Scheidegger: > I didn't even look at that was just curious why the memcmp (which is > used a lot in other places) is slow. However, none of the other memcmp > seem to show up prominently (cso functions are quite low in profiles, > _mesa_search_program_cache uses memcmp too but it's not that high > neither). So I guess those functions either aren't called that often or > the sizes they compare are small. > So should maybe just file a gcc bug for memcmp and look at that > particular llvmpipe issue again :-). > > Roland > > Am 30.06.2011 01:16, schrieb Corbin Simpson: >> Okay, so maybe I'm failing to recognize the exact situation here, but >> wouldn't it be possible to mark the FS state with a serial number and >> just compare those? Or are these FS states not CSO-cached? >> >> ~ C. >> >> On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger >> wrote: >>> Actually I ran some numbers here and tried out a optimized struct compare: >>> original ipers: 12.1 fps >>> ajax patch: 15.5 fps >>> optimized struct compare: 16.8 fps >>> >>> >>> This is the function I used for that (just enabled in that lp_setup >>> function): >>> >>> static INLINE int util_cmp_struct(const void *src1, const void *src2, >>> unsigned count) >>> { >>> /* hmm pointer casting is evil */ >>> const uint32_t *src1_ptr = (uint32_t *)src1; >>> const uint32_t *src2_ptr = (uint32_t *)src2; >>> unsigned i; >>> assert(count % 4 == 0); >>> for (i = 0; i < count/4; i++) { >>>if (*src1_ptr != *src2_ptr) { >>> return 1; >>>} >>>src1_ptr++; >>>src2_ptr++; >>> } >>> return 0; >>> } >>> >>> (And no this doesn't use repz cmpsd here.) >>> >>> So, unless I made some mistake, memcmp is just dead slow (*), most of >>> the slowness probably coming from the bytewise comparison (and >>> apparently I was wrong in assuming the comparison there might never be >>> the same for ipers). >>> Of course, the optimized struct compare relies on structs really being >>> dword aligned (I think this is always the case), and additionally it >>> relies on the struct size being a whole multiple of dwords - likely >>> struct needs padding to ensure that (at least I don't think this is >>> always guaranteed for all structs). >>> But since memcmp is used extensively (cso for one) maybe some >>> optimization along these lines might be worth it (though of course for >>> small structs the win isn't going to be as big - and can't beat the repz >>> cmps in code size...). >>> >>> Roland >>> >>> (*) I actually found some references some implementations might be >>> better they don't just use repz cmpsb but they split this up in parts >>> which do dword (or qword even - well for really large structs could use >>> sse2) comparisons for the parts where it's possible and only byte >>> comparisons for the remaining bytes (and if the compiler does that it >>> probably would know the size at compile time here hence could leave out >>> much of the code). Of course memcmp requires that the return value isn't >>> just a true or false value, hence there's more code needed once an >>> unequal dword is found, though the compiler could optimize that away too >>> in case it's not needed. Much the same as memcpy is optimized usually >>> really, so blame gcc :-). >>> >>> >>> >>> Am 29.06.2011 20:33, schrieb Roland Scheidegger: Ohh that's interesting, you'd think the comparison shouldn't be that expensive (though I guess in ipers case the comparison is never true). memcmp is quite extensively used everywhere. Maybe we could replace that with something faster (since we only ever care if the blocks are the same but not care about the lexographic ordering and always compare whole structs, should compare dwords instead of bytes for a 4 time speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? Also I guess it would help if the values which are more likely to be unequal are first in the struct (if we can tell that). Of course though if it's unlikely to be the same as the compared value anyway not comparing at all still might be a win (here). Roland Am 29.06.2011 19:19, schrieb Adam Jackson: > Perversely, do this by eliminating
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
I didn't even look at that was just curious why the memcmp (which is used a lot in other places) is slow. However, none of the other memcmp seem to show up prominently (cso functions are quite low in profiles, _mesa_search_program_cache uses memcmp too but it's not that high neither). So I guess those functions either aren't called that often or the sizes they compare are small. So should maybe just file a gcc bug for memcmp and look at that particular llvmpipe issue again :-). Roland Am 30.06.2011 01:16, schrieb Corbin Simpson: > Okay, so maybe I'm failing to recognize the exact situation here, but > wouldn't it be possible to mark the FS state with a serial number and > just compare those? Or are these FS states not CSO-cached? > > ~ C. > > On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger > wrote: >> Actually I ran some numbers here and tried out a optimized struct compare: >> original ipers: 12.1 fps >> ajax patch: 15.5 fps >> optimized struct compare: 16.8 fps >> >> >> This is the function I used for that (just enabled in that lp_setup >> function): >> >> static INLINE int util_cmp_struct(const void *src1, const void *src2, >> unsigned count) >> { >> /* hmm pointer casting is evil */ >> const uint32_t *src1_ptr = (uint32_t *)src1; >> const uint32_t *src2_ptr = (uint32_t *)src2; >> unsigned i; >> assert(count % 4 == 0); >> for (i = 0; i < count/4; i++) { >>if (*src1_ptr != *src2_ptr) { >> return 1; >>} >>src1_ptr++; >>src2_ptr++; >> } >> return 0; >> } >> >> (And no this doesn't use repz cmpsd here.) >> >> So, unless I made some mistake, memcmp is just dead slow (*), most of >> the slowness probably coming from the bytewise comparison (and >> apparently I was wrong in assuming the comparison there might never be >> the same for ipers). >> Of course, the optimized struct compare relies on structs really being >> dword aligned (I think this is always the case), and additionally it >> relies on the struct size being a whole multiple of dwords - likely >> struct needs padding to ensure that (at least I don't think this is >> always guaranteed for all structs). >> But since memcmp is used extensively (cso for one) maybe some >> optimization along these lines might be worth it (though of course for >> small structs the win isn't going to be as big - and can't beat the repz >> cmps in code size...). >> >> Roland >> >> (*) I actually found some references some implementations might be >> better they don't just use repz cmpsb but they split this up in parts >> which do dword (or qword even - well for really large structs could use >> sse2) comparisons for the parts where it's possible and only byte >> comparisons for the remaining bytes (and if the compiler does that it >> probably would know the size at compile time here hence could leave out >> much of the code). Of course memcmp requires that the return value isn't >> just a true or false value, hence there's more code needed once an >> unequal dword is found, though the compiler could optimize that away too >> in case it's not needed. Much the same as memcpy is optimized usually >> really, so blame gcc :-). >> >> >> >> Am 29.06.2011 20:33, schrieb Roland Scheidegger: >>> Ohh that's interesting, you'd think the comparison shouldn't be that >>> expensive (though I guess in ipers case the comparison is never true). >>> memcmp is quite extensively used everywhere. Maybe we could replace that >>> with something faster (since we only ever care if the blocks are the >>> same but not care about the lexographic ordering and always compare >>> whole structs, should compare dwords instead of bytes for a 4 time >>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? >>> Also I guess it would help if the values which are more likely to be >>> unequal are first in the struct (if we can tell that). >>> Of course though if it's unlikely to be the same as the compared value >>> anyway not comparing at all still might be a win (here). >>> >>> Roland >>> >>> Am 29.06.2011 19:19, schrieb Adam Jackson: Perversely, do this by eliminating the comparison between stored and current fs state. On ipers, a perf trace showed try_update_scene_state using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp. Taking that out takes try_update_scene_state down to 6.5% of the profile; more importantly, ipers goes from 10 to 14fps and gears goes from 790 to 830fps. Signed-off-by: Adam Jackson --- src/gallium/drivers/llvmpipe/lp_setup.c | 61 ++- 1 files changed, 27 insertions(+), 34 deletions(-) diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c b/src/gallium/drivers/llvmpipe/lp_setup.c index cbe06e5..9118db5 100644 --- a/src/gallium/drivers/llvmpipe/lp_setup.c +++ b/src/gallium/drivers/llvmpipe/lp_setup.c @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup )
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Okay, so maybe I'm failing to recognize the exact situation here, but wouldn't it be possible to mark the FS state with a serial number and just compare those? Or are these FS states not CSO-cached? ~ C. On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger wrote: > Actually I ran some numbers here and tried out a optimized struct compare: > original ipers: 12.1 fps > ajax patch: 15.5 fps > optimized struct compare: 16.8 fps > > > This is the function I used for that (just enabled in that lp_setup > function): > > static INLINE int util_cmp_struct(const void *src1, const void *src2, > unsigned count) > { > /* hmm pointer casting is evil */ > const uint32_t *src1_ptr = (uint32_t *)src1; > const uint32_t *src2_ptr = (uint32_t *)src2; > unsigned i; > assert(count % 4 == 0); > for (i = 0; i < count/4; i++) { > if (*src1_ptr != *src2_ptr) { > return 1; > } > src1_ptr++; > src2_ptr++; > } > return 0; > } > > (And no this doesn't use repz cmpsd here.) > > So, unless I made some mistake, memcmp is just dead slow (*), most of > the slowness probably coming from the bytewise comparison (and > apparently I was wrong in assuming the comparison there might never be > the same for ipers). > Of course, the optimized struct compare relies on structs really being > dword aligned (I think this is always the case), and additionally it > relies on the struct size being a whole multiple of dwords - likely > struct needs padding to ensure that (at least I don't think this is > always guaranteed for all structs). > But since memcmp is used extensively (cso for one) maybe some > optimization along these lines might be worth it (though of course for > small structs the win isn't going to be as big - and can't beat the repz > cmps in code size...). > > Roland > > (*) I actually found some references some implementations might be > better they don't just use repz cmpsb but they split this up in parts > which do dword (or qword even - well for really large structs could use > sse2) comparisons for the parts where it's possible and only byte > comparisons for the remaining bytes (and if the compiler does that it > probably would know the size at compile time here hence could leave out > much of the code). Of course memcmp requires that the return value isn't > just a true or false value, hence there's more code needed once an > unequal dword is found, though the compiler could optimize that away too > in case it's not needed. Much the same as memcpy is optimized usually > really, so blame gcc :-). > > > > Am 29.06.2011 20:33, schrieb Roland Scheidegger: >> Ohh that's interesting, you'd think the comparison shouldn't be that >> expensive (though I guess in ipers case the comparison is never true). >> memcmp is quite extensively used everywhere. Maybe we could replace that >> with something faster (since we only ever care if the blocks are the >> same but not care about the lexographic ordering and always compare >> whole structs, should compare dwords instead of bytes for a 4 time >> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? >> Also I guess it would help if the values which are more likely to be >> unequal are first in the struct (if we can tell that). >> Of course though if it's unlikely to be the same as the compared value >> anyway not comparing at all still might be a win (here). >> >> Roland >> >> Am 29.06.2011 19:19, schrieb Adam Jackson: >>> Perversely, do this by eliminating the comparison between stored and >>> current fs state. On ipers, a perf trace showed try_update_scene_state >>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp. >>> Taking that out takes try_update_scene_state down to 6.5% of the >>> profile; more importantly, ipers goes from 10 to 14fps and gears goes >>> from 790 to 830fps. >>> >>> Signed-off-by: Adam Jackson >>> --- >>> src/gallium/drivers/llvmpipe/lp_setup.c | 61 >>> ++- >>> 1 files changed, 27 insertions(+), 34 deletions(-) >>> >>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c >>> b/src/gallium/drivers/llvmpipe/lp_setup.c >>> index cbe06e5..9118db5 100644 >>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c >>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c >>> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context >>> *setup ) >>> setup->dirty |= LP_SETUP_NEW_FS; >>> } >>> >>> - >>> if (setup->dirty & LP_SETUP_NEW_FS) { >>> - if (!setup->fs.stored || >>> - memcmp(setup->fs.stored, >>> - &setup->fs.current, >>> - sizeof setup->fs.current) != 0) >>> - { >>> - struct lp_rast_state *stored; >>> - uint i; >>> - >>> - /* The fs state that's been stored in the scene is different from >>> - * the new, current state. So allocate a new lp_rast_state object >>> - * and append it to the bin's setup data buffer. >>> - */ >>> - stored = (struct lp_rast_state *) l
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Actually I ran some numbers here and tried out a optimized struct compare: original ipers: 12.1 fps ajax patch: 15.5 fps optimized struct compare: 16.8 fps This is the function I used for that (just enabled in that lp_setup function): static INLINE int util_cmp_struct(const void *src1, const void *src2, unsigned count) { /* hmm pointer casting is evil */ const uint32_t *src1_ptr = (uint32_t *)src1; const uint32_t *src2_ptr = (uint32_t *)src2; unsigned i; assert(count % 4 == 0); for (i = 0; i < count/4; i++) { if (*src1_ptr != *src2_ptr) { return 1; } src1_ptr++; src2_ptr++; } return 0; } (And no this doesn't use repz cmpsd here.) So, unless I made some mistake, memcmp is just dead slow (*), most of the slowness probably coming from the bytewise comparison (and apparently I was wrong in assuming the comparison there might never be the same for ipers). Of course, the optimized struct compare relies on structs really being dword aligned (I think this is always the case), and additionally it relies on the struct size being a whole multiple of dwords - likely struct needs padding to ensure that (at least I don't think this is always guaranteed for all structs). But since memcmp is used extensively (cso for one) maybe some optimization along these lines might be worth it (though of course for small structs the win isn't going to be as big - and can't beat the repz cmps in code size...). Roland (*) I actually found some references some implementations might be better they don't just use repz cmpsb but they split this up in parts which do dword (or qword even - well for really large structs could use sse2) comparisons for the parts where it's possible and only byte comparisons for the remaining bytes (and if the compiler does that it probably would know the size at compile time here hence could leave out much of the code). Of course memcmp requires that the return value isn't just a true or false value, hence there's more code needed once an unequal dword is found, though the compiler could optimize that away too in case it's not needed. Much the same as memcpy is optimized usually really, so blame gcc :-). Am 29.06.2011 20:33, schrieb Roland Scheidegger: > Ohh that's interesting, you'd think the comparison shouldn't be that > expensive (though I guess in ipers case the comparison is never true). > memcmp is quite extensively used everywhere. Maybe we could replace that > with something faster (since we only ever care if the blocks are the > same but not care about the lexographic ordering and always compare > whole structs, should compare dwords instead of bytes for a 4 time > speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? > Also I guess it would help if the values which are more likely to be > unequal are first in the struct (if we can tell that). > Of course though if it's unlikely to be the same as the compared value > anyway not comparing at all still might be a win (here). > > Roland > > Am 29.06.2011 19:19, schrieb Adam Jackson: >> Perversely, do this by eliminating the comparison between stored and >> current fs state. On ipers, a perf trace showed try_update_scene_state >> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp. >> Taking that out takes try_update_scene_state down to 6.5% of the >> profile; more importantly, ipers goes from 10 to 14fps and gears goes >> from 790 to 830fps. >> >> Signed-off-by: Adam Jackson >> --- >> src/gallium/drivers/llvmpipe/lp_setup.c | 61 >> ++- >> 1 files changed, 27 insertions(+), 34 deletions(-) >> >> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c >> b/src/gallium/drivers/llvmpipe/lp_setup.c >> index cbe06e5..9118db5 100644 >> --- a/src/gallium/drivers/llvmpipe/lp_setup.c >> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c >> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup >> ) >>setup->dirty |= LP_SETUP_NEW_FS; >> } >> >> - >> if (setup->dirty & LP_SETUP_NEW_FS) { >> - if (!setup->fs.stored || >> - memcmp(setup->fs.stored, >> - &setup->fs.current, >> - sizeof setup->fs.current) != 0) >> - { >> - struct lp_rast_state *stored; >> - uint i; >> - >> - /* The fs state that's been stored in the scene is different from >> - * the new, current state. So allocate a new lp_rast_state object >> - * and append it to the bin's setup data buffer. >> - */ >> - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof >> *stored); >> - if (!stored) { >> -assert(!new_scene); >> -return FALSE; >> - } >> + struct lp_rast_state *stored; >> + uint i; >> + >> + /* The fs state that's been stored in the scene is different from >> + * the new, current state. So allocate a new lp_rast_state object >> + * and append it to
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Ohh that's interesting, you'd think the comparison shouldn't be that expensive (though I guess in ipers case the comparison is never true). memcmp is quite extensively used everywhere. Maybe we could replace that with something faster (since we only ever care if the blocks are the same but not care about the lexographic ordering and always compare whole structs, should compare dwords instead of bytes for a 4 time speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? Also I guess it would help if the values which are more likely to be unequal are first in the struct (if we can tell that). Of course though if it's unlikely to be the same as the compared value anyway not comparing at all still might be a win (here). Roland Am 29.06.2011 19:19, schrieb Adam Jackson: > Perversely, do this by eliminating the comparison between stored and > current fs state. On ipers, a perf trace showed try_update_scene_state > using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp. > Taking that out takes try_update_scene_state down to 6.5% of the > profile; more importantly, ipers goes from 10 to 14fps and gears goes > from 790 to 830fps. > > Signed-off-by: Adam Jackson > --- > src/gallium/drivers/llvmpipe/lp_setup.c | 61 > ++- > 1 files changed, 27 insertions(+), 34 deletions(-) > > diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c > b/src/gallium/drivers/llvmpipe/lp_setup.c > index cbe06e5..9118db5 100644 > --- a/src/gallium/drivers/llvmpipe/lp_setup.c > +++ b/src/gallium/drivers/llvmpipe/lp_setup.c > @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup ) >setup->dirty |= LP_SETUP_NEW_FS; > } > > - > if (setup->dirty & LP_SETUP_NEW_FS) { > - if (!setup->fs.stored || > - memcmp(setup->fs.stored, > - &setup->fs.current, > - sizeof setup->fs.current) != 0) > - { > - struct lp_rast_state *stored; > - uint i; > - > - /* The fs state that's been stored in the scene is different from > - * the new, current state. So allocate a new lp_rast_state object > - * and append it to the bin's setup data buffer. > - */ > - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof > *stored); > - if (!stored) { > -assert(!new_scene); > -return FALSE; > - } > + struct lp_rast_state *stored; > + uint i; > + > + /* The fs state that's been stored in the scene is different from > + * the new, current state. So allocate a new lp_rast_state object > + * and append it to the bin's setup data buffer. > + */ > + stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof > *stored); > + if (!stored) { > + assert(!new_scene); > + return FALSE; > + } > > - memcpy(stored, > -&setup->fs.current, > -sizeof setup->fs.current); > - setup->fs.stored = stored; > - > - /* The scene now references the textures in the rasterization > - * state record. Note that now. > - */ > - for (i = 0; i < Elements(setup->fs.current_tex); i++) { > -if (setup->fs.current_tex[i]) { > - if (!lp_scene_add_resource_reference(scene, > -setup->fs.current_tex[i], > -new_scene)) { > - assert(!new_scene); > - return FALSE; > - } > + memcpy(stored, > + &setup->fs.current, > + sizeof setup->fs.current); > + setup->fs.stored = stored; > + > + /* The scene now references the textures in the rasterization > + * state record. Note that now. > + */ > + for (i = 0; i < Elements(setup->fs.current_tex); i++) { > + if (setup->fs.current_tex[i]) { > +if (!lp_scene_add_resource_reference(scene, > + setup->fs.current_tex[i], > + new_scene)) { > + assert(!new_scene); > + return FALSE; > } > } >} ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
On Wed, 2011-06-29 at 13:19 -0400, Adam Jackson wrote: > Perversely, do this by eliminating the comparison between stored and > current fs state. On ipers, a perf trace showed try_update_scene_state > using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp. > Taking that out takes try_update_scene_state down to 6.5% of the > profile; more importantly, ipers goes from 10 to 14fps and gears goes > from 790 to 830fps. Some of the motivation for that memcpy is about keeping the memory usage of the binned scene from exploding and forcing unnecessary flushes on more complex apps. I wonder if there is a way to improve the dirty flag handling to avoid ending up in that memcpy so often? Note that freeglut is probably dominating your gears numbers by trying to reinitialize your SpaceBall device (I don't have one either) on every swapbuffers. http://lists.freedesktop.org/archives/mesa-dev/2011-February/005599.html Keith > Signed-off-by: Adam Jackson > --- > src/gallium/drivers/llvmpipe/lp_setup.c | 61 > ++- > 1 files changed, 27 insertions(+), 34 deletions(-) > > diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c > b/src/gallium/drivers/llvmpipe/lp_setup.c > index cbe06e5..9118db5 100644 > --- a/src/gallium/drivers/llvmpipe/lp_setup.c > +++ b/src/gallium/drivers/llvmpipe/lp_setup.c > @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup ) >setup->dirty |= LP_SETUP_NEW_FS; > } > > - > if (setup->dirty & LP_SETUP_NEW_FS) { > - if (!setup->fs.stored || > - memcmp(setup->fs.stored, > - &setup->fs.current, > - sizeof setup->fs.current) != 0) > - { > - struct lp_rast_state *stored; > - uint i; > - > - /* The fs state that's been stored in the scene is different from > - * the new, current state. So allocate a new lp_rast_state object > - * and append it to the bin's setup data buffer. > - */ > - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof > *stored); > - if (!stored) { > -assert(!new_scene); > -return FALSE; > - } > + struct lp_rast_state *stored; > + uint i; > + > + /* The fs state that's been stored in the scene is different from > + * the new, current state. So allocate a new lp_rast_state object > + * and append it to the bin's setup data buffer. > + */ > + stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof > *stored); > + if (!stored) { > + assert(!new_scene); > + return FALSE; > + } > > - memcpy(stored, > -&setup->fs.current, > -sizeof setup->fs.current); > - setup->fs.stored = stored; > - > - /* The scene now references the textures in the rasterization > - * state record. Note that now. > - */ > - for (i = 0; i < Elements(setup->fs.current_tex); i++) { > -if (setup->fs.current_tex[i]) { > - if (!lp_scene_add_resource_reference(scene, > -setup->fs.current_tex[i], > -new_scene)) { > - assert(!new_scene); > - return FALSE; > - } > + memcpy(stored, > + &setup->fs.current, > + sizeof setup->fs.current); > + setup->fs.stored = stored; > + > + /* The scene now references the textures in the rasterization > + * state record. Note that now. > + */ > + for (i = 0; i < Elements(setup->fs.current_tex); i++) { > + if (setup->fs.current_tex[i]) { > +if (!lp_scene_add_resource_reference(scene, > + setup->fs.current_tex[i], > + new_scene)) { > + assert(!new_scene); > + return FALSE; > } > } >} ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Perversely, do this by eliminating the comparison between stored and current fs state. On ipers, a perf trace showed try_update_scene_state using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp. Taking that out takes try_update_scene_state down to 6.5% of the profile; more importantly, ipers goes from 10 to 14fps and gears goes from 790 to 830fps. Signed-off-by: Adam Jackson --- src/gallium/drivers/llvmpipe/lp_setup.c | 61 ++- 1 files changed, 27 insertions(+), 34 deletions(-) diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c b/src/gallium/drivers/llvmpipe/lp_setup.c index cbe06e5..9118db5 100644 --- a/src/gallium/drivers/llvmpipe/lp_setup.c +++ b/src/gallium/drivers/llvmpipe/lp_setup.c @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup ) setup->dirty |= LP_SETUP_NEW_FS; } - if (setup->dirty & LP_SETUP_NEW_FS) { - if (!setup->fs.stored || - memcmp(setup->fs.stored, - &setup->fs.current, - sizeof setup->fs.current) != 0) - { - struct lp_rast_state *stored; - uint i; - - /* The fs state that's been stored in the scene is different from - * the new, current state. So allocate a new lp_rast_state object - * and append it to the bin's setup data buffer. - */ - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored); - if (!stored) { -assert(!new_scene); -return FALSE; - } + struct lp_rast_state *stored; + uint i; + + /* The fs state that's been stored in the scene is different from + * the new, current state. So allocate a new lp_rast_state object + * and append it to the bin's setup data buffer. + */ + stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored); + if (!stored) { + assert(!new_scene); + return FALSE; + } - memcpy(stored, -&setup->fs.current, -sizeof setup->fs.current); - setup->fs.stored = stored; - - /* The scene now references the textures in the rasterization - * state record. Note that now. - */ - for (i = 0; i < Elements(setup->fs.current_tex); i++) { -if (setup->fs.current_tex[i]) { - if (!lp_scene_add_resource_reference(scene, -setup->fs.current_tex[i], -new_scene)) { - assert(!new_scene); - return FALSE; - } + memcpy(stored, + &setup->fs.current, + sizeof setup->fs.current); + setup->fs.stored = stored; + + /* The scene now references the textures in the rasterization + * state record. Note that now. + */ + for (i = 0; i < Elements(setup->fs.current_tex); i++) { + if (setup->fs.current_tex[i]) { +if (!lp_scene_add_resource_reference(scene, + setup->fs.current_tex[i], + new_scene)) { + assert(!new_scene); + return FALSE; } } } -- 1.7.5.4 ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev