Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Keith Whitwell
On Thu, 2011-06-30 at 17:53 +0200, Roland Scheidegger wrote:
> Am 30.06.2011 16:14, schrieb Adam Jackson:
> > On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote:
> >> Ok in fact there's a gcc bug about memcmp:
> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> >> In short gcc's memcmp builtin is totally lame and loses to glibc's
> >> memcmp (including call overhead, no knowledge about alignment etc.) even
> >> when comparing only very few bytes (and loses BIG time for lots of bytes
> >> to compare). Oops. Well at least if the strings are the same (I'd guess
> >> if the first byte is different it's hard to beat the gcc builtin...).
> >> So this is really a gcc bug. The bug is quite old though with no fix in
> >> sight apparently so might need to think about some workaround (but just
> >> not doing the comparison doesn't look like the right idea, since
> >> apparently it would be faster with the comparison if gcc's memcmp got
> >> fixed).
> > 
> > How do things fare if you build with -fno-builtin-memcmp?
> 
> This is even faster:
> original ipers: 12.1 fps
> ajax patch: 15.5 fps
> optimized struct compare: 16.8 fps
> -fno-builtin-memcmp: 18.1 fps
> 
> Looks like we have a winner :-) I guess glibc optimizes the hell out of
> it (in contrast to the other results, this affected all memcmp though I
> don't know if any others benefited from that on average).
> As noted by Keith though the struct we compare is really large (over 4k)
> so trimming the size might be a good idea anyway (of course the 4k size
> also meant any call overhead and non-optimal code due to glibc not
> knowing alignment beforehand and usage of return value is completely
> insignificant).
> A 50% improvement from disabling a compiler optimization, lol.

We probably what this everywhere throughout Mesa & Gallium...

Keith

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Roland Scheidegger
Am 30.06.2011 16:14, schrieb Adam Jackson:
> On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote:
>> Ok in fact there's a gcc bug about memcmp:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
>> In short gcc's memcmp builtin is totally lame and loses to glibc's
>> memcmp (including call overhead, no knowledge about alignment etc.) even
>> when comparing only very few bytes (and loses BIG time for lots of bytes
>> to compare). Oops. Well at least if the strings are the same (I'd guess
>> if the first byte is different it's hard to beat the gcc builtin...).
>> So this is really a gcc bug. The bug is quite old though with no fix in
>> sight apparently so might need to think about some workaround (but just
>> not doing the comparison doesn't look like the right idea, since
>> apparently it would be faster with the comparison if gcc's memcmp got
>> fixed).
> 
> How do things fare if you build with -fno-builtin-memcmp?

This is even faster:
original ipers: 12.1 fps
ajax patch: 15.5 fps
optimized struct compare: 16.8 fps
-fno-builtin-memcmp: 18.1 fps

Looks like we have a winner :-) I guess glibc optimizes the hell out of
it (in contrast to the other results, this affected all memcmp though I
don't know if any others benefited from that on average).
As noted by Keith though the struct we compare is really large (over 4k)
so trimming the size might be a good idea anyway (of course the 4k size
also meant any call overhead and non-optimal code due to glibc not
knowing alignment beforehand and usage of return value is completely
insignificant).
A 50% improvement from disabling a compiler optimization, lol.

Roland
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Roland Scheidegger
Am 30.06.2011 12:14, schrieb Jose Fonseca:
> 
> 
> - Original Message -
>> Hmm.
>> Forgive my ignorance, but isn't memcmp() on structs pretty prone to
>> give
>> incorrect != results, given that there may be padding between members
>> in
>> structs and that IIRC gcc struct assignment is member-wise.
> 
> There's no alternative to bitwise comparison on C:
> 
> $ cat cmp.c 
> struct foo {
>   int a;
>   int b;
> };
> 
> 
> int cmp(const struct foo *a, const struct foo *b) {
>   return *a == *b;
> }
> $ gcc -c cmp.c 
> cmp.c: In function ‘cmp’:
> cmp.c:7:12: error: invalid operands to binary == (have ‘const struct foo’ and 
> ‘const struct foo’)
> 
>> What happens if there's padding between the jit_context and variant
>> members of struct lp_rast_state?
>>
>> I seem to recall hitting similar issues a number of times in the
>> past.
> 
> I recall that as well, but my memory is the other way around: struct 
> assignment is considerer harmful because it breaks memcmp. Instead all 
> structures should be initialized with memset(0) first, and always copied with 
> memcpy.  This should ensure that padding doesn't get clobbered.
> 
> But now that you mention it, I'm not still 100% that unsed bits on bitfields 
> are preserved like that or not, when being assigned. We probably should 
> ensure that the  all bits  in bitfields are used, using "reserved" members, 
> so that the zeros there stay zero.

We've definitely hit issues like that in the past - I think if you use
struct assignment you'll need to initialize the dst struct to 0
initially (but only once - even though the padding is probably undefined
after such an assignment, any implementation should either copy
everything including the padding or leave padding alone).
I don't think anything else will touch the unused parts, though I guess
it might be possible for instance if a 32bit int is assigned to a 16 bit
bitfield which has padding after it.
But generally using memcmp/memcpy should work ok, and it gives the
compiler all the information it needs to do it fast. Well if it uses it
or not is another question... I think the problem with gcc is that when
it inserts the comparisons with repz cmpsb it knows alignment and size
to copy but doesn't know the result is only used as a strict comparison
- that makes it impossible to generate really optimized code there (as
you need some byte comparison to get correct memcmp semantics unless you
use bswap or do dword comparison then byte comparison on a non-match,
both of which are probably slower if you expect the comparison to fail
on first byte which is the case for instance in substring searches) and
later it can't optimize that into something more sensible. So this might
not be trivial to fix in gcc. Too bad no builtin is available which only
returns true/false...

Roland

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Adam Jackson
On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote:
> Ok in fact there's a gcc bug about memcmp:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> In short gcc's memcmp builtin is totally lame and loses to glibc's
> memcmp (including call overhead, no knowledge about alignment etc.) even
> when comparing only very few bytes (and loses BIG time for lots of bytes
> to compare). Oops. Well at least if the strings are the same (I'd guess
> if the first byte is different it's hard to beat the gcc builtin...).
> So this is really a gcc bug. The bug is quite old though with no fix in
> sight apparently so might need to think about some workaround (but just
> not doing the comparison doesn't look like the right idea, since
> apparently it would be faster with the comparison if gcc's memcmp got
> fixed).

How do things fare if you build with -fno-builtin-memcmp?

- ajax


signature.asc
Description: This is a digitally signed message part
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Keith Whitwell
On Thu, 2011-06-30 at 03:27 -0700, Jose Fonseca wrote:
> 
> - Original Message -
> > On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote:
> > > Ok in fact there's a gcc bug about memcmp:
> > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> > > In short gcc's memcmp builtin is totally lame and loses to glibc's
> > > memcmp (including call overhead, no knowledge about alignment etc.)
> > > even
> > > when comparing only very few bytes (and loses BIG time for lots of
> > > bytes
> > > to compare). Oops. Well at least if the strings are the same (I'd
> > > guess
> > > if the first byte is different it's hard to beat the gcc
> > > builtin...).
> > > So this is really a gcc bug. The bug is quite old though with no
> > > fix in
> > > sight apparently so might need to think about some workaround (but
> > > just
> > > not doing the comparison doesn't look like the right idea, since
> > > apparently it would be faster with the comparison if gcc's memcmp
> > > got
> > > fixed).
> > 
> > Looking at the struct again (it's been a while), it seems like it
> > could
> > be rearranged to be variable-sized and on average significantly
> > smaller:
> > 
> > struct lp_rast_state {
> >struct lp_jit_context jit_context;
> >struct lp_fragment_shader_variant *variant;
> > };
> > 
> > struct lp_jit_context {
> >const float *constants;
> >float alpha_ref_value;
> >uint32_t stencil_ref_front, stencil_ref_back;
> >uint8_t *blend_color;
> >struct lp_jit_texture textures[PIPE_MAX_SAMPLERS];
> > };
> > 
> > If we moved the jit_context part behind "variant", and then hopefully
> > note that most of those lp_jit_texture structs are not in use.  That
> > would save time on the memcmp *and* space in the binned data.
> 
> Yeah, sounds a good idea.
> 
> But there's some subtletly to computing the number of textures: it
>  can't be just the NULL textures, because they may be reffered by the
>  JIT code, which has no NULL checks and  relies on the state setup to
>  provide storage for all textures, or dummy memory if one is not bound.

So it's a property of the variant, right?  We should just store that
information when we generate the llvm variant.

> I think a better idea would be:
> - split the texture/sampler state
> - to make the lp_jit_context::textures an array of pointers, and put the 
> struct lp_jit_texture in the pipe_texture object themselves
> - to make the lp_jit_context::samplers an array of pointers, and put the 
> struct lp_jit_sampler in the pipe_sampler_state CSO

I like this too - it's somewhat more involved of course.

In fact the two are orthogonal -- the struct below can still be shrunk
significantly by knowing how many samplers & textures the variant refers
to.  Interleaving them or packing them would reduce the bytes to be
compared.

Alternatively there could be just a pointer in jit_context to
textures/samplers binned elsewhere.

> struct lp_jit_context {
> struct lp_jit_texture *textures[PIPE_MAX_SAMPLERS];
> struct lp_jit_sampler *samplers[PIPE_MAX_SAMPLERS];
> };

The jit context above seems to have lost some of its fields...

The next step might be to split the context into four parts: textures,
samplers, constants, "other", and have jit_context just be a set of
pointers into the binned data:

struct lp_jit_context {
 struct lp_jit_texture **textures;
 struct lp_jit_sampler **samplers;
 const float *constants;
 const struct lp_jit_other *other;   
};

struct lp_jit_other {
   float alpha_ref_value;
   uint32_t stencil_ref_front;
   uint32_t stencil_ref_back;
   uint8_t *blend_color;
};

> struct lp_jit_texture
> {
>uint32_t width;
>uint32_t height;
>uint32_t depth;
>uint32_t first_level;
>uint32_t last_level;
>uint32_t row_stride[LP_MAX_TEXTURE_LEVELS];
>uint32_t img_stride[LP_MAX_TEXTURE_LEVELS];
>const void *data[LP_MAX_TEXTURE_LEVELS];
>/* sampler state, actually */
>float min_lod;
>float max_lod;
>float lod_bias;
>float border_color[4];
> };
> 
> struct lp_jit_sampler
> {
>float min_lod;
>float max_lod;
>float lod_bias;
>float border_color[4];
> };
> 
> 
> Jose


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Jose Fonseca


- Original Message -
> On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote:
> > Ok in fact there's a gcc bug about memcmp:
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> > In short gcc's memcmp builtin is totally lame and loses to glibc's
> > memcmp (including call overhead, no knowledge about alignment etc.)
> > even
> > when comparing only very few bytes (and loses BIG time for lots of
> > bytes
> > to compare). Oops. Well at least if the strings are the same (I'd
> > guess
> > if the first byte is different it's hard to beat the gcc
> > builtin...).
> > So this is really a gcc bug. The bug is quite old though with no
> > fix in
> > sight apparently so might need to think about some workaround (but
> > just
> > not doing the comparison doesn't look like the right idea, since
> > apparently it would be faster with the comparison if gcc's memcmp
> > got
> > fixed).
> 
> Looking at the struct again (it's been a while), it seems like it
> could
> be rearranged to be variable-sized and on average significantly
> smaller:
> 
> struct lp_rast_state {
>struct lp_jit_context jit_context;
>struct lp_fragment_shader_variant *variant;
> };
> 
> struct lp_jit_context {
>const float *constants;
>float alpha_ref_value;
>uint32_t stencil_ref_front, stencil_ref_back;
>uint8_t *blend_color;
>struct lp_jit_texture textures[PIPE_MAX_SAMPLERS];
> };
> 
> If we moved the jit_context part behind "variant", and then hopefully
> note that most of those lp_jit_texture structs are not in use.  That
> would save time on the memcmp *and* space in the binned data.

Yeah, sounds a good idea.

But there's some subtletly to computing the number of textures: it can't be 
just the NULL textures, because they may be reffered by the JIT code, which has 
no NULL checks and  relies on the state setup to provide storage for all 
textures, or dummy memory if one is not bound.

I think a better idea would be:
- split the texture/sampler state
- to make the lp_jit_context::textures an array of pointers, and put the struct 
lp_jit_texture in the pipe_texture object themselves
- to make the lp_jit_context::samplers an array of pointers, and put the struct 
lp_jit_sampler in the pipe_sampler_state CSO

struct lp_jit_context {
struct lp_jit_texture *textures[PIPE_MAX_SAMPLERS];
struct lp_jit_sampler *samplers[PIPE_MAX_SAMPLERS];
};

struct lp_jit_texture
{
   uint32_t width;
   uint32_t height;
   uint32_t depth;
   uint32_t first_level;
   uint32_t last_level;
   uint32_t row_stride[LP_MAX_TEXTURE_LEVELS];
   uint32_t img_stride[LP_MAX_TEXTURE_LEVELS];
   const void *data[LP_MAX_TEXTURE_LEVELS];
   /* sampler state, actually */
   float min_lod;
   float max_lod;
   float lod_bias;
   float border_color[4];
};

struct lp_jit_sampler
{
   float min_lod;
   float max_lod;
   float lod_bias;
   float border_color[4];
};


Jose
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Jose Fonseca


- Original Message -
> Hmm.
> Forgive my ignorance, but isn't memcmp() on structs pretty prone to
> give
> incorrect != results, given that there may be padding between members
> in
> structs and that IIRC gcc struct assignment is member-wise.

There's no alternative to bitwise comparison on C:

$ cat cmp.c 
struct foo {
int a;
int b;
};


int cmp(const struct foo *a, const struct foo *b) {
return *a == *b;
}
$ gcc -c cmp.c 
cmp.c: In function ‘cmp’:
cmp.c:7:12: error: invalid operands to binary == (have ‘const struct foo’ and 
‘const struct foo’)

> What happens if there's padding between the jit_context and variant
> members of struct lp_rast_state?
> 
> I seem to recall hitting similar issues a number of times in the
> past.

I recall that as well, but my memory is the other way around: struct assignment 
is considerer harmful because it breaks memcmp. Instead all structures should 
be initialized with memset(0) first, and always copied with memcpy.  This 
should ensure that padding doesn't get clobbered.

But now that you mention it, I'm not still 100% that unsed bits on bitfields 
are preserved like that or not, when being assigned. We probably should ensure 
that the  all bits  in bitfields are used, using "reserved" members, so that 
the zeros there stay zero.

Jose


> /Thomas
> 
> On 06/30/2011 03:36 AM, Roland Scheidegger wrote:
> > Ok in fact there's a gcc bug about memcmp:
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> > In short gcc's memcmp builtin is totally lame and loses to glibc's
> > memcmp (including call overhead, no knowledge about alignment etc.)
> > even
> > when comparing only very few bytes (and loses BIG time for lots of
> > bytes
> > to compare). Oops. Well at least if the strings are the same (I'd
> > guess
> > if the first byte is different it's hard to beat the gcc
> > builtin...).
> > So this is really a gcc bug. The bug is quite old though with no
> > fix in
> > sight apparently so might need to think about some workaround (but
> > just
> > not doing the comparison doesn't look like the right idea, since
> > apparently it would be faster with the comparison if gcc's memcmp
> > got
> > fixed).
> >
> >
> > Roland
> >
> >
> >
> > Am 30.06.2011 01:47, schrieb Roland Scheidegger:
> >
> >> I didn't even look at that was just curious why the memcmp (which
> >> is
> >> used a lot in other places) is slow. However, none of the other
> >> memcmp
> >> seem to show up prominently (cso functions are quite low in
> >> profiles,
> >> _mesa_search_program_cache uses memcmp too but it's not that high
> >> neither). So I guess those functions either aren't called that
> >> often or
> >> the sizes they compare are small.
> >> So should maybe just file a gcc bug for memcmp and look at that
> >> particular llvmpipe issue again :-).
> >>
> >> Roland
> >>
> >> Am 30.06.2011 01:16, schrieb Corbin Simpson:
> >>  
> >>> Okay, so maybe I'm failing to recognize the exact situation here,
> >>> but
> >>> wouldn't it be possible to mark the FS state with a serial number
> >>> and
> >>> just compare those? Or are these FS states not CSO-cached?
> >>>
> >>> ~ C.
> >>>
> >>> On Wed, Jun 29, 2011 at 3:44 PM, Roland
> >>> Scheidegger  wrote:
> >>>
>  Actually I ran some numbers here and tried out a optimized
>  struct compare:
>  original ipers: 12.1 fps
>  ajax patch: 15.5 fps
>  optimized struct compare: 16.8 fps
> 
> 
>  This is the function I used for that (just enabled in that
>  lp_setup
>  function):
> 
>  static INLINE int util_cmp_struct(const void *src1, const void
>  *src2,
>  unsigned count)
>  {
>    /* hmm pointer casting is evil */
>    const uint32_t *src1_ptr = (uint32_t *)src1;
>    const uint32_t *src2_ptr = (uint32_t *)src2;
>    unsigned i;
>    assert(count % 4 == 0);
>    for (i = 0; i<  count/4; i++) {
>  if (*src1_ptr != *src2_ptr) {
>    return 1;
>  }
>  src1_ptr++;
>  src2_ptr++;
>    }
>    return 0;
>  }
> 
>  (And no this doesn't use repz cmpsd here.)
> 
>  So, unless I made some mistake, memcmp is just dead slow (*),
>  most of
>  the slowness probably coming from the bytewise comparison (and
>  apparently I was wrong in assuming the comparison there might
>  never be
>  the same for ipers).
>  Of course, the optimized struct compare relies on structs really
>  being
>  dword aligned (I think this is always the case), and
>  additionally it
>  relies on the struct size being a whole multiple of dwords -
>  likely
>  struct needs padding to ensure that (at least I don't think this
>  is
>  always guaranteed for all structs).
>  But since memcmp is used extensively (cso for one) maybe some
>  optimization along these lines might be worth it (though of
>  course for
>  small structs the win isn't going 

Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Jose Fonseca
Great work Roland! And thanks Ajax to finding this hot spot.

We use memcmp a lot -- all CSO caching, so we should use this everywhere.

We should also code a sse2 version with intrinsics for x86-64, which is 
guaranteed to always have SSE2.

Jose

- Original Message -
> Actually I ran some numbers here and tried out a optimized struct
> compare:
> original ipers: 12.1 fps
> ajax patch: 15.5 fps
> optimized struct compare: 16.8 fps
> 
> 
> This is the function I used for that (just enabled in that lp_setup
> function):
> 
> static INLINE int util_cmp_struct(const void *src1, const void *src2,
> unsigned count)
> {
>   /* hmm pointer casting is evil */
>   const uint32_t *src1_ptr = (uint32_t *)src1;
>   const uint32_t *src2_ptr = (uint32_t *)src2;
>   unsigned i;
>   assert(count % 4 == 0);
>   for (i = 0; i < count/4; i++) {
> if (*src1_ptr != *src2_ptr) {
>   return 1;
> }
> src1_ptr++;
> src2_ptr++;
>   }
>   return 0;
> }
> 
> (And no this doesn't use repz cmpsd here.)
> 
> So, unless I made some mistake, memcmp is just dead slow (*), most of
> the slowness probably coming from the bytewise comparison (and
> apparently I was wrong in assuming the comparison there might never
> be
> the same for ipers).
> Of course, the optimized struct compare relies on structs really
> being
> dword aligned (I think this is always the case), and additionally it
> relies on the struct size being a whole multiple of dwords - likely
> struct needs padding to ensure that (at least I don't think this is
> always guaranteed for all structs).
> But since memcmp is used extensively (cso for one) maybe some
> optimization along these lines might be worth it (though of course
> for
> small structs the win isn't going to be as big - and can't beat the
> repz
> cmps in code size...).
> 
> Roland
> 
> (*) I actually found some references some implementations might be
> better they don't just use repz cmpsb but they split this up in parts
> which do dword (or qword even - well for really large structs could
> use
> sse2) comparisons for the parts where it's possible and only byte
> comparisons for the remaining bytes (and if the compiler does that it
> probably would know the size at compile time here hence could leave
> out
> much of the code). Of course memcmp requires that the return value
> isn't
> just a true or false value, hence there's more code needed once an
> unequal dword is found, though the compiler could optimize that away
> too
> in case it's not needed. Much the same as memcpy is optimized usually
> really, so blame gcc :-).
> 
> 
> 
> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
> > Ohh that's interesting, you'd think the comparison shouldn't be
> > that
> > expensive (though I guess in ipers case the comparison is never
> > true).
> > memcmp is quite extensively used everywhere. Maybe we could replace
> > that
> > with something faster (since we only ever care if the blocks are
> > the
> > same but not care about the lexographic ordering and always compare
> > whole structs, should compare dwords instead of bytes for a 4 time
> > speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
> > Also I guess it would help if the values which are more likely to
> > be
> > unequal are first in the struct (if we can tell that).
> > Of course though if it's unlikely to be the same as the compared
> > value
> > anyway not comparing at all still might be a win (here).
> > 
> > Roland
> > 
> > Am 29.06.2011 19:19, schrieb Adam Jackson:
> >> Perversely, do this by eliminating the comparison between stored
> >> and
> >> current fs state.  On ipers, a perf trace showed
> >> try_update_scene_state
> >> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the
> >> memcmp.
> >> Taking that out takes try_update_scene_state down to 6.5% of the
> >> profile; more importantly, ipers goes from 10 to 14fps and gears
> >> goes
> >> from 790 to 830fps.
> >>
> >> Signed-off-by: Adam Jackson 
> >> ---
> >>  src/gallium/drivers/llvmpipe/lp_setup.c |   61
> >>  ++-
> >>  1 files changed, 27 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c
> >> b/src/gallium/drivers/llvmpipe/lp_setup.c
> >> index cbe06e5..9118db5 100644
> >> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
> >> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
> >> @@ -839,42 +839,35 @@ try_update_scene_state( struct
> >> lp_setup_context *setup )
> >>setup->dirty |= LP_SETUP_NEW_FS;
> >> }
> >>  
> >> -
> >> if (setup->dirty & LP_SETUP_NEW_FS) {
> >> -  if (!setup->fs.stored ||
> >> -  memcmp(setup->fs.stored,
> >> - &setup->fs.current,
> >> - sizeof setup->fs.current) != 0)
> >> -  {
> >> - struct lp_rast_state *stored;
> >> - uint i;
> >> -
> >> - /* The fs state that's been stored in the scene is
> >> different from
> >> -  * the new, current state.  So allocate a n

Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Keith Whitwell
On Thu, 2011-06-30 at 03:36 +0200, Roland Scheidegger wrote:
> Ok in fact there's a gcc bug about memcmp:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> In short gcc's memcmp builtin is totally lame and loses to glibc's
> memcmp (including call overhead, no knowledge about alignment etc.) even
> when comparing only very few bytes (and loses BIG time for lots of bytes
> to compare). Oops. Well at least if the strings are the same (I'd guess
> if the first byte is different it's hard to beat the gcc builtin...).
> So this is really a gcc bug. The bug is quite old though with no fix in
> sight apparently so might need to think about some workaround (but just
> not doing the comparison doesn't look like the right idea, since
> apparently it would be faster with the comparison if gcc's memcmp got
> fixed).

Looking at the struct again (it's been a while), it seems like it could
be rearranged to be variable-sized and on average significantly smaller:

struct lp_rast_state {
   struct lp_jit_context jit_context;
   struct lp_fragment_shader_variant *variant;
};

struct lp_jit_context {
   const float *constants;
   float alpha_ref_value;
   uint32_t stencil_ref_front, stencil_ref_back;
   uint8_t *blend_color;
   struct lp_jit_texture textures[PIPE_MAX_SAMPLERS];
};

If we moved the jit_context part behind "variant", and then hopefully
note that most of those lp_jit_texture structs are not in use.  That
would save time on the memcmp *and* space in the binned data.

It's weird this wasn't showing up in past profiling.

Kieth


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Keith Whitwell
On Wed, 2011-06-29 at 16:16 -0700, Corbin Simpson wrote:
> Okay, so maybe I'm failing to recognize the exact situation here, but
> wouldn't it be possible to mark the FS state with a serial number and
> just compare those? Or are these FS states not CSO-cached?

No, the struct being compared is poorly named & collides with a CSO
entity.  It's really all the state which the compiled fragment shader
will reference when it is later invoked.  It's all packed into a single
struct because it's easier to pass a single parameter to llvm-compiled
shaders and add/change that parameter, but it is somewhat non-orthogonal
and we end up generating too many of them.

Keith

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-30 Thread Thomas Hellstrom

Hmm.
Forgive my ignorance, but isn't memcmp() on structs pretty prone to give 
incorrect != results, given that there may be padding between members in 
structs and that IIRC gcc struct assignment is member-wise.


What happens if there's padding between the jit_context and variant 
members of struct lp_rast_state?


I seem to recall hitting similar issues a number of times in the past.

/Thomas

On 06/30/2011 03:36 AM, Roland Scheidegger wrote:

Ok in fact there's a gcc bug about memcmp:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
In short gcc's memcmp builtin is totally lame and loses to glibc's
memcmp (including call overhead, no knowledge about alignment etc.) even
when comparing only very few bytes (and loses BIG time for lots of bytes
to compare). Oops. Well at least if the strings are the same (I'd guess
if the first byte is different it's hard to beat the gcc builtin...).
So this is really a gcc bug. The bug is quite old though with no fix in
sight apparently so might need to think about some workaround (but just
not doing the comparison doesn't look like the right idea, since
apparently it would be faster with the comparison if gcc's memcmp got
fixed).


Roland



Am 30.06.2011 01:47, schrieb Roland Scheidegger:
   

I didn't even look at that was just curious why the memcmp (which is
used a lot in other places) is slow. However, none of the other memcmp
seem to show up prominently (cso functions are quite low in profiles,
_mesa_search_program_cache uses memcmp too but it's not that high
neither). So I guess those functions either aren't called that often or
the sizes they compare are small.
So should maybe just file a gcc bug for memcmp and look at that
particular llvmpipe issue again :-).

Roland

Am 30.06.2011 01:16, schrieb Corbin Simpson:
 

Okay, so maybe I'm failing to recognize the exact situation here, but
wouldn't it be possible to mark the FS state with a serial number and
just compare those? Or are these FS states not CSO-cached?

~ C.

On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger  wrote:
   

Actually I ran some numbers here and tried out a optimized struct compare:
original ipers: 12.1 fps
ajax patch: 15.5 fps
optimized struct compare: 16.8 fps


This is the function I used for that (just enabled in that lp_setup
function):

static INLINE int util_cmp_struct(const void *src1, const void *src2,
unsigned count)
{
  /* hmm pointer casting is evil */
  const uint32_t *src1_ptr = (uint32_t *)src1;
  const uint32_t *src2_ptr = (uint32_t *)src2;
  unsigned i;
  assert(count % 4 == 0);
  for (i = 0; i<  count/4; i++) {
if (*src1_ptr != *src2_ptr) {
  return 1;
}
src1_ptr++;
src2_ptr++;
  }
  return 0;
}

(And no this doesn't use repz cmpsd here.)

So, unless I made some mistake, memcmp is just dead slow (*), most of
the slowness probably coming from the bytewise comparison (and
apparently I was wrong in assuming the comparison there might never be
the same for ipers).
Of course, the optimized struct compare relies on structs really being
dword aligned (I think this is always the case), and additionally it
relies on the struct size being a whole multiple of dwords - likely
struct needs padding to ensure that (at least I don't think this is
always guaranteed for all structs).
But since memcmp is used extensively (cso for one) maybe some
optimization along these lines might be worth it (though of course for
small structs the win isn't going to be as big - and can't beat the repz
cmps in code size...).

Roland

(*) I actually found some references some implementations might be
better they don't just use repz cmpsb but they split this up in parts
which do dword (or qword even - well for really large structs could use
sse2) comparisons for the parts where it's possible and only byte
comparisons for the remaining bytes (and if the compiler does that it
probably would know the size at compile time here hence could leave out
much of the code). Of course memcmp requires that the return value isn't
just a true or false value, hence there's more code needed once an
unequal dword is found, though the compiler could optimize that away too
in case it's not needed. Much the same as memcpy is optimized usually
really, so blame gcc :-).



Am 29.06.2011 20:33, schrieb Roland Scheidegger:
 

Ohh that's interesting, you'd think the comparison shouldn't be that
expensive (though I guess in ipers case the comparison is never true).
memcmp is quite extensively used everywhere. Maybe we could replace that
with something faster (since we only ever care if the blocks are the
same but not care about the lexographic ordering and always compare
whole structs, should compare dwords instead of bytes for a 4 time
speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
Also I guess it would help if the values which are more likely to be
unequal are first in the struct (if we can tell that).
Of course though if it's unlikely to be the same as the compared value
any

Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-29 Thread Roland Scheidegger
Ok in fact there's a gcc bug about memcmp:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
In short gcc's memcmp builtin is totally lame and loses to glibc's
memcmp (including call overhead, no knowledge about alignment etc.) even
when comparing only very few bytes (and loses BIG time for lots of bytes
to compare). Oops. Well at least if the strings are the same (I'd guess
if the first byte is different it's hard to beat the gcc builtin...).
So this is really a gcc bug. The bug is quite old though with no fix in
sight apparently so might need to think about some workaround (but just
not doing the comparison doesn't look like the right idea, since
apparently it would be faster with the comparison if gcc's memcmp got
fixed).


Roland



Am 30.06.2011 01:47, schrieb Roland Scheidegger:
> I didn't even look at that was just curious why the memcmp (which is
> used a lot in other places) is slow. However, none of the other memcmp
> seem to show up prominently (cso functions are quite low in profiles,
> _mesa_search_program_cache uses memcmp too but it's not that high
> neither). So I guess those functions either aren't called that often or
> the sizes they compare are small.
> So should maybe just file a gcc bug for memcmp and look at that
> particular llvmpipe issue again :-).
> 
> Roland
> 
> Am 30.06.2011 01:16, schrieb Corbin Simpson:
>> Okay, so maybe I'm failing to recognize the exact situation here, but
>> wouldn't it be possible to mark the FS state with a serial number and
>> just compare those? Or are these FS states not CSO-cached?
>>
>> ~ C.
>>
>> On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger  
>> wrote:
>>> Actually I ran some numbers here and tried out a optimized struct compare:
>>> original ipers: 12.1 fps
>>> ajax patch: 15.5 fps
>>> optimized struct compare: 16.8 fps
>>>
>>>
>>> This is the function I used for that (just enabled in that lp_setup
>>> function):
>>>
>>> static INLINE int util_cmp_struct(const void *src1, const void *src2,
>>> unsigned count)
>>> {
>>>  /* hmm pointer casting is evil */
>>>  const uint32_t *src1_ptr = (uint32_t *)src1;
>>>  const uint32_t *src2_ptr = (uint32_t *)src2;
>>>  unsigned i;
>>>  assert(count % 4 == 0);
>>>  for (i = 0; i < count/4; i++) {
>>>if (*src1_ptr != *src2_ptr) {
>>>  return 1;
>>>}
>>>src1_ptr++;
>>>src2_ptr++;
>>>  }
>>>  return 0;
>>> }
>>>
>>> (And no this doesn't use repz cmpsd here.)
>>>
>>> So, unless I made some mistake, memcmp is just dead slow (*), most of
>>> the slowness probably coming from the bytewise comparison (and
>>> apparently I was wrong in assuming the comparison there might never be
>>> the same for ipers).
>>> Of course, the optimized struct compare relies on structs really being
>>> dword aligned (I think this is always the case), and additionally it
>>> relies on the struct size being a whole multiple of dwords - likely
>>> struct needs padding to ensure that (at least I don't think this is
>>> always guaranteed for all structs).
>>> But since memcmp is used extensively (cso for one) maybe some
>>> optimization along these lines might be worth it (though of course for
>>> small structs the win isn't going to be as big - and can't beat the repz
>>> cmps in code size...).
>>>
>>> Roland
>>>
>>> (*) I actually found some references some implementations might be
>>> better they don't just use repz cmpsb but they split this up in parts
>>> which do dword (or qword even - well for really large structs could use
>>> sse2) comparisons for the parts where it's possible and only byte
>>> comparisons for the remaining bytes (and if the compiler does that it
>>> probably would know the size at compile time here hence could leave out
>>> much of the code). Of course memcmp requires that the return value isn't
>>> just a true or false value, hence there's more code needed once an
>>> unequal dword is found, though the compiler could optimize that away too
>>> in case it's not needed. Much the same as memcpy is optimized usually
>>> really, so blame gcc :-).
>>>
>>>
>>>
>>> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
 Ohh that's interesting, you'd think the comparison shouldn't be that
 expensive (though I guess in ipers case the comparison is never true).
 memcmp is quite extensively used everywhere. Maybe we could replace that
 with something faster (since we only ever care if the blocks are the
 same but not care about the lexographic ordering and always compare
 whole structs, should compare dwords instead of bytes for a 4 time
 speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
 Also I guess it would help if the values which are more likely to be
 unequal are first in the struct (if we can tell that).
 Of course though if it's unlikely to be the same as the compared value
 anyway not comparing at all still might be a win (here).

 Roland

 Am 29.06.2011 19:19, schrieb Adam Jackson:
> Perversely, do this by eliminating 

Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-29 Thread Roland Scheidegger
I didn't even look at that was just curious why the memcmp (which is
used a lot in other places) is slow. However, none of the other memcmp
seem to show up prominently (cso functions are quite low in profiles,
_mesa_search_program_cache uses memcmp too but it's not that high
neither). So I guess those functions either aren't called that often or
the sizes they compare are small.
So should maybe just file a gcc bug for memcmp and look at that
particular llvmpipe issue again :-).

Roland

Am 30.06.2011 01:16, schrieb Corbin Simpson:
> Okay, so maybe I'm failing to recognize the exact situation here, but
> wouldn't it be possible to mark the FS state with a serial number and
> just compare those? Or are these FS states not CSO-cached?
> 
> ~ C.
> 
> On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger  
> wrote:
>> Actually I ran some numbers here and tried out a optimized struct compare:
>> original ipers: 12.1 fps
>> ajax patch: 15.5 fps
>> optimized struct compare: 16.8 fps
>>
>>
>> This is the function I used for that (just enabled in that lp_setup
>> function):
>>
>> static INLINE int util_cmp_struct(const void *src1, const void *src2,
>> unsigned count)
>> {
>>  /* hmm pointer casting is evil */
>>  const uint32_t *src1_ptr = (uint32_t *)src1;
>>  const uint32_t *src2_ptr = (uint32_t *)src2;
>>  unsigned i;
>>  assert(count % 4 == 0);
>>  for (i = 0; i < count/4; i++) {
>>if (*src1_ptr != *src2_ptr) {
>>  return 1;
>>}
>>src1_ptr++;
>>src2_ptr++;
>>  }
>>  return 0;
>> }
>>
>> (And no this doesn't use repz cmpsd here.)
>>
>> So, unless I made some mistake, memcmp is just dead slow (*), most of
>> the slowness probably coming from the bytewise comparison (and
>> apparently I was wrong in assuming the comparison there might never be
>> the same for ipers).
>> Of course, the optimized struct compare relies on structs really being
>> dword aligned (I think this is always the case), and additionally it
>> relies on the struct size being a whole multiple of dwords - likely
>> struct needs padding to ensure that (at least I don't think this is
>> always guaranteed for all structs).
>> But since memcmp is used extensively (cso for one) maybe some
>> optimization along these lines might be worth it (though of course for
>> small structs the win isn't going to be as big - and can't beat the repz
>> cmps in code size...).
>>
>> Roland
>>
>> (*) I actually found some references some implementations might be
>> better they don't just use repz cmpsb but they split this up in parts
>> which do dword (or qword even - well for really large structs could use
>> sse2) comparisons for the parts where it's possible and only byte
>> comparisons for the remaining bytes (and if the compiler does that it
>> probably would know the size at compile time here hence could leave out
>> much of the code). Of course memcmp requires that the return value isn't
>> just a true or false value, hence there's more code needed once an
>> unequal dword is found, though the compiler could optimize that away too
>> in case it's not needed. Much the same as memcpy is optimized usually
>> really, so blame gcc :-).
>>
>>
>>
>> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
>>> Ohh that's interesting, you'd think the comparison shouldn't be that
>>> expensive (though I guess in ipers case the comparison is never true).
>>> memcmp is quite extensively used everywhere. Maybe we could replace that
>>> with something faster (since we only ever care if the blocks are the
>>> same but not care about the lexographic ordering and always compare
>>> whole structs, should compare dwords instead of bytes for a 4 time
>>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
>>> Also I guess it would help if the values which are more likely to be
>>> unequal are first in the struct (if we can tell that).
>>> Of course though if it's unlikely to be the same as the compared value
>>> anyway not comparing at all still might be a win (here).
>>>
>>> Roland
>>>
>>> Am 29.06.2011 19:19, schrieb Adam Jackson:
 Perversely, do this by eliminating the comparison between stored and
 current fs state.  On ipers, a perf trace showed try_update_scene_state
 using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
 Taking that out takes try_update_scene_state down to 6.5% of the
 profile; more importantly, ipers goes from 10 to 14fps and gears goes
 from 790 to 830fps.

 Signed-off-by: Adam Jackson 
 ---
  src/gallium/drivers/llvmpipe/lp_setup.c |   61 
 ++-
  1 files changed, 27 insertions(+), 34 deletions(-)

 diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c 
 b/src/gallium/drivers/llvmpipe/lp_setup.c
 index cbe06e5..9118db5 100644
 --- a/src/gallium/drivers/llvmpipe/lp_setup.c
 +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
 @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context 
 *setup )
 

Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-29 Thread Corbin Simpson
Okay, so maybe I'm failing to recognize the exact situation here, but
wouldn't it be possible to mark the FS state with a serial number and
just compare those? Or are these FS states not CSO-cached?

~ C.

On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger  wrote:
> Actually I ran some numbers here and tried out a optimized struct compare:
> original ipers: 12.1 fps
> ajax patch: 15.5 fps
> optimized struct compare: 16.8 fps
>
>
> This is the function I used for that (just enabled in that lp_setup
> function):
>
> static INLINE int util_cmp_struct(const void *src1, const void *src2,
> unsigned count)
> {
>  /* hmm pointer casting is evil */
>  const uint32_t *src1_ptr = (uint32_t *)src1;
>  const uint32_t *src2_ptr = (uint32_t *)src2;
>  unsigned i;
>  assert(count % 4 == 0);
>  for (i = 0; i < count/4; i++) {
>    if (*src1_ptr != *src2_ptr) {
>      return 1;
>    }
>    src1_ptr++;
>    src2_ptr++;
>  }
>  return 0;
> }
>
> (And no this doesn't use repz cmpsd here.)
>
> So, unless I made some mistake, memcmp is just dead slow (*), most of
> the slowness probably coming from the bytewise comparison (and
> apparently I was wrong in assuming the comparison there might never be
> the same for ipers).
> Of course, the optimized struct compare relies on structs really being
> dword aligned (I think this is always the case), and additionally it
> relies on the struct size being a whole multiple of dwords - likely
> struct needs padding to ensure that (at least I don't think this is
> always guaranteed for all structs).
> But since memcmp is used extensively (cso for one) maybe some
> optimization along these lines might be worth it (though of course for
> small structs the win isn't going to be as big - and can't beat the repz
> cmps in code size...).
>
> Roland
>
> (*) I actually found some references some implementations might be
> better they don't just use repz cmpsb but they split this up in parts
> which do dword (or qword even - well for really large structs could use
> sse2) comparisons for the parts where it's possible and only byte
> comparisons for the remaining bytes (and if the compiler does that it
> probably would know the size at compile time here hence could leave out
> much of the code). Of course memcmp requires that the return value isn't
> just a true or false value, hence there's more code needed once an
> unequal dword is found, though the compiler could optimize that away too
> in case it's not needed. Much the same as memcpy is optimized usually
> really, so blame gcc :-).
>
>
>
> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
>> Ohh that's interesting, you'd think the comparison shouldn't be that
>> expensive (though I guess in ipers case the comparison is never true).
>> memcmp is quite extensively used everywhere. Maybe we could replace that
>> with something faster (since we only ever care if the blocks are the
>> same but not care about the lexographic ordering and always compare
>> whole structs, should compare dwords instead of bytes for a 4 time
>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
>> Also I guess it would help if the values which are more likely to be
>> unequal are first in the struct (if we can tell that).
>> Of course though if it's unlikely to be the same as the compared value
>> anyway not comparing at all still might be a win (here).
>>
>> Roland
>>
>> Am 29.06.2011 19:19, schrieb Adam Jackson:
>>> Perversely, do this by eliminating the comparison between stored and
>>> current fs state.  On ipers, a perf trace showed try_update_scene_state
>>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
>>> Taking that out takes try_update_scene_state down to 6.5% of the
>>> profile; more importantly, ipers goes from 10 to 14fps and gears goes
>>> from 790 to 830fps.
>>>
>>> Signed-off-by: Adam Jackson 
>>> ---
>>>  src/gallium/drivers/llvmpipe/lp_setup.c |   61 
>>> ++-
>>>  1 files changed, 27 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c 
>>> b/src/gallium/drivers/llvmpipe/lp_setup.c
>>> index cbe06e5..9118db5 100644
>>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
>>> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context 
>>> *setup )
>>>        setup->dirty |= LP_SETUP_NEW_FS;
>>>     }
>>>
>>> -
>>>     if (setup->dirty & LP_SETUP_NEW_FS) {
>>> -      if (!setup->fs.stored ||
>>> -          memcmp(setup->fs.stored,
>>> -                 &setup->fs.current,
>>> -                 sizeof setup->fs.current) != 0)
>>> -      {
>>> -         struct lp_rast_state *stored;
>>> -         uint i;
>>> -
>>> -         /* The fs state that's been stored in the scene is different from
>>> -          * the new, current state.  So allocate a new lp_rast_state object
>>> -          * and append it to the bin's setup data buffer.
>>> -          */
>>> -         stored = (struct lp_rast_state *) l

Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-29 Thread Roland Scheidegger
Actually I ran some numbers here and tried out a optimized struct compare:
original ipers: 12.1 fps
ajax patch: 15.5 fps
optimized struct compare: 16.8 fps


This is the function I used for that (just enabled in that lp_setup
function):

static INLINE int util_cmp_struct(const void *src1, const void *src2,
unsigned count)
{
  /* hmm pointer casting is evil */
  const uint32_t *src1_ptr = (uint32_t *)src1;
  const uint32_t *src2_ptr = (uint32_t *)src2;
  unsigned i;
  assert(count % 4 == 0);
  for (i = 0; i < count/4; i++) {
if (*src1_ptr != *src2_ptr) {
  return 1;
}
src1_ptr++;
src2_ptr++;
  }
  return 0;
}

(And no this doesn't use repz cmpsd here.)

So, unless I made some mistake, memcmp is just dead slow (*), most of
the slowness probably coming from the bytewise comparison (and
apparently I was wrong in assuming the comparison there might never be
the same for ipers).
Of course, the optimized struct compare relies on structs really being
dword aligned (I think this is always the case), and additionally it
relies on the struct size being a whole multiple of dwords - likely
struct needs padding to ensure that (at least I don't think this is
always guaranteed for all structs).
But since memcmp is used extensively (cso for one) maybe some
optimization along these lines might be worth it (though of course for
small structs the win isn't going to be as big - and can't beat the repz
cmps in code size...).

Roland

(*) I actually found some references some implementations might be
better they don't just use repz cmpsb but they split this up in parts
which do dword (or qword even - well for really large structs could use
sse2) comparisons for the parts where it's possible and only byte
comparisons for the remaining bytes (and if the compiler does that it
probably would know the size at compile time here hence could leave out
much of the code). Of course memcmp requires that the return value isn't
just a true or false value, hence there's more code needed once an
unequal dword is found, though the compiler could optimize that away too
in case it's not needed. Much the same as memcpy is optimized usually
really, so blame gcc :-).



Am 29.06.2011 20:33, schrieb Roland Scheidegger:
> Ohh that's interesting, you'd think the comparison shouldn't be that
> expensive (though I guess in ipers case the comparison is never true).
> memcmp is quite extensively used everywhere. Maybe we could replace that
> with something faster (since we only ever care if the blocks are the
> same but not care about the lexographic ordering and always compare
> whole structs, should compare dwords instead of bytes for a 4 time
> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
> Also I guess it would help if the values which are more likely to be
> unequal are first in the struct (if we can tell that).
> Of course though if it's unlikely to be the same as the compared value
> anyway not comparing at all still might be a win (here).
> 
> Roland
> 
> Am 29.06.2011 19:19, schrieb Adam Jackson:
>> Perversely, do this by eliminating the comparison between stored and
>> current fs state.  On ipers, a perf trace showed try_update_scene_state
>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
>> Taking that out takes try_update_scene_state down to 6.5% of the
>> profile; more importantly, ipers goes from 10 to 14fps and gears goes
>> from 790 to 830fps.
>>
>> Signed-off-by: Adam Jackson 
>> ---
>>  src/gallium/drivers/llvmpipe/lp_setup.c |   61 
>> ++-
>>  1 files changed, 27 insertions(+), 34 deletions(-)
>>
>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c 
>> b/src/gallium/drivers/llvmpipe/lp_setup.c
>> index cbe06e5..9118db5 100644
>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
>> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup 
>> )
>>setup->dirty |= LP_SETUP_NEW_FS;
>> }
>>  
>> -
>> if (setup->dirty & LP_SETUP_NEW_FS) {
>> -  if (!setup->fs.stored ||
>> -  memcmp(setup->fs.stored,
>> - &setup->fs.current,
>> - sizeof setup->fs.current) != 0)
>> -  {
>> - struct lp_rast_state *stored;
>> - uint i;
>> - 
>> - /* The fs state that's been stored in the scene is different from
>> -  * the new, current state.  So allocate a new lp_rast_state object
>> -  * and append it to the bin's setup data buffer.
>> -  */
>> - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof 
>> *stored);
>> - if (!stored) {
>> -assert(!new_scene);
>> -return FALSE;
>> - }
>> +  struct lp_rast_state *stored;
>> +  uint i;
>> +  
>> +  /* The fs state that's been stored in the scene is different from
>> +   * the new, current state.  So allocate a new lp_rast_state object
>> +   * and append it to 

Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-29 Thread Roland Scheidegger
Ohh that's interesting, you'd think the comparison shouldn't be that
expensive (though I guess in ipers case the comparison is never true).
memcmp is quite extensively used everywhere. Maybe we could replace that
with something faster (since we only ever care if the blocks are the
same but not care about the lexographic ordering and always compare
whole structs, should compare dwords instead of bytes for a 4 time
speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
Also I guess it would help if the values which are more likely to be
unequal are first in the struct (if we can tell that).
Of course though if it's unlikely to be the same as the compared value
anyway not comparing at all still might be a win (here).

Roland

Am 29.06.2011 19:19, schrieb Adam Jackson:
> Perversely, do this by eliminating the comparison between stored and
> current fs state.  On ipers, a perf trace showed try_update_scene_state
> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
> Taking that out takes try_update_scene_state down to 6.5% of the
> profile; more importantly, ipers goes from 10 to 14fps and gears goes
> from 790 to 830fps.
> 
> Signed-off-by: Adam Jackson 
> ---
>  src/gallium/drivers/llvmpipe/lp_setup.c |   61 
> ++-
>  1 files changed, 27 insertions(+), 34 deletions(-)
> 
> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c 
> b/src/gallium/drivers/llvmpipe/lp_setup.c
> index cbe06e5..9118db5 100644
> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup )
>setup->dirty |= LP_SETUP_NEW_FS;
> }
>  
> -
> if (setup->dirty & LP_SETUP_NEW_FS) {
> -  if (!setup->fs.stored ||
> -  memcmp(setup->fs.stored,
> - &setup->fs.current,
> - sizeof setup->fs.current) != 0)
> -  {
> - struct lp_rast_state *stored;
> - uint i;
> - 
> - /* The fs state that's been stored in the scene is different from
> -  * the new, current state.  So allocate a new lp_rast_state object
> -  * and append it to the bin's setup data buffer.
> -  */
> - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof 
> *stored);
> - if (!stored) {
> -assert(!new_scene);
> -return FALSE;
> - }
> +  struct lp_rast_state *stored;
> +  uint i;
> +  
> +  /* The fs state that's been stored in the scene is different from
> +   * the new, current state.  So allocate a new lp_rast_state object
> +   * and append it to the bin's setup data buffer.
> +   */
> +  stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof 
> *stored);
> +  if (!stored) {
> + assert(!new_scene);
> + return FALSE;
> +  }
>  
> - memcpy(stored,
> -&setup->fs.current,
> -sizeof setup->fs.current);
> - setup->fs.stored = stored;
> - 
> - /* The scene now references the textures in the rasterization
> -  * state record.  Note that now.
> -  */
> - for (i = 0; i < Elements(setup->fs.current_tex); i++) {
> -if (setup->fs.current_tex[i]) {
> -   if (!lp_scene_add_resource_reference(scene,
> -setup->fs.current_tex[i],
> -new_scene)) {
> -  assert(!new_scene);
> -  return FALSE;
> -   }
> +  memcpy(stored,
> + &setup->fs.current,
> + sizeof setup->fs.current);
> +  setup->fs.stored = stored;
> +  
> +  /* The scene now references the textures in the rasterization
> +   * state record.  Note that now.
> +   */
> +  for (i = 0; i < Elements(setup->fs.current_tex); i++) {
> + if (setup->fs.current_tex[i]) {
> +if (!lp_scene_add_resource_reference(scene,
> + setup->fs.current_tex[i],
> + new_scene)) {
> +   assert(!new_scene);
> +   return FALSE;
>  }
>   }
>}

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-29 Thread Keith Whitwell
On Wed, 2011-06-29 at 13:19 -0400, Adam Jackson wrote:
> Perversely, do this by eliminating the comparison between stored and
> current fs state.  On ipers, a perf trace showed try_update_scene_state
> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
> Taking that out takes try_update_scene_state down to 6.5% of the
> profile; more importantly, ipers goes from 10 to 14fps and gears goes
> from 790 to 830fps.

Some of the motivation for that memcpy is about keeping the memory usage
of the binned scene from exploding and forcing unnecessary flushes on
more complex apps.

I wonder if there is a way to improve the dirty flag handling to avoid
ending up in that memcpy so often?


Note that freeglut is probably dominating your gears numbers by trying
to reinitialize your SpaceBall device (I don't have one either) on every
swapbuffers.

http://lists.freedesktop.org/archives/mesa-dev/2011-February/005599.html


Keith


> Signed-off-by: Adam Jackson 
> ---
>  src/gallium/drivers/llvmpipe/lp_setup.c |   61 
> ++-
>  1 files changed, 27 insertions(+), 34 deletions(-)
> 
> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c 
> b/src/gallium/drivers/llvmpipe/lp_setup.c
> index cbe06e5..9118db5 100644
> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup )
>setup->dirty |= LP_SETUP_NEW_FS;
> }
>  
> -
> if (setup->dirty & LP_SETUP_NEW_FS) {
> -  if (!setup->fs.stored ||
> -  memcmp(setup->fs.stored,
> - &setup->fs.current,
> - sizeof setup->fs.current) != 0)
> -  {
> - struct lp_rast_state *stored;
> - uint i;
> - 
> - /* The fs state that's been stored in the scene is different from
> -  * the new, current state.  So allocate a new lp_rast_state object
> -  * and append it to the bin's setup data buffer.
> -  */
> - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof 
> *stored);
> - if (!stored) {
> -assert(!new_scene);
> -return FALSE;
> - }
> +  struct lp_rast_state *stored;
> +  uint i;
> +  
> +  /* The fs state that's been stored in the scene is different from
> +   * the new, current state.  So allocate a new lp_rast_state object
> +   * and append it to the bin's setup data buffer.
> +   */
> +  stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof 
> *stored);
> +  if (!stored) {
> + assert(!new_scene);
> + return FALSE;
> +  }
>  
> - memcpy(stored,
> -&setup->fs.current,
> -sizeof setup->fs.current);
> - setup->fs.stored = stored;
> - 
> - /* The scene now references the textures in the rasterization
> -  * state record.  Note that now.
> -  */
> - for (i = 0; i < Elements(setup->fs.current_tex); i++) {
> -if (setup->fs.current_tex[i]) {
> -   if (!lp_scene_add_resource_reference(scene,
> -setup->fs.current_tex[i],
> -new_scene)) {
> -  assert(!new_scene);
> -  return FALSE;
> -   }
> +  memcpy(stored,
> + &setup->fs.current,
> + sizeof setup->fs.current);
> +  setup->fs.stored = stored;
> +  
> +  /* The scene now references the textures in the rasterization
> +   * state record.  Note that now.
> +   */
> +  for (i = 0; i < Elements(setup->fs.current_tex); i++) {
> + if (setup->fs.current_tex[i]) {
> +if (!lp_scene_add_resource_reference(scene,
> + setup->fs.current_tex[i],
> + new_scene)) {
> +   assert(!new_scene);
> +   return FALSE;
>  }
>   }
>}


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

2011-06-29 Thread Adam Jackson
Perversely, do this by eliminating the comparison between stored and
current fs state.  On ipers, a perf trace showed try_update_scene_state
using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
Taking that out takes try_update_scene_state down to 6.5% of the
profile; more importantly, ipers goes from 10 to 14fps and gears goes
from 790 to 830fps.

Signed-off-by: Adam Jackson 
---
 src/gallium/drivers/llvmpipe/lp_setup.c |   61 ++-
 1 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c 
b/src/gallium/drivers/llvmpipe/lp_setup.c
index cbe06e5..9118db5 100644
--- a/src/gallium/drivers/llvmpipe/lp_setup.c
+++ b/src/gallium/drivers/llvmpipe/lp_setup.c
@@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup )
   setup->dirty |= LP_SETUP_NEW_FS;
}
 
-
if (setup->dirty & LP_SETUP_NEW_FS) {
-  if (!setup->fs.stored ||
-  memcmp(setup->fs.stored,
- &setup->fs.current,
- sizeof setup->fs.current) != 0)
-  {
- struct lp_rast_state *stored;
- uint i;
- 
- /* The fs state that's been stored in the scene is different from
-  * the new, current state.  So allocate a new lp_rast_state object
-  * and append it to the bin's setup data buffer.
-  */
- stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof 
*stored);
- if (!stored) {
-assert(!new_scene);
-return FALSE;
- }
+  struct lp_rast_state *stored;
+  uint i;
+  
+  /* The fs state that's been stored in the scene is different from
+   * the new, current state.  So allocate a new lp_rast_state object
+   * and append it to the bin's setup data buffer.
+   */
+  stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
+  if (!stored) {
+ assert(!new_scene);
+ return FALSE;
+  }
 
- memcpy(stored,
-&setup->fs.current,
-sizeof setup->fs.current);
- setup->fs.stored = stored;
- 
- /* The scene now references the textures in the rasterization
-  * state record.  Note that now.
-  */
- for (i = 0; i < Elements(setup->fs.current_tex); i++) {
-if (setup->fs.current_tex[i]) {
-   if (!lp_scene_add_resource_reference(scene,
-setup->fs.current_tex[i],
-new_scene)) {
-  assert(!new_scene);
-  return FALSE;
-   }
+  memcpy(stored,
+ &setup->fs.current,
+ sizeof setup->fs.current);
+  setup->fs.stored = stored;
+  
+  /* The scene now references the textures in the rasterization
+   * state record.  Note that now.
+   */
+  for (i = 0; i < Elements(setup->fs.current_tex); i++) {
+ if (setup->fs.current_tex[i]) {
+if (!lp_scene_add_resource_reference(scene,
+ setup->fs.current_tex[i],
+ new_scene)) {
+   assert(!new_scene);
+   return FALSE;
 }
  }
   }
-- 
1.7.5.4

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev