Cool.  Isosurf is a good benchmark for these rasterization functions.

I found one reason why main is slower than master.  Master has this commit:



Author: Keith Whitwell <kei...@vmware.com>  2010-09-12 14:29:00
Committer: Keith Whitwell <kei...@vmware.com>  2010-10-11 23:43:53
Parent: 5d9de332bfef20c2e8b8942980f3e085915df251 (llvmpipe: add debug helpers 
for epi32 etc)
Child:  309d7bb01bdc306bd4f1964768e78f5479deb5ab (llvmpipe: remove 
perspective-divide-per-quad code)
Branch: main
Follows: mesa-7.8.1
Precedes: 

    llvmpipe: fix wierd performance regression in isosurf
    
    I really don't understand the mechanism behind this, but it
    seems like the way data blocks for a scene are malloced, and in
    particular whether we treat them as stack or a queue, and whether
    we retain the most recently allocated or least recently allocated
    has a real affect (~5%) on isosurf framerates...
    
    This is probably specific to my distro or even just my machine,
    but none the less, it's nicer not to see the framerates go in the
    wrong direction.



which fixed a problem I thought didn't exist on main - turns out it does.

The "problem" as such is that although malloc/free are very fast, brk() and 
mmap() are not fast at all, and unfortunately it's hard to predict whether 
malloc() will end up calling brk() or not.  The "fix" above just removes a few 
mallocs & gets better behaviour out of the linux malloc implementation.  
Without this, we're doing a brk() every scene, it seems, which is responsible 
for (some of) the slowdown.

All of this is linux-specific & may not be relevant on more complex demos, but 
it is fairly interesting.

Keith




________________________________________
From: José Fonseca [jfons...@vmware.com]
Sent: Tuesday, October 12, 2010 1:57 PM
To: Keith Whitwell
Cc: mesa-commit@lists.freedesktop.org
Subject: Re: Mesa (master): llvmpipe: try to do more of rast_tri_3_16 with 
intrinsics

> +   __m128i b4a4_mask       = _mm_and_si128(b4a4, mask);
> +   __m128i b4a4_mask_shift = _mm_slli_si128(b4a4_mask, 4);

I think you could replace these two calls with _mm_slli_epi64(b4a4_mask,
32)

I also think you should replace the two _mm_sril_si128 above with
_mm_srli_epi64, as  _mm_sril_si128 could be up to 4x slower, according
to Intel's intrinsic guide.

There's a patch attached that does this, but I'm not sure how to
benchmark such a tiny optimization.

> +   __m128i result          = _mm_or_si128(ba_mask, b4a4_mask_shift);
> +#endif
> +
> +   return result;
> +}

Jose
_______________________________________________
mesa-commit mailing list
mesa-commit@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-commit

Reply via email to