On Fri, Jul 10, 2015 at 02:06:26PM -0600, Jeff Law wrote:
> On 07/10/2015 07:25 AM, Ondřej Bílka wrote:
> >On Fri, Jul 10, 2015 at 12:43:48PM +0200, Jakub Jelinek wrote:
> >>On Fri, Jul 10, 2015 at 11:37:18AM +0200, Uros Bizjak wrote:
> >>>Have you tried new SSE4.2 implementation (the one with asm flags) with
> >>>unrolled loop?
> >>
> >>Also, the SSE4.2 implementation looks shorter, so more I-cache friendly,
> >>so I wouldn't really say it is redundant if they are roughly same speed.
> >>
> >Ok, I tried to also optimize sse4 and found that main problem was
> >checking that index==16 caused high latency.
> >
> >Trick was checking first 64 bytes in header using flags. Then loop is
> >relatively unlikely as lines longer than 64 bytes are relatively rare.
> >
> >I tested that on more machines. On haswell sse4 is noticable faster, on
> >nehalem a sse2 is still bit faster and on amd fx10 its lot slower. How
> >do I check processor to select sse2 on amd processors where its
> >considerably slower?
> I doubt any of this is worth the maintenance burden.  I think we
> should pick a reasonably performant implementation and move on to
> bigger issues.
> 
Then we could proceed with this patch on basis that intel processors
are more common than amd ones. On fx10 a new sse2 implementation is 20%
faster than sse4, but on haswell its opposite and sse4 is 20% faster
than sse2.


Reply via email to