Optimizing bit extract

2014-02-14 Thread Allan Sandfeld Jensen
Hello gcc

I have been looking at optimizations of pixel-format conversion recently and 
have noticed that gcc does take advantage of SSE4a extrq, BMI1 bextr TBM 
bextri or BMI2 pext instructions when it could be useful.

As far as I can tell it should not be that hard. A bextr expression can 
typically be recognized as ((x  s)   mask) or ((x  s1))  s2). But I am 
unsure where to do such a matching since the mask needs to have specific form 
to be valid for bextr, so it seems it needs to be done before instruction 
selection.

Secondly the bextr instruction in itself only replace two already fast 
instructions so is very minor (unless extracting variable bit-fields which is 
harder recognize).  The real optimization comes from being able to use pext 
(parallel bit extract), which can implement several bextr expressions in 
parallel. 

So, where would be the right place to implement such instructions. Would it 
make sense to recognize bextr early before we get to i386 code, or would it be 
better to recognize it late. And where do I put such instruction selection 
optimizations?

Motivating example:

unsigned rgb32_to_rgb16(unsigned rgb32) {
unsigned char red = (rgb32  19)  0x1f;
unsigned char green = (rgb32  10)  0x3f;
unsigned char blue = rgb32   0x1f;
   return (red  11) | (green  5) | blue;
}

can be implemented as pext(rgb32, 0x001f3f1f)

Best regards
`Allan Sandfeld


Re: Optimizing bit extract

2014-02-14 Thread Richard Biener
On Fri, Feb 14, 2014 at 2:23 PM, Allan Sandfeld Jensen
li...@carewolf.com wrote:
 Hello gcc

 I have been looking at optimizations of pixel-format conversion recently and
 have noticed that gcc does take advantage of SSE4a extrq, BMI1 bextr TBM
 bextri or BMI2 pext instructions when it could be useful.

 As far as I can tell it should not be that hard. A bextr expression can
 typically be recognized as ((x  s)   mask) or ((x  s1))  s2). But I am
 unsure where to do such a matching since the mask needs to have specific form
 to be valid for bextr, so it seems it needs to be done before instruction
 selection.

 Secondly the bextr instruction in itself only replace two already fast
 instructions so is very minor (unless extracting variable bit-fields which is
 harder recognize).  The real optimization comes from being able to use pext
 (parallel bit extract), which can implement several bextr expressions in
 parallel.

 So, where would be the right place to implement such instructions. Would it
 make sense to recognize bextr early before we get to i386 code, or would it be
 better to recognize it late. And where do I put such instruction selection
 optimizations?

 Motivating example:

 unsigned rgb32_to_rgb16(unsigned rgb32) {
 unsigned char red = (rgb32  19)  0x1f;
 unsigned char green = (rgb32  10)  0x3f;
 unsigned char blue = rgb32   0x1f;
return (red  11) | (green  5) | blue;
 }

 can be implemented as pext(rgb32, 0x001f3f1f)

We have a special pass that already deals with similar patterns,
the bswap pass in tree-ssa-math-opts.c.  It does symbolic
execution to produce the composition of a value.  It currently
handles byte-shifts only I think (not shifting by 19 or 10) but
this is certainly the way I'd recognize pext() (and other generic
shuffles supported by vector ISAs).

You'd have to extend the representation it uses to handle
these more arbitrary shifts/masks of course.

Richard.

 Best regards
 `Allan Sandfeld