Hello gcc I have been looking at optimizations of pixel-format conversion recently and have noticed that gcc does take advantage of SSE4a extrq, BMI1 bextr TBM bextri or BMI2 pext instructions when it could be useful.
As far as I can tell it should not be that hard. A bextr expression can typically be recognized as ((x >> s) & mask) or ((x << s1)) >> s2). But I am unsure where to do such a matching since the mask needs to have specific form to be valid for bextr, so it seems it needs to be done before instruction selection. Secondly the bextr instruction in itself only replace two already fast instructions so is very minor (unless extracting variable bit-fields which is harder recognize). The real optimization comes from being able to use pext (parallel bit extract), which can implement several bextr expressions in parallel. So, where would be the right place to implement such instructions. Would it make sense to recognize bextr early before we get to i386 code, or would it be better to recognize it late. And where do I put such instruction selection optimizations? Motivating example: unsigned rgb32_to_rgb16(unsigned rgb32) { unsigned char red = (rgb32 >> 19) & 0x1f; unsigned char green = (rgb32 >> 10) & 0x3f; unsigned char blue = rgb32 & 0x1f; return (red << 11) | (green << 5) | blue; } can be implemented as pext(rgb32, 0x001f3f1f) Best regards `Allan Sandfeld