On Tue, 12 Jul 2016, Richard Biener wrote: > On Tue, 12 Jul 2016, Uros Bizjak wrote: > > > On Tue, Jul 12, 2016 at 10:58 AM, Richard Biener <rguent...@suse.de> wrote: > > > On Sun, 10 Jul 2016, Uros Bizjak wrote: > > > > > >> On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguent...@suse.de> wrote: > > >> > > >> >> > 2016-07-04 Richard Biener <rguent...@suse.de> > > >> >> > > > >> >> > PR rtl-optimization/68961 > > >> >> > * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and > > >> >> > CONCAT > > >> >> > to simplify to a non-constant. > > >> >> > > > >> >> > * gcc.target/i386/pr68961.c: New testcase. > > >> >> > > >> >> Thanks, LGTM. > > >> > > > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes > > >> > > > >> > FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2 > > >> > > > >> > as the peephole created for that testcase no longer applies as fwprop > > >> > does > > >> > > > >> > In insn 10, replacing > > >> > (vec_concat:V2DF (vec_select:DF (reg:V2DF 91) > > >> > (parallel [ > > >> > (const_int 0 [0]) > > >> > ])) > > >> > (mem:DF (reg/f:DI 95) [0 S8 A128])) > > >> > with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ]) > > >> > (mem:DF (reg/f:DI 95) [0 S8 A128])) > > >> > Changed insn 10 > > >> > > > >> > resulting in > > >> > > > >> > movsd a+8(%rip), %xmm0 > > >> > movhpd a+16(%rip), %xmm0 > > >> > > > >> > again rather than movupd. > > >> > > > >> > Uros, there is probably a missing peephole for the new form - can you > > >> > fix this as a followup or should I hold on this patch for a bit longer? > > >> > > >> No, please proceed with the patch, I'll fix this fallout with a > > >> followup patch in a couple of days. > > > > > > Applied as r238238. Is the following x86 change ok then which > > > adjusts the vectorizer vector construction cost to sth more sensible? > > > I have adjusted the generic implementation in targhooks.c this way > > > a few weeks ago already. > > > > > > Thanks, > > > Richard. > > > > > > 2016-07-12 Richard Biener <rguent...@suse.de> > > > > > > * targhooks.c (default_builtin_vectorization_cost): Adjust > > > vec_construct cost. > > > * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise. > > > > Looks OK to me, but let's also give Intel chance to comment. > > Btw, the motivation is that the cost of large initializers like for > v16qi or v32qi is underestimated currently. You end up with > 15 or 31 vinsert calls (or similar with other ISAs) and you can't do > better than elements - 1 operations. It doesn't really matter > for smaller vectors of course (seen for CPU v6 x264)
I've applied the patch now given no further comments from Intel. Richard.