On Fri, Sep 16, 2011 at 6:20 PM, Jakub Jelinek <ja...@redhat.com> wrote:
>> Surprisingly with -mavx2 the integer loops aren't vectorized with >> 32-byte vectors, wonder why. But looking at the integer umin/umax/smin/smax >> 16-byte reductions they generate good code even without reduc_* patterns, >> apparently using vector shifts. > > Seems on that testcase the integer loops weren't using 32-byte vectors > because there were no expanders for 32-byte integer min/max. > The following patch adds that (and also 32-byte integer condition > vcond/u because it is related). With this all the integer loops > in that testcase are nicely vectorized with 32-byte vectors with -mavx2, > unfortunately the reductions look terrible. > > The problem is that AVX2 doesn't have 32-byte whole vector shift right > (well, in theory it has it if the shift count is exactly 128 - vextractf128). > For shift counts > 128 we could in theory handle it as two instructions, > vextractf128 plus a 16-byte whole vector shift with count - 128, but > reductions actually don't need the two steps, we only care about the > bottom bits after the shifts and the upper bits can contain anything. > > So, either we can fix this by adding > reduc_{smin,smax,umin,umax}_v{32q,16h,8s,4d}i > patterns (at that point I guess I should just macroize them together with > the reduc_{smin,smax,umin,umax}_v{4sf,8sf,4df}) and handle the 4 32-byte > integer modes also in ix86_expand_reduc, or come up with some new optab > for an operation like whole vector shift right, but which would allow > the upper bits to be undefined and would only allow shifts by > vector size / 2, / 4, / 8 down to element size and corresponding tree code. > What do you prefer? I think that the former approach is better. We don't have full-vector shift in this case, so faking it with some very constrainted optab would be IMO pointless. > OT: seems the AVX2 support put the avx2_<code><mode>3 and > *avx2_<code><mode>3 patterns (the former after this patch <code><mode>3) > in a wrong spot, in between vec_shr_<mode> expander and sse2_lshrv1ti3 > insn which implements what the expander expands. Uros, would you like to > move it elsewhere? Where exactly? I'd put these after sse4_1 umaxmin patterns, just before: ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; ;; Parallel integral comparisons ;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; > > This patch has been tested on x86_64-linux and i686-linux on SandyBridge. > > 2011-09-16 Jakub Jelinek <ja...@redhat.com> > > * config/i386/i386.c (ix86_build_const_vector): Handle V8SImode > and V4DImode. > (ix86_build_signbit_mask): Likewise. > (ix86_expand_int_vcond): Likewise. Handle V16HImode and > V32QImode. > (bdesc_args): Use CODE_FOR_{s,u}m{ax,in}v{32q,16h,8s}i3 > instead of CODE_FOR_avx2_{s,u}m{ax,in}v{32q,16h,8s}i3. > * config/i386/sse.md (avx2_<code><mode>3 umaxmin expand): Rename > to... > (<code><mode>3) ... this. > (avx2_<code><mode>3 smaxmin expand): Rename to... > (<code><mode>3) ... this. > (smax<mode>3, smin<mode>3): Macroize using smaxmin code iterator. > (smaxv2di3, sminv2di3): Macroize using smaxmin code iterator and > VI8_AVX2 mode iterator. > (umaxv2di3, uminv2di3): Macroize using umaxmin code iterator and > VI8_AVX2 mode iterator. > (vcond<V_256:mode><VI_256:mode>, vcondu<V_256:mode><VI_256:mode>): > New expanders. This is OK for mainline SVN. Thanks, Uros.