https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90028

            Bug ID: 90028
           Summary: On Intel Skylake (-march=native) generated avx512
                    instruction can be wrong
           Product: gcc
           Version: 8.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ferruh.yigit at intel dot com
  Target Milestone: ---

Created attachment 46114
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46114&action=edit
19.05-rc1 default gcc build on skylake

gcc version:
gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)

binutils:
GNU ld version 2.31.1-24.fc29

This is observed in dpdk project (https://git.dpdk.org/dpdk/tree/?h=v19.05-rc1)
on Intel Skylate CPU.

Full build command (removed -I & -D ones):
gcc -Wp,-MD,./.rte_kni.o.d.tmp  -m64 -pthread -march=native -W -Wall
-Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations
-Wold-style-definition -Wpointer-arith -Wcast-align -Wnested-externs
-Wcast-qual -Wformat-nonliteral -Wformat-security -Wundef -Wwrite-strings
-Wdeprecated -Werror -Wimplicit-fallthrough=2 -Wno-format-truncation -O3
-fno-strict-aliasing -o rte_kni.o -c
/root/development/dpdk-next-net/lib/librte_kni/rte_kni.c 


When related code build with "-mno-avx512f" flag, problem solved. Also clang
(clang version 7.0.1 (Fedora 7.0.1-6.fc29)) output works fine.


Suspected from 'vpgatherqq' instruction usage.

The related .c code is
(https://git.dpdk.org/dpdk/tree/lib/librte_kni/rte_kni.c?h=v19.05-rc1#n546):

"
  static void *
  va2pa(struct rte_mbuf *m)
  {
          return (void *)((unsigned long)m -
                          ((unsigned long)m->buf_addr -
                           (unsigned long)m->buf_iova));
  }

  unsigned
  rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned num)
  {
          void *phy_mbufs[num];
          unsigned int ret;
          unsigned int i;

          for (i = 0; i < num; i++)
                  phy_mbufs[i] = va2pa(mbufs[i]);
....
"

'm->buf_addr' & 'm->buf_iova' are next to each other in the struct, so there is
8 bytes difference between their address.

Generated asm code:
avx512 enabled code snippet:

232c:       ba ff ff ff ff          mov    $0xffffffff,%edx                     
2331:       48 c1 e0 05             shl    $0x5,%rax                            
2335:       31 c9                   xor    %ecx,%ecx                            
2337:       c5 f9 92 ca             kmovb  %edx,%k1                             
233b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)                     
2340:       62 f1 fe 28 6f 0c 0e    vmovdqu64 (%rsi,%rcx,1),%ymm1               
2347:       c5 f9 90 d1             kmovb  %k1,%k2                              
*234b:       62 f2 fd 2a 91 04 0d    vpgatherqq 0x1(,%ymm1,1),%ymm0{%k2}        
2352:       01 00 00 00                                                         
2356:       c5 f9 90 d9             kmovb  %k1,%k3                              
235a:       c5 fd d4 c1             vpaddq %ymm1,%ymm0,%ymm0                    
235e:       62 f2 fd 2b 91 14 0d    vpgatherqq 0x0(,%ymm1,1),%ymm2{%k3}         
2365:       00 00 00 00                                                         
2369:       c5 fd fb c2             vpsubq %ymm2,%ymm0,%ymm0                    
236d:       62 d1 fe 28 7f 04 08    vmovdqu64 %ymm0,(%r8,%rcx,1)


same code avx512 disabled (avx2) code snippet:

2332:       c5 ed 76 d2             vpcmpeqd %ymm2,%ymm2,%ymm2                  
2336:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)                 
233d:       00 00 00                                                            
2340:       c5 fe 6f 0c 0e          vmovdqu (%rsi,%rcx,1),%ymm1                 
2345:       c5 fd 6f e2             vmovdqa %ymm2,%ymm4                         
2349:       c4 e2 dd 91 04 0d 08    vpgatherqq %ymm4,0x8(,%ymm1,1),%ymm0        
2350:       00 00 00                                                            
2353:       c5 fd 6f ea             vmovdqa %ymm2,%ymm5                         
2357:       c4 e2 d5 91 1c 0d 00    vpgatherqq %ymm5,0x0(,%ymm1,1),%ymm3        
235e:       00 00 00                                                            
2361:       c5 fd d4 c1             vpaddq %ymm1,%ymm0,%ymm0                    
2365:       c5 fd fb c3             vpsubq %ymm3,%ymm0,%ymm0                    
2369:       c4 c1 7e 7f 04 08       vmovdqu %ymm0,(%r8,%rcx,1)


full asm outputs are attached.

In the avx512 one, for 'vpgatherqq', it looks like the offset should be 0x8
instead of 0x1.

Reply via email to