On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
As a follow up patch will show, the nvptx BE falls back to using
vector_length = 32 when a vector loop is nested inside a worker loop.

I disabled the fallback, and analyzed the vred2d-128.c illegal memory access execution failure.

I minimized that down to this ptx:
...
.shared .align 8 .u8 __oacc_bcast[176];

{
  {
    .reg .u32 %x;
    mov.u32 %x,%tid.x;
    setp.ne.u32 %r86,%x,0;
  }

  {
    .reg .u32 %tidy;
    .reg .u64 %t_bcast;
    .reg .u64 %y64;
    mov.u32 %tidy,%tid.y;
    cvt.u64.u32 %y64,%tidy;
    add.u64 %y64,%y64,1;
    cvta.shared.u64 %t_bcast,__oacc_bcast;
    mad.lo.u64 %r66,%y64,88,%t_bcast;
  }

  @ %r86 bra $L28;
  st.u32 [%r66+80],0;
 $L28:
  ret;
}
...

The ptx is called with 2 workers and 128 vector_length.

So, 2 workers mean %tid.y has values 0 and 1.
Then %y64 has values 1 and 2.
Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).
Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast + (2 * 88) + 80.

So we're accessing memory at location 256, while the __oacc_bcast is only 176 bytes big.

I formulated this assert that AFAIU detects this situation in the compiler:
...
@@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }

+static int nvptx_mach_max_workers ();
+
 /* Emit code to initialize OpenACC worker broadcast and synchronization
    registers.  */

@@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)
               "// vector broadcast offset\n",
               REGNO (cfun->machine->bcast_partition),
               oacc_bcast_partition);
+ gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 1) <= oacc_bcast_size);
     }
   if (cfun->machine->sync_bar)
     fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
...

The assert is not triggered when the fallback is used.

Thanks,
- Tom

Reply via email to