On Fri, Aug 23, 2013 at 01:28:10PM +0400, Michael V. Zolotukhin wrote:
> Sure, I used '#pragma omp target' just for a simple example.  The
> question about '#pragma omp target data' is still open.  As far as I
> understand, all three of the pragmas could require data marshalling (but
> not necessary - 'omp target', if it's located inside 'omp target data'
> which specifies all needed for 'omp dtarget' variables, won't need any
> data marshalling - right?).  This data movement could be done, as you
> noted by a single call or a set of calls (one for each clause) - and
> while single call seems appealing, it could be better to use separate
> calls in case, when e.g. we have a 'target update' for only a subset of
> all described in 'target data' variables.  One-call approach has
> difficulties with specifying, which subset of the data we want to
> update.

The single call approach would just be passed array of control structures
that would describe each of the MAP clauses, and you'd simply loop over
them; the standard requires that if some object is already mapped into the
device, then nothing is performed (well, target update is an exception).
So you'd have some data structure that maps [address,address+length)
intervals for each device into target addresses and start with doing a
lookup on that (note, if [addr,addr+6) -> devaddr is already in, then addr+4
should be just mapped to devaddr+4).  If not found, you'd allocate the
memory on the device, add into the mapping data structure and record in a
vector for the target/target data in question, so that on exit from that it
would be copied back if requested, and deallocated.  I don't see how single
call vs. multiple calls would change anything on that.  Plus a single call
allows the device id to be looked up just once and treat all the mappings
as a batch (GOMP_target_data would probably need corresponding
GOMP_target_data_end call, perhaps just with a device_id and would pop from
the stack of pending target data regions for that device.
[B
> > I'd prefer GOMP_target instead of GOMP_offload for the function name, to
> > make it clearly related to the corresponding directive.
> That makes sense - I just used first name that came to my mind here.
> 
> > > GOMP_offload is a call to libgomp, which will be implemented somehow like 
> > > this:
> > >   void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
> > >   {
> > >     if (gomp_offload_available ())
> > 
> > This really isn't just check whether accelerator is available, we need to
> > query all accelerators in the system (and cache that somehow in the
> > library), assign device numbers to individual devices (say, you could have
> > two Intel MIC cards, one AMD HSAIL capable GPGPU and 4 Nvidia PTX capable
> > GPGPUs or similar), ensure that already assigned device numbers aren't
> > reused when discovering new ones and then just check what device user
> > requested (if not available, fall back to host), next check see if we
> > have corresponding code for that accelerator (again, fallback to host
> > otherwise), optionally compile the code if not compiled yet (HSAIL/PTX code
> > only) then finally do the name lookup and spawn it.
> Multi-target option arises another bunch of questions:)  Could you
> please check if my vision of how GCC would handle multiple offload
> targets? Here it is:
> We have GCC with a set of plugins for compiling code for each available
> offloading target.  These plugins work similarly to lto-plugin, i.e.
> they consume gimple as the input, but produce a code for the specific
> target.  Libgomp also has similar set of plugins for HW specific
> implementation of functions for remote running code, data transferring,
> getting device status etc.
> For example, for Intel MIC, AMD HSAIL and Nvidia PTX we'll have host-GCC
> with three plugins and host-libgomp, also with three plugins.
> Invoking GCC with options like '-mmic', '-mhsail', '-mptx' triggers
> usage of a corresponding plugins in GCC driver.  In result, after the
> compilation we'd have four binaries: one for host and three for possible
> targets.

I meant just a single plugin that would handle all of them, or as richi
said, perhaps teach LTO plugin to do that.
For options, my vision was something like:
-ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
which would mean:
1) compile LTO IL from the accelerator section for mic with
   the originally recorded gcc command line options with the Target options
   removed and no extra options added
2) compile LTO IL also for hsail target, with originally recorded gcc
   command line options but Target options and -mfoobaz=4 -mbazbaz
   options added
3) don't compile for ptx
The thing is if you originally compile with
-O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
the -m* options might not apply to the target compiler at all.
So you'd construct the command line from the original command line sans
CL_TARGET options, append to that the link time override for the
accelerator.  Then another thing is how to find out the corresponding
compiler (including its driver) for the target from the plugin.

> Now, libgomp part.  The host binary consists calls to libgomp.so, which
> is target-independent (i.e. it's host-specific).  It should be able to
> call plugins for all three targets, so in functions like
> gomp_offload_available it probably would iterate through all available
> plugins, asking for device status and the code availability.  This
> iterating would include dlopen of the corresponding plugin, calling a
> function from it and moving to the next plugin.
> Is this correct?

libgomp would start by trying to dlopen all available plugins,
and for each of them call some routine in them that would query the hw
for available devices, then libgomp would assign device ids to them (0 and
up) and then for target specific parts just dispatch again to the plugin
corresponding to the chosen device.

        Jakub

Reply via email to