Ping.

Ceasr

On 06/20/2018 02:59 PM, Cesar Philippidis wrote:
> At present, the nvptx libgomp plugin does not take into account the
> amount of shared resources on GPUs (mostly shared-memory are register
> usage) when selecting the default num_gangs and num_workers. In certain
> situations, an OpenACC offloaded function can fail to launch if the GPU
> does not have sufficient shared resources to accommodate all of the
> threads in a CUDA block. This typically manifests when a PTX function
> uses a lot of registers and num_workers is set too large, although it
> can also happen if the shared-memory has been exhausted by the threads
> in a vector.
> 
> This patch resolves that issue by adjusting num_workers based the amount
> of shared resources used by each threads. If worker parallelism has been
> requested, libgomp will spawn as many workers as possible up to 32.
> Without this patch, libgomp would always default to launching 32 workers
> when worker parallelism is used.
> 
> Besides for the worker parallelism, this patch also includes some
> heuristics on selecting num_gangs. Before, the plugin would launch two
> gangs per GPU multiprocessor. Now it follows the formula contained in
> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
> 
> Is this patch OK for trunk?
> 
> Thanks,
> Cesar
> 

Reply via email to