On 08/03/2018 08:22 AM, Tom de Vries wrote:
> On 08/01/2018 09:11 PM, Cesar Philippidis wrote:
>> On 08/01/2018 07:12 AM, Tom de Vries wrote:
>>
>>>>>> +              gangs = grids * (blocks / warp_size);
>>>>>
>>>>> So, we launch with gangs == grids * workers ? Is that intentional?
>>>>
>>>> Yes. At least that's what I've been using in og8. Setting num_gangs =
>>>> grids alone caused significant slow downs.
>>>>
>>>
>>> Well, what you're saying here is: increasing num_gangs increases
>>> performance.
>>>
>>> You don't explain why you multiply with workers specifically.
>>
>> I set it that way because I think the occupancy calculator is
>> determining the occupancy of a single multiprocessor unit, rather than
>> the entire GPU. Looking at the og8 code again, I had
>>
>>    num_gangs = 2 * threads_per_sm / warp_size * dev_size
>>
>> which corresponds to
>>
>>    2 * grids * blocks / warp_size
>>
> 
> I've done an experiment using the sample simpleOccupancy. The kernel is
> small, so the blocks returned is the maximum: max_threads_per_block (1024).
> 
> The grids returned is 10, which I tentatively interpret as num_dev *
> (max_threads_per_multi_processor / blocks). [ Where num_dev == 5, and
> max_threads_per_multi_processor == 2048. ]
> 
> Substituting that into the og8 code, and equating
> max_threads_per_multi_processor with threads_per_sm, I indeed get
> 
> num_gangs = 2 * grids * blocks / warp_size.
> 
> So with this extra information I see how you got there.
> 
> But I still see no rationale why blocks is used here, and I wonder
> whether something like num_gangs = grids * 64 would give similar results.

My original intent was to keep the load proportional to the block size.
So, in the case were a block size is limited by shared-memory or the
register file capacity, the runtime wouldn't excessively over assign
gangs to the multiprocessor units if their state is going to be swapped
out even more than necessary.

With that said, I could be wrong here. It would be nice if Nvidia
provided us with more insights into their hardware.

> Anyway, given that this is what is used on og8, I'm ok with using that,
> so let's go with:
> ...
>             gangs = 2 * grids * (blocks / warp_size);
> ...
> [ so, including the factor two you explicitly left out from the original
> patch. Unless you see a pressing reason not to include it. ]
> 
> Can you repost after retesting? [ note: the updated patch I posted
> earlier doesn't apply on trunk anymore due to the cuda-lib.def change. ]

Thanks for looking into this. I got bogged down tracking a problem with
allocatable scalars in fortran. I'll repost post this patch after I
tested it with an older version of CUDA (probably CUDA 5.5 using the
Nvidia driver 331.113 on a K40).

Cesar

Reply via email to