[PATCH] D99432: [OPENMP]Fix PR48851: the locals are not globalized in SPMD mode.

Alexey Bataev via Phabricator via cfe-commits Wed, 05 May 2021 07:34:34 -0700

ABataev added a comment.

In D99432#2738859 <https://reviews.llvm.org/D99432#2738859>, @estewart08 wrote:


> In D99432#2736981 <https://reviews.llvm.org/D99432#2736981>, @ABataev wrote:
>
>> In D99432#2736970 <https://reviews.llvm.org/D99432#2736970>, @estewart08 
>> wrote:
>>
>>> In D99432#2728788 <https://reviews.llvm.org/D99432#2728788>, @ABataev wrote:
>>>
>>>> In D99432#2726997 <https://reviews.llvm.org/D99432#2726997>, @estewart08 
>>>> wrote:
>>>>
>>>>> In D99432#2726845 <https://reviews.llvm.org/D99432#2726845>, @ABataev 
>>>>> wrote:
>>>>>
>>>>>> In D99432#2726588 <https://reviews.llvm.org/D99432#2726588>, @estewart08 
>>>>>> wrote:
>>>>>>
>>>>>>> In D99432#2726391 <https://reviews.llvm.org/D99432#2726391>, @ABataev 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> In D99432#2726337 <https://reviews.llvm.org/D99432#2726337>, 
>>>>>>>> @estewart08 wrote:
>>>>>>>>
>>>>>>>>> In D99432#2726060 <https://reviews.llvm.org/D99432#2726060>, @ABataev 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> In D99432#2726050 <https://reviews.llvm.org/D99432#2726050>, 
>>>>>>>>>> @estewart08 wrote:
>>>>>>>>>>
>>>>>>>>>>> In D99432#2726025 <https://reviews.llvm.org/D99432#2726025>, 
>>>>>>>>>>> @ABataev wrote:
>>>>>>>>>>>
>>>>>>>>>>>> In D99432#2726019 <https://reviews.llvm.org/D99432#2726019>, 
>>>>>>>>>>>> @estewart08 wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do 
>>>>>>>>>>>>> not see how this helps SPMD mode with team privatization of 
>>>>>>>>>>>>> declarations in-between target teams and parallel regions.
>>>>>>>>>>>>
>>>>>>>>>>>> Diв you try the reproducer with the applied patch?
>>>>>>>>>>>
>>>>>>>>>>> Yes, I still saw the test fail, although it was not with latest 
>>>>>>>>>>> llvm-project. Are you saying the reproducer passes for you?
>>>>>>>>>>
>>>>>>>>>> I don't have CUDA installed but from what I see in the LLVM IR it 
>>>>>>>>>> shall pass. Do you have a debug log, does it crashes or produces 
>>>>>>>>>> incorrect results?
>>>>>>>>>
>>>>>>>>> This is on an AMDGPU but I assume the behavior would be similar for 
>>>>>>>>> NVPTX.
>>>>>>>>>
>>>>>>>>> It produces incorrect/incomplete results in the dist[0] index after a 
>>>>>>>>> manual reduction and in turn the final global gpu_results array is 
>>>>>>>>> incorrect.
>>>>>>>>> When thread 0 does a reduction into dist[0] it has no knowledge of 
>>>>>>>>> dist[1] having been updated by thread 1. Which tells me the array is 
>>>>>>>>> still thread private.
>>>>>>>>> Adding some printfs, looking at one teams' output:
>>>>>>>>>
>>>>>>>>> SPMD
>>>>>>>>>
>>>>>>>>>   Thread 0: dist[0]: 1
>>>>>>>>>   Thread 0: dist[1]: 0  // This should be 1
>>>>>>>>>   After reduction into dist[0]: 1  // This should be 2
>>>>>>>>>   gpu_results = [1,1]  // [2,2] expected
>>>>>>>>>
>>>>>>>>> Generic Mode:
>>>>>>>>>
>>>>>>>>>   Thread 0: dist[0]: 1
>>>>>>>>>   Thread 0: dist[1]: 1   
>>>>>>>>>   After reduction into dist[0]: 2
>>>>>>>>>   gpu_results = [2,2]
>>>>>>>>
>>>>>>>> Hmm, I would expect a crash if the array was allocated in the local 
>>>>>>>> memory. Could you try to add some more printfs (with data and 
>>>>>>>> addresses of the array) to check the results? Maybe there is a data 
>>>>>>>> race somewhere in the code?
>>>>>>>
>>>>>>> As a reminder, each thread updates a unique index in the dist array and 
>>>>>>> each team updates a unique index in gpu_results.
>>>>>>>
>>>>>>> SPMD - shows each thread has a unique address for dist array
>>>>>>>
>>>>>>>   Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
>>>>>>>   Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc
>>>>>>>   
>>>>>>>   Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
>>>>>>>   Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4
>>>>>>>   
>>>>>>>   Team 0 Thread 0: After reduction into dist[0]: 1
>>>>>>>   Team 0 Thread 0: gpu_results address: 0x7f92a5000000
>>>>>>>   --------------------------------------------------
>>>>>>>   Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
>>>>>>>   Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c
>>>>>>>   
>>>>>>>   Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
>>>>>>>   Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184
>>>>>>>   
>>>>>>>   Team 1 Thread 0: After reduction into dist[0]: 1
>>>>>>>   Team 1 Thread 0: gpu_results address: 0x7f92a5000000
>>>>>>>   
>>>>>>>   gpu_results[0]: 1
>>>>>>>   gpu_results[1]: 1
>>>>>>>
>>>>>>> Generic - shows each team shares dist array address amongst threads
>>>>>>>
>>>>>>>   Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
>>>>>>>   Team 0 Thread 1: dist[1]: 1, 0x7fac01938884
>>>>>>>   
>>>>>>>   Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
>>>>>>>   Team 0 Thread 0: dist[1]: 1, 0x7fac01938884
>>>>>>>   
>>>>>>>   Team 0 Thread 0: After reduction into dist[0]: 2
>>>>>>>   Team 0 Thread 0: gpu_results address: 0x7fabc5000000
>>>>>>>   --------------------------------------------------
>>>>>>>   Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
>>>>>>>   Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14
>>>>>>>   
>>>>>>>   Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
>>>>>>>   Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14
>>>>>>>   
>>>>>>>   Team 1 Thread 0: After reduction into dist[0]: 2
>>>>>>>   Team 1 Thread 0: gpu_results address: 0x7fabc5000000
>>>>>>
>>>>>> Could you check if it works with 
>>>>>> `-fno-openmp-cuda-parallel-target-regions` option?
>>>>>
>>>>> Unfortunately that crashes:
>>>>> llvm-project/llvm/lib/IR/Instructions.cpp:495: void 
>>>>> llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, 
>>>>> llvm::ArrayRef<llvm::Value*>, 
>>>>> llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const 
>>>>> llvm::Twine&): Assertion `(i >= FTy->getNumParams() || 
>>>>> FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a 
>>>>> bad signature!"' failed.
>>>>
>>>> Hmm, could you provide a full stack trace?
>>>
>>> At this point I am not sure I want to dig into that crash as our 
>>> llvm-branch is not caught up to trunk.
>>>
>>> I did build trunk and ran some tests on a sm_70:
>>> -Without this patch: code fails with incomplete results
>>> -Without this patch and with -fno-openmp-cuda-parallel-target-regions: code 
>>> fails with incomplete results
>>>
>>> -With this patch: code fails with incomplete results (thread private array)
>>> Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8
>>> Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac
>>>
>>> Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0
>>> Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4
>>>
>>> Team 0 Thread 0: After reduction into dist[0]: 1
>>> Team 0 Thread 0: gpu_results address: 0x7c1ebc800000
>>>
>>> Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8
>>> Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc
>>>
>>> Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0
>>> Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4
>>>
>>> Team 1 Thread 0: After reduction into dist[0]: 1
>>> Team 1 Thread 0: gpu_results address: 0x7c1ebc800000
>>>
>>> gpu_results[0]: 1
>>> gpu_results[1]: 1
>>> FAIL
>>>
>>> -With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass
>>> Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018
>>> Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c
>>>
>>> Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018
>>> Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c
>>>
>>> Team 0 Thread 0: After reduction into dist[0]: 2
>>> Team 0 Thread 0: gpu_results address: 0x7a5afc800000
>>>
>>> Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018
>>> Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c
>>>
>>> Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018
>>> Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c
>>>
>>> Team 1 Thread 0: After reduction into dist[0]: 2
>>> Team 1 Thread 0: gpu_results address: 0x7a5afc800000
>>>
>>> gpu_results[0]: 2
>>> gpu_results[1]: 2
>>> PASS
>>>
>>> I am concerned about team 0 and team 1 having the same address for the dist 
>>> array here.
>>
>> It is caused by the problem with the runtime. It should work with 
>> `-fno-openmp-cuda-parallel-target-regions` (I think) option (it uses a 
>> different runtime function for this case) and I just want to check that it 
>> really works. Looks like currently, runtime allocates a unique array for 
>> each thread.
>
> Unfortunately, this patch + the flag does not work for the larger reproducer, 
> the CPU check passes but GPU check fails with incorrect results.
> https://github.com/zjin-lcf/oneAPI-DirectProgramming/blob/master/all-pairs-distance-omp/main.cpp

Ok, I'll drop this patch and prepare the more conservative one, where the 
kernel will be executed in non-SPMD mode instead. Later it will be improved by 
the LLVM optimization.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99432/new/

https://reviews.llvm.org/D99432

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D99432: [OPENMP]Fix PR48851: the locals are not globalized in SPMD mode.

Reply via email to