ps. I suspect that, if you have a collision from two workers in the same
wavelength, that function may go in an infinite loop.... something that
should be tested carefully
On 13 Aug 2015 17:44, "CRV§ADER//KY" <[email protected]> wrote:

> 1. try replacing the whole atomic function with a trivial *x+= y and
> retest the performance, out of curiosity
>
> 2. try invoking the function from a trivial kernel that guarantees that
> you never have a collision, e.g.
> GAtomicAdd(x+get_global_id(0), 1.0)
>
> if in the second case you get good performance (compared to the non-atomic
> *x+=y) then your collision rate is not <0.001 after all...
>
> also worth considering that, even if collisions were perfectly spread
> across,  you'll inevitably get a worse collision rate if you buy a video
> card with more workers.
>
> also SIMD size doesn't help you. In case of collision, on NVidia you'll
> have 32 workers spend 2 cycles in that while loop. On ATI, the stuck
> workers will be 64.
>
> Finally I second Jerome in saying that your code is probably improvable.
> See if you can have workgroups do partial sums in local memory, and then do
> a single-pass final sum in global memory at the end?
> On 13 Aug 2015 15:14, "Joe Haywood" <[email protected]> wrote:
>
>> A little clarificatio. Here is the function I refer to when I say global
>> add, I should have said global atomic add floats.
>>
>> inline void GAtomicAdd(volatile __global float *source, const float
>> operand) {
>> union {
>> unsigned int intVal;
>> float floatVal;
>> } newVal;
>>
>> union {
>> unsigned int intVal;
>> float floatVal;
>> } prevVal;
>>
>> do {
>> prevVal.floatVal = *source;
>> newVal.floatVal = prevVal.floatVal + operand;
>> } while (atomic_cmpxchg((volatile __global unsigned int *)source,
>> prevVal.intVal, newVal.intVal) != prevVal.intVal);
>> }
>>
>> As used in the code
>> GAtomicAdd(&pdose[indz+indx*NZ+indy*NZ*NCOL],urn);
>>
>> Pdose is a large array stored in global memory. Urn is the value the
>> current item has calculated and needs to add to the array at the current
>> item position. Each workitem takes a random path and it is highly unlikely
>> that a race condition exists given the large number of cells in Pdose
>> millions compared to the number of workitems thousands but I have to
>> include it for those times it does occur. This is the function the AMD
>> profiler says all the time is spent executing.
>>
>>
>> Sent from my Samsung Galaxy Tab® S
>>
>>
>> -------- Original message --------
>> From: CRV§ADER//KY <[email protected]>
>> Date: 08/13/2015 3:44 AM (GMT-05:00)
>> To: Joe Haywood <[email protected]>
>> Cc: Pyopencl <[email protected]>
>> Subject: Re: [PyOpenCL] Opinions
>>
>> Excuse my confusion, but what do you mean with global addition? C = a +
>> b,  where a b and c are vectors of single precision floats in shared
>> memory?
>> Or is it double precision?
>> On 12 Aug 2015 22:15, "Joe Haywood" <[email protected]> wrote:
>>
>>> I apologize in advance for asking the following questions because they
>>> are not directly related to pyopencl.  Also, I realize opinions can be very
>>> diverse but I think you all might be able to help me. I am planning on
>>> purchasing a new laptop to have for programming at home. I am currently
>>> using a workstation with an NVIDIA 780 TI while at work.  I have been able
>>> to get my pyopencl code to run at nearly the same speed as my CUDA code on
>>> this hardware. I have tried running the pyopencl code on an AMD FirePro
>>> V4800 and see serious speed degradation. According to the AMD profiler, the
>>> bottleneck is the global add. Also, a few websites suggest utilizing
>>> float4's would increase the speed, but programming float4s in this
>>> embarrassingly parallel Monte Carlo code is impractical due to branching.
>>> Further investigation using the old CompuBench website (early 2014 ish)
>>> confirmed the global addition on anything except NVIDIA was very slow. That
>>> was nearly 2 years ago. The compubench website no longer lists global add
>>> as an evaluation. So, in your experience is this still the case, that
>>> anything except Nvidia will be slow at global additions? Or have AMD and
>>> Intel "caught up"? I cannot find any laptops spec'd exactly the way I want,
>>> but the 2015 MacBook Pro is close. I just don't want to buy one and run the
>>> code and see it also suffers a terrible loss of speed. Finally, I noticed
>>> on the compubench website that the NVIDIA GTX 980M is equal or better than
>>> the GTX 780 TI in nearly all tests. If you have this hardware, can you
>>> confirm this with your own code? I can run some tests on my computer if
>>> someone has a 980M they would be willing to give me numbers on.
>>>
>>>
>>>
>>> Again, I apologize for being off topic, private messages might be best,
>>> and I appreciate your help.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Reese
>>>
>>>
>>>
>>> Joe Reese Haywood, Ph.D., DABR
>>>
>>> Medical Physicist
>>>
>>> Johnson Family Cancer Center
>>>
>>> Mercy Health Muskegon
>>>
>>> 1440 E. Sherman Blvd, Suite 300
>>>
>>> Muskegon, MI 49444
>>>
>>> Phone: 231-672-2019
>>>
>>> Email: [email protected]
>>>
>>>
>>>
>>> Confidentiality Notice:
>>> This e-mail, including any attachments is the property of Trinity Health
>>> and is intended for the sole use of the intended recipient(s). It may
>>> contain information that is privileged and confidential.  Any unauthorized
>>> review, use, disclosure, or distribution is prohibited. If you are not the
>>> intended recipient, please delete this message, and reply to the sender
>>> regarding the error in a separate email.
>>>
>>> _______________________________________________
>>> PyOpenCL mailing list
>>> [email protected]
>>> http://lists.tiker.net/listinfo/pyopencl
>>>
>>>
>> Confidentiality Notice:
>> This e-mail, including any attachments is the property of Trinity Health
>> and is intended for the sole use of the intended recipient(s). It may
>> contain information that is privileged and confidential.  Any unauthorized
>> review, use, disclosure, or distribution is prohibited. If you are not the
>> intended recipient, please delete this message, and reply to the sender
>> regarding the error in a separate email.
>>
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to