1. try replacing the whole atomic function with a trivial *x+= y and retest
the performance, out of curiosity

2. try invoking the function from a trivial kernel that guarantees that you
never have a collision, e.g.
GAtomicAdd(x+get_global_id(0), 1.0)

if in the second case you get good performance (compared to the non-atomic
*x+=y) then your collision rate is not <0.001 after all...

also worth considering that, even if collisions were perfectly spread
across,  you'll inevitably get a worse collision rate if you buy a video
card with more workers.

also SIMD size doesn't help you. In case of collision, on NVidia you'll
have 32 workers spend 2 cycles in that while loop. On ATI, the stuck
workers will be 64.

Finally I second Jerome in saying that your code is probably improvable.
See if you can have workgroups do partial sums in local memory, and then do
a single-pass final sum in global memory at the end?
On 13 Aug 2015 15:14, "Joe Haywood" <[email protected]> wrote:

> A little clarificatio. Here is the function I refer to when I say global
> add, I should have said global atomic add floats.
>
> inline void GAtomicAdd(volatile __global float *source, const float
> operand) {
> union {
> unsigned int intVal;
> float floatVal;
> } newVal;
>
> union {
> unsigned int intVal;
> float floatVal;
> } prevVal;
>
> do {
> prevVal.floatVal = *source;
> newVal.floatVal = prevVal.floatVal + operand;
> } while (atomic_cmpxchg((volatile __global unsigned int *)source,
> prevVal.intVal, newVal.intVal) != prevVal.intVal);
> }
>
> As used in the code
> GAtomicAdd(&pdose[indz+indx*NZ+indy*NZ*NCOL],urn);
>
> Pdose is a large array stored in global memory. Urn is the value the
> current item has calculated and needs to add to the array at the current
> item position. Each workitem takes a random path and it is highly unlikely
> that a race condition exists given the large number of cells in Pdose
> millions compared to the number of workitems thousands but I have to
> include it for those times it does occur. This is the function the AMD
> profiler says all the time is spent executing.
>
>
> Sent from my Samsung Galaxy Tab® S
>
>
> -------- Original message --------
> From: CRV§ADER//KY <[email protected]>
> Date: 08/13/2015 3:44 AM (GMT-05:00)
> To: Joe Haywood <[email protected]>
> Cc: Pyopencl <[email protected]>
> Subject: Re: [PyOpenCL] Opinions
>
> Excuse my confusion, but what do you mean with global addition? C = a +
> b,  where a b and c are vectors of single precision floats in shared
> memory?
> Or is it double precision?
> On 12 Aug 2015 22:15, "Joe Haywood" <[email protected]> wrote:
>
>> I apologize in advance for asking the following questions because they
>> are not directly related to pyopencl.  Also, I realize opinions can be very
>> diverse but I think you all might be able to help me. I am planning on
>> purchasing a new laptop to have for programming at home. I am currently
>> using a workstation with an NVIDIA 780 TI while at work.  I have been able
>> to get my pyopencl code to run at nearly the same speed as my CUDA code on
>> this hardware. I have tried running the pyopencl code on an AMD FirePro
>> V4800 and see serious speed degradation. According to the AMD profiler, the
>> bottleneck is the global add. Also, a few websites suggest utilizing
>> float4's would increase the speed, but programming float4s in this
>> embarrassingly parallel Monte Carlo code is impractical due to branching.
>> Further investigation using the old CompuBench website (early 2014 ish)
>> confirmed the global addition on anything except NVIDIA was very slow. That
>> was nearly 2 years ago. The compubench website no longer lists global add
>> as an evaluation. So, in your experience is this still the case, that
>> anything except Nvidia will be slow at global additions? Or have AMD and
>> Intel "caught up"? I cannot find any laptops spec'd exactly the way I want,
>> but the 2015 MacBook Pro is close. I just don't want to buy one and run the
>> code and see it also suffers a terrible loss of speed. Finally, I noticed
>> on the compubench website that the NVIDIA GTX 980M is equal or better than
>> the GTX 780 TI in nearly all tests. If you have this hardware, can you
>> confirm this with your own code? I can run some tests on my computer if
>> someone has a 980M they would be willing to give me numbers on.
>>
>>
>>
>> Again, I apologize for being off topic, private messages might be best,
>> and I appreciate your help.
>>
>>
>>
>> Thanks
>>
>> Reese
>>
>>
>>
>> Joe Reese Haywood, Ph.D., DABR
>>
>> Medical Physicist
>>
>> Johnson Family Cancer Center
>>
>> Mercy Health Muskegon
>>
>> 1440 E. Sherman Blvd, Suite 300
>>
>> Muskegon, MI 49444
>>
>> Phone: 231-672-2019
>>
>> Email: [email protected]
>>
>>
>>
>> Confidentiality Notice:
>> This e-mail, including any attachments is the property of Trinity Health
>> and is intended for the sole use of the intended recipient(s). It may
>> contain information that is privileged and confidential.  Any unauthorized
>> review, use, disclosure, or distribution is prohibited. If you are not the
>> intended recipient, please delete this message, and reply to the sender
>> regarding the error in a separate email.
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> [email protected]
>> http://lists.tiker.net/listinfo/pyopencl
>>
>>
> Confidentiality Notice:
> This e-mail, including any attachments is the property of Trinity Health
> and is intended for the sole use of the intended recipient(s). It may
> contain information that is privileged and confidential.  Any unauthorized
> review, use, disclosure, or distribution is prohibited. If you are not the
> intended recipient, please delete this message, and reply to the sender
> regarding the error in a separate email.
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to