1. try replacing the whole atomic function with a trivial *x+= y and retest the performance, out of curiosity
2. try invoking the function from a trivial kernel that guarantees that you never have a collision, e.g. GAtomicAdd(x+get_global_id(0), 1.0) if in the second case you get good performance (compared to the non-atomic *x+=y) then your collision rate is not <0.001 after all... also worth considering that, even if collisions were perfectly spread across, you'll inevitably get a worse collision rate if you buy a video card with more workers. also SIMD size doesn't help you. In case of collision, on NVidia you'll have 32 workers spend 2 cycles in that while loop. On ATI, the stuck workers will be 64. Finally I second Jerome in saying that your code is probably improvable. See if you can have workgroups do partial sums in local memory, and then do a single-pass final sum in global memory at the end? On 13 Aug 2015 15:14, "Joe Haywood" <[email protected]> wrote: > A little clarificatio. Here is the function I refer to when I say global > add, I should have said global atomic add floats. > > inline void GAtomicAdd(volatile __global float *source, const float > operand) { > union { > unsigned int intVal; > float floatVal; > } newVal; > > union { > unsigned int intVal; > float floatVal; > } prevVal; > > do { > prevVal.floatVal = *source; > newVal.floatVal = prevVal.floatVal + operand; > } while (atomic_cmpxchg((volatile __global unsigned int *)source, > prevVal.intVal, newVal.intVal) != prevVal.intVal); > } > > As used in the code > GAtomicAdd(&pdose[indz+indx*NZ+indy*NZ*NCOL],urn); > > Pdose is a large array stored in global memory. Urn is the value the > current item has calculated and needs to add to the array at the current > item position. Each workitem takes a random path and it is highly unlikely > that a race condition exists given the large number of cells in Pdose > millions compared to the number of workitems thousands but I have to > include it for those times it does occur. This is the function the AMD > profiler says all the time is spent executing. > > > Sent from my Samsung Galaxy Tab® S > > > -------- Original message -------- > From: CRV§ADER//KY <[email protected]> > Date: 08/13/2015 3:44 AM (GMT-05:00) > To: Joe Haywood <[email protected]> > Cc: Pyopencl <[email protected]> > Subject: Re: [PyOpenCL] Opinions > > Excuse my confusion, but what do you mean with global addition? C = a + > b, where a b and c are vectors of single precision floats in shared > memory? > Or is it double precision? > On 12 Aug 2015 22:15, "Joe Haywood" <[email protected]> wrote: > >> I apologize in advance for asking the following questions because they >> are not directly related to pyopencl. Also, I realize opinions can be very >> diverse but I think you all might be able to help me. I am planning on >> purchasing a new laptop to have for programming at home. I am currently >> using a workstation with an NVIDIA 780 TI while at work. I have been able >> to get my pyopencl code to run at nearly the same speed as my CUDA code on >> this hardware. I have tried running the pyopencl code on an AMD FirePro >> V4800 and see serious speed degradation. According to the AMD profiler, the >> bottleneck is the global add. Also, a few websites suggest utilizing >> float4's would increase the speed, but programming float4s in this >> embarrassingly parallel Monte Carlo code is impractical due to branching. >> Further investigation using the old CompuBench website (early 2014 ish) >> confirmed the global addition on anything except NVIDIA was very slow. That >> was nearly 2 years ago. The compubench website no longer lists global add >> as an evaluation. So, in your experience is this still the case, that >> anything except Nvidia will be slow at global additions? Or have AMD and >> Intel "caught up"? I cannot find any laptops spec'd exactly the way I want, >> but the 2015 MacBook Pro is close. I just don't want to buy one and run the >> code and see it also suffers a terrible loss of speed. Finally, I noticed >> on the compubench website that the NVIDIA GTX 980M is equal or better than >> the GTX 780 TI in nearly all tests. If you have this hardware, can you >> confirm this with your own code? I can run some tests on my computer if >> someone has a 980M they would be willing to give me numbers on. >> >> >> >> Again, I apologize for being off topic, private messages might be best, >> and I appreciate your help. >> >> >> >> Thanks >> >> Reese >> >> >> >> Joe Reese Haywood, Ph.D., DABR >> >> Medical Physicist >> >> Johnson Family Cancer Center >> >> Mercy Health Muskegon >> >> 1440 E. Sherman Blvd, Suite 300 >> >> Muskegon, MI 49444 >> >> Phone: 231-672-2019 >> >> Email: [email protected] >> >> >> >> Confidentiality Notice: >> This e-mail, including any attachments is the property of Trinity Health >> and is intended for the sole use of the intended recipient(s). It may >> contain information that is privileged and confidential. Any unauthorized >> review, use, disclosure, or distribution is prohibited. If you are not the >> intended recipient, please delete this message, and reply to the sender >> regarding the error in a separate email. >> >> _______________________________________________ >> PyOpenCL mailing list >> [email protected] >> http://lists.tiker.net/listinfo/pyopencl >> >> > Confidentiality Notice: > This e-mail, including any attachments is the property of Trinity Health > and is intended for the sole use of the intended recipient(s). It may > contain information that is privileged and confidential. Any unauthorized > review, use, disclosure, or distribution is prohibited. If you are not the > intended recipient, please delete this message, and reply to the sender > regarding the error in a separate email. >
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
