ps. I suspect that, if you have a collision from two workers in the same wavelength, that function may go in an infinite loop.... something that should be tested carefully On 13 Aug 2015 17:44, "CRV§ADER//KY" <[email protected]> wrote:
> 1. try replacing the whole atomic function with a trivial *x+= y and > retest the performance, out of curiosity > > 2. try invoking the function from a trivial kernel that guarantees that > you never have a collision, e.g. > GAtomicAdd(x+get_global_id(0), 1.0) > > if in the second case you get good performance (compared to the non-atomic > *x+=y) then your collision rate is not <0.001 after all... > > also worth considering that, even if collisions were perfectly spread > across, you'll inevitably get a worse collision rate if you buy a video > card with more workers. > > also SIMD size doesn't help you. In case of collision, on NVidia you'll > have 32 workers spend 2 cycles in that while loop. On ATI, the stuck > workers will be 64. > > Finally I second Jerome in saying that your code is probably improvable. > See if you can have workgroups do partial sums in local memory, and then do > a single-pass final sum in global memory at the end? > On 13 Aug 2015 15:14, "Joe Haywood" <[email protected]> wrote: > >> A little clarificatio. Here is the function I refer to when I say global >> add, I should have said global atomic add floats. >> >> inline void GAtomicAdd(volatile __global float *source, const float >> operand) { >> union { >> unsigned int intVal; >> float floatVal; >> } newVal; >> >> union { >> unsigned int intVal; >> float floatVal; >> } prevVal; >> >> do { >> prevVal.floatVal = *source; >> newVal.floatVal = prevVal.floatVal + operand; >> } while (atomic_cmpxchg((volatile __global unsigned int *)source, >> prevVal.intVal, newVal.intVal) != prevVal.intVal); >> } >> >> As used in the code >> GAtomicAdd(&pdose[indz+indx*NZ+indy*NZ*NCOL],urn); >> >> Pdose is a large array stored in global memory. Urn is the value the >> current item has calculated and needs to add to the array at the current >> item position. Each workitem takes a random path and it is highly unlikely >> that a race condition exists given the large number of cells in Pdose >> millions compared to the number of workitems thousands but I have to >> include it for those times it does occur. This is the function the AMD >> profiler says all the time is spent executing. >> >> >> Sent from my Samsung Galaxy Tab® S >> >> >> -------- Original message -------- >> From: CRV§ADER//KY <[email protected]> >> Date: 08/13/2015 3:44 AM (GMT-05:00) >> To: Joe Haywood <[email protected]> >> Cc: Pyopencl <[email protected]> >> Subject: Re: [PyOpenCL] Opinions >> >> Excuse my confusion, but what do you mean with global addition? C = a + >> b, where a b and c are vectors of single precision floats in shared >> memory? >> Or is it double precision? >> On 12 Aug 2015 22:15, "Joe Haywood" <[email protected]> wrote: >> >>> I apologize in advance for asking the following questions because they >>> are not directly related to pyopencl. Also, I realize opinions can be very >>> diverse but I think you all might be able to help me. I am planning on >>> purchasing a new laptop to have for programming at home. I am currently >>> using a workstation with an NVIDIA 780 TI while at work. I have been able >>> to get my pyopencl code to run at nearly the same speed as my CUDA code on >>> this hardware. I have tried running the pyopencl code on an AMD FirePro >>> V4800 and see serious speed degradation. According to the AMD profiler, the >>> bottleneck is the global add. Also, a few websites suggest utilizing >>> float4's would increase the speed, but programming float4s in this >>> embarrassingly parallel Monte Carlo code is impractical due to branching. >>> Further investigation using the old CompuBench website (early 2014 ish) >>> confirmed the global addition on anything except NVIDIA was very slow. That >>> was nearly 2 years ago. The compubench website no longer lists global add >>> as an evaluation. So, in your experience is this still the case, that >>> anything except Nvidia will be slow at global additions? Or have AMD and >>> Intel "caught up"? I cannot find any laptops spec'd exactly the way I want, >>> but the 2015 MacBook Pro is close. I just don't want to buy one and run the >>> code and see it also suffers a terrible loss of speed. Finally, I noticed >>> on the compubench website that the NVIDIA GTX 980M is equal or better than >>> the GTX 780 TI in nearly all tests. If you have this hardware, can you >>> confirm this with your own code? I can run some tests on my computer if >>> someone has a 980M they would be willing to give me numbers on. >>> >>> >>> >>> Again, I apologize for being off topic, private messages might be best, >>> and I appreciate your help. >>> >>> >>> >>> Thanks >>> >>> Reese >>> >>> >>> >>> Joe Reese Haywood, Ph.D., DABR >>> >>> Medical Physicist >>> >>> Johnson Family Cancer Center >>> >>> Mercy Health Muskegon >>> >>> 1440 E. Sherman Blvd, Suite 300 >>> >>> Muskegon, MI 49444 >>> >>> Phone: 231-672-2019 >>> >>> Email: [email protected] >>> >>> >>> >>> Confidentiality Notice: >>> This e-mail, including any attachments is the property of Trinity Health >>> and is intended for the sole use of the intended recipient(s). It may >>> contain information that is privileged and confidential. Any unauthorized >>> review, use, disclosure, or distribution is prohibited. If you are not the >>> intended recipient, please delete this message, and reply to the sender >>> regarding the error in a separate email. >>> >>> _______________________________________________ >>> PyOpenCL mailing list >>> [email protected] >>> http://lists.tiker.net/listinfo/pyopencl >>> >>> >> Confidentiality Notice: >> This e-mail, including any attachments is the property of Trinity Health >> and is intended for the sole use of the intended recipient(s). It may >> contain information that is privileged and confidential. Any unauthorized >> review, use, disclosure, or distribution is prohibited. If you are not the >> intended recipient, please delete this message, and reply to the sender >> regarding the error in a separate email. >> >
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
