Hi Andreas, I know it's subject to rounding etc. but ReductionKernel often finds a max value at a different position than both numpy.argmax and thrust::max_element, the latter two agreeing always. http://dev.math.canterbury.ac.nz/home/pub/26/
Sum reduction seems to be taking the same time both, http://dev.math.canterbury.ac.nz/home/pub/27/ (Check out the wildly different answer from numpy.sum) Igor On Thu, May 31, 2012 at 6:59 PM, Andreas Kloeckner <li...@informa.tiker.net> wrote: > On Thu, 31 May 2012 12:56:15 +1200, Igor <rych...@gmail.com> wrote: >> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/ >> >> larger vector, a billion elements. >> >> As for returning the value, it's the pair of max value and position we >> are talking about, thrust returns the position and I'm now timing the >> extraction of the value from the gpu array which didn't change timing >> too much. >> >> ReductionKernel still appears 5 times slower than thrust. >> >> Bryan, on the same worksheet the numpy timing is printed as well: >> argmax is 3 times slower than ReductionKernel. > > I've looked at this for a little while, can't quite make heads or tails > of it yet. Here's the profiler output: > > method=[ reduce_kernel_stage1 ] gputime=[ 20617.984 ] cputime=[20647.000 ] > gridsize=[ 128, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 1.000 ] > l1_shared_bank_conflict=[ 672 ] inst_issued=[ 7906011 ] > method=[ reduce_kernel_stage2 ] gputime=[ 9.696 ] cputime=[ 29.000 ] > gridsize=[ 1, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 0.333 ] > l1_shared_bank_conflict=[ 96 ] > > method=[ _ZN6thrust<snip>] gputime=[ 3556.736 ] cputime=[ 3583.000 ] > gridsize=[ 32, 1 ] threadblocksize=[ 768, 1, 1 ] occupancy=[ 1.000 ] > l1_shared_bank_conflict=[ 1255 ] inst_issued=[ 2964333 ] > method=[ _ZN6thrust6<snip>] gputime=[ 8.640 ] cputime=[ 30.000 ] gridsize=[ > 1, 1 ] threadblocksize=[ 32, 1, 1 ] occupancy=[ 0.021 ] > l1_shared_bank_conflict=[ 18 ] > > Second stages are comparable, but PyCUDA receives a sound beating in the > first stage. I don't quite understand why though. Code-wise, PyCUDA and > thrust do mostly the same thing--some parameters are different, but I've > twiddled them, and they don't make a big difference. From the profile, > the main killer seems to be that thrust's code simply issues three times > fewer instructions. But I don't get why--the codes aren't that > different. > > Compare yourself: > > https://code.google.com/p/thrust/source/browse/thrust/system/detail/generic/extrema.inl > https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/reduce.inl > https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/block/reduce.h > > vs > > https://github.com/inducer/pycuda/blob/b28595eb92345f561096e833062f11b896013d47/pycuda/reduction.py > > I've even made a version of reduction that's even more directly like > what thrust does: > > https://github.com/inducer/pycuda/blob/thrusty-reduce/pycuda/reduction.py > > The timing is about the same, even a tad bit slower. I'd much appreciate > any clues. Igor, can you please check if the perf difference is the same > on just a simple sum'o'floats? > > Andreas _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda