Hi Andreas,
I know it's subject to rounding etc. but ReductionKernel often finds a
max value at a different position than both numpy.argmax and
thrust::max_element, the latter two agreeing always.
http://dev.math.canterbury.ac.nz/home/pub/26/

Sum reduction seems to be taking the same time both,
http://dev.math.canterbury.ac.nz/home/pub/27/
(Check out the wildly different answer from numpy.sum)

Igor

On Thu, May 31, 2012 at 6:59 PM, Andreas Kloeckner
<li...@informa.tiker.net> wrote:
> On Thu, 31 May 2012 12:56:15 +1200, Igor <rych...@gmail.com> wrote:
>> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/
>>
>> larger vector, a billion elements.
>>
>> As for returning the value, it's the pair of max value and position we
>> are talking about, thrust returns the position and I'm now timing the
>> extraction of the value from the gpu array which didn't change timing
>> too much.
>>
>> ReductionKernel still appears 5 times slower than thrust.
>>
>> Bryan, on the same worksheet the numpy timing is printed as well:
>> argmax is 3 times slower than ReductionKernel.
>
> I've looked at this for a little while, can't quite make heads or tails
> of it yet. Here's the profiler output:
>
> method=[ reduce_kernel_stage1 ] gputime=[ 20617.984 ] cputime=[20647.000 ] 
> gridsize=[ 128, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 1.000 ] 
> l1_shared_bank_conflict=[ 672 ] inst_issued=[ 7906011 ]
> method=[ reduce_kernel_stage2 ] gputime=[ 9.696 ] cputime=[ 29.000 ] 
> gridsize=[ 1, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 0.333 ] 
> l1_shared_bank_conflict=[ 96 ]
>
> method=[ _ZN6thrust<snip>] gputime=[ 3556.736 ] cputime=[ 3583.000 ] 
> gridsize=[ 32, 1 ] threadblocksize=[ 768, 1, 1 ] occupancy=[ 1.000 ] 
> l1_shared_bank_conflict=[ 1255 ] inst_issued=[ 2964333 ]
> method=[ _ZN6thrust6<snip>] gputime=[ 8.640 ] cputime=[ 30.000 ] gridsize=[ 
> 1, 1 ] threadblocksize=[ 32, 1, 1 ] occupancy=[ 0.021 ] 
> l1_shared_bank_conflict=[ 18 ]
>
> Second stages are comparable, but PyCUDA receives a sound beating in the
> first stage. I don't quite understand why though. Code-wise, PyCUDA and
> thrust do mostly the same thing--some parameters are different, but I've
> twiddled them, and they don't make a big difference. From the profile,
> the main killer seems to be that thrust's code simply issues three times
> fewer instructions. But I don't get why--the codes aren't that
> different.
>
> Compare yourself:
>
> https://code.google.com/p/thrust/source/browse/thrust/system/detail/generic/extrema.inl
> https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/reduce.inl
> https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/block/reduce.h
>
> vs
>
> https://github.com/inducer/pycuda/blob/b28595eb92345f561096e833062f11b896013d47/pycuda/reduction.py
>
> I've even made a version of reduction that's even more directly like
> what thrust does:
>
> https://github.com/inducer/pycuda/blob/thrusty-reduce/pycuda/reduction.py
>
> The timing is about the same, even a tad bit slower. I'd much appreciate
> any clues. Igor, can you please check if the perf difference is the same
> on just a simple sum'o'floats?
>
> Andreas

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to