On Thu, 31 May 2012 12:56:15 +1200, Igor <rych...@gmail.com> wrote:
> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/
> larger vector, a billion elements.
> As for returning the value, it's the pair of max value and position we
> are talking about, thrust returns the position and I'm now timing the
> extraction of the value from the gpu array which didn't change timing
> too much.
> ReductionKernel still appears 5 times slower than thrust.
> Bryan, on the same worksheet the numpy timing is printed as well:
> argmax is 3 times slower than ReductionKernel.

I've looked at this for a little while, can't quite make heads or tails
of it yet. Here's the profiler output:

method=[ reduce_kernel_stage1 ] gputime=[ 20617.984 ] cputime=[20647.000 ] 
gridsize=[ 128, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 1.000 ] 
l1_shared_bank_conflict=[ 672 ] inst_issued=[ 7906011 ]
method=[ reduce_kernel_stage2 ] gputime=[ 9.696 ] cputime=[ 29.000 ] gridsize=[ 
1, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 0.333 ] 
l1_shared_bank_conflict=[ 96 ] 

method=[ _ZN6thrust<snip>] gputime=[ 3556.736 ] cputime=[ 3583.000 ] gridsize=[ 
32, 1 ] threadblocksize=[ 768, 1, 1 ] occupancy=[ 1.000 ] 
l1_shared_bank_conflict=[ 1255 ] inst_issued=[ 2964333 ] 
method=[ _ZN6thrust6<snip>] gputime=[ 8.640 ] cputime=[ 30.000 ] gridsize=[ 1, 
1 ] threadblocksize=[ 32, 1, 1 ] occupancy=[ 0.021 ] l1_shared_bank_conflict=[ 
18 ] 

Second stages are comparable, but PyCUDA receives a sound beating in the
first stage. I don't quite understand why though. Code-wise, PyCUDA and
thrust do mostly the same thing--some parameters are different, but I've
twiddled them, and they don't make a big difference. From the profile,
the main killer seems to be that thrust's code simply issues three times
fewer instructions. But I don't get why--the codes aren't that

Compare yourself:




I've even made a version of reduction that's even more directly like
what thrust does:


The timing is about the same, even a tad bit slower. I'd much appreciate
any clues. Igor, can you please check if the perf difference is the same
on just a simple sum'o'floats?


Attachment: pgp5UhnuIEwmy.pgp
Description: PGP signature

PyCUDA mailing list

Reply via email to