Hi Bryan,
http://dev.math.canterbury.ac.nz/home/pub/26/ now has the timing
measured with Python's time.time() -- there isn't much difference. The
card is Tesla C2070.
Igor


On Thu, May 31, 2012 at 3:31 PM, Bryan Catanzaro <bcatanz...@acm.org> wrote:
> Hi Igor -
> I meant that it's more useful to know the execution time of code
> running on the GPU from Python's perspective, since Python is the one
> driving the work, and the execution overheads can be significant.
> What timings do you get when you use timeit rather than CUDA events?
> Also, what GPU are you running on?
>
> - bryan
>
> On Wed, May 30, 2012 at 5:56 PM, Igor <rych...@gmail.com> wrote:
>> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/
>>
>> larger vector, a billion elements.
>>
>> As for returning the value, it's the pair of max value and position we
>> are talking about, thrust returns the position and I'm now timing the
>> extraction of the value from the gpu array which didn't change timing
>> too much.
>>
>> ReductionKernel still appears 5 times slower than thrust.
>>
>> Bryan, on the same worksheet the numpy timing is printed as well:
>> argmax is 3 times slower than ReductionKernel.
>>
>>
>>
>>
>> On Thu, May 31, 2012 at 12:08 PM, Andreas Kloeckner
>> <li...@informa.tiker.net> wrote:
>>> On Wed, 30 May 2012 22:13:27 +1200, Igor <rych...@gmail.com> wrote:
>>>> Hi Andreas,
>>>> I'm attaching an example for your wiki demonstrating how to find a max
>>>> element position both using ReductionKernel and thrust-nvcc-ctypes.
>>>> The latter doesn't quite work on windows yet. Should work if you're on
>>>> a linux, just change the FOLDER. There is a live version published on
>>>> my sage server (http://dev.math.canterbury.ac.nz/home/pub/26/ ) --
>>>> there all work and show a discouraging 5-fold slowdown of
>>>> ReductionKernel as compared to thrust (run twice, as the .so file is
>>>> loaded lazily?). Could you take a look and edit it if necessary?
>>>
>>> Not a fair comparison. The PyCUDA test includes the transfer of the
>>> result to the host. (.get()) Doesn't look like that's the case for
>>> thrust. Also, an 80 MB vector is tiny. At 200 GB/s, that's about 4e-4s,
>>> which is in the vicinity of launch overhead.
>>>
>>> Andreas
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA@tiker.net
>> http://lists.tiker.net/listinfo/pycuda

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to