Hi Yue Hu,

On 06/13/2013 01:07 AM, Yue Hu wrote:
> However, no matter how much computation is allocated to CPU, it has a time
> overhead of about 3 seconds. For better illustration, I listed the code that
> uses CPU to do the whole computation as below. The red lines are the execution
> time I measured for a 1024x1024 matrix multiplication case. I also tested
> other matrix size, the time overhead is still about 3 seconds.

My first guess it the compilation overhead. You can verify this
by running the kernel again in the same program: this time the
compilation overhead should not be there, thanks to the caching of
the compilation results.

This relates to the recent discussion of a proper compiler cache that
stores the compilation results over the program launches and has proper
checking of cache entry validity.

Have you compiled your LLVM/Clang with optimizations on? If you have
an unoptimized LLVM it can easily account for an order of magnitude
kernel compiler slowdown.

There can be other surprising performance issues. E.g., I fixed one
when I noticed that LLVM 3.3 produced fmuladd intrinsics automatically
and in my machine those intrinsics were converted to uninlineable
math library calls which caused significant performance regressions
in some of the cases.

I use the 'valgrind' tool to produce execution profiles. This is
something you might want to do also unless it's clearly the compilation
overhead you are seeing:

valgrind --tool=cachegrind --trace-children=yes ./your_opencl_program

Then it dumps several traces which you can visualize, e.g., with
kcachegrind to find the hot spots.

BR,
-- 
Pekka

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to