Hi,

I had the opportunity to benchmark open source OpenCL drivers (POCL on
CPU, Beignet on GPU) versus proprietary ones and they behave very well,
now !

Test computer:
Macbook pro 13" with an Iris 5100 GPU integrated into the Haswell
processor (i5-4308U) running Debian Jessie (or macOSX)

The code used is describes on pages 7-14 of this document:
http://pdebuyl.be/tmp/esp2014_draft.pdf

It consists of a map operation (cast and multiplication/divisions)
followed by a sparse matrix dense vector multiplication implemented as
an array of struct (method called LUT, better suited to CPU) or as a
struct of array (called CSR, better suited to GPU). CSR is implemented
using parallel reduction within a workgroup. All OpenCL method use
single precision floating point arithmetics and Kahan summation while OpenMP
code uses double precision arithmetics.

This benchmark is the execution time in millisecond of the complete
treatment for input images of various size (from 1 to 16 Mpixel).
It is the best timing out of 3, averaged over 10 processing, using the
timeit module from python.

Reference timings:
1D_CPU_LUT_OpenMP
Img size        Linux/gcc       Apple/clang
   1.02         12.12            13.451
   2.10         30.14            35.307
   4.19         63.79            87.110
   6.22         96.17            130.77
  11.90         222.15           265.94
  16.78         270.42           359.93
        
1D_CPU_CSR_OpenMP
Img size        Linux/gcc       Apple/clang
   1.02         12.31           12.256
   2.10         30.20           33.220
   4.19         64.34           76.948
   6.22         88.82           111.60
  11.90         206.82          218.81
  16.78         280.03          443.35

Execution on the CPU:
1D_CPU_LUT_OpenCL
Img size        AMD             Intel           Apple           POCL
   1.02         13.11           8.25            9.7813           8.47   
   2.10         29.85           15.20           20.563          17.85  
   4.19         58.08           32.77           47.877          47.19  
   6.22         97.88           53.04           80.372          62.53  
  11.90         184.29          125.52          149.33          135.89 
  16.78         261.21          149.31          205.81          190.14 

1D_CPU_CSR_OpenCL
Img size        AMD             Intel           Apple           POCL
   1.02         16.96           10.05           9.8027           10.02
   2.10         37.12           18.46           21.904           21.35
   4.19         82.78           42.24           46.961           59.89
   6.22         133.41          70.17           68.312           73.87
  11.90         271.61          182.41          143.57           178.77
  16.78         346.55          222.82          212.17           260.62

Execution on the integrated GPU:
1D_GPU_LUT_OpenCL
Img size        Beignet         Apple
   1.02         7.50            10.066
   2.10         14.44           16.345
   4.19         28.91           34.538
   6.22         -----           37.570
  11.90         -----           68.443
  16.78         -----           78.333
no data: MemoryError (only 256MB on GPU)

1D_GPU_CSR_OpenCL
Img size        Beignet         Apple
   1.02         3.95            6.0475
   2.10         7.55            13.324
   4.19         15.62           23.255
   6.22         23.88           33.352
  11.90         45.63           55.099
  16.78         68.78           82.569

It is funny to notice this laptop GPU outperforms a Intel Xeon-phi
accelerator which is much more expensive than the whole laptop, using
the same code.

Cheers,
-- 
Jérôme Kieffer
Data analysis unit - ESRF

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to