On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <[email protected]> wrote:
> Hi Karl, > > My motivation was to avoid duplicating code for the CPU and the GPU. This > is important considering that it takes a long time to test and make sure > the code produces the right results. > > I guess, I can add a switch in my code with something like: > > if (usingCPU) use VecGetArray() > > else if (usingGPU) use VecViennaCLGetArray() > > and then wrap the pointers that the above functions return with OpenCL > buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and > CL_ALLOC_.. for GPU) > > Hopefully, this will avoid unnecessary data transfers. > I do not understand this comment at all. This looks crazy to me. The whole point of having Vec is so that no one ever ever ever ever does anything like this. I saw nothing in the thread that would compel you to do this. What are you trying to accomplish with this switch? Matt > Cheers, > Mani > > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <[email protected]> wrote: > >> Hi Mani, >> >> > Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf >> >>> (page 16), I ran KSP ex12 for two cases: >>> >>> 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl >>> >>> real 0m0.213s >>> user 0m0.206s >>> sys 0m0.004s >>> >>> 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl >>> -log_summary > log_summary_with_viennacl >>> >>> real 0m20.296s >>> user 0m46.025s >>> sys 0m1.435s >>> >>> The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from >>> AMD-APP-SDK-v3.0. >>> >> >> there are a couple of things to note here: >> >> a) The total execution time contains the OpenCL kernel compilation time, >> which is on the order of one or two seconds. Thus, you need much larger >> problem sizes to get a good comparison. >> >> b) Most of the execution time is spent on VecMDot, which is optimized for >> GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend >> because one can use just plain C/C++/whatever). >> >> c) My experiences with this AMD APU are quite mixed, as I've never found >> a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part. >> The integrated GPU, however, reached 80% without much effort. This is >> particularly remarkable as both CPU and GPU share the same DDR3 memory >> link. Thus, it is more than unlikely that you will ever beat the >> performance of PETSc's native types. >> >> >> >> Attached are: >>> 1) configure.log for the petsc build >>> 2) log summary without viennacl >>> 3) log summary with viennacl >>> 4) OpenCL info for the system on which the runs were performed >>> >>> Perhaps the reason for the slow performance are superfluous copies being >>> performed, which need not occur when running ViennaCL on the CPU. >>> Looking at >>> >>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx >>> : >>> >>> /* Copies a vector from the CPU to the GPU unless we already have an >>> up-to-date copy on the GPU */ >>> PetscErrorCode VecViennaCLCopyToGPU(Vec v) >>> { >>> PetscErrorCode ierr; >>> >>> PetscFunctionBegin; >>> ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr); >>> if (v->map->n > 0) { >>> if (v->valid_GPU_array == PETSC_VIENNACL_CPU) { >>> ierr = >>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr); >>> try { >>> ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray; >>> viennacl::fast_copy(*(PetscScalar**)v->data, >>> *(PetscScalar**)v->data + v->map->n, vec->begin()); >>> ViennaCLWaitForGPU(); >>> } catch(std::exception const & ex) { >>> SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", >>> ex.what()); >>> } >>> ierr = >>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr); >>> v->valid_GPU_array = PETSC_VIENNACL_BOTH; >>> } >>> } >>> PetscFunctionReturn(0); >>> } >>> >>> When running ViennaCL with OpenCL on the CPU, the above function should >>> maybe be modified? >>> >> >> Unfortunately that is quite hard: OpenCL manages its own memory handles, >> so 'injecting' memory into an OpenCL kernel that is not allocated by the >> OpenCL runtime is not recommended, fairly tricky, and still involves some >> overhead. As I see no reason to run OpenCL on a CPU, I refrained from >> adding this extra code complexity. >> >> Overall, I recommend rerunning the benchmark on more powerful discrete >> GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any >> performance benefits. >> >> Hope this sheds some light on things :-) >> >> Best regards, >> Karli >> >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
