On Mon, Oct 12, 2015 at 6:56 PM, Mani Chandra <[email protected]> wrote:
> Here is the source file: > https://github.com/AFD-Illinois/grim/blob/opencl/computeresidual.cl > > Attached is the assembly code "assembly_code.asm" generated using: > > ioc64 -input=computeresidual.cl -bo='-DOPENCL' -device='cpu' > -asm=assembly_code > The kernel is beyond enormous and I unfortunately cannot make any sense of it. THanks, Matt > Cheers, > Mani > > Caution: The source code in the opencl branch of > https://github.com/AFD-Illinois/grim is not very clean.. > > On Mon, Oct 12, 2015 at 4:28 PM, Matthew Knepley <[email protected]> > wrote: > >> On Mon, Oct 12, 2015 at 3:25 PM, Mani Chandra <[email protected]> wrote: >> >>> Here is the code: http://github.com/afd-illinois/grim, branch:opencl. >>> >>> Now using >>> >>> Kernel Builder for OpenCL API - compiler command line, version 1.4.0.134 >>> Copyright (C) 2014 Intel Corporation. All rights reserved. >>> >>> >>> manic@bh27:~/grim_opencl/grim> ioc64 -input=computeresidual.cl >>> -bo='-DOPENCL' >>> -device='cpu' >>> >>> No command specified, using 'build' as default >>> >> >> Sorry if I am being obtuse, but I cannot find that source file in the >> repo above. Can you give the direct link to the file? >> >> >>> Using build options: -DOPENCL >>> Setting target instruction set architecture to: Default (Advanced Vector >>> Extension (AVX)) >>> OpenCL Intel CPU device was found! >>> Device name: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz >>> Device version: OpenCL 1.2 (Build 44) >>> Device vendor: Intel(R) Corporation >>> Device profile: FULL_PROFILE >>> Compilation started >>> Compilation done >>> Linking started >>> Linking done >>> Device build started >>> Device build done >>> Kernel <ComputeResidual> was successfully vectorized >>> Done. >>> Build succeeded! >>> >> >> It definitely says it vectorized, but what code did it generate. Can you >> post the object file since I do not have the compiler. I >> have seen that message with really bad code before. >> >> Thanks, >> >> Matt >> >> >>> Cheers, >>> Mani >>> >>> On Mon, Oct 12, 2015 at 12:52 PM, Matthew Knepley <[email protected]> >>> wrote: >>> >>>> On Mon, Oct 12, 2015 at 2:44 PM, Mani Chandra <[email protected]> wrote: >>>>> >>>>> On Mon, Oct 12, 2015 at 12:36 PM, Barry Smith <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> > On Oct 12, 2015, at 2:29 PM, Matthew Knepley <[email protected]> >>>>>> wrote: >>>>>> > >>>>>> > On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <[email protected]> >>>>>> wrote: >>>>>> > Hi Karl, >>>>>> > >>>>>> > My motivation was to avoid duplicating code for the CPU and the >>>>>> GPU. This is important considering that it takes a long time to test and >>>>>> make sure the code produces the right results. >>>>>> > >>>>>> > I guess, I can add a switch in my code with something like: >>>>>> > >>>>>> > if (usingCPU) use VecGetArray() >>>>>> > >>>>>> > else if (usingGPU) use VecViennaCLGetArray() >>>>>> > >>>>>> > and then wrap the pointers that the above functions return with >>>>>> OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU >>>>>> and CL_ALLOC_.. for GPU) >>>>>> > >>>>>> > Hopefully, this will avoid unnecessary data transfers. >>>>>> > >>>>>> > I do not understand this comment at all. This looks crazy to me. >>>>>> The whole point of having Vec >>>>>> > is so that no one ever ever ever ever does anything like this. I >>>>>> saw nothing in the thread that would >>>>>> > compel you to do this. What are you trying to accomplish with this >>>>>> switch? >>>>>> >>>>>> >>>>> I'm trying to assemble the residual needed for SNES using an OpenCL >>>>> kernel. The kernel operates on OpenCL buffers which can either live on the >>>>> CPU or the GPU. >>>>> >>>>> I think it is useful to use OpenCL on the CPU basically because of >>>>> vectorization and vector data types. If I had to write usual C code, I'd >>>>> have to use all sorts of pragmas in icc to get the code to vectorize and >>>>> even then its pretty hard. >>>>> >>>> >>>> I would completely agree with you, if I thought the compiler actually >>>> vectorized that code. I do not think that >>>> is the case. Is there an example you have where you get vectorized >>>> assembly? >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> >>>>> Mani >>>>> >>>>> >>>>>> Matt, >>>>>> >>>>>> The current OpenCL code in PETSc is hardwired for GPU usage. So >>>>>> the correct fix, I believe, is to add to the VecViennaCL wrappers support >>>>>> for either using the GPU or the CPU. >>>>>> >>>>>> Barry >>>>>> >>>>>> > >>>>>> > Matt >>>>>> > >>>>>> > Cheers, >>>>>> > Mani >>>>>> > >>>>>> > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <[email protected]> >>>>>> wrote: >>>>>> > Hi Mani, >>>>>> > >>>>>> > > Following >>>>>> http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf >>>>>> > (page 16), I ran KSP ex12 for two cases: >>>>>> > >>>>>> > 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl >>>>>> > >>>>>> > real 0m0.213s >>>>>> > user 0m0.206s >>>>>> > sys 0m0.004s >>>>>> > >>>>>> > 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl >>>>>> > -log_summary > log_summary_with_viennacl >>>>>> > >>>>>> > real 0m20.296s >>>>>> > user 0m46.025s >>>>>> > sys 0m1.435s >>>>>> > >>>>>> > The runs have been performed on a CPU : AMD A10-5800K, with OpenCL >>>>>> from >>>>>> > AMD-APP-SDK-v3.0. >>>>>> > >>>>>> > there are a couple of things to note here: >>>>>> > >>>>>> > a) The total execution time contains the OpenCL kernel compilation >>>>>> time, which is on the order of one or two seconds. Thus, you need much >>>>>> larger problem sizes to get a good comparison. >>>>>> > >>>>>> > b) Most of the execution time is spent on VecMDot, which is >>>>>> optimized for GPUs (CPUs are not an optimization goal in ViennaCL's >>>>>> OpenCL >>>>>> backend because one can use just plain C/C++/whatever). >>>>>> > >>>>>> > c) My experiences with this AMD APU are quite mixed, as I've never >>>>>> found a way to get more than 45% of STREAM bandwidth with OpenCL on the >>>>>> CPU >>>>>> part. The integrated GPU, however, reached 80% without much effort. This >>>>>> is >>>>>> particularly remarkable as both CPU and GPU share the same DDR3 memory >>>>>> link. Thus, it is more than unlikely that you will ever beat the >>>>>> performance of PETSc's native types. >>>>>> > >>>>>> > >>>>>> > >>>>>> > Attached are: >>>>>> > 1) configure.log for the petsc build >>>>>> > 2) log summary without viennacl >>>>>> > 3) log summary with viennacl >>>>>> > 4) OpenCL info for the system on which the runs were performed >>>>>> > >>>>>> > Perhaps the reason for the slow performance are superfluous copies >>>>>> being >>>>>> > performed, which need not occur when running ViennaCL on the CPU. >>>>>> > Looking at >>>>>> > >>>>>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx >>>>>> : >>>>>> > >>>>>> > /* Copies a vector from the CPU to the GPU unless we already have >>>>>> an up-to-date copy on the GPU */ >>>>>> > PetscErrorCode VecViennaCLCopyToGPU(Vec v) >>>>>> > { >>>>>> > PetscErrorCode ierr; >>>>>> > >>>>>> > PetscFunctionBegin; >>>>>> > ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr); >>>>>> > if (v->map->n > 0) { >>>>>> > if (v->valid_GPU_array == PETSC_VIENNACL_CPU) { >>>>>> > ierr = >>>>>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr); >>>>>> > try { >>>>>> > ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray; >>>>>> > viennacl::fast_copy(*(PetscScalar**)v->data, >>>>>> *(PetscScalar**)v->data + v->map->n, vec->begin()); >>>>>> > ViennaCLWaitForGPU(); >>>>>> > } catch(std::exception const & ex) { >>>>>> > SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: >>>>>> %s", ex.what()); >>>>>> > } >>>>>> > ierr = >>>>>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr); >>>>>> > v->valid_GPU_array = PETSC_VIENNACL_BOTH; >>>>>> > } >>>>>> > } >>>>>> > PetscFunctionReturn(0); >>>>>> > } >>>>>> > >>>>>> > When running ViennaCL with OpenCL on the CPU, the above function >>>>>> should >>>>>> > maybe be modified? >>>>>> > >>>>>> > Unfortunately that is quite hard: OpenCL manages its own memory >>>>>> handles, so 'injecting' memory into an OpenCL kernel that is not >>>>>> allocated >>>>>> by the OpenCL runtime is not recommended, fairly tricky, and still >>>>>> involves >>>>>> some overhead. As I see no reason to run OpenCL on a CPU, I refrained >>>>>> from >>>>>> adding this extra code complexity. >>>>>> > >>>>>> > Overall, I recommend rerunning the benchmark on more powerful >>>>>> discrete GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any >>>>>> performance benefits. >>>>>> > >>>>>> > Hope this sheds some light on things :-) >>>>>> > >>>>>> > Best regards, >>>>>> > Karli >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > What most experimenters take for granted before they begin their >>>>>> experiments is infinitely more interesting than any results to which >>>>>> their >>>>>> experiments lead. >>>>>> > -- Norbert Wiener >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>> >>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
