Hi Barry, Barry Smith writes: > Dave, > > I cannot explain why it does not use the MatMult_SeqAIJCusp() - it does for > me.
Do you get good performance running a problem like ex2? > Have you updated to the latest cusp/thrust? From the mecurial repositories? I did try the latest version of cusp from mercurial initially but the build failed. So I am currently using the latest cusp tarball. I did not try the latest version of thrust but instead was just using what came with the released version of Cuda 4.0. I could try the mercurial versions of both. > There is a difference, in your new 4.0 build you added > --download-txpetscgpu=yes BTW: that doesn't work for me with the latest > cusp and thrust from the mecurial repositories can you try reconfiguring > and making without that? Yes, I can try that. Maybe that is why my original build with cusp from mercurial failed. Thanks for your help, Dave > Barry > > On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote: > >> Barry Smith writes: >>> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote: >>> >>>> Thanks for the update. I don't believe I have gotten a run with good >>>> performance yet, either from C or Fortran. I wish there was an easy way >>>> for >>>> me to force use of only one of my gpus. I don't want to have to pull one >>>> of >>>> the gpus in order to see if that is complicating things with Cuda 4.0. >>>> I'll >>>> be eager to hear if you make any progress on figuring things out. >>>> >>>> Do you understand yet why the petsc ex2.c example is able to parse the >>>> "-cuda_show_devices" argument but ex2f.F does not? >>> >>> Matt put the code in the wrong place in PETSc, that is all, no big >>> existentialist reason. I will fix that. >> >> Thanks. I'll look forward to testing out the new version. >> >> Dave >> >>> Barry >>> >>>> >>>> Thanks, >>>> >>>> Dave >>>> >>>> Barry Smith writes: >>>>> It is not doing the MatMult operation on the GPU and hence needs to move >>>>> the vectors back and forth for each operation (since MatMult is done on >>>>> the CPU with the vector while vector operations are done on the GPU) >>>>> hence >>>>> the terrible performance. >>>>> >>>>> Not sure why yet. It is copying the Mat down for me from C. >>>>> >>>>> Barry >>>>> >>>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote: >>>>> >>>>>> In case it might be useful, I have attached two log files of runs with >>>>>> the >>>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials. One was run >>>>>> back in >>>>>> April with petsc-dev linked to Cuda 3.2. It shows excellent runtime >>>>>> performance. The other was run today with petsc-dev checked out of the >>>>>> mercurial repo yesterday morning and linked to Cuda 4.0. In addition >>>>>> to the >>>>>> differences in run time performance, I also do not see an entry for >>>>>> MatCUSPCopyTo in the profiling section. I'm not sure what the >>>>>> significance >>>>>> of that is. I do observe that the run time for PCApply is about the >>>>>> same for >>>>>> the two cases. I think I would expect that to be the case even if the >>>>>> problem were partitioned across two gpus. However, it does make me >>>>>> wonder if >>>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 >>>>>> log >>>>>> file is an indication that the matrix was not actually copied to the >>>>>> gpu. >>>>>> I'm not sure yet how to check for that. Hope this might be useful. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dave >>>>>> >>>>>> >>>>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log> >>>>>> Dave Nystrom writes: >>>>>>> Matthew Knepley writes: >>>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <Dave.Nystrom at >>>>>>>> tachyonlogic.com> wrote: >>>>>>>>> Barry Smith writes: >>>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote: >>>>>>>>>>> Hi Barry, >>>>>>>>>>> >>>>>>>>>>> I've sent a couple more emails on this topic. What I am trying to >>>>>>>>>>> do at the >>>>>>>>>>> moment is to figure out how to have a problem run on only one gpu >>>>>>>>>>> if it will >>>>>>>>>>> fit in the memory of that gpu. Back in April when I had built >>>>>>>>>>> petsc-dev with >>>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple gpus on >>>>>>>>>>> your >>>>>>>>>>> machine. In order to use multiple gpus for a problem, one had to >>>>>>>>>>> use >>>>>>>>>>> multiple threads with a separate thread assigned to control each >>>>>>>>>>> gpu. But >>>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the hood. >>>>>>>>>>> So now >>>>>>>>>>> when I run a small example problem such as >>>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem, it >>>>>>>>>>> gets >>>>>>>>>>> partitioned to run on both of the gpus in my machine. The result >>>>>>>>>>> is a very >>>>>>>>>>> large performance hit because of communication back and forth from >>>>>>>>>>> one gpu to >>>>>>>>>>> the other via the cpu. >>>>>>>>>> >>>>>>>>>> How do you know there is lots of communication from the GPU to the >>>>>>>>>> CPU? In >>>>>>>>>> the -log_summary? Nope because PETSc does not manage anything like >>>>>>>>>> that >>>>>>>>>> (that is one CPU process using both GPUs). >>>>>>>>> >>>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by petsc. >>>>>>>>> >>>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower >>>>>>>>>>> now than it did with Cuda 3.2. I believe if one is programming >>>>>>>>>>> down at the >>>>>>>>>>> cuda level, it is possible to have a smaller problem run on only >>>>>>>>>>> one gpu so >>>>>>>>>>> that there is communication only between the cpu and gpu and only >>>>>>>>>>> at the >>>>>>>>>>> start and end of the calculation. >>>>>>>>>>> >>>>>>>>>>> To me, it seems like what is needed is a petsc option to specify >>>>>>>>>>> the number >>>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda >>>>>>>>>>> level through >>>>>>>>>>> cusp and thrust. I fear that the short term solution is going to >>>>>>>>>>> have to be >>>>>>>>>>> for me to pull one of the gpus out of my desktop system but it >>>>>>>>>>> would be nice >>>>>>>>>>> if there was a way to tell petsc and friends to just use one gpu >>>>>>>>>>> when I want >>>>>>>>>>> it to. >>>>>>>>>>> >>>>>>>>>>> If necessary, I can send a couple of log files to demonstrate what >>>>>>>>>>> I am >>>>>>>>>>> trying to describe regarding the performance hit. >>>>>>>>>> >>>>>>>>>> I am not convinced that the poor performance you are getting now has >>>>>>>>>> anything to do with using both GPUs. Please run a PETSc program >>>>>>>>>> with the >>>>>>>>>> command -cuda_show_devices >>>>>>>>> >>>>>>>>> I ran the following command: >>>>>>>>> >>>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary >>>>>>>>> -cuda_show_devices >>>>>>>>> -mat_type aijcusp -vec_type cusp -options_left >>>>>>>>> >>>>>>>>> The result was a report that there was one option left, that being >>>>>>>>> -cuda_show_devices. I am using a copy of petsc-dev that I cloned >>>>>>>>> and built >>>>>>>>> this morning. >>>>>>>> >>>>>>>> What do you have at src/sys/object/pinit.c:825? You should see the >>>>>>>> code >>>>>>>> that processes this option. You should be able to break there in the >>>>>>>> debugger and see what happens. This sounds again like you are not >>>>>>>> processing options correctly. >>>>>>> >>>>>>> Hi Matt, >>>>>>> >>>>>>> I'll take a look at that in a bit and see if I can figure out what is >>>>>>> going >>>>>>> on. I do see the code that you mention that processes the arguments >>>>>>> that >>>>>>> Barry mentioned. In terms of processing options correctly, at least >>>>>>> in this >>>>>>> case I am actually running one of the petsc examples rather than my own >>>>>>> code. And it seems to correctly process the other command line >>>>>>> arguments. >>>>>>> Anyway, I'll write more after I have had a chance to investigate more. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Dave >>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>>>> What are the choices? You can then pick one of them and run with >>>>>>>>> -cuda_set_device integer >>>>>>>>> >>>>>>>>> The -cuda_set_device option does not appear to be recognized either, >>>>>>>>> even >>>>>>>>> if I choose an integer like 0. >>>>>>>>> >>>>>>>>>> Does this change things? >>>>>>>>> >>>>>>>>> I suspect it would change things if I could get it to work. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Dave >>>>>>>>> >>>>>>>>>> Barry >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Dave >>>>>>>>>>> >>>>>>>>>>> Barry Smith writes: >>>>>>>>>>>> Dave, >>>>>>>>>>>> >>>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU >>>>>>>>>>>> process to >>>>>>>>>>>> use two GPUs at the same time. However you could have two MPI >>>>>>>>>>>> processes >>>>>>>>>>>> each using their own GPU. >>>>>>>>>>>> >>>>>>>>>>>> The one tricky part is you need to make sure each MPI process >>>>>>>>>>>> uses a >>>>>>>>>>>> different GPU. We currently do not have a mechanism to do this >>>>>>>>>>>> assignment >>>>>>>>>>>> automatically. I think it can be done with cudaSetDevice(). But I >>>>>>>>>>>> don't >>>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.gov where >>>>>>>>>>>> more people >>>>>>>>>>>> may know. >>>>>>>>>>>> >>>>>>>>>>>> PETSc-folks, >>>>>>>>>>>> >>>>>>>>>>>> We need a way to have this setup automatically. >>>>>>>>>>>> >>>>>>>>>>>> Barry >>>>>>>>>>>> >>>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus. This >>>>>>>>>>>>> is a desktop >>>>>>>>>>>>> machine with a single processor. I know that Cuda 4.0 has >>>>>>>>>>>>> support for >>>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that. But >>>>>>>>>>>>> suppose I >>>>>>>>>>>>> have a problem that will fit in the memory for a single gpu. >>>>>>>>>>>>> Will petsc run >>>>>>>>>>>>> the problem on a single gpu or does it split it between the 2 >>>>>>>>>>>>> gpus and incur >>>>>>>>>>>>> the communication overhead of copying data between the two gpus? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Dave >>>>>>>> >>>>>>>> -- >>>>>>>> What most experimenters take for granted before they begin their >>>>>>>> experiments >>>>>>>> is infinitely more interesting than any results to which their >>>>>>>> experiments >>>>>>>> lead. >>>>>>>> -- Norbert Wiener >>>>> >>> >