Hi Barry,

Barry Smith writes:
 > Dave,
 > 
 > I cannot explain why it does not use the MatMult_SeqAIJCusp() - it does for 
 > me.

Do you get good performance running a problem like ex2?

 > Have you updated to the latest cusp/thrust? From the mecurial repositories?

I did try the latest version of cusp from mercurial initially but the build
failed.  So I am currently using the latest cusp tarball.  I did not try the
latest version of thrust but instead was just using what came with the
released version of Cuda 4.0.  I could try the mercurial versions of both.

 > There is a difference, in your new 4.0 build you added
 > --download-txpetscgpu=yes BTW: that doesn't work for me with the latest
 > cusp and thrust from the mecurial repositories can you try reconfiguring
 > and making without that?

Yes, I can try that.  Maybe that is why my original build with cusp from
mercurial failed.

Thanks for your help,

Dave

 > Barry
 > 
 > On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote:
 > 
 >> Barry Smith writes:
 >>> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:
 >>> 
 >>>> Thanks for the update.  I don't believe I have gotten a run with good
 >>>> performance yet, either from C or Fortran.  I wish there was an easy way 
 >>>> for
 >>>> me to force use of only one of my gpus.  I don't want to have to pull one 
 >>>> of
 >>>> the gpus in order to see if that is complicating things with Cuda 4.0.  
 >>>> I'll
 >>>> be eager to hear if you make any progress on figuring things out.
 >>>> 
 >>>> Do you understand yet why the petsc ex2.c example is able to parse the
 >>>> "-cuda_show_devices" argument but ex2f.F does not?
 >>> 
 >>> Matt put the code in the wrong place in PETSc, that is all, no big
 >>> existentialist reason. I will fix that.
 >> 
 >> Thanks.  I'll look forward to testing out the new version.
 >> 
 >> Dave
 >> 
 >>> Barry
 >>> 
 >>>> 
 >>>> Thanks,
 >>>> 
 >>>> Dave
 >>>> 
 >>>> Barry Smith writes:
 >>>>> It is not doing the MatMult operation on the GPU and hence needs to move
 >>>>> the vectors back and forth for each operation (since MatMult is done on
 >>>>> the CPU with the vector while vector operations are done on the GPU) 
 >>>>> hence
 >>>>> the terrible performance.
 >>>>> 
 >>>>> Not sure why yet. It is copying the Mat down for me from C.
 >>>>> 
 >>>>> Barry
 >>>>> 
 >>>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:
 >>>>> 
 >>>>>> In case it might be useful, I have attached two log files of runs with 
 >>>>>> the
 >>>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was run 
 >>>>>> back in
 >>>>>> April with petsc-dev linked to Cuda 3.2.  It shows excellent runtime
 >>>>>> performance.  The other was run today with petsc-dev checked out of the
 >>>>>> mercurial repo yesterday morning and linked to Cuda 4.0.  In addition 
 >>>>>> to the
 >>>>>> differences in run time performance, I also do not see an entry for
 >>>>>> MatCUSPCopyTo in the profiling section.  I'm not sure what the 
 >>>>>> significance
 >>>>>> of that is.  I do observe that the run time for PCApply is about the 
 >>>>>> same for
 >>>>>> the two cases.  I think I would expect that to be the case even if the
 >>>>>> problem were partitioned across two gpus.  However, it does make me 
 >>>>>> wonder if
 >>>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 
 >>>>>> log
 >>>>>> file is an indication that the matrix was not actually copied to the 
 >>>>>> gpu.
 >>>>>> I'm not sure yet how to check for that.  Hope this might be useful.
 >>>>>> 
 >>>>>> Thanks,
 >>>>>> 
 >>>>>> Dave
 >>>>>> 
 >>>>>> 
 >>>>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log>
 >>>>>> Dave Nystrom writes:
 >>>>>>> Matthew Knepley writes:
 >>>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <Dave.Nystrom at 
 >>>>>>>> tachyonlogic.com> wrote:
 >>>>>>>>> Barry Smith writes:
 >>>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
 >>>>>>>>>>> Hi Barry,
 >>>>>>>>>>> 
 >>>>>>>>>>> I've sent a couple more emails on this topic.  What I am trying to 
 >>>>>>>>>>> do at the
 >>>>>>>>>>> moment is to figure out how to have a problem run on only one gpu 
 >>>>>>>>>>> if it will
 >>>>>>>>>>> fit in the memory of that gpu.  Back in April when I had built 
 >>>>>>>>>>> petsc-dev with
 >>>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple gpus on 
 >>>>>>>>>>> your
 >>>>>>>>>>> machine.  In order to use multiple gpus for a problem, one had to 
 >>>>>>>>>>> use
 >>>>>>>>>>> multiple threads with a separate thread assigned to control each 
 >>>>>>>>>>> gpu.  But
 >>>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the hood. 
 >>>>>>>>>>>  So now
 >>>>>>>>>>> when I run a small example problem such as
 >>>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem, it 
 >>>>>>>>>>> gets
 >>>>>>>>>>> partitioned to run on both of the gpus in my machine.  The result 
 >>>>>>>>>>> is a very
 >>>>>>>>>>> large performance hit because of communication back and forth from 
 >>>>>>>>>>> one gpu to
 >>>>>>>>>>> the other via the cpu.
 >>>>>>>>>> 
 >>>>>>>>>> How do you know there is lots of communication from the GPU to the 
 >>>>>>>>>> CPU? In
 >>>>>>>>>> the -log_summary? Nope because PETSc does not manage anything like 
 >>>>>>>>>> that
 >>>>>>>>>> (that is one CPU process using both GPUs).
 >>>>>>>>> 
 >>>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by petsc.
 >>>>>>>>> 
 >>>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower
 >>>>>>>>>>> now than it did with Cuda 3.2.  I believe if one is programming 
 >>>>>>>>>>> down at the
 >>>>>>>>>>> cuda level, it is possible to have a smaller problem run on only 
 >>>>>>>>>>> one gpu so
 >>>>>>>>>>> that there is communication only between the cpu and gpu and only 
 >>>>>>>>>>> at the
 >>>>>>>>>>> start and end of the calculation.
 >>>>>>>>>>> 
 >>>>>>>>>>> To me, it seems like what is needed is a petsc option to specify 
 >>>>>>>>>>> the number
 >>>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda 
 >>>>>>>>>>> level through
 >>>>>>>>>>> cusp and thrust.  I fear that the short term solution is going to 
 >>>>>>>>>>> have to be
 >>>>>>>>>>> for me to pull one of the gpus out of my desktop system but it 
 >>>>>>>>>>> would be nice
 >>>>>>>>>>> if there was a way to tell petsc and friends to just use one gpu 
 >>>>>>>>>>> when I want
 >>>>>>>>>>> it to.
 >>>>>>>>>>> 
 >>>>>>>>>>> If necessary, I can send a couple of log files to demonstrate what 
 >>>>>>>>>>> I am
 >>>>>>>>>>> trying to describe regarding the performance hit.
 >>>>>>>>>> 
 >>>>>>>>>> I am not convinced that the poor performance you are getting now has
 >>>>>>>>>> anything to do with using both GPUs. Please run a PETSc program 
 >>>>>>>>>> with the
 >>>>>>>>>> command -cuda_show_devices
 >>>>>>>>> 
 >>>>>>>>> I ran the following command:
 >>>>>>>>> 
 >>>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary 
 >>>>>>>>> -cuda_show_devices
 >>>>>>>>> -mat_type aijcusp -vec_type cusp -options_left
 >>>>>>>>> 
 >>>>>>>>> The result was a report that there was one option left, that being
 >>>>>>>>> -cuda_show_devices.  I am using a copy of petsc-dev that I cloned 
 >>>>>>>>> and built
 >>>>>>>>> this morning.
 >>>>>>>> 
 >>>>>>>> What do you have at src/sys/object/pinit.c:825? You should see the 
 >>>>>>>> code
 >>>>>>>> that processes this option. You should be able to break there in the
 >>>>>>>> debugger and see what happens. This sounds again like you are not
 >>>>>>>> processing options correctly.
 >>>>>>> 
 >>>>>>> Hi Matt,
 >>>>>>> 
 >>>>>>> I'll take a look at that in a bit and see if I can figure out what is 
 >>>>>>> going
 >>>>>>> on.  I do see the code that you mention that processes the arguments 
 >>>>>>> that
 >>>>>>> Barry mentioned.  In terms of processing options correctly, at least 
 >>>>>>> in this
 >>>>>>> case I am actually running one of the petsc examples rather than my own
 >>>>>>> code.  And it seems to correctly process the other command line 
 >>>>>>> arguments.
 >>>>>>> Anyway, I'll write more after I have had a chance to investigate more.
 >>>>>>> 
 >>>>>>> Thanks,
 >>>>>>> 
 >>>>>>> Dave
 >>>>>>> 
 >>>>>>>> Matt
 >>>>>>>> 
 >>>>>>>>>> What are the choices?  You can then pick one of them and run with
 >>>>>>>>> -cuda_set_device integer
 >>>>>>>>> 
 >>>>>>>>> The -cuda_set_device option does not appear to be recognized either, 
 >>>>>>>>> even
 >>>>>>>>> if I choose an integer like 0.
 >>>>>>>>> 
 >>>>>>>>>> Does this change things?
 >>>>>>>>> 
 >>>>>>>>> I suspect it would change things if I could get it to work.
 >>>>>>>>> 
 >>>>>>>>> Thanks,
 >>>>>>>>> 
 >>>>>>>>> Dave
 >>>>>>>>> 
 >>>>>>>>>> Barry
 >>>>>>>>>> 
 >>>>>>>>>>> 
 >>>>>>>>>>> Thanks,
 >>>>>>>>>>> 
 >>>>>>>>>>> Dave
 >>>>>>>>>>> 
 >>>>>>>>>>> Barry Smith writes:
 >>>>>>>>>>>> Dave,
 >>>>>>>>>>>> 
 >>>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU 
 >>>>>>>>>>>> process to
 >>>>>>>>>>>> use two GPUs at the same time. However you could have two MPI 
 >>>>>>>>>>>> processes
 >>>>>>>>>>>> each using their own GPU.
 >>>>>>>>>>>> 
 >>>>>>>>>>>> The one tricky part is you need to make sure each MPI process 
 >>>>>>>>>>>> uses a
 >>>>>>>>>>>> different GPU. We currently do not have a mechanism to do this 
 >>>>>>>>>>>> assignment
 >>>>>>>>>>>> automatically. I think it can be done with cudaSetDevice(). But I 
 >>>>>>>>>>>> don't
 >>>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.gov where 
 >>>>>>>>>>>> more people
 >>>>>>>>>>>> may know.
 >>>>>>>>>>>> 
 >>>>>>>>>>>> PETSc-folks,
 >>>>>>>>>>>> 
 >>>>>>>>>>>> We need a way to have this setup automatically.
 >>>>>>>>>>>> 
 >>>>>>>>>>>> Barry
 >>>>>>>>>>>> 
 >>>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
 >>>>>>>>>>>> 
 >>>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.  This 
 >>>>>>>>>>>>> is a desktop
 >>>>>>>>>>>>> machine with a single processor.  I know that Cuda 4.0 has 
 >>>>>>>>>>>>> support for
 >>>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that.  But 
 >>>>>>>>>>>>> suppose I
 >>>>>>>>>>>>> have a problem that will fit in the memory for a single gpu.  
 >>>>>>>>>>>>> Will petsc run
 >>>>>>>>>>>>> the problem on a single gpu or does it split it between the 2 
 >>>>>>>>>>>>> gpus and incur
 >>>>>>>>>>>>> the communication overhead of copying data between the two gpus?
 >>>>>>>>>>>>> 
 >>>>>>>>>>>>> Thanks,
 >>>>>>>>>>>>> 
 >>>>>>>>>>>>> Dave
 >>>>>>>> 
 >>>>>>>> -- 
 >>>>>>>> What most experimenters take for granted before they begin their 
 >>>>>>>> experiments
 >>>>>>>> is infinitely more interesting than any results to which their 
 >>>>>>>> experiments
 >>>>>>>> lead.
 >>>>>>>> -- Norbert Wiener
 >>>>> 
 >>> 
 > 

Reply via email to