I will look into the issue and fix. -Paul > > John, Paul, > > I ran the example with the same options and the code aborts at > > a different location in cusp. Although still called by PCSetUp_SACUSP. > > The example works fine if txpetscgpu is not used. > > Valgrind does not show any relevant issues prior to the std::terminate. > > My best guess based on this and some investigation is that this is > happening > > because of inconsistent C style casts in the code (which are #ifdefed > out when > > txpetscgpu is not used). They could be related to different code paths > taken > > in calling MatCUSPCopyToGPU in sacusp.cu depending on txpetscgpu macro. > > I'm busy with other stuff, but I'll let you know when this gets fixed. > > Chetan > > *From:*petsc-dev-bounces at mcs.anl.gov > <mailto:petsc-dev-bounces at mcs.anl.gov> > [mailto:petsc-dev-bounces at mcs.anl.gov] *On Behalf Of *John Fettig > *Sent:* Monday, February 27, 2012 2:02 PM > *To:* For users of the development version of PETSc > *Subject:* Re: [petsc-dev] PETSc GPU capabilities > > It finally finished running through cuda-gdb. Here's a backtrace. > new_size=46912574500784 in the call to > thrust::detail::vector_base<double, > thrust::device_malloc_allocator<double> >::resize looks suspicious. > > #0 0x0000003e1c832885 in raise () from /lib64/libc.so.6 > #1 0x0000003e1c834065 in abort () from /lib64/libc.so.6 > #2 0x0000003e284bea7d in __gnu_cxx::__verbose_terminate_handler() () > from /usr/lib64/libstdc++.so.6 > #3 0x0000003e284bcc06 in ?? () from /usr/lib64/libstdc++.so.6 > #4 0x0000003e284bcc33 in std::terminate() () from > /usr/lib64/libstdc++.so.6 > #5 0x0000003e284bcd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6 > #6 0x00002aaaab45ad71 in thrust::detail::backend::cuda::malloc<0u> > (n=375300596006272) > at malloc.inl:50 > #7 0x00002aaaab454322 in > thrust::detail::backend::dispatch::malloc<0u> (n=375300596006272) > at malloc.h:56 > #8 0x00002aaaab453555 in thrust::device_malloc (n=375300596006272) at > device_malloc.inl:32 > #9 0x00002aaaab46477d in thrust::device_malloc<double> (n=46912574500784) > at device_malloc.inl:38 > #10 0x00002aaaab461fce in > thrust::device_malloc_allocator<double>::allocate ( > this=0x7fffffff9880, cnt=46912574500784) at > device_malloc_allocator.h:101 > #11 0x00002aaaab45ee91 in thrust::detail::contiguous_storage<double, > thrust::device_malloc_allocator<double> >::allocate > (this=0x7fffffff9880, n=46912574500784) > at contiguous_storage.inl:134 > #12 0x00002aaaab46ebba in thrust::detail::contiguous_storage<double, > thrust::device_malloc_allocator<double> >::contiguous_storage > (this=0x7fffffff9880, n=46912574500784) > at contiguous_storage.inl:46 > #13 0x00002aaaab46cd1e in thrust::detail::vector_base<double, > thrust::device_malloc_allocator<double> >::fill_insert > (this=0x13623990, position=..., n=46912574500784, > x=@0x7fffffff9f18) at vector_base.inl:792 > #14 0x00002aaaab46b058 in thrust::detail::vector_base<double, > thrust::device_malloc_allocator<double> >::insert (this=0x13623990, > position=..., n=46912574500784, x=@0x7fffffff9f18) > at vector_base.inl:561 > #15 0x00002aaaab4692a3 in thrust::detail::vector_base<double, > thrust::device_malloc_allocator<double> >::resize (this=0x13623990, > new_size=46912574500784, x=@0x7fffffff9f18) > at vector_base.inl:222 > #16 0x00002aaaac2c3d9b in cusp::precond::smoothed_aggregation<int, > double, > thrust::detail::cuda_device_space_tag>::smoothed_aggregation<cusp::csr_matrix<int, > > double, thrust::detail::cuda_device_space_tag> > (this=0x136182b0, > A=..., theta=0) at smoothed_aggregation.inl:210 > #17 0x00002aaaac27cf84 in PCSetUp_SACUSP (pc=0x1360f330) at > sacusp.cu:76 <http://sacusp.cu:76> > #18 0x00002aaaac1f0024 in PCSetUp (pc=0x1360f330) at precon.c:832 > #19 0x00002aaaabd02144 in KSPSetUp (ksp=0x135d2a00) at itfunc.c:261 > #20 0x00002aaaabd0396e in KSPSolve (ksp=0x135d2a00, b=0x135a0fa0, > x=0x135a2b50) > at itfunc.c:385 > #21 0x0000000000403619 in main (argc=17, args=0x7fffffffc538) at ex2.c:217 > > On Mon, Feb 27, 2012 at 4:48 PM, John Fettig <john.fettig at gmail.com > <mailto:john.fettig at gmail.com>> wrote: > > Hi Paul, > > This is very interesting. I tried building the code with > --download-txpetscgpu and it doesn't work for me. It runs out of > memory, no matter how small the problem (this is ex2 from > src/ksp/ksp/examples/tutorials): > > mpirun -np 1 ./ex2 -n 10 -m 10 -ksp_type cg -pc_type sacusp -mat_type > aijcusp -vec_type cusp -cusp_storage_format csr -use_cusparse 0 > > terminate called after throwing an instance of > 'thrust::system::detail::bad_alloc' > what(): std::bad_alloc: out of memory > MPI Application rank 0 killed before MPI_Finalize() with signal 6 > > This example works fine when I build without your gpu additions (and > for much larger problems too). Am I doing something wrong? > > For reference, I'm using CUDA 4.1, CUSP 0.3, and Thrust 1.5.1 > > John > > On Fri, Feb 10, 2012 at 5:04 PM, Paul Mullowney <paulm at txcorp.com > <mailto:paulm at txcorp.com>> wrote: > > Hi All, > > I've been developing GPU capabilities for PETSc. The development has > focused mostly on > (1) An efficient multi-GPU SpMV, i.e. MatMult. This is working well. > (2) Triangular Solve used in ILU preconditioners; i.e. MatSolve. The > performance of this ... is what it is :| > This code is in beta mode. Keep that in mind, if you decide to use it. > It supports single and double precision, real numbers only! Complex > will be supported at some point in the future, but not any time soon. > > To build with these capabilities, add the following to your configure > line. > --download-txpetscgpu=yes > > The capabilities of the SpMV code are accessed with the following 2 > command line flags > -cusp_storage_format csr (other options are coo (coordinate), ell > (ellpack), dia (diagonal). hyb (hybrid) is not yet supported) > -use_cusparse (this is a boolean and at the moment is only supported > with csr format matrices. In the future, cusparse will work with ell, > coo, and hyb formats). > > Regarding the number of GPUs to run on: > Imagine a system with P nodes, N cores per node, and M GPUs per node. > Then, to use only the GPUs, I would run with M ranks per node over P > nodes. As an example, I have a system with 2 nodes. Each node has 8 > cores, and 4 GPUs attached to each node (P=2, N=8, M=4). In a PBS > queue script, one would use 2 nodes at 4 processors per node. Each mpi > rank (CPU processor) will be attached to a GPU. > > You do not need to explicitly manage the GPUs, apart from > understanding what type of system you are running on. To learn how > many devices are available per node, use the command line flag: > -cuda_show_devices > > -Paul >
-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120307/ef31ee04/attachment.html>