I will look into the issue and fix.
> John, Paul,
> I ran the example with the same options and the code aborts at
> a different location in cusp.  Although still called by PCSetUp_SACUSP.
> The example works fine if txpetscgpu is not used.
> Valgrind does not show any relevant issues prior to the std::terminate.
> My best guess based on this and some investigation is that this is 
> happening
> because of inconsistent C style casts in the code (which are #ifdefed 
> out when
> txpetscgpu is not used). They could be related to different code paths 
> taken
> in calling MatCUSPCopyToGPU in sacusp.cu depending on txpetscgpu macro.
> I'm busy with other stuff, but I'll let you know when this gets fixed.
> Chetan
> *From:*petsc-dev-bounces at mcs.anl.gov 
> <mailto:petsc-dev-bounces at mcs.anl.gov> 
> [mailto:petsc-dev-bounces at mcs.anl.gov] *On Behalf Of *John Fettig
> *Sent:* Monday, February 27, 2012 2:02 PM
> *To:* For users of the development version of PETSc
> *Subject:* Re: [petsc-dev] PETSc GPU capabilities
> It finally finished running through cuda-gdb.  Here's a backtrace.  
> new_size=46912574500784 in the call to 
> thrust::detail::vector_base<double, 
> thrust::device_malloc_allocator<double> >::resize looks suspicious.
> #0  0x0000003e1c832885 in raise () from /lib64/libc.so.6
> #1  0x0000003e1c834065 in abort () from /lib64/libc.so.6
> #2  0x0000003e284bea7d in __gnu_cxx::__verbose_terminate_handler() ()
>    from /usr/lib64/libstdc++.so.6
> #3  0x0000003e284bcc06 in ?? () from /usr/lib64/libstdc++.so.6
> #4  0x0000003e284bcc33 in std::terminate() () from 
> /usr/lib64/libstdc++.so.6
> #5  0x0000003e284bcd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6
> #6  0x00002aaaab45ad71 in thrust::detail::backend::cuda::malloc<0u> 
> (n=375300596006272)
>     at malloc.inl:50
> #7  0x00002aaaab454322 in 
> thrust::detail::backend::dispatch::malloc<0u> (n=375300596006272)
>     at malloc.h:56
> #8  0x00002aaaab453555 in thrust::device_malloc (n=375300596006272) at 
> device_malloc.inl:32
> #9  0x00002aaaab46477d in thrust::device_malloc<double> (n=46912574500784)
>     at device_malloc.inl:38
> #10 0x00002aaaab461fce in 
> thrust::device_malloc_allocator<double>::allocate (
>     this=0x7fffffff9880, cnt=46912574500784) at 
> device_malloc_allocator.h:101
> #11 0x00002aaaab45ee91 in thrust::detail::contiguous_storage<double, 
> thrust::device_malloc_allocator<double> >::allocate 
> (this=0x7fffffff9880, n=46912574500784)
>     at contiguous_storage.inl:134
> #12 0x00002aaaab46ebba in thrust::detail::contiguous_storage<double, 
> thrust::device_malloc_allocator<double> >::contiguous_storage 
> (this=0x7fffffff9880, n=46912574500784)
>     at contiguous_storage.inl:46
> #13 0x00002aaaab46cd1e in thrust::detail::vector_base<double, 
> thrust::device_malloc_allocator<double> >::fill_insert 
> (this=0x13623990, position=..., n=46912574500784,
>     x=@0x7fffffff9f18) at vector_base.inl:792
> #14 0x00002aaaab46b058 in thrust::detail::vector_base<double, 
> thrust::device_malloc_allocator<double> >::insert (this=0x13623990, 
> position=..., n=46912574500784, x=@0x7fffffff9f18)
>     at vector_base.inl:561
> #15 0x00002aaaab4692a3 in thrust::detail::vector_base<double, 
> thrust::device_malloc_allocator<double> >::resize (this=0x13623990, 
> new_size=46912574500784, x=@0x7fffffff9f18)
>     at vector_base.inl:222
> #16 0x00002aaaac2c3d9b in cusp::precond::smoothed_aggregation<int, 
> double, 
> thrust::detail::cuda_device_space_tag>::smoothed_aggregation<cusp::csr_matrix<int,
> double, thrust::detail::cuda_device_space_tag> > (this=0x136182b0, 
> A=..., theta=0) at smoothed_aggregation.inl:210
> #17 0x00002aaaac27cf84 in PCSetUp_SACUSP (pc=0x1360f330) at 
> sacusp.cu:76 <http://sacusp.cu:76>
> #18 0x00002aaaac1f0024 in PCSetUp (pc=0x1360f330) at precon.c:832
> #19 0x00002aaaabd02144 in KSPSetUp (ksp=0x135d2a00) at itfunc.c:261
> #20 0x00002aaaabd0396e in KSPSolve (ksp=0x135d2a00, b=0x135a0fa0, 
> x=0x135a2b50)
>     at itfunc.c:385
> #21 0x0000000000403619 in main (argc=17, args=0x7fffffffc538) at ex2.c:217
> On Mon, Feb 27, 2012 at 4:48 PM, John Fettig <john.fettig at gmail.com 
> <mailto:john.fettig at gmail.com>> wrote:
> Hi Paul,
> This is very interesting.  I tried building the code with 
> --download-txpetscgpu and it doesn't work for me.  It runs out of 
> memory, no matter how small the problem (this is ex2 from 
> src/ksp/ksp/examples/tutorials):
> mpirun -np 1 ./ex2 -n 10 -m 10 -ksp_type cg -pc_type sacusp -mat_type 
> aijcusp -vec_type cusp -cusp_storage_format csr -use_cusparse 0
> terminate called after throwing an instance of 
> 'thrust::system::detail::bad_alloc'
>   what():  std::bad_alloc: out of memory
> MPI Application rank 0 killed before MPI_Finalize() with signal 6
> This example works fine when I build without your gpu additions (and 
> for much larger problems too).  Am I doing something wrong?
> For reference, I'm using CUDA 4.1, CUSP 0.3, and Thrust 1.5.1
> John
> On Fri, Feb 10, 2012 at 5:04 PM, Paul Mullowney <paulm at txcorp.com 
> <mailto:paulm at txcorp.com>> wrote:
> Hi All,
> I've been developing GPU capabilities for PETSc. The development has 
> focused mostly on
> (1) An efficient multi-GPU SpMV, i.e. MatMult. This is working well.
> (2) Triangular Solve used in ILU preconditioners; i.e. MatSolve. The 
> performance of this ... is what it is :|
> This code is in beta mode. Keep that in mind, if you decide to use it. 
> It supports single and double precision, real numbers only! Complex 
> will be supported at some point in the future, but not any time soon.
> To build with these capabilities, add the following to your configure 
> line.
> --download-txpetscgpu=yes
> The capabilities of the SpMV code are accessed with the following 2 
> command line flags
> -cusp_storage_format csr (other options are coo (coordinate), ell 
> (ellpack), dia (diagonal). hyb (hybrid) is not yet supported)
> -use_cusparse (this is a boolean and at the moment is only supported 
> with csr format matrices. In the future, cusparse will work with ell, 
> coo, and hyb formats).
> Regarding the number of GPUs to run on:
> Imagine a system with P nodes, N cores per node, and M GPUs per node. 
> Then, to use only the GPUs, I would run with M ranks per node over P 
> nodes.  As an example, I have a system with 2 nodes. Each node has 8 
> cores, and 4 GPUs attached to each node (P=2, N=8, M=4). In a PBS 
> queue script, one would use 2 nodes at 4 processors per node. Each mpi 
> rank (CPU processor) will be attached to a GPU.
> You do not need to explicitly manage the GPUs, apart from 
> understanding what type of system you are running on. To learn how 
> many devices are available per node, use the command line flag:
> -cuda_show_devices
> -Paul

-------------- next part --------------
An HTML attachment was scrubbed...

Reply via email to