Am 17.10.2012 17:50, schrieb Hong Zhang: > Thomas: > > Does this occur only for large matrices? > Can you dump your matrices into petsc binary files > (e.g., A.dat, B.dat) and send to us for debugging? > > Lately, we added a new implementation of MatTransposeMatMult() in > petsc-dev > which is shown much faster than released MatTransposeMatMult(). > You might give it a try by > 1. install petsc-dev (see > http://www.mcs.anl.gov/petsc/developers/index.html) > 2. run your code with option '-mattransposematmult_viamatmatmult 1' > Let us know what you get. > I checked the problem with petsc-dev. Here, the code just hangs somewhere inside MatTransposeMatMult. I checked, what MatTranspose does on the corresponding matrix and the behavior is the same. I extracted the matrix from my simulations, its of size 123,432 x 1,533,726 and very sparse (2 to 8 nnzs per row). I'm sorry, but this is the smallest matrix where I found the problem (I will send the matrix file to petsc-maint). I wrote some small piece of code, that just reads the matrix and runs MatTranspose. With 1 mpi task, it works fine. With small number of mpi tasks (so around 8), I get the following error message:
[1]PETSC ERROR: ------------------------------------------------------------------------ [1]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the batch system) has told this process to end [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [1]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [1]PETSC ERROR: likely location of problem given in stack below [1]PETSC ERROR: --------------------- Stack Frames ------------------------------------ [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, [1]PETSC ERROR: INSTEAD the line number of the start of the function [1]PETSC ERROR: is given. [1]PETSC ERROR: [1] PetscSFReduceEnd line 1259 src/sys/sf/sf.c [1]PETSC ERROR: [1] MatTranspose_MPIAIJ line 2045 src/mat/impls/aij/mpi/mpiaij.c [1]PETSC ERROR: [1] MatTranspose line 4341 src/mat/interface/matrix.c With 32 mpi tasks, which I also use in my simulation, the code hangs in MatTranspose. If there is something more I can do to help you finding the problem, please let me know! Thomas > Hong > > My code makes use of the function MatTransposeMatMult, and usually > it work fine! For some larger input data, it now stops with a lot > of MPI errors: > > fatal error in PMPI_Barrier: Other MPI error, error stack: > PMPI_Barrier(476)..: MPI_Barrier(comm=0x84000001) failed > MPIR_Barrier(82)...: > MPI_Waitall(261): MPI_Waitall(count=9, req_array=0xa787ba0, > status_array=0xa789240) failed > MPI_Waitall(113): The supplied request in array element 8 was > invalid (kind=0) > Fatal error in PMPI_Barrier: Other MPI error, error stack: > PMPI_Barrier(476)..: MPI_Barrier(comm=0x84000001) failed > MPIR_Barrier(82)...: > mpid_irecv_done(98): read from socket failed - request > state:recv(pde)done > > > Here is the stack print from the debugger: > > 6, MatTransposeMatMult (matrix.c:8907) > 6, MatTransposeMatMult_MPIAIJ_MPIAIJ > (mpimatmatmult.c:809) > 6, MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ > (mpimatmatmult.c:1136) > 6, PetscGatherMessageLengths2 (mpimesg.c:213) > 6, PMPI_Waitall > 6, MPIR_Err_return_comm > 6, MPID_Abort > > > I use PETSc 3.3-p3. Any idea whether this is or could be related > to some bug in PETSc or whether I make wrong use of the function > in some way? > > Thomas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121017/8ddf1a06/attachment-0001.html>
