On Wed, Jun 27, 2018 at 3:12 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> > David, > > This is ugly but should work. BEFORE reading in the matrix and right > hand side set the LOCAL sizes for the matrix and vector. This way you can > control exactly which rows go on which process. Note you will have to have > your own mechanism to know what the local sizes should be (for example have > the original program print out the sizes and just cut and paste them into > your copy of ex10.c) PETSc doesn't provide an automatic way to do this (nor > should it). > > Barry > Thanks Barry and Stefano, the approach you both suggested (calling MatSetSizes and VecSetSizes before reading in) was what I was looking for. Best, David > On Jun 27, 2018, at 1:36 PM, David Knezevic <david.kneze...@akselos.com> > wrote: > > > > I ran into a case where using MUMPS (called via "-ksp_type preonly > -pc_type lu -pc_factor_mat_solver_package mumps") for a particular solve > hangs indefinitely with 24 MPI processes (but it works fine with other > numbers of processes). The stack trace when killing the job is below, in > case that gives any clue as to what is wrong. > > > > I'm trying to replicate this with a simple test case. I wrote out the > matrix and right-hand side to disk using MatView and VecView, and then I > modified ksp ex10 to read in these files and solve with 24 cores. However, > that did not replicate the error, so I think I also need to make sure that > I use the same number of rows per process in the test case as in the case > that hung. As a result I'm wondering if there is a way to modify the > parallel layout of the matrix and vector after I read them in? > > > > Also, if there are any other suggestions about reproducing or debugging > this issue, please let me know! > > > > Best, > > David > > > > -------------------------------- > > > > #0 0x00007fb12bf0e74d in poll () at ../sysdeps/unix/syscall-templa > te.S:84 > > #1 0x00007fb126262e58 in ?? () from /usr/lib/libopen-pal.so.13 > > #2 0x00007fb1262596fb in opal_libevent2021_event_base_loop () from > /usr/lib/libopen-pal.so.13 > > #3 0x00007fb126223238 in opal_progress () from > /usr/lib/libopen-pal.so.13 > > #4 0x00007fb12cef53db in ompi_request_default_test () from > /usr/lib/libmpi.so.12 > > #5 0x00007fb12cf21d61 in PMPI_Test () from /usr/lib/libmpi.so.12 > > #6 0x00007fb127a5b939 in pmpi_test__ () from /usr/lib/libmpi_mpifh.so.12 > > #7 0x00007fb132888d87 in dmumps_try_recvtreat (comm_load=8, > ass_irecv=40, blocking=.FALSE., set_irecv=.TRUE., message_received=.FALSE., > msgsou=-1, msgtag=-1, status=..., bufr=..., lbufr=401408, > lbufr_bytes=1605629, procnode_steps=..., posfac=410095, iwpos=3151, > iwposcb=30557, > > iptrlu=1536548, lrlu=1126454, lrlus=2864100, n=30675, iw=..., > liw=39935, a=..., la=3367108, ptrist=..., ptlust=..., ptrfac=..., > ptrast=..., step=..., pimaster=..., pamaster=..., nstk_s=..., comp=0, > iflag=0, ierror=0, comm=7, nbprocfils=..., ipool=..., lpool=48, leaf=2, > > nbfin=90, myid=33, slavef=90, root=..., opassw=353031, > opeliw=700399235, itloc=..., rhs_mumps=..., fils=..., ptrarw=..., > ptraiw=..., intarr=..., dblarr=..., icntl=..., keep=..., keep8=..., > dkeep=..., nd=..., frere=..., lptrar=30675, nelt=1, frtptr=..., frtelt=..., > > istep_to_iniv2=..., tab_pos_in_pere=..., > stack_right_authorized=.TRUE., lrgroups=...) at dfac_process_message.F:646 > > #8 0x00007fb1328cfcd1 in dmumps_fac_par_m::dmumps_fac_par (n=30675, > iw=..., liw=39935, a=..., la=3367108, nstk_steps=..., nbprocfils=..., > nd=..., fils=..., step=..., frere=..., dad=..., cand=..., > istep_to_iniv2=..., tab_pos_in_pere=..., maxfrt=0, ntotpv=0, nmaxnpiv=150, > > ptrist=..., ptrast=..., pimaster=..., pamaster=..., ptrarw=..., > ptraiw=..., itloc=..., rhs_mumps=..., ipool=..., lpool=48, rinfo=..., > posfac=410095, iwpos=3151, lrlu=1126454, iptrlu=1536548, lrlus=2864100, > leaf=2, nbroot=1, nbrtot=90, uu=0.01, icntl=..., ptlust=..., ptrfac=..., > > nsteps=1, info=..., keep=..., keep8=..., procnode_steps=..., > slavef=90, myid=33, comm_nodes=7, myid_nodes=33, bufr=..., lbufr=401408, > lbufr_bytes=1605629, intarr=..., dblarr=..., root=..., perm=..., nelt=1, > frtptr=..., frtelt=..., lptrar=30675, comm_load=8, ass_irecv=40, > > seuil=0, seuil_ldlt_niv2=0, mem_distrib=..., ne=..., dkeep=..., > pivnul_list=..., lpn_list=1, lrgroups=...) at dfac_par_m.F:207 > > #9 0x00007fb13287f875 in dmumps_fac_b (n=30675, nsteps=1, a=..., > la=3367108, iw=..., liw=39935, sym_perm=..., na=..., lna=47, ne_steps=..., > nfsiz=..., fils=..., step=..., frere=..., dad=..., cand=..., > istep_to_iniv2=..., tab_pos_in_pere=..., ptrar=..., ldptrar=30675, > ptrist=..., > > ptlust_s=..., ptrfac=..., iw1=..., iw2=..., itloc=..., > rhs_mumps=..., pool=..., lpool=48, cntl1=0.01, icntl=..., info=..., > rinfo=..., keep=..., keep8=..., procnode_steps=..., slavef=90, > comm_nodes=7, myid=33, myid_nodes=33, bufr=..., lbufr=401408, > lbufr_bytes=1605629, > > intarr=..., dblarr=..., root=..., nelt=1, frtptr=..., frtelt=..., > comm_load=8, ass_irecv=40, seuil=0, seuil_ldlt_niv2=0, mem_distrib=..., > dkeep=..., pivnul_list=..., lpn_list=1, lrgroups=...) at dfac_b.F:167 > > #10 0x00007fb1328419ed in dmumps_fac_driver (id=<error reading variable: > value requires 600640 bytes, which is more than max-value-size>) at > dfac_driver.F:2291 > > #11 0x00007fb1327ff6dc in dmumps (id=<error reading variable: value > requires 600640 bytes, which is more than max-value-size>) at > dmumps_driver.F:1686 > > #12 0x00007fb1327faf0a in dmumps_f77 (job=2, sym=0, par=1, comm_f77=5, > n=30675, icntl=..., cntl=..., keep=..., dkeep=..., keep8=..., nz=0, nnz=0, > irn=..., irnhere=0, jcn=..., jcnhere=0, a=..., ahere=0, nz_loc=622296, > nnz_loc=0, irn_loc=..., irn_lochere=1, jcn_loc=..., > > jcn_lochere=1, a_loc=..., a_lochere=1, nelt=0, eltptr=..., > eltptrhere=0, eltvar=..., eltvarhere=0, a_elt=..., a_elthere=0, > perm_in=..., perm_inhere=0, rhs=..., rhshere=0, redrhs=..., redrhshere=0, > info=..., rinfo=..., infog=..., rinfog=..., deficiency=0, lwk_user=0, > > size_schur=0, listvar_schur=..., listvar_schurhere=0, schur=..., > schurhere=0, wk_user=..., wk_userhere=0, colsca=..., colscahere=0, > rowsca=..., rowscahere=0, instance_number=1, nrhs=1, lrhs=0, lredrhs=0, > rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0, > > irhs_sparse=..., irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0, > isol_loc=..., isol_lochere=0, nz_rhs=0, lsol_loc=0, schur_mloc=0, > schur_nloc=0, schur_lld=0, mblock=0, nblock=0, nprow=0, npcol=0, > ooc_tmpdir=..., ooc_prefix=..., write_problem=..., tmpdirlen=20, > prefixlen=20, > > write_problemlen=20) at dmumps_f77.F:267 > > #13 0x00007fb1327f9cfa in dmumps_c (mumps_par=mumps_par@entry=0x12bd9660) > at mumps_c.c:417 > > #14 0x00007fb1321a23fc in MatFactorNumeric_MUMPS (F=0x12bd8b60, > A=0x26bd890, info=<optimized out>) at /home/buildslave/software/pets > c-src/src/mat/impls/aij/mpi/mumps/mumps.c:1073 > > #15 0x00007fb131ec6ea7 in MatLUFactorNumeric (fact=0x12bd8b60, > mat=0x26bd890, info=info@entry=0xc2a66f8) at > /home/buildslave/software/petsc-src/src/mat/interface/matrix.c:3025 > > #16 0x00007fb1325040d6 in PCSetUp_LU (pc=0xc2a6380) at > /home/buildslave/software/petsc-src/src/ksp/pc/impls/factor/lu/lu.c:131 > > #17 0x00007fb13259903e in PCSetUp (pc=0xc2a6380) at > /home/buildslave/software/petsc-src/src/ksp/pc/interface/precon.c:923 > > #18 0x00007fb13263e53f in KSPSetUp (ksp=ksp@entry=0x12b28c70) at > /home/buildslave/software/petsc-src/src/ksp/ksp/interface/itfunc.c:381 > > #19 0x00007fb13263ed36 in KSPSolve (ksp=0x12b28c70, b=0xad77d50, > x=0xad801c0) at /home/buildslave/software/pets > c-src/src/ksp/ksp/interface/itfunc.c:612 > > #20 0x00007fb12db5dfc2 in libMesh::PetscLinearSolver<dou > ble>::solve(libMesh::SparseMatrix<double>&, libMesh::SparseMatrix<double>&, > libMesh::NumericVector<double>&, libMesh::NumericVector<double>&, double, > unsigned int) () > > from /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../../third_pa > rty/opt_real/libmesh_opt.so.0 > > #21 0x00007fb1338d0c06 in libMesh::PetscLinearSolver<dou > ble>::solve(libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&, > libMesh::NumericVector<double>&, double, unsigned int) () from > /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../lib/libscrbe-opt_real.so > > #22 0x00007fb1335e8abd in std::pair<unsigned int, double> > SolveHelper::try_linear_solve<libMesh::LinearSolver<double> > >(libMesh::LinearSolver<double>&, libMesh::SolverConfiguration&, > libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&, > libMesh::NumericVector<double>&) () > > from /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../lib/libscrb > e-opt_real.so > > #23 0x00007fb133a70206 in > >