David, Superlu_dist seems sligtly better. Does mumps crashes during numeric factorization due to memory limitation? You may try the option '-mat_mumps_icntl_14 <num>' with num>20 (ICNTL(14): percentage of estimated workspace increase, default=20). Run your code with '-help' to see all available options.
>From your output > MatSolve 135030 1.0 3.0340e+03 i.e., you called MatMatSolve() 30 times, with num of rhs= 135030 (matrix B has 135030/30 columns). Although superlu_dist and mumps suppport multiple rhs operation, petsc interface actually calls MatSolve() in a loop, which can be accelarated if petsc interfaces superlu/mumps's MatMatSolve() directly. I'll try to add it into the interface and let you know after I'm done (it might take a while because I'm tied with other projects). May I have your calling sequence of using MatMatSolve()? To me, the performances of superlu_dist and mumps are reasonable under current version of petsc library. Thanks for providing us the data, Hong On Sun, 15 Mar 2009, David Fuentes wrote: > On Sat, 14 Mar 2009, Hong Zhang wrote: > >> >> David, >> >> Yes, MatMatSolve dominates. Can you also send us the output of >> '-log_summary' from superlu_dist? >> >> MUMPS only suppports centralized rhs vector b. >> Thus, we must scatter petsc distributed b into a seqential rhs vector >> (stored in root proc) in the petsc interface, which explains why the root >> proc takes longer time. >> I see that the numerical factorization and MatMatSolve are called >> 30 times. >> Do you iterate with the sequence similar to >> for i=0,1, ... >> B_i = X_(i-1) >> Solve A_i * X_i = B_i >> >> i.e., the rhs B is based on previously computed X? > > Hong, > > Yes my sequence is similiar to the algorithm above. > > > The numbers I sent were from superlu. I'm seeing pretty similiar > performance profiles between the two. Sorry, I tried to get a good > apples to apples comparison but getting seg faults as I increase > the # of processors w/ mumps which is why it is ran w/ only 24 procs and > super lu is w/ 40 procs. > > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flops --- Global --- --- > Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 3: State Update (superlu 40 processors) > > VecCopy 135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecWAXPY 30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 840 > VecScatterBegin 30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02 > 0.0e+00 0 0 15 0 0 0 0 50 0 0 0 > VecScatterEnd 30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMultAdd 30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02 > 0.0e+00 0 0 15 0 0 0 0 50 0 0 3679 > MatSolve 135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 78 0 0 0 0 81 0 0 0 0 0 > MatLUFactorSym 30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatLUFactorNum 30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 7 0 0 0 0 7 0 0 0 0 0 > MatConvert 150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.8e+02 0 0 0 0 2 0 0 0 0 30 0 > MatScale 60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 2210 > MatAssemblyBegin 180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.4e+02 2 0 0 0 2 2 0 0 0 40 0 > MatAssemblyEnd 180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.4e+02 0 0 0 0 2 0 0 0 0 40 0 > MatGetRow 4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMult 30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06 > 2.4e+02 11100 15 97 2 11100 50100 40 12841 > MatMatSolve 30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+01 77 0 0 0 1 81 0 0 0 10 0 > > --- Event Stage 3: State Update (mumps 24 processors) > > VecWAXPY 30 1.0 3.5802e-04 2.0 6.00e+03 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 377 > VecScatterBegin 270090 1.0 2.6040e+0121.1 0.00e+00 0.0 3.1e+06 2.3e+03 > 0.0e+00 0 0 97 6 0 0 0 99 6 0 0 > VecScatterEnd 135060 1.0 3.7928e+0164.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMultAdd 30 1.0 4.5802e-01 2.3 5.40e+07 1.1 1.7e+04 1.5e+03 > 0.0e+00 0 0 1 0 0 0 0 1 0 0 2653 > MatSolve 135030 1.0 6.4960e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03 > 1.5e+02 81 0 96 6 0 86 0 99 6 7 0 > MatLUFactorSym 30 1.0 1.0538e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatLUFactorNum 30 1.0 4.4708e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.8e+02 6 0 0 0 0 6 0 0 0 9 0 > MatConvert 150 1.0 4.7433e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.3e+02 0 0 0 0 0 0 0 0 0 30 0 > MatScale 60 1.0 4.3342e-01 6.7 2.70e+07 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 1402 > MatAssemblyBegin 180 1.0 8.4294e+01 5.9 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.4e+02 1 0 0 0 0 1 0 0 0 12 0 > MatAssemblyEnd 180 1.0 1.3100e-01 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 4.2e+02 0 0 0 0 0 0 0 0 0 20 0 > MatGetRow 6000 1.1 3.6813e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMult 30 1.0 6.1625e+02 1.0 2.43e+11 1.1 1.7e+04 6.8e+06 > 5.1e+02 8100 1 91 0 8100 1 94 25 8872 > MatMatSolve 30 1.0 6.4946e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03 > 1.2e+02 81 0 96 6 0 86 0 99 6 6 0 > ------------------------------------------------------------------------------------------------------------------------ > > > > > > >> On Sat, 14 Mar 2009, David Fuentes wrote: >> >>> Thanks a lot Hong, >>> >>> The switch definitely seemed to balance the load during the SuperLU >>> matmatsolve. >>> Although I'm not completely sure what I'm seeing. Changing the #dof >>> also seemed to affect the load balance of the Mumps MatMatSolve. >>> I need to investigate a bit more. >>> >>> Looking in the profile. The majority of the time is spent in the >>> MatSolve called by the MatMatSolve. >>> >>> >>> >>> >>> ------------------------------------------------------------------------------------------------------------------------ >>> Event Count Time (sec) Flops --- Global --- --- >>> Stage --- Total >>> Max Ratio Max Ratio Max Ratio Mess Avg len >>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>> >>> >>> ------------------------------------------------------------------------------------------------------------------------ >>> >>> VecCopy 135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecWAXPY 30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 840 >>> VecScatterBegin 30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02 >>> 0.0e+00 0 0 15 0 0 0 0 50 0 0 0 >>> VecScatterEnd 30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatMultAdd 30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02 >>> 0.0e+00 0 0 15 0 0 0 0 50 0 0 3679 >>> MatSolve 135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 78 0 0 0 0 81 0 0 0 0 0 >>> MatLUFactorSym 30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatLUFactorNum 30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 7 0 0 0 0 7 0 0 0 0 0 >>> MatConvert 150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 1.8e+02 0 0 0 0 4 0 0 0 0 30 0 >>> MatScale 60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2210 >>> MatAssemblyBegin 180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 2.4e+02 2 0 0 0 5 2 0 0 0 40 0 >>> MatAssemblyEnd 180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 2.4e+02 0 0 0 0 5 0 0 0 0 40 0 >>> MatGetRow 4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatMatMult 30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06 >>> 2.4e+02 11 100 15 97 5 11100 50100 40 12841 >>> MatMatSolve 30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 6.0e+01 77 0 0 0 1 81 0 0 0 10 0 >>> >>> >>> >>> df >>> >>> >>> >>> On Fri, 13 Mar 2009, Hong Zhang wrote: >>> >>>> David, >>>> >>>> You may run with option '-log_summary <log_file>' and >>>> check which function dominates the time. >>>> I suspect the symbolic factorization, because it is >>>> implemented sequentially in mumps. >>>> >>>> If this is the case, you may swich to superlu_dist >>>> which supports parallel symbolic factorization >>>> in the latest release. >>>> >>>> Let us know what you get, >>>> >>>> Hong >>>> >>>> On Fri, 13 Mar 2009, David Fuentes wrote: >>>> >>>>> >>>>> The majority of time in my code is spent in the MatMatSolve. I'm running >>>>> MatMatSolve in parallel using Mumps as the factored matrix. >>>>> Using top, I've noticed that during the MatMatSolve >>>>> the majority of the load seems to be on the root process. >>>>> Is this expected? Or do I most likely have a problem with the matrices >>>>> that I'm passing in? >>>>> >>>>> >>>>> >>>>> thank you, >>>>> David Fuentes >>>>> >>>>> >>>> >>> >> >