On Mon, 06 Oct 2014 12:53:00 +0100 "Garth N. Wells" <[email protected]> wrote:
> > > On Mon, 6 Oct, 2014 at 12:32 PM, Jan Blechta > <[email protected]> wrote: > > On Mon, 06 Oct 2014 11:53:58 +0100 > > "Garth N. Wells" <[email protected]> wrote: > > > >> On Mon, 6 Oct, 2014 at 11:23 AM, Martin Sandve Alnæs > >> <[email protected]> wrote: > >> > All collective destructors must be managed explicitly in python, > >> > preferably via with statement. Are there any apart from file > >> > objects? Vectors? Matrices? Meshes? > >> > > >> > >> Off the top of my head I can't think of any cases, apart from IO, > >> in which a (collective) MPI call needs to be made inside a > >> destructor. For IO, we could insist on a user closing or flushing > >> a file explicitly. We cannot guarantee that 3rd party linear > >> algebra backends do not call MPI when objects are destroyed. > > > > VecDestroy and MatDestroy (called by PESTcVector and PETScBaseMatrix > > destructors) are claimed to be collective by PETSc doc: > > http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecDestroy.html > > http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatDestroy.html > > Yes, they are collective but don't necessarily make MPI calls. My > understanding is that 'collective' is not the issue but whether or > not MPI calls are made from a destructor. Some functions will only > make sense if called collectively (e.g., VecDestroy), but might not > make collective MPI calls. In an ususal sense, collectivity imlies a communication, see http://en.wikipedia.org/wiki/Collective_operation. I think that, collective PESTc routine employs MPI collective operation (in the case that MPI implementation of Vec/Mat is used). What you are speaking about is "logically collective" routine, e.g. http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetOption.html Can you confirm, Jed, whether we are correct? Jan > > For the tests, assuming PyTest permits a 'setup' function like > unittest, would it help if MPI is initialised explicitly in the setup > function and closed down at the end of a test suite (if possible with > PyTest)? > > Garth > > > > > > Jan > > > >> > >> We have had this problem in the past with the 'automatic' > >> finalisation of MPI, which is a problem if MPI is shutdown before > >> PETSc. > >> > >> Garth > >> > >> > >> > 6. okt. 2014 12:18 skrev "Jan Blechta" > >> <[email protected]> > >> > følgende: > >> >> On Mon, 6 Oct 2014 12:07:02 +0200 > >> >> Martin Sandve Alnæs <[email protected]> wrote: > >> >> > >> >> > The problem is that gc is nondeterministic and in particular > >> not > >> >> > running with equal timing and ordering on each mpi process. > >> >> > > >> >> > We can't use the with statement to handle the scope of every > >> >> > single dolfin object in a program. > >> >> > >> >> Most of the DOLFIN destructors are not collective. So the > >> >> moral > >> is > >> >> that > >> >> we should avoid collective destructors as possible and > >> >> document > >> it > >> >> like > >> >> it is in PETSc doc. > >> >> > >> >> Jan > >> >> > >> >> > > >> >> > We can change all file handling to use with, and require the > >> >> > user > >> >> to > >> >> > use that in parallel. > >> >> > 6. okt. 2014 11:41 skrev "Jan Blechta" > >> >> <[email protected]> > >> >> > følgende: > >> >> > > >> >> > > On Mon, 6 Oct 2014 09:48:29 +0200 > >> >> > > Martin Sandve Alnæs <[email protected]> wrote: > >> >> > > > >> >> > > > The 'fix' that's in the branch now was to trigger python > >> >> garbage > >> >> > > > collection (suggested by Øyvind Evju) before each test. > >> >> > > > > >> >> > > > This probably means we have a general problem in dolfin > >> with > >> >> > > > non-deterministic destruction order of objects in > >> >> > > > parallel. Any destructor that uses MPI represents a > >> >> > > > potential > >> deadlock. > >> >> > > > >> >> > > To understand the issue, is the problem that garbage > >> >> > > collection > >> >> does > >> >> > > not ensure when the object is destroyed which is the > >> >> > > problem? > >> >> > > > >> >> > > Here http://stackoverflow.com/a/5071376/1796717 the > >> distinction > >> >> > > between variable scoping and object cleanup is discussed. > >> >> Quoting it > >> >> > > > >> >> > > Deterministic cleanup happens through the with statement. > >> >> > > > >> >> > > which might be a proper solution to the problem. > >> >> > > > >> >> > > Jan > >> >> > > > >> >> > > > > >> >> > > > On 19 September 2014 12:52, Jan Blechta > >> >> > > > <[email protected]> wrote: > >> >> > > > > >> >> > > > > On Fri, 19 Sep 2014 00:27:50 +0200 > >> >> > > > > Jan Blechta <[email protected]> wrote: > >> >> > > > > > >> >> > > > > > Yes, after many trials using > >> >> > > > > > > >> >> > > > > > $ cd test/unit/io/python > >> >> > > > > > $ while true; do git clean -fdx && mpirun -n 3 xterm > >> >> > > > > > -e gdb -ex r -ex q -args python -m pytest -sv; done > >> >> > > > > > # when it hangs and you interrupt it, it asks for > >> >> > > > > > confirmation for # quitting, so you type n and enjoy > >> >> > > > > > gdb... > >> >> > > > > > > >> >> > > > > > I've seen a situation when 2 processes deadlocked on > >> >> > > > > > HDF5Interface::close_file() in DOLFIN with backtrace > >> like > >> >> > > > > > > >> >> > > > > > # MPI barrier > >> >> > > > > > ... > >> >> > > > > > # MPI close > >> >> > > > > > # HDF5 lib calls > >> >> > > > > > H5FClose() > >> >> > > > > > dolfin::HDF5Interface::close_file() > >> >> > > > > > dolfin::HDF5File::close() > >> >> > > > > > dolfin::HDF5File::~HDF5File() > >> >> > > > > > dolfin::HDF5File::~HDF5File() > >> >> > > > > > # smart ptr management > >> >> > > > > > # garbage collection > >> >> > > > > > > >> >> > > > > > while 3rd process is waiting far away. Isn't it > >> >> > > > > > strange > >> >> that > >> >> > > > > > destructor is there twice in stacktrace? (The upper > >> >> > > > > > one is > >> >> on > >> >> > > > > > '}' line which I don't get.) What does it mean? > >> >> > > > > > >> >> > > > > Probably just code generation artifact - nothing > >> >> > > > > harmful, see > >> >> > > > > http://stackoverflow.com/a/15244091/1796717 > >> >> > > > > > >> >> > > > > Jan > >> >> > > > > > >> >> > > > > > > >> >> > > > > > Jan > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > On Thu, 18 Sep 2014 16:20:51 +0200 > >> >> > > > > > Martin Sandve Alnæs <[email protected]> wrote: > >> >> > > > > > > >> >> > > > > > > I've added the mpi fixes for temppath fixture and > >> fixed > >> >> > > > > > > some other related issues while at it: When > >> >> parameterizing > >> >> > > > > > > a test that uses a temppath fixture, there is a > >> >> > > > > > > need for separate directories for each parameter > >> >> > > > > > > combo. A further improvement would be automatic > >> >> > > > > > > cleaning of old tempdirs, but I leave that for now. > >> >> > > > > > > > >> >> > > > > > > I've pushed these changes to the branch > >> >> > > > > > > aslakbergersen/topic-change-unittest-to-pytest > >> >> > > > > > > > >> >> > > > > > > The tests still hang though, in the closing of > >> >> > > > > > > HDF5File. > >> >> > > > > > > > >> >> > > > > > > Here's now to debug if someone wants to give it a > >> shot: > >> >> > > > > > > Just run: > >> >> > > > > > > mpirun -np 3 python -m pytest -s -v > >> >> > > > > > > With gdb: > >> >> > > > > > > mpirun -np 3 xterm -e gdb --args python -m > >> >> > > > > > > pytest then enter 'r' in each of the three xterms. > >> >> > > > > > > > >> >> > > > > > > You may have to try a couple of times to get the > >> >> > > > > > > hanging behaviour. > >> >> > > > > > > > >> >> > > > > > > Martin > >> >> > > > > > > > >> >> > > > > > > On 18 September 2014 13:23, Martin Sandve Alnæs > >> >> > > > > > > <[email protected]> wrote: > >> >> > > > > > > > >> >> > > > > > > > Good spotting both of you, thanks. > >> >> > > > > > > > > >> >> > > > > > > > Martin > >> >> > > > > > > > > >> >> > > > > > > > On 18 September 2014 13:01, Lawrence Mitchell < > >> >> > > > > > > > [email protected]> wrote: > >> >> > > > > > > > > >> >> > > > > > > >> On 18/09/14 11:42, Jan Blechta wrote: > >> >> > > > > > > >> > Some problems (when running in a clean dir) > >> >> > > > > > > >> > are > >> >> avoided > >> >> > > > > > > >> > using this (although incorrect) patch. There > >> >> > > > > > > >> > are > >> >> race > >> >> > > > > > > >> > conditions in creation of temp dir. It should > >> >> > > > > > > >> > be > >> >> done > >> >> > > > > > > >> > using atomic operation. > >> >> > > > > > > >> > > >> >> > > > > > > >> > Jan > >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> >> > > > >> >> > >> ================================================================== > >> >> > > > > > > >> > diff --git a/test/unit/io/python/test_XDMF.py > >> >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py index > >> >> > > > > > > >> > 9ad65a4..31471f1 100755 --- > >> >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py +++ > >> >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py @@ -28,8 > >> +28,9 > >> >> > > > > > > >> > @@ def temppath(): filedir = > >> >> > > > > > > >> > os.path.dirname(os.path.abspath(__file__)) > >> >> > > > > > > >> > basename > >> >> = > >> >> > > > > > > >> > os.path.basename(__file__).replace(".py", > >> "_data") > >> >> > > > > > > >> > temppath = os.path.join(filedir, basename, "") > >> >> > > > > > > >> > - if not os.path.exists(temppath): > >> >> > > > > > > >> > - os.mkdir(temppath) > >> >> > > > > > > >> > + if MPI.rank(mpi_comm_world()) == 0: > >> >> > > > > > > >> > + if not os.path.exists(temppath): > >> >> > > > > > > >> > + os.mkdir(temppath) > >> >> > > > > > > >> > return temppath > >> >> > > > > > > >> > >> >> > > > > > > >> There's still a race condition here because > >> >> > > > > > > >> ranks > >> >> other > >> >> > > > > > > >> than zero might try and use temppath before it's > >> >> > > > > > > >> created. I think you want something like the > >> below: > >> >> > > > > > > >> > >> >> > > > > > > >> if MPI.rank(mpi_comm_world()) == 0: > >> >> > > > > > > >> if not os.path.exists(temppath): > >> >> > > > > > > >> os.mkdir(temppath) > >> >> > > > > > > >> MPI.barrier(mpi_comm_world()) > >> >> > > > > > > >> return temppath > >> >> > > > > > > >> > >> >> > > > > > > >> If you're worried about the OS not creating > >> >> > > > > > > >> files atomically, you can always mkdir into a > >> >> > > > > > > >> tmp directory > >> >> and > >> >> > > > > > > >> then os.rename(tmp, temppath), since posix > >> >> > > > > > > >> guarantees that renames are atomic. > >> >> > > > > > > >> > >> >> > > > > > > >> Lawrence > >> >> > > > > > > >> _______________________________________________ > >> >> > > > > > > >> fenics mailing list > >> >> > > > > > > >> [email protected] > >> >> > > > > > > >> http://fenicsproject.org/mailman/listinfo/fenics > >> >> > > > > > > >> > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > >> >> > > > > > _______________________________________________ > >> >> > > > > > fenics mailing list > >> >> > > > > > [email protected] > >> >> > > > > > http://fenicsproject.org/mailman/listinfo/fenics > >> >> > > > > > >> >> > > > > > >> >> > > > >> >> > > > >> >> > > > > _______________________________________________ > fenics mailing list > [email protected] > http://fenicsproject.org/mailman/listinfo/fenics _______________________________________________ fenics mailing list [email protected] http://fenicsproject.org/mailman/listinfo/fenics
