> 2) Recommend users to throw in some gc.collect() calls in their
> code if objects go out of scope in their code. This doesn't seem
to
> be a big problem, but it's a lingering non-deterministic mpi
> deadlock waiting to happen and very hard to debug.
>
What about insisting that objects that require collective calls
during destruction must have a collective ‘clear’ or ‘destroy'
function that cleans up the object.
Related to this discussion, we really need to to starting marking
(logically) collective functions in the docstrings.
Garth
> Martin
>
>
> On 6 October 2014 15:05, Martin Sandve Alnæs <[email protected]>
> wrote: Yes. The difference is that mpi initialization /
destruction
> happens at beginning / end of the process while the destructors
> happen all the time anywhere. I think that makes this a harder
> problem to solve.
>
> Anyway I was replying to "would it help if MPI is initialised
> explicitly in the setup" and the answer is still no because mpi
> init is not the problem in the tests, although it is of similar
> nature.
>
> I'm pondering if its possible (if necessary) to add a
> dolfin.mpi_gc() function and overload __del__ in some classes to
> handle this deterministically.
>
> 6. okt. 2014 14:54 skrev "Garth N. Wells" <[email protected]>
> følgende:
>
>
>
> On Mon, 6 Oct, 2014 at 1:10 PM, Martin Sandve Alnæs
> <[email protected]> wrote: MPI initialization has nothing to do
> with the test problems. The problem is the destructors of objects.
> It is temporarily solved by calling gc collect in pytest fixtures.
>
>
> The core problem is the same. The problem I describe occurs when
> the SubSystemsManager singleton that controls MPI intialisation is
> destroyed (and finalises MPI) before a PETSc object is destroyed.
> It is an issue of destruction order.
>
> Garth
>
> I think we should implement the with statement pattern for all
file
> types in dolfin to allow scope management.
>
> If vectors _do_ call mpi in destructors that's a problem for
> nontrivial dolfin python programs.
>
> 6. okt. 2014 13:53 skrev "Garth N. Wells" <[email protected]>
> følgende:
>
>
> On Mon, 6 Oct, 2014 at 12:32 PM, Jan Blechta
> <[email protected]> wrote: On Mon, 06 Oct 2014 11:53:58
> +0100 "Garth N. Wells" <[email protected]> wrote:
>
> On Mon, 6 Oct, 2014 at 11:23 AM, Martin Sandve Alnæs
> <[email protected]> wrote:
> > All collective destructors must be managed explicitly in
python,
> > preferably via with statement. Are there any apart from file
> > objects? Vectors? Matrices? Meshes?
> >
>
> Off the top of my head I can't think of any cases, apart from IO,
> in which a (collective) MPI call needs to be made inside a
> destructor. For IO, we could insist on a user closing or flushing
a
> file explicitly. We cannot guarantee that 3rd party linear algebra
> backends do not call MPI when objects are destroyed.
>
> VecDestroy and MatDestroy (called by PESTcVector and
PETScBaseMatrix
> destructors) are claimed to be collective by PETSc doc:
>
http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecDestroy.html
>
http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatDestroy.html
>
> Yes, they are collective but don't necessarily make MPI calls. My
> understanding is that 'collective' is not the issue but whether or
> not MPI calls are made from a destructor. Some functions will only
> make sense if called collectively (e.g., VecDestroy), but might
> not make collective MPI calls.
>
> For the tests, assuming PyTest permits a 'setup' function like
> unittest, would it help if MPI is initialised explicitly in the
> setup function and closed down at the end of a test suite (if
> possible with PyTest)?
>
> Garth
>
>
>
> Jan
>
>
> We have had this problem in the past with the 'automatic'
> finalisation of MPI, which is a problem if MPI is shutdown before
> PETSc.
>
> Garth
>
>
> > 6. okt. 2014 12:18 skrev "Jan Blechta"
> > <[email protected]> følgende:
> >> On Mon, 6 Oct 2014 12:07:02 +0200
> >> Martin Sandve Alnæs <[email protected]> wrote:
> >>
> >> > The problem is that gc is nondeterministic and in particular
> >> > not running with equal timing and ordering on each mpi
> >> > process.
> >> >
> >> > We can't use the with statement to handle the scope of every
> >> > single dolfin object in a program.
> >>
> >> Most of the DOLFIN destructors are not collective. So the
moral
> >> is that
> >> we should avoid collective destructors as possible and
document
> >> it like
> >> it is in PETSc doc.
> >>
> >> Jan
> >>
> >> >
> >> > We can change all file handling to use with, and require the
> >> > user
> >> to
> >> > use that in parallel.
> >> > 6. okt. 2014 11:41 skrev "Jan Blechta"
> >> <[email protected]>
> >> > følgende:
> >> >
> >> > > On Mon, 6 Oct 2014 09:48:29 +0200
> >> > > Martin Sandve Alnæs <[email protected]> wrote:
> >> > >
> >> > > > The 'fix' that's in the branch now was to trigger python
> >> garbage
> >> > > > collection (suggested by Øyvind Evju) before each test.
> >> > > >
> >> > > > This probably means we have a general problem in dolfin
> >> > > > with non-deterministic destruction order of objects in
> >> > > > parallel. Any destructor that uses MPI represents a
> >> > > > potential deadlock.
> >> > >
> >> > > To understand the issue, is the problem that garbage
> >> > > collection
> >> does
> >> > > not ensure when the object is destroyed which is the
> >> > > problem?
> >> > >
> >> > > Here http://stackoverflow.com/a/5071376/1796717 the
> >> > > distinction between variable scoping and object cleanup is
> >> > > discussed.
> >> Quoting it
> >> > >
> >> > > Deterministic cleanup happens through the with
statement.
> >> > >
> >> > > which might be a proper solution to the problem.
> >> > >
> >> > > Jan
> >> > >
> >> > > >
> >> > > > On 19 September 2014 12:52, Jan Blechta
> >> > > > <[email protected]> wrote:
> >> > > >
> >> > > > > On Fri, 19 Sep 2014 00:27:50 +0200
> >> > > > > Jan Blechta <[email protected]> wrote:
> >> > > > >
> >> > > > > > Yes, after many trials using
> >> > > > > >
> >> > > > > > $ cd test/unit/io/python
> >> > > > > > $ while true; do git clean -fdx && mpirun -n 3 xterm
> >> > > > > > -e gdb -ex r -ex q -args python -m pytest -sv; done
> >> > > > > > # when it hangs and you interrupt it, it asks for
> >> > > > > > confirmation for # quitting, so you type n and enjoy
> >> > > > > > gdb...
> >> > > > > >
> >> > > > > > I've seen a situation when 2 processes deadlocked on
> >> > > > > > HDF5Interface::close_file() in DOLFIN with backtrace
> >> > > > > > like
> >> > > > > >
> >> > > > > > # MPI barrier
> >> > > > > > ...
> >> > > > > > # MPI close
> >> > > > > > # HDF5 lib calls
> >> > > > > > H5FClose()
> >> > > > > > dolfin::HDF5Interface::close_file()
> >> > > > > > dolfin::HDF5File::close()
> >> > > > > > dolfin::HDF5File::~HDF5File()
> >> > > > > > dolfin::HDF5File::~HDF5File()
> >> > > > > > # smart ptr management
> >> > > > > > # garbage collection
> >> > > > > >
> >> > > > > > while 3rd process is waiting far away. Isn't it
> >> > > > > > strange
> >> that
> >> > > > > > destructor is there twice in stacktrace? (The upper
> >> > > > > > one is
> >> on
> >> > > > > > '}' line which I don't get.) What does it mean?
> >> > > > >
> >> > > > > Probably just code generation artifact - nothing
> >> > > > > harmful, see
http://stackoverflow.com/a/15244091/1796717
> >> > > > >
> >> > > > > Jan
> >> > > > >
> >> > > > > >
> >> > > > > > Jan
> >> > > > > >
> >> > > > > >
> >> > > > > > On Thu, 18 Sep 2014 16:20:51 +0200
> >> > > > > > Martin Sandve Alnæs <[email protected]> wrote:
> >> > > > > >
> >> > > > > > > I've added the mpi fixes for temppath fixture and
> >> > > > > > > fixed some other related issues while at it: When
> >> parameterizing
> >> > > > > > > a test that uses a temppath fixture, there is a
need
> >> > > > > > > for separate directories for each parameter combo.
> >> > > > > > > A further improvement would be automatic cleaning
of
> >> > > > > > > old tempdirs, but I leave that for now.
> >> > > > > > >
> >> > > > > > > I've pushed these changes to the branch
> >> > > > > > > aslakbergersen/topic-change-unittest-to-pytest
> >> > > > > > >
> >> > > > > > > The tests still hang though, in the closing of
> >> > > > > > > HDF5File.
> >> > > > > > >
> >> > > > > > > Here's now to debug if someone wants to give it a
> >> > > > > > > shot: Just run:
> >> > > > > > > mpirun -np 3 python -m pytest -s -v
> >> > > > > > > With gdb:
> >> > > > > > > mpirun -np 3 xterm -e gdb --args python -m
> >> > > > > > > pytest then enter 'r' in each of the three xterms.
> >> > > > > > >
> >> > > > > > > You may have to try a couple of times to get the
> >> > > > > > > hanging behaviour.
> >> > > > > > >
> >> > > > > > > Martin
> >> > > > > > >
> >> > > > > > > On 18 September 2014 13:23, Martin Sandve Alnæs
> >> > > > > > > <[email protected]> wrote:
> >> > > > > > >
> >> > > > > > > > Good spotting both of you, thanks.
> >> > > > > > > >
> >> > > > > > > > Martin
> >> > > > > > > >
> >> > > > > > > > On 18 September 2014 13:01, Lawrence Mitchell <
> >> > > > > > > > [email protected]> wrote:
> >> > > > > > > >
> >> > > > > > > >> On 18/09/14 11:42, Jan Blechta wrote:
> >> > > > > > > >> > Some problems (when running in a clean dir)
are
> >> avoided
> >> > > > > > > >> > using this (although incorrect) patch. There
> >> > > > > > > >> > are
> >> race
> >> > > > > > > >> > conditions in creation of temp dir. It should
> >> > > > > > > >> > be
> >> done
> >> > > > > > > >> > using atomic operation.
> >> > > > > > > >> >
> >> > > > > > > >> > Jan
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > >
> >>
==================================================================
> >> > > > > > > >> > diff --git a/test/unit/io/python/test_XDMF.py
> >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py index
> >> > > > > > > >> > 9ad65a4..31471f1 100755 ---
> >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py +++
> >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py @@ -28,8
> >> > > > > > > >> > +28,9 @@ def temppath(): filedir =
> >> > > > > > > >> > os.path.dirname(os.path.abspath(__file__))
> >> > > > > > > >> > basename
> >> =
> >> > > > > > > >> > os.path.basename(__file__).replace(".py",
> >> > > > > > > >> > "_data") temppath = os.path.join(filedir,
> >> > > > > > > >> > basename, "")
> >> > > > > > > >> > - if not os.path.exists(temppath):
> >> > > > > > > >> > - os.mkdir(temppath)
> >> > > > > > > >> > + if MPI.rank(mpi_comm_world()) == 0:
> >> > > > > > > >> > + if not os.path.exists(temppath):
> >> > > > > > > >> > + os.mkdir(temppath)
> >> > > > > > > >> > return temppath
> >> > > > > > > >>
> >> > > > > > > >> There's still a race condition here because
ranks
> >> other
> >> > > > > > > >> than zero might try and use temppath before
it's
> >> > > > > > > >> created. I think you want something like the
> >> > > > > > > >> below:
> >> > > > > > > >>
> >> > > > > > > >> if MPI.rank(mpi_comm_world()) == 0:
> >> > > > > > > >> if not os.path.exists(temppath):
> >> > > > > > > >> os.mkdir(temppath)
> >> > > > > > > >> MPI.barrier(mpi_comm_world())
> >> > > > > > > >> return temppath
> >> > > > > > > >>
> >> > > > > > > >> If you're worried about the OS not creating
files
> >> > > > > > > >> atomically, you can always mkdir into a tmp
> >> > > > > > > >> directory
> >> and
> >> > > > > > > >> then os.rename(tmp, temppath), since posix
> >> > > > > > > >> guarantees that renames are atomic.
> >> > > > > > > >>
> >> > > > > > > >> Lawrence
> >> > > > > > > >> _______________________________________________
> >> > > > > > > >> fenics mailing list
> >> > > > > > > >> [email protected]
> >> > > > > > > >>
http://fenicsproject.org/mailman/listinfo/fenics
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > >
> >> > > > > > _______________________________________________
> >> > > > > > fenics mailing list
> >> > > > > > [email protected]
> >> > > > > > http://fenicsproject.org/mailman/listinfo/fenics
> >> > > > >
> >> > > > >
> >> > >
> >> > >
> >>
>
>
>
>
_______________________________________________
fenics mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics