On 6 Oct 2014, at 16:38, Martin Sandve Alnæs <[email protected]> wrote:

> I think this is the best solution:
> 
> 1) Require the user to close file objects deterministically.
> Relying on the del operator is not deterministic, we need to support
> .close() and/or __enter__/__exit__ for the with statement in dolfin.
> 

Sounds good. We can print a warning message from the File object destructors if 
a file is not closed (this can later become an error).

> 2) Recommend users to throw in some gc.collect() calls in their code if 
> objects go out of scope in their code. This doesn't seem to be a big problem, 
> but it's a lingering non-deterministic mpi deadlock waiting to happen and 
> very hard to debug.
> 

What about insisting that objects that require collective calls during 
destruction must have a collective ‘clear’ or ‘destroy' function that cleans up 
the object.

Related to this discussion, we really need to to starting marking (logically) 
collective functions in the docstrings.

Garth

> Martin
> 
> 
> On 6 October 2014 15:05, Martin Sandve Alnæs <[email protected]> wrote:
> Yes. The difference is that mpi initialization / destruction happens at 
> beginning / end of the process while the destructors happen all the time 
> anywhere. I think that makes this a harder problem to solve.
> 
> Anyway I was replying to "would it help if MPI is initialised explicitly in 
> the setup" and the answer is still no because mpi init is not the problem in 
> the tests, although it is of similar nature.
> 
> I'm pondering if its possible (if necessary) to add a dolfin.mpi_gc() 
> function and overload __del__ in some classes to handle this 
> deterministically.
> 
> 6. okt. 2014 14:54 skrev "Garth N. Wells" <[email protected]> følgende:
> 
> 
> 
> On Mon, 6 Oct, 2014 at 1:10 PM, Martin Sandve Alnæs <[email protected]> 
> wrote:
> MPI initialization has nothing to do with the test problems. The problem is 
> the destructors of objects. It is temporarily solved by calling gc collect in 
> pytest fixtures.
> 
> 
> The core problem is the same. The problem I describe occurs when the 
> SubSystemsManager singleton that controls MPI intialisation is destroyed (and 
> finalises MPI) before a PETSc object is destroyed. It is an issue of 
> destruction order.
> 
> Garth
> 
> I think we should implement the with statement pattern for all file types in 
> dolfin to allow scope management.
> 
> If vectors _do_ call mpi in destructors that's a problem for nontrivial 
> dolfin python programs.
> 
> 6. okt. 2014 13:53 skrev "Garth N. Wells" <[email protected]> følgende:
> 
> 
> On Mon, 6 Oct, 2014 at 12:32 PM, Jan Blechta <[email protected]> 
> wrote:
> On Mon, 06 Oct 2014 11:53:58 +0100
> "Garth N. Wells" <[email protected]> wrote:
> 
>  On Mon, 6 Oct, 2014 at 11:23 AM, Martin Sandve Alnæs
>  <[email protected]> wrote:
>  > All collective destructors must be managed explicitly in python,
>  > preferably via with statement. Are there any apart from file
>  > objects? Vectors? Matrices? Meshes?
>  >
> 
>  Off the top of my head I can't think of any cases, apart from IO, in
>  which a (collective) MPI call needs to be made inside a destructor.
>  For IO, we could insist on a user closing or flushing a file
>  explicitly. We cannot guarantee that 3rd party linear algebra
>  backends do not call MPI when objects are destroyed.
> 
> VecDestroy and MatDestroy (called by PESTcVector and PETScBaseMatrix
> destructors) are claimed to be collective by PETSc doc:
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecDestroy.html
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatDestroy.html
> 
> Yes, they are collective but don't necessarily make MPI calls. My 
> understanding is that 'collective' is not the issue but whether or not MPI 
> calls are made from a destructor. Some functions will only make sense  if 
> called collectively (e.g., VecDestroy), but might not make collective MPI 
> calls.
> 
> For the tests, assuming PyTest permits a 'setup' function like unittest, 
> would it help if MPI is initialised explicitly in the setup function and 
> closed down at the end of a test suite (if possible with PyTest)?
> 
> Garth
> 
> 
> 
> Jan
> 
> 
>  We have had this problem in the past with the 'automatic'
>  finalisation of MPI, which is a problem if MPI is shutdown before
>  PETSc.
> 
>  Garth
> 
> 
>  > 6. okt. 2014 12:18 skrev "Jan Blechta" <[email protected]>
>  > følgende:
>  >> On Mon, 6 Oct 2014 12:07:02 +0200
>  >> Martin Sandve Alnæs <[email protected]> wrote:
>  >>
>  >> > The problem is that gc is nondeterministic and in particular not
>  >> > running with equal timing and ordering on each mpi process.
>  >> >
>  >> > We can't use the with statement to handle the scope of every
>  >> > single dolfin object in a program.
>  >>
>  >> Most of the DOLFIN destructors are not collective. So the moral is
>  >> that
>  >> we should avoid collective destructors as possible and document it
>  >> like
>  >> it is in PETSc doc.
>  >>
>  >> Jan
>  >>
>  >> >
>  >> > We can change all file handling to use with, and require the
>  >> > user
>  >> to
>  >> > use that in parallel.
>  >> >  6. okt. 2014 11:41 skrev "Jan Blechta"
>  >> <[email protected]>
>  >> > følgende:
>  >> >
>  >> > > On Mon, 6 Oct 2014 09:48:29 +0200
>  >> > > Martin Sandve Alnæs <[email protected]> wrote:
>  >> > >
>  >> > > > The 'fix' that's in the branch now was to trigger python
>  >> garbage
>  >> > > > collection (suggested by Øyvind Evju) before each test.
>  >> > > >
>  >> > > > This probably means we have a general problem in dolfin with
>  >> > > > non-deterministic destruction order of objects in parallel.
>  >> > > > Any destructor that uses MPI represents a potential deadlock.
>  >> > >
>  >> > > To understand the issue, is the problem that garbage
>  >> > > collection
>  >> does
>  >> > > not ensure when the object is destroyed which is the problem?
>  >> > >
>  >> > > Here http://stackoverflow.com/a/5071376/1796717 the distinction
>  >> > > between variable scoping and object cleanup is discussed.
>  >> Quoting it
>  >> > >
>  >> > >   Deterministic cleanup happens through the with statement.
>  >> > >
>  >> > > which might be a proper solution to the problem.
>  >> > >
>  >> > > Jan
>  >> > >
>  >> > > >
>  >> > > > On 19 September 2014 12:52, Jan Blechta
>  >> > > > <[email protected]> wrote:
>  >> > > >
>  >> > > > > On Fri, 19 Sep 2014 00:27:50 +0200
>  >> > > > > Jan Blechta <[email protected]> wrote:
>  >> > > > >
>  >> > > > > > Yes, after many trials using
>  >> > > > > >
>  >> > > > > > $ cd test/unit/io/python
>  >> > > > > > $ while true; do git clean -fdx && mpirun -n 3 xterm -e
>  >> > > > > > gdb -ex r -ex q -args python -m pytest -sv; done
>  >> > > > > > # when it hangs and you interrupt it, it asks for
>  >> > > > > > confirmation for # quitting, so you type n and enjoy
>  >> > > > > > gdb...
>  >> > > > > >
>  >> > > > > > I've seen a situation when 2 processes deadlocked on
>  >> > > > > > HDF5Interface::close_file() in DOLFIN with backtrace like
>  >> > > > > >
>  >> > > > > > # MPI barrier
>  >> > > > > > ...
>  >> > > > > > # MPI close
>  >> > > > > > # HDF5 lib calls
>  >> > > > > > H5FClose()
>  >> > > > > > dolfin::HDF5Interface::close_file()
>  >> > > > > > dolfin::HDF5File::close()
>  >> > > > > > dolfin::HDF5File::~HDF5File()
>  >> > > > > > dolfin::HDF5File::~HDF5File()
>  >> > > > > > # smart ptr management
>  >> > > > > > # garbage collection
>  >> > > > > >
>  >> > > > > > while 3rd process is waiting far away. Isn't it strange
>  >> that
>  >> > > > > > destructor is there twice in stacktrace? (The upper one
>  >> > > > > > is
>  >> on
>  >> > > > > > '}' line which I don't get.) What does it mean?
>  >> > > > >
>  >> > > > > Probably just code generation artifact - nothing harmful,
>  >> > > > > see http://stackoverflow.com/a/15244091/1796717
>  >> > > > >
>  >> > > > > Jan
>  >> > > > >
>  >> > > > > >
>  >> > > > > > Jan
>  >> > > > > >
>  >> > > > > >
>  >> > > > > > On Thu, 18 Sep 2014 16:20:51 +0200
>  >> > > > > > Martin Sandve Alnæs <[email protected]> wrote:
>  >> > > > > >
>  >> > > > > > > I've added the mpi fixes for temppath fixture and fixed
>  >> > > > > > > some other related issues while at it: When
>  >> parameterizing
>  >> > > > > > > a test that uses a temppath fixture, there is a need
>  >> > > > > > > for separate directories for each parameter combo.
>  >> > > > > > > A further improvement would be automatic cleaning of
>  >> > > > > > > old tempdirs, but I leave that for now.
>  >> > > > > > >
>  >> > > > > > > I've pushed these changes to the branch
>  >> > > > > > > aslakbergersen/topic-change-unittest-to-pytest
>  >> > > > > > >
>  >> > > > > > > The tests still hang though, in the closing of
>  >> > > > > > > HDF5File.
>  >> > > > > > >
>  >> > > > > > > Here's now to debug if someone wants to give it a shot:
>  >> > > > > > > Just run:
>  >> > > > > > >     mpirun -np 3 python -m pytest -s -v
>  >> > > > > > > With gdb:
>  >> > > > > > >     mpirun -np 3 xterm -e gdb --args python -m pytest
>  >> > > > > > > then enter 'r' in each of the three xterms.
>  >> > > > > > >
>  >> > > > > > > You may have to try a couple of times to get the
>  >> > > > > > > hanging behaviour.
>  >> > > > > > >
>  >> > > > > > > Martin
>  >> > > > > > >
>  >> > > > > > > On 18 September 2014 13:23, Martin Sandve Alnæs
>  >> > > > > > > <[email protected]> wrote:
>  >> > > > > > >
>  >> > > > > > > > Good spotting both of you, thanks.
>  >> > > > > > > >
>  >> > > > > > > > Martin
>  >> > > > > > > >
>  >> > > > > > > > On 18 September 2014 13:01, Lawrence Mitchell <
>  >> > > > > > > > [email protected]> wrote:
>  >> > > > > > > >
>  >> > > > > > > >> On 18/09/14 11:42, Jan Blechta wrote:
>  >> > > > > > > >> > Some problems (when running in a clean dir) are
>  >> avoided
>  >> > > > > > > >> > using this (although incorrect) patch. There are
>  >> race
>  >> > > > > > > >> > conditions in creation of temp dir. It should be
>  >> done
>  >> > > > > > > >> > using atomic operation.
>  >> > > > > > > >> >
>  >> > > > > > > >> > Jan
>  >> > > > > > > >> >
>  >> > > > > > > >> >
>  >> > > > > > > >> >
>  >> > >
>  >> ==================================================================
>  >> > > > > > > >> > diff --git a/test/unit/io/python/test_XDMF.py
>  >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py index
>  >> > > > > > > >> > 9ad65a4..31471f1 100755 ---
>  >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py +++
>  >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py @@ -28,8 +28,9
>  >> > > > > > > >> > @@ def temppath(): filedir =
>  >> > > > > > > >> > os.path.dirname(os.path.abspath(__file__))
>  >> > > > > > > >> > basename
>  >> =
>  >> > > > > > > >> > os.path.basename(__file__).replace(".py", "_data")
>  >> > > > > > > >> > temppath = os.path.join(filedir, basename, "")
>  >> > > > > > > >> > -    if not os.path.exists(temppath):
>  >> > > > > > > >> > -        os.mkdir(temppath)
>  >> > > > > > > >> > +    if MPI.rank(mpi_comm_world()) == 0:
>  >> > > > > > > >> > +        if not os.path.exists(temppath):
>  >> > > > > > > >> > +            os.mkdir(temppath)
>  >> > > > > > > >> >      return temppath
>  >> > > > > > > >>
>  >> > > > > > > >> There's still a race condition here because ranks
>  >> other
>  >> > > > > > > >> than zero might try and use temppath before it's
>  >> > > > > > > >> created.  I think you want something like the below:
>  >> > > > > > > >>
>  >> > > > > > > >> if MPI.rank(mpi_comm_world()) == 0:
>  >> > > > > > > >>     if not os.path.exists(temppath):
>  >> > > > > > > >>         os.mkdir(temppath)
>  >> > > > > > > >> MPI.barrier(mpi_comm_world())
>  >> > > > > > > >> return temppath
>  >> > > > > > > >>
>  >> > > > > > > >> If you're worried about the OS not creating files
>  >> > > > > > > >> atomically, you can always mkdir into a tmp
>  >> > > > > > > >> directory
>  >> and
>  >> > > > > > > >> then os.rename(tmp, temppath), since posix
>  >> > > > > > > >> guarantees that renames are atomic.
>  >> > > > > > > >>
>  >> > > > > > > >> Lawrence
>  >> > > > > > > >> _______________________________________________
>  >> > > > > > > >> fenics mailing list
>  >> > > > > > > >> [email protected]
>  >> > > > > > > >> http://fenicsproject.org/mailman/listinfo/fenics
>  >> > > > > > > >>
>  >> > > > > > > >
>  >> > > > > > > >
>  >> > > > > >
>  >> > > > > > _______________________________________________
>  >> > > > > > fenics mailing list
>  >> > > > > > [email protected]
>  >> > > > > > http://fenicsproject.org/mailman/listinfo/fenics
>  >> > > > >
>  >> > > > >
>  >> > >
>  >> > >
>  >>
> 
> 
> 
> 

_______________________________________________
fenics mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics

Reply via email to