On Tue, 07 Oct 2014 13:34:17 +0100
"Garth N. Wells" <[email protected]> wrote:

> 
> 
> On Tue, 7 Oct, 2014 at 10:35 AM, Jan Blechta 
> <[email protected]> wrote:
> > On Tue, 7 Oct 2014 10:23:21 +0100
> > "Garth N. Wells" <[email protected]> wrote:
> > 
> >> 
> >>  On 6 Oct 2014, at 16:38, Martin Sandve Alnæs <[email protected]>
> >>  wrote:
> >> 
> >>  > I think this is the best solution:
> >>  >
> >>  > 1) Require the user to close file objects deterministically.
> >>  > Relying on the del operator is not deterministic, we need to 
> >> support
> >>  > .close() and/or __enter__/__exit__ for the with statement in 
> >> dolfin.
> >>  >
> >> 
> >>  Sounds good. We can print a warning message from the File object
> >>  destructors if a file is not closed (this can later become an 
> >> error).
> > 
> > Good idea, but maybe warning could be issued from __del__ operator
> > if object was not properly destroyed/closed. In C++ layer
> > everything is OK.
> 
> There are some advantages to insisting on explicit file
> opening/closing in the C++ interface too, particularly in parallel,
> so we could reasonably keep the interface consistent across Python
> and C++.

Ok, but do you really want to stay so consistent and require explicit
destruction of PETScVector / PETScMatrix?

Jan

> 
> > 
> > Maybe we should also check how petsc4py deals with the issue and get
> > eventually inspired.
> 
> I haven't checked in detail, but I presumed that petsc4py wraps the 
> PETSc 'FooDestroy' functions. From a quick survey of some of the 
> petsc4py demos, it appears that some explicitly clean up, e.g.
> 
> https://bitbucket.org/petsc/petsc4py/src/7081705ebf90034163de05034df749fcd50cc288/demo/taosolve/chwirut.py?at=master
> 
> and some do not, e.g.
> 
> https://bitbucket.org/petsc/petsc4py/src/7081705ebf90034163de05034df749fcd50cc288/demo/kspsolve/test_mat_cg.py?at=master
> 
> I suspect that petsc4py could suffer the same issue that we're seeing 
> in the case that objects are not explicitly cleaned up.
> 
> Garth
> 
> 
> > 
> > Jan
> > 
> >> 
> >>  > 2) Recommend users to throw in some gc.collect() calls in their
> >>  > code if objects go out of scope in their code. This doesn't
> >>  > seem 
> >> to
> >>  > be a big problem, but it's a lingering non-deterministic mpi
> >>  > deadlock waiting to happen and very hard to debug.
> >>  >
> >> 
> >>  What about insisting that objects that require collective calls
> >>  during destruction must have a collective ‘clear’ or ‘destroy'
> >>  function that cleans up the object.
> >> 
> >>  Related to this discussion, we really need to to starting marking
> >>  (logically) collective functions in the docstrings.
> >> 
> >>  Garth
> >> 
> >>  > Martin
> >>  >
> >>  >
> >>  > On 6 October 2014 15:05, Martin Sandve Alnæs
> >>  > <[email protected]> wrote: Yes. The difference is that mpi
> >>  > initialization / 
> >> destruction
> >>  > happens at beginning / end of the process while the destructors
> >>  > happen all the time anywhere. I think that makes this a harder
> >>  > problem to solve.
> >>  >
> >>  > Anyway I was replying to "would it help if MPI is initialised
> >>  > explicitly in the setup" and the answer is still no because mpi
> >>  > init is not the problem in the tests, although it is of similar
> >>  > nature.
> >>  >
> >>  > I'm pondering if its possible (if necessary) to add a
> >>  > dolfin.mpi_gc() function and overload __del__ in some classes to
> >>  > handle this deterministically.
> >>  >
> >>  > 6. okt. 2014 14:54 skrev "Garth N. Wells" <[email protected]>
> >>  > følgende:
> >>  >
> >>  >
> >>  >
> >>  > On Mon, 6 Oct, 2014 at 1:10 PM, Martin Sandve Alnæs
> >>  > <[email protected]> wrote: MPI initialization has nothing to do
> >>  > with the test problems. The problem is the destructors of
> >>  > objects. It is temporarily solved by calling gc collect in
> >>  > pytest fixtures.
> >>  >
> >>  >
> >>  > The core problem is the same. The problem I describe occurs when
> >>  > the SubSystemsManager singleton that controls MPI intialisation
> >>  > is destroyed (and finalises MPI) before a PETSc object is
> >>  > destroyed. It is an issue of destruction order.
> >>  >
> >>  > Garth
> >>  >
> >>  > I think we should implement the with statement pattern for all 
> >> file
> >>  > types in dolfin to allow scope management.
> >>  >
> >>  > If vectors _do_ call mpi in destructors that's a problem for
> >>  > nontrivial dolfin python programs.
> >>  >
> >>  > 6. okt. 2014 13:53 skrev "Garth N. Wells" <[email protected]>
> >>  > følgende:
> >>  >
> >>  >
> >>  > On Mon, 6 Oct, 2014 at 12:32 PM, Jan Blechta
> >>  > <[email protected]> wrote: On Mon, 06 Oct 2014 11:53:58
> >>  > +0100 "Garth N. Wells" <[email protected]> wrote:
> >>  >
> >>  >  On Mon, 6 Oct, 2014 at 11:23 AM, Martin Sandve Alnæs
> >>  >  <[email protected]> wrote:
> >>  >  > All collective destructors must be managed explicitly in 
> >> python,
> >>  >  > preferably via with statement. Are there any apart from file
> >>  >  > objects? Vectors? Matrices? Meshes?
> >>  >  >
> >>  >
> >>  >  Off the top of my head I can't think of any cases, apart from
> >>  > IO, in which a (collective) MPI call needs to be made inside a
> >>  > destructor. For IO, we could insist on a user closing or
> >>  > flushing 
> >> a
> >>  > file explicitly. We cannot guarantee that 3rd party linear
> >>  > algebra backends do not call MPI when objects are destroyed.
> >>  >
> >>  > VecDestroy and MatDestroy (called by PESTcVector and 
> >> PETScBaseMatrix
> >>  > destructors) are claimed to be collective by PETSc doc:
> >>  > 
> >> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecDestroy.html
> >>  > 
> >> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatDestroy.html
> >>  >
> >>  > Yes, they are collective but don't necessarily make MPI calls.
> >>  > My understanding is that 'collective' is not the issue but
> >>  > whether or not MPI calls are made from a destructor. Some
> >>  > functions will only make sense  if called collectively (e.g.,
> >>  > VecDestroy), but might not make collective MPI calls.
> >>  >
> >>  > For the tests, assuming PyTest permits a 'setup' function like
> >>  > unittest, would it help if MPI is initialised explicitly in the
> >>  > setup function and closed down at the end of a test suite (if
> >>  > possible with PyTest)?
> >>  >
> >>  > Garth
> >>  >
> >>  >
> >>  >
> >>  > Jan
> >>  >
> >>  >
> >>  >  We have had this problem in the past with the 'automatic'
> >>  >  finalisation of MPI, which is a problem if MPI is shutdown
> >>  > before PETSc.
> >>  >
> >>  >  Garth
> >>  >
> >>  >
> >>  >  > 6. okt. 2014 12:18 skrev "Jan Blechta"
> >>  >  > <[email protected]> følgende:
> >>  >  >> On Mon, 6 Oct 2014 12:07:02 +0200
> >>  >  >> Martin Sandve Alnæs <[email protected]> wrote:
> >>  >  >>
> >>  >  >> > The problem is that gc is nondeterministic and in
> >>  >  >> > particular not running with equal timing and ordering on
> >>  >  >> > each mpi process.
> >>  >  >> >
> >>  >  >> > We can't use the with statement to handle the scope of
> >>  >  >> > every single dolfin object in a program.
> >>  >  >>
> >>  >  >> Most of the DOLFIN destructors are not collective. So the 
> >> moral
> >>  >  >> is that
> >>  >  >> we should avoid collective destructors as possible and 
> >> document
> >>  >  >> it like
> >>  >  >> it is in PETSc doc.
> >>  >  >>
> >>  >  >> Jan
> >>  >  >>
> >>  >  >> >
> >>  >  >> > We can change all file handling to use with, and require
> >>  >  >> > the user
> >>  >  >> to
> >>  >  >> > use that in parallel.
> >>  >  >> >  6. okt. 2014 11:41 skrev "Jan Blechta"
> >>  >  >> <[email protected]>
> >>  >  >> > følgende:
> >>  >  >> >
> >>  >  >> > > On Mon, 6 Oct 2014 09:48:29 +0200
> >>  >  >> > > Martin Sandve Alnæs <[email protected]> wrote:
> >>  >  >> > >
> >>  >  >> > > > The 'fix' that's in the branch now was to trigger
> >>  >  >> > > > python
> >>  >  >> garbage
> >>  >  >> > > > collection (suggested by Øyvind Evju) before each
> >>  >  >> > > > test.
> >>  >  >> > > >
> >>  >  >> > > > This probably means we have a general problem in
> >>  >  >> > > > dolfin with non-deterministic destruction order of
> >>  >  >> > > > objects in parallel. Any destructor that uses MPI
> >>  >  >> > > > represents a potential deadlock.
> >>  >  >> > >
> >>  >  >> > > To understand the issue, is the problem that garbage
> >>  >  >> > > collection
> >>  >  >> does
> >>  >  >> > > not ensure when the object is destroyed which is the
> >>  >  >> > > problem?
> >>  >  >> > >
> >>  >  >> > > Here http://stackoverflow.com/a/5071376/1796717 the
> >>  >  >> > > distinction between variable scoping and object cleanup
> >>  >  >> > > is discussed.
> >>  >  >> Quoting it
> >>  >  >> > >
> >>  >  >> > >   Deterministic cleanup happens through the with 
> >> statement.
> >>  >  >> > >
> >>  >  >> > > which might be a proper solution to the problem.
> >>  >  >> > >
> >>  >  >> > > Jan
> >>  >  >> > >
> >>  >  >> > > >
> >>  >  >> > > > On 19 September 2014 12:52, Jan Blechta
> >>  >  >> > > > <[email protected]> wrote:
> >>  >  >> > > >
> >>  >  >> > > > > On Fri, 19 Sep 2014 00:27:50 +0200
> >>  >  >> > > > > Jan Blechta <[email protected]> wrote:
> >>  >  >> > > > >
> >>  >  >> > > > > > Yes, after many trials using
> >>  >  >> > > > > >
> >>  >  >> > > > > > $ cd test/unit/io/python
> >>  >  >> > > > > > $ while true; do git clean -fdx && mpirun -n 3
> >>  >  >> > > > > > xterm -e gdb -ex r -ex q -args python -m pytest
> >>  >  >> > > > > > -sv; done # when it hangs and you interrupt it,
> >>  >  >> > > > > > it asks for confirmation for # quitting, so you
> >>  >  >> > > > > > type n and enjoy gdb...
> >>  >  >> > > > > >
> >>  >  >> > > > > > I've seen a situation when 2 processes deadlocked
> >>  >  >> > > > > > on HDF5Interface::close_file() in DOLFIN with
> >>  >  >> > > > > > backtrace like
> >>  >  >> > > > > >
> >>  >  >> > > > > > # MPI barrier
> >>  >  >> > > > > > ...
> >>  >  >> > > > > > # MPI close
> >>  >  >> > > > > > # HDF5 lib calls
> >>  >  >> > > > > > H5FClose()
> >>  >  >> > > > > > dolfin::HDF5Interface::close_file()
> >>  >  >> > > > > > dolfin::HDF5File::close()
> >>  >  >> > > > > > dolfin::HDF5File::~HDF5File()
> >>  >  >> > > > > > dolfin::HDF5File::~HDF5File()
> >>  >  >> > > > > > # smart ptr management
> >>  >  >> > > > > > # garbage collection
> >>  >  >> > > > > >
> >>  >  >> > > > > > while 3rd process is waiting far away. Isn't it
> >>  >  >> > > > > > strange
> >>  >  >> that
> >>  >  >> > > > > > destructor is there twice in stacktrace? (The
> >>  >  >> > > > > > upper one is
> >>  >  >> on
> >>  >  >> > > > > > '}' line which I don't get.) What does it mean?
> >>  >  >> > > > >
> >>  >  >> > > > > Probably just code generation artifact - nothing
> >>  >  >> > > > > harmful, see 
> >> http://stackoverflow.com/a/15244091/1796717
> >>  >  >> > > > >
> >>  >  >> > > > > Jan
> >>  >  >> > > > >
> >>  >  >> > > > > >
> >>  >  >> > > > > > Jan
> >>  >  >> > > > > >
> >>  >  >> > > > > >
> >>  >  >> > > > > > On Thu, 18 Sep 2014 16:20:51 +0200
> >>  >  >> > > > > > Martin Sandve Alnæs <[email protected]> wrote:
> >>  >  >> > > > > >
> >>  >  >> > > > > > > I've added the mpi fixes for temppath fixture
> >>  >  >> > > > > > > and fixed some other related issues while at
> >>  >  >> > > > > > > it: When
> >>  >  >> parameterizing
> >>  >  >> > > > > > > a test that uses a temppath fixture, there is a 
> >> need
> >>  >  >> > > > > > > for separate directories for each parameter
> >>  >  >> > > > > > > combo. A further improvement would be automatic
> >>  >  >> > > > > > > cleaning 
> >> of
> >>  >  >> > > > > > > old tempdirs, but I leave that for now.
> >>  >  >> > > > > > >
> >>  >  >> > > > > > > I've pushed these changes to the branch
> >>  >  >> > > > > > > aslakbergersen/topic-change-unittest-to-pytest
> >>  >  >> > > > > > >
> >>  >  >> > > > > > > The tests still hang though, in the closing of
> >>  >  >> > > > > > > HDF5File.
> >>  >  >> > > > > > >
> >>  >  >> > > > > > > Here's now to debug if someone wants to give it
> >>  >  >> > > > > > > a shot: Just run:
> >>  >  >> > > > > > >     mpirun -np 3 python -m pytest -s -v
> >>  >  >> > > > > > > With gdb:
> >>  >  >> > > > > > >     mpirun -np 3 xterm -e gdb --args python -m
> >>  >  >> > > > > > > pytest then enter 'r' in each of the three
> >>  >  >> > > > > > > xterms.
> >>  >  >> > > > > > >
> >>  >  >> > > > > > > You may have to try a couple of times to get the
> >>  >  >> > > > > > > hanging behaviour.
> >>  >  >> > > > > > >
> >>  >  >> > > > > > > Martin
> >>  >  >> > > > > > >
> >>  >  >> > > > > > > On 18 September 2014 13:23, Martin Sandve Alnæs
> >>  >  >> > > > > > > <[email protected]> wrote:
> >>  >  >> > > > > > >
> >>  >  >> > > > > > > > Good spotting both of you, thanks.
> >>  >  >> > > > > > > >
> >>  >  >> > > > > > > > Martin
> >>  >  >> > > > > > > >
> >>  >  >> > > > > > > > On 18 September 2014 13:01, Lawrence Mitchell
> >>  >  >> > > > > > > > < [email protected]> wrote:
> >>  >  >> > > > > > > >
> >>  >  >> > > > > > > >> On 18/09/14 11:42, Jan Blechta wrote:
> >>  >  >> > > > > > > >> > Some problems (when running in a clean
> >>  >  >> > > > > > > >> > dir) 
> >> are
> >>  >  >> avoided
> >>  >  >> > > > > > > >> > using this (although incorrect) patch.
> >>  >  >> > > > > > > >> > There are
> >>  >  >> race
> >>  >  >> > > > > > > >> > conditions in creation of temp dir. It
> >>  >  >> > > > > > > >> > should be
> >>  >  >> done
> >>  >  >> > > > > > > >> > using atomic operation.
> >>  >  >> > > > > > > >> >
> >>  >  >> > > > > > > >> > Jan
> >>  >  >> > > > > > > >> >
> >>  >  >> > > > > > > >> >
> >>  >  >> > > > > > > >> >
> >>  >  >> > >
> >>  >  >> 
> >> ==================================================================
> >>  >  >> > > > > > > >> > diff --git
> >>  >  >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py
> >>  >  >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py index
> >>  >  >> > > > > > > >> > 9ad65a4..31471f1 100755 ---
> >>  >  >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py +++
> >>  >  >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py @@
> >>  >  >> > > > > > > >> > -28,8 +28,9 @@ def temppath(): filedir =
> >>  >  >> > > > > > > >> > os.path.dirname(os.path.abspath(__file__))
> >>  >  >> > > > > > > >> > basename
> >>  >  >> =
> >>  >  >> > > > > > > >> > os.path.basename(__file__).replace(".py",
> >>  >  >> > > > > > > >> > "_data") temppath = os.path.join(filedir,
> >>  >  >> > > > > > > >> > basename, "")
> >>  >  >> > > > > > > >> > -    if not os.path.exists(temppath):
> >>  >  >> > > > > > > >> > -        os.mkdir(temppath)
> >>  >  >> > > > > > > >> > +    if MPI.rank(mpi_comm_world()) == 0:
> >>  >  >> > > > > > > >> > +        if not os.path.exists(temppath):
> >>  >  >> > > > > > > >> > +            os.mkdir(temppath)
> >>  >  >> > > > > > > >> >      return temppath
> >>  >  >> > > > > > > >>
> >>  >  >> > > > > > > >> There's still a race condition here because 
> >> ranks
> >>  >  >> other
> >>  >  >> > > > > > > >> than zero might try and use temppath before 
> >> it's
> >>  >  >> > > > > > > >> created.  I think you want something like the
> >>  >  >> > > > > > > >> below:
> >>  >  >> > > > > > > >>
> >>  >  >> > > > > > > >> if MPI.rank(mpi_comm_world()) == 0:
> >>  >  >> > > > > > > >>     if not os.path.exists(temppath):
> >>  >  >> > > > > > > >>         os.mkdir(temppath)
> >>  >  >> > > > > > > >> MPI.barrier(mpi_comm_world())
> >>  >  >> > > > > > > >> return temppath
> >>  >  >> > > > > > > >>
> >>  >  >> > > > > > > >> If you're worried about the OS not creating 
> >> files
> >>  >  >> > > > > > > >> atomically, you can always mkdir into a tmp
> >>  >  >> > > > > > > >> directory
> >>  >  >> and
> >>  >  >> > > > > > > >> then os.rename(tmp, temppath), since posix
> >>  >  >> > > > > > > >> guarantees that renames are atomic.
> >>  >  >> > > > > > > >>
> >>  >  >> > > > > > > >> Lawrence
> >>  >  >> > > > > > > >> _______________________________________________
> >>  >  >> > > > > > > >> fenics mailing list
> >>  >  >> > > > > > > >> [email protected]
> >>  >  >> > > > > > > >> 
> >> http://fenicsproject.org/mailman/listinfo/fenics
> >>  >  >> > > > > > > >>
> >>  >  >> > > > > > > >
> >>  >  >> > > > > > > >
> >>  >  >> > > > > >
> >>  >  >> > > > > > _______________________________________________
> >>  >  >> > > > > > fenics mailing list
> >>  >  >> > > > > > [email protected]
> >>  >  >> > > > > > http://fenicsproject.org/mailman/listinfo/fenics
> >>  >  >> > > > >
> >>  >  >> > > > >
> >>  >  >> > >
> >>  >  >> > >
> >>  >  >>
> >>  >
> >>  >
> >>  >
> >>  >
> >> 
> >>  _______________________________________________
> >>  fenics mailing list
> >>  [email protected]
> >>  http://fenicsproject.org/mailman/listinfo/fenics
> > 

_______________________________________________
fenics mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics

Reply via email to