The problem is that gc is nondeterministic and in particular not running with equal timing and ordering on each mpi process.
We can't use the with statement to handle the scope of every single dolfin object in a program. We can change all file handling to use with, and require the user to use that in parallel. 6. okt. 2014 11:41 skrev "Jan Blechta" <[email protected]> følgende: > On Mon, 6 Oct 2014 09:48:29 +0200 > Martin Sandve Alnæs <[email protected]> wrote: > > > The 'fix' that's in the branch now was to trigger python garbage > > collection (suggested by Øyvind Evju) before each test. > > > > This probably means we have a general problem in dolfin with > > non-deterministic destruction order of objects in parallel. Any > > destructor that uses MPI represents a potential deadlock. > > To understand the issue, is the problem that garbage collection does > not ensure when the object is destroyed which is the problem? > > Here http://stackoverflow.com/a/5071376/1796717 the distinction between > variable scoping and object cleanup is discussed. Quoting it > > Deterministic cleanup happens through the with statement. > > which might be a proper solution to the problem. > > Jan > > > > > On 19 September 2014 12:52, Jan Blechta <[email protected]> > > wrote: > > > > > On Fri, 19 Sep 2014 00:27:50 +0200 > > > Jan Blechta <[email protected]> wrote: > > > > > > > Yes, after many trials using > > > > > > > > $ cd test/unit/io/python > > > > $ while true; do git clean -fdx && mpirun -n 3 xterm -e gdb -ex r > > > > -ex q -args python -m pytest -sv; done > > > > # when it hangs and you interrupt it, it asks for confirmation for > > > > # quitting, so you type n and enjoy gdb... > > > > > > > > I've seen a situation when 2 processes deadlocked on > > > > HDF5Interface::close_file() in DOLFIN with backtrace like > > > > > > > > # MPI barrier > > > > ... > > > > # MPI close > > > > # HDF5 lib calls > > > > H5FClose() > > > > dolfin::HDF5Interface::close_file() > > > > dolfin::HDF5File::close() > > > > dolfin::HDF5File::~HDF5File() > > > > dolfin::HDF5File::~HDF5File() > > > > # smart ptr management > > > > # garbage collection > > > > > > > > while 3rd process is waiting far away. Isn't it strange that > > > > destructor is there twice in stacktrace? (The upper one is on '}' > > > > line which I don't get.) What does it mean? > > > > > > Probably just code generation artifact - nothing harmful, see > > > http://stackoverflow.com/a/15244091/1796717 > > > > > > Jan > > > > > > > > > > > Jan > > > > > > > > > > > > On Thu, 18 Sep 2014 16:20:51 +0200 > > > > Martin Sandve Alnæs <[email protected]> wrote: > > > > > > > > > I've added the mpi fixes for temppath fixture and fixed > > > > > some other related issues while at it: When parameterizing > > > > > a test that uses a temppath fixture, there is a need for > > > > > separate directories for each parameter combo. > > > > > A further improvement would be automatic cleaning of old > > > > > tempdirs, but I leave that for now. > > > > > > > > > > I've pushed these changes to the branch > > > > > aslakbergersen/topic-change-unittest-to-pytest > > > > > > > > > > The tests still hang though, in the closing of HDF5File. > > > > > > > > > > Here's now to debug if someone wants to give it a shot: > > > > > Just run: > > > > > mpirun -np 3 python -m pytest -s -v > > > > > With gdb: > > > > > mpirun -np 3 xterm -e gdb --args python -m pytest > > > > > then enter 'r' in each of the three xterms. > > > > > > > > > > You may have to try a couple of times to get the hanging > > > > > behaviour. > > > > > > > > > > Martin > > > > > > > > > > On 18 September 2014 13:23, Martin Sandve Alnæs > > > > > <[email protected]> wrote: > > > > > > > > > > > Good spotting both of you, thanks. > > > > > > > > > > > > Martin > > > > > > > > > > > > On 18 September 2014 13:01, Lawrence Mitchell < > > > > > > [email protected]> wrote: > > > > > > > > > > > >> On 18/09/14 11:42, Jan Blechta wrote: > > > > > >> > Some problems (when running in a clean dir) are avoided > > > > > >> > using this (although incorrect) patch. There are race > > > > > >> > conditions in creation of temp dir. It should be done > > > > > >> > using atomic operation. > > > > > >> > > > > > > >> > Jan > > > > > >> > > > > > > >> > > > > > > >> > > ================================================================== > > > > > >> > diff --git a/test/unit/io/python/test_XDMF.py > > > > > >> > b/test/unit/io/python/test_XDMF.py index 9ad65a4..31471f1 > > > > > >> > 100755 --- a/test/unit/io/python/test_XDMF.py > > > > > >> > +++ b/test/unit/io/python/test_XDMF.py > > > > > >> > @@ -28,8 +28,9 @@ def temppath(): > > > > > >> > filedir = os.path.dirname(os.path.abspath(__file__)) > > > > > >> > basename = os.path.basename(__file__).replace(".py", > > > > > >> > "_data") temppath = os.path.join(filedir, basename, "") > > > > > >> > - if not os.path.exists(temppath): > > > > > >> > - os.mkdir(temppath) > > > > > >> > + if MPI.rank(mpi_comm_world()) == 0: > > > > > >> > + if not os.path.exists(temppath): > > > > > >> > + os.mkdir(temppath) > > > > > >> > return temppath > > > > > >> > > > > > >> There's still a race condition here because ranks other than > > > > > >> zero might try and use temppath before it's created. I > > > > > >> think you want something like the below: > > > > > >> > > > > > >> if MPI.rank(mpi_comm_world()) == 0: > > > > > >> if not os.path.exists(temppath): > > > > > >> os.mkdir(temppath) > > > > > >> MPI.barrier(mpi_comm_world()) > > > > > >> return temppath > > > > > >> > > > > > >> If you're worried about the OS not creating files > > > > > >> atomically, you can always mkdir into a tmp directory and > > > > > >> then os.rename(tmp, temppath), since posix guarantees that > > > > > >> renames are atomic. > > > > > >> > > > > > >> Lawrence > > > > > >> _______________________________________________ > > > > > >> fenics mailing list > > > > > >> [email protected] > > > > > >> http://fenicsproject.org/mailman/listinfo/fenics > > > > > >> > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > fenics mailing list > > > > [email protected] > > > > http://fenicsproject.org/mailman/listinfo/fenics > > > > > > > >
_______________________________________________ fenics mailing list [email protected] http://fenicsproject.org/mailman/listinfo/fenics
