On Mon, Aug 21, 2006 at 01:34:05PM -0500, Murali Vilayannur wrote: > Hey Rob, > Shall we first try to diagnose the problem a little bit? > Can you describe the problem a little bit more?
Ok, sure: Some background: You can specify a ROMIO hint (cb_config_list) that lists machines (as returned by MPI_Get_processor_name). We call these nodes 'aggregators', and ROMIO will use just those nodes for collective I/O when it carries out its two-phase optimization. Why would you do this? consider, for example, a situation where you have many compute nodes, but only some of them are well-connected to the storage devices. noncontig_coll2 is a test to exercise the aggregator code of ROMIO. We construct a list of all the machines involved, and pass several permutations of that hint to ROMIO. If you only have one host, the permutations aren't very exciting, so to really exercise this test you have to run it on several distinct machines. I only recently added noncontig_coll2 to the jazz nightly tests, so this is why the problem has gone undetected for a few weeks. There is a core i/o routine that does the follwing: - create MPI info with cb_config_list hint - construct noncontiguous datatype - rank 0 deletes the file - everyone opens the file, passing the info with cb_config_list - set noncontiguous file view - write/read/verify - close - rank 0 deletes - open, again with hint set. upon re-opening, the default (contiguous) file view is in effect - write/read/verify - close - rank 0 deletes - write/read/verify - close This core i/o routine is called several times. We call it once w/o any hints set, and we get no errors. We call it again, setting a hint to the default values and again, no errors. We call it a third time, setting a cb_config_list with a different order of hosts, and the first write/read/verify works, but not the second. We fail with an I/O errror. Here's a link to the noncontig_coll2 test: http://www.mcs.anl.gov/mpi/mpich2/cvsweb.cgi/romio/test/noncontig_coll2.c I've verified that turning off the ncache makes the full noncontig_coll2 test pass. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Labs, IL USA B29D F333 664A 4280 315B _______________________________________________ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers