On Mon, Aug 21, 2006 at 01:34:05PM -0500, Murali Vilayannur wrote:
> Hey Rob,
> Shall we first try to diagnose the problem a little bit?
> Can you describe the problem a little bit more?

Ok, sure:

Some background:

You can specify a ROMIO hint (cb_config_list) that lists machines (as
returned by MPI_Get_processor_name).  We call these nodes
'aggregators', and ROMIO will use just those nodes for collective I/O
when it carries out its two-phase optimization.  Why would you do
this? consider, for example, a situation where you have many compute
nodes, but only some of them are well-connected to the storage
devices.  

noncontig_coll2 is a test to exercise the aggregator code of ROMIO.
We construct a list of all the machines involved, and pass several
permutations of that hint to ROMIO.   If you only have one host, the
permutations aren't very exciting, so to really exercise this test you
have to run it on several distinct machines.  I only recently added
noncontig_coll2 to the jazz nightly tests, so this is why the problem
has gone undetected for a few weeks.

There is a core i/o routine that does the follwing:

- create MPI info with cb_config_list hint
- construct noncontiguous datatype
- rank 0 deletes the file
- everyone opens the file, passing the info with cb_config_list
- set noncontiguous file view
- write/read/verify 
- close
- rank 0 deletes
- open, again with hint set.  upon re-opening, the default
  (contiguous) file view is in effect
- write/read/verify
- close
- rank 0 deletes
- write/read/verify
- close

This core i/o routine is called several times.  We call it once w/o
any hints set, and we get no errors.  We call it again, setting a hint
to the default values and again, no errors.  We call it a third time,
setting a cb_config_list with a different order of hosts, and the
first write/read/verify works, but not the second.  We fail with an
I/O errror.

Here's a link to the noncontig_coll2 test:
http://www.mcs.anl.gov/mpi/mpich2/cvsweb.cgi/romio/test/noncontig_coll2.c

I've verified that turning off the ncache makes the full
noncontig_coll2 test pass.  

==rob
-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Labs, IL USA                B29D F333 664A 4280 315B
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to