file_status_get_count

Eugene Loh Fri, 4 Nov 2011 12:06:08 -0400

On 11/4/2011 5:56 AM, Jeff Squyres wrote:

On Oct 28, 2011, at 1:59 AM, Eugene Loh wrote

In our MTT testing, we see ibm/io/file_status_get_count fail occasionally with:


File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type F_RDLCK/0,whence 
0) with return value
FFFFFFFF and errno 5.
- If the file system is NFS, you need to use NFS version 3, ensure that the 
lockd daemon is running
on all the machines, and mount the directory with the 'noac' option (no 
attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 
'flock' option.
ADIOI_Set_lock:: Input/output error
ADIOI_Set_lock:offset 0, length 1

One of the curious things (to us) about this test is that no one else appears 
to run it.  Looking back through a lot of MTT results, essentially the only 
results reported are Oracle.  Almost no non-Oracle results for this test have 
been reported in the last few months.  Is there something special about this 
test we should know about?

Not that I'm aware of.

I see why Cisco skipped it -- I didn't have the "io" directory listed in my 
list of IBM directories to traverse.  Doh!  That's been fixed.

(Cisco's MTT runs look like they need a bit of TLC -- I'm guessing IB is down 
on a node or two, resulting in a lot of false failures, but I likely won't have 
time to look at them until after SC :-( )

Yeah. In our recent experience, everyone's MTT runs seem to need lotsof TLC. Anyhow, thanks for the feedback: it appears there is nogeneral intentional avoidance of this particular test that we weresimply unaware of.

P.S.  We're also interested in understanding the error message better.  I 
suppose that's more appropriately taken up with ROMIO folks, which I will do, 
but if anyone on this list has useful information I'd love to hear it.  The 
error apparently comes when MPI_File_get_size sets a lock.  Each process has 
its own file and the test usually passes, so it's unclear to me what the 
problem is.  Further, the error message discussing NFS and Lustre strikes me as 
rather speculative.  We tend to run these tests repeatedly on the same file 
systems from the same test nodes.  Anyone have any idea how sound the 
NFSv3/lockd/noac advice is or what the real issue is here?

No.  You'll need to ask Rob Latham.

Thanks. He replied to my inquiry on the MPICH list. Main answer isthat robustness bets are off on NFS and the message might be a littlemisleading.

Re: [OMPI devel] ibm/io/file_status_get_count

Reply via email to