Hello, 

We have a few files created by a particular application where reads to those 
files consistently hang. The debug log on a client attempting a read() has 
messages like:

> ldlm_completion_ast(): waiting indefinitely because of NO_TIMEOUT ...

This is printed when the flag LDLM_FL_NO_TIMEOUT is true, and code comments 
above that flag imply that it is set for group locks. So, we've been trying to 
identify if the application in question uses group locks. (I have reached out 
to the app's developers but do not have a response yet.)

If I open the file with O_NONBLOCK, any reads immediately return with error 11 
/ EWOULDBLOCK. This behavior is documented to occur for Lustre group locks.

However, I would like to clarify whether the LDLM_FL_NO_TIMEOUT flag is true 
*only* when a group lock is held, or are there other circumstances where the 
behavior described above could occur?

If this is caused by a group lock is there an easy way to tell from server side 
logs or data what client(s) have the group lock and are blocking access? The 
motivation is that we believe any jobs accessing these files have long since 
been killed, and no nodes from the job are expected to be holding the files 
open. We would like to confirm or rule out that possibility by easily 
identifying any such clients.

Advice on how to effectively debug ldlm issues could be useful beyond just this 
issue. In general, if there is a reliable way to start from a log entry for a 
lock like 

> ... ns: lustre-OST0000-osc-ffff9a0942c79800 lock: 
> 000000003f3a5950/0xe54ca8d2d7b66d03 lrc: 4/1,0 mode: --/PR  ...

and get information about the client(s) holding that lock and any contending 
locks, that would be helpful in debugging situations like this.

server: 2.15.2
client that application ran on: 2.15.0.4_rc2_cray_172_ge66844d
client that I tested file access from: 2.15.2

Thanks!

- Thomas Bertschinger
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
  • [lustre-discuss] que... Bertschinger, Thomas Andrew Hjorth via lustre-discuss

Reply via email to