On Feb 03, 2009 12:21 -0500, Charles Taylor wrote: > In our experience, despite what has been said and what we have read, > if we lose or take down a single OSS, our clients lose access (i/o > seems blocked) to the file system until that OSS is back up and has > completed recovery. That's just or experience and it has been very > consistent. We've never seen otherwise, though we would like to. :)
To be clear - a client process will wait indefinitely until an OST is back alive, unless either the process is killed (this should be possible after the Lustre recovery timeout is exceeded, 100s by default), or the OST is explicitly marked "inactive" on the clients: lctl --device {failed OSC device on client} deactivate After the OSC is marked inactive, then all IO to that OST should immediately return with -EIO, and not hang. If you have experiences other than this it is a bug. If this isn't explained in the documentation it is a documentation bug. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss