Hi I have a Lustre cluster composed by 1 MDS and 2 OSS servers.Clients are both physical machines (~ 25 boxes) and virtual machines (instantiated on a OpenStack cluster). These Virtual Machines are dynamically created and destroyed as needed (we have a machinery which provides such automatic elasticity). They access the Lustre cluster through a NAT.
We start having problems when the number of virtual machines reaches a certain value (about 130 - 140). In such scenario we start seeing problems: we are not able to mount anymore Lustre on new clients and the access to the lustre file system is very slow.
In the OSS and MDS syslogs I see a lot of errors, such as: Request sent has timed out for slow reply bulk GET failed Request sent has failed due to network error lock blocking callback time out In: https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-mds.txt https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-oss-01.txt https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-oss-03.txtI saved a copy of these syslogs (just related to Lustre, and just for a time slot when the problem happened). In this example 10.64.22.248 is a new VM that is not able to mount the lustre filesystem.
There aren't network saturations when the problem happen and the lustre servers don't appear heavily loaded.
I would appreciate any hints that could help in troubleshooting this issue Thanks, Massimo
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
