Hi

I have a Lustre cluster composed by 1 MDS and 2 OSS servers.
Clients are both physical machines (~ 25 boxes) and virtual machines (instantiated on a OpenStack cluster). These Virtual Machines are dynamically created and destroyed as needed (we have a machinery which provides such automatic elasticity). They access the Lustre cluster through a NAT.

We start having problems when the number of virtual machines reaches a certain value (about 130 - 140). In such scenario we start seeing problems: we are not able to mount anymore Lustre on new clients and the access to the lustre file system is very slow.


In the OSS and MDS syslogs I see a lot of errors, such as:

Request sent has timed out for slow reply
bulk GET failed
Request sent has failed due to network error
lock blocking callback time out

In:

https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-mds.txt
https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-oss-01.txt
https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-oss-03.txt

I saved a copy of these syslogs (just related to Lustre, and just for a time slot when the problem happened). In this example 10.64.22.248 is a new VM that is not able to mount the lustre filesystem.


There aren't network saturations when the problem happen and the lustre servers don't appear heavily loaded.

I would appreciate any hints that could help in troubleshooting this issue


Thanks, Massimo

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to