Hi,
We're setting up fairly large Lustre 2.1.2 filesystems, each with 18 nodes and 159 resources all in one Corosync/Pacemaker cluster as suggested by our vendor. We're getting mixed messages on how large of a Corosync/Pacemaker cluster will work well between our vendor an others. 1. Are there Lustre Corosync/Pacemaker clusters out there of this size or larger? 2. If so, what tuning needed to be done to get it to work well? 3. Should we be looking more seriously into splitting this Corosync/Pacemaker cluster into pairs or sets of 4 nodes? Right now, our current configuration takes a long time to start/stop all resources (~30-45 mins), and failing back OSTs puts a heavy load on the cib process on every node in the cluster. Under heavy IO load, the many of the nodes will show as "unclean/offline" and many OST resources will show as inactive in crm status, despite the fact that every single MDT and OST is still mounted in the appropriate place. We are running 2 corosync rings, each on a private 1 GbE network. We have a bonded 10 GbE network for the LNET. Thanks, Shawn
_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss