We are dipping our toes into the waters of Lustre HA using pacemaker. We have 16 7.2 TB OSTs across 4 OSSs (4 OSTs each). The four OSSs are broken out into two dual-active pairs running Lustre 1.8.5. Mostly, the water is fine but we've encountered a few surprises.
1. An 8-client iozone write test in which we write 64 files of 1.7 TB each seems to go well - until the end at which point iozone seems to finish successfully and begins its "cleanup". That is to say it starts to remove all 64 large files. At this point, the ll_ost threads go bananas - consuming all available cpu cycles on all 8 cores of each server. This seems to block the corosync "totem" exchange long enough to initiate a "stonith" request. 2. We have found that re-mounting the OSTs, either via the HA agent or manually, often can take a *very* long time - on the order of four or five minutes. We have not figured out why yet. An strace of the mount process has not yielded much. The mount seems to just be waiting for something but we can't tell what. We are starting to adjust our HA parameters to compensate for these observations but we hate to do this in a vacuum and wonder if others have also observed these behaviors and what, if anything, was done to compensate/correct? Regards, Charlie Taylor UF HPC Center _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss