On Wed, 11 May 2022 at 04:37, Laura Hild <l...@jlab.org> wrote: > The non-dummy SRP module is in the kmod-srp package, which isn't included in > the Lustre repository...
Thanks Laura, Yeah, I realised that earlier in the week, and have rebuilt the srp module from source via mlnxofedinstall, and sure enough installing srp-4.9-OFED.4.9.4.1.6.1.kver.3.10.0_1160.49.1.el7_lustre.x86_64.x86_64.rpm (gotta love those short names) gives me working srp again. Hat tip to a DDN contact here (we owe him even more beers now) for some extra tuning parameters: options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048 allow_ext_sg=1 ch_count=1 use_imm_data=0 but I'm pleased to say that it _seems_ to be working much better. I'd done one half of the HA pairs earlier in the week, lfsck completed, full robinhood scan done (dropped the DB and rescanned from fresh) and I'm just bringing the other half of the pairs up to the same software stack now. Couple of pointers for anyone caught in the same boat that apparently we did correctly: * upgrade your f2fsprogs to the latest - if your fsck'ing disks make sure you're not introducing more problems with a buggy old e2fsck * tunefs.lustre --writeconf isn't too destructive (see the warnings, you'll lose pool info but in our case that wasn't critical) * monitoring is good but tbh the rate of change and that it happened out of hours means we likely couldn't have intervened * so quotas are better. Thanks to those who replied on and off-list - I'm just grateful we only had the pair of MDTs, not the 40 (!!!) that Origin's getting (yeah, I was watching the LUG talk last night) - service isn't quite back to users but we're getting there! Andrew _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org