[lustre-discuss] BCP for High Availability?
Hi Folks, I'm just rebuilding my testbed and have got to the "sort out all the pacemaker stuff" part. What's the best current practice for the current LTS (2.15.x) release tree? I've always done this as multiple individual HA clusters covering each pair of servers with common dual connected drive array(s), but I remember seeing a talk some years ago where one of the US labs was using ?pacemaker-remote? and bringing them all up from a central node I note there's a few (old) crib notes on the wiki - referenced from the lustre manual, but nothing updated in the last couple of years. What are people out there doing? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] 2.15.x with ConnectX-3 cards
Hi Gang, I've just gone and reimaged a test system in prep for doing an upgrade to Rocky 8 + 2.15.1 (What's the bets 2.15.2 comes out the night I push to prod?) However, the 2.15.1-ib release uses mofed 5.6 ... which no longer supports CX-3 cards. (yeah, it's olde hardware...) Having been badly bitten (see posts passim) with using non-OFED 2.10/2.12 on this hardware (it needs to talk ib_srp to a SFA7700X) what's the advice for going to a 2.15 server? non-OFED or some rebuild based on 4.9? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Version interoperability
Hi folks, We're faced with a (short term measured in months, not years thankfully) seriously large gap in versions between our existing clients (2.7.5) and new hardware clients (2.15.0) that will be mounting the same file system. It's currently on 2.10.8-ib (ldiskfs) with connectx-5 cards, and I have a maintenance window coming up where I have the opportunity to upgrade it. Which is likely to cause less breakage: * stick with 2.10.8 on server and the annoyances with multi-rail / discovery when talking to our new system * upgrade to 2.12.9, sticking with the same OS major version * upgrade to 2.15.1 including a jump to RHEL 8 / rocky 8 (depending on licencing as we seem to have lost our HA add-on) I can't upgrade the old 2.7.5 clients, as this system is already on the decommissioning roadmap for next year Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] 2.12.9-ib release?
Hi folks, I see the 2.12.9/ release tree on https://downloads.whamcloud.com/public/lustre/, but I don't see the accompanying 2.12.9-ib/ one. ISTR someone needed to poke a build process last time to get this public - can they do the same this time please? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] unclear language in Operations manual
Hi folks, I've recently come across this snippet in the ops manual (section 13.8. Running Multiple Lustre File Systems, page 111 in the current pdf) > Note > If a client(s) will be mounted on several file systems, add the following > line to /etc/ xattr.conf file to avoid problems when files are moved between > the file systems: lustre.* skip Is this describing the case where a single client mounts more than one lustre filesystem simultaneously? ie mount -t lustre mgsnode:/foo /mnt/foo AND mount -t lustre mgsnode:/bar /mnt/bar? I suspect I should file a LUDOC if so, as the language doesn't flow. As (ahem) we've never done this, what's it likely to screw up when a user's copying from /mnt/foo to /mnt/bar ? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] jobstats
Hi folks, I've finally started to re-investigate pushing jobstats to our central dashboards and realised there's a dearth of scripts / tooling to actually gather the job_stats files and push them to $whatever. I have seen the telegraf one, and the DDN fork of collectd seems somewhat abandonware. Hence at this stage I'm back to rolling another Python script to feed influxdb. Yes, I know all the cool kids are using prometheus, but I'm not one of them. However while rummaging I came across LU-11407 (Improve stats data) - Andreas commented[1] he was hoping to add start_time and elapsed_time fields, but are these targeted in an upcoming release (still shows 'open') - It's also referred to in LU-15826 - Is that likely to make a point release of 2.15 or will it be the targeted at the next major release? It might be handy to save me correlating with slurm job start times, especially if the user job does $other_stuff before actually hitting the disks. Many thanks Andrew [1] https://jira.whamcloud.com/browse/LU-11407?focusedCommentId=234830=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-234830 ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Corrupted? MDT not mounting
On Wed, 11 May 2022 at 04:37, Laura Hild wrote: > The non-dummy SRP module is in the kmod-srp package, which isn't included in > the Lustre repository... Thanks Laura, Yeah, I realised that earlier in the week, and have rebuilt the srp module from source via mlnxofedinstall, and sure enough installing srp-4.9-OFED.4.9.4.1.6.1.kver.3.10.0_1160.49.1.el7_lustre.x86_64.x86_64.rpm (gotta love those short names) gives me working srp again. Hat tip to a DDN contact here (we owe him even more beers now) for some extra tuning parameters: options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048 allow_ext_sg=1 ch_count=1 use_imm_data=0 but I'm pleased to say that it _seems_ to be working much better. I'd done one half of the HA pairs earlier in the week, lfsck completed, full robinhood scan done (dropped the DB and rescanned from fresh) and I'm just bringing the other half of the pairs up to the same software stack now. Couple of pointers for anyone caught in the same boat that apparently we did correctly: * upgrade your f2fsprogs to the latest - if your fsck'ing disks make sure you're not introducing more problems with a buggy old e2fsck * tunefs.lustre --writeconf isn't too destructive (see the warnings, you'll lose pool info but in our case that wasn't critical) * monitoring is good but tbh the rate of change and that it happened out of hours means we likely couldn't have intervened * so quotas are better. Thanks to those who replied on and off-list - I'm just grateful we only had the pair of MDTs, not the 40 (!!!) that Origin's getting (yeah, I was watching the LUG talk last night) - service isn't quite back to users but we're getting there! Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Corrupted? MDT not mounting
On Fri, 6 May 2022 at 20:04, Andreas Dilger wrote: > MOFED is usually preferred over in-kernel OFED, it is just tested and fixed a > lot more. Fair enough, However is the 2.12.8-ib tree built with all the features? specifically https://downloads.whamcloud.com/public/lustre/lustre-2.12.8-ib/MOFED-4.9-4.1.7.0/el7/server/ If I compare the ib_srp module from 2.12 in-kernel [root@astrofs-oss3 ~]# find /lib/modules/`uname -r` -name ib_srp.ko.xz /lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz [root@astrofs-oss3 ~]# rpm -qf /lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz kernel-3.10.0-1160.49.1.el7_lustre.x86_64 [root@astrofs-oss3 ~]# modinfo ib_srp filename: /lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz license:Dual BSD/GPL description:InfiniBand SCSI RDMA Protocol initiator author: Roland Dreier retpoline: Y rhelversion:7.9 srcversion: 1FB80E3A962EE7F39AD3959 depends:ib_core,scsi_transport_srp,ib_cm,rdma_cm intree: Y vermagic: 3.10.0-1160.49.1.el7_lustre.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:FA:A3:27:4B:D9:17:36:F0:FD:43:6A:42:1B:6A:A4:FA:FE:D0:AC:FA sig_hashalgo: sha256 parm: srp_sg_tablesize:Deprecated name for cmd_sg_entries (uint) parm: cmd_sg_entries:Default number of gather/scatter entries in the SRP command (default is 12, max 255) (uint) parm: indirect_sg_entries:Default max number of gather/scatter entries (default is 12, max is 2048) (uint) parm: allow_ext_sg:Default behavior when there are more than cmd_sg_entries S/G entries after mapping; fails the request when false (default false) (bool) parm: topspin_workarounds:Enable workarounds for Topspin/Cisco SRP target bugs if != 0 (int) parm: prefer_fr:Whether to use fast registration if both FMR and fast registration are supported (bool) parm: register_always:Use memory registration even for contiguous memory regions (bool) parm: never_register:Never register memory (bool) parm: reconnect_delay:Time between successive reconnect attempts parm: fast_io_fail_tmo:Number of seconds between the observation of a transport layer error and failing all I/O. "off" means that this functionality is disabled. parm: dev_loss_tmo:Maximum number of seconds that the SRP transport should insulate transport layer errors. After this time has been exceeded the SCSI host is removed. Should be between 1 and SCSI_DEVICE_BLOCK_MAX_TIMEOUT if fast_io_fail_tmo has not been set. "off" means that this functionality is disabled. parm: ch_count:Number of RDMA channels to use for communication with an SRP target. Using more than one channel improves performance if the HCA supports multiple completion vectors. The default value is the minimum of four times the number of online CPU sockets and the number of completion vectors supported by the HCA. (uint) parm: use_blk_mq:Use blk-mq for SRP (bool) [root@astrofs-oss3 ~]# .. it all looks normal and capable of mounting our exascaler luns cf the one from 2.12.8-ib = PackageArch Version Repository Size = Installing: kernel x86_64 3.10.0-1160.49.1.el7_lustre lustre-2.12-mofed50 M kmod-lustre-osd-ldiskfsx86_64 2.12.8_6_g5457c37-1.el7 lustre-2.12-mofed 469 k lustre x86_64 2.12.8_6_g5457c37-1.el7 lustre-2.12-mofed 805 k Installing for dependencies: kmod-lustrex86_64 2.12.8_6_g5457c37-1.el7 lustre-2.12-mofed 3.9 M kmod-mlnx-ofa_kernel x86_64 4.9-OFED.4.9.4.1.7.1 lustre-2.12-mofed 1.3 M lustre-osd-ldiskfs-mount x86_64 2.12.8_6_g5457c37-1.el7 lustre-2.12-mofed15 k mlnx-ofa_kernelx86_64 4.9-OFED.4.9.4.1.7.1 lustre-2.12-mofed 108 k [root@astrofs-oss1 ~]# find /lib/modules/`uname -r` -name ib_srp.ko.xz
Re: [lustre-discuss] Corrupted? MDT not mounting
> It's looking more like something filled up our space - I'm just > copying the files out as a backup (mounted as ldiskfs just now) - Ahem. Inode quotas are a good idea. Turns out that a user creating about 130 million directories rapidly is more than a small MDT volume can take. An update on recovery progress - Upgrading the MDS to 2.12 got us over the issue in LU-12674 enough to recover, and I've migrated half (one of the HA pairs) of the OSSs to RHEL 7.9 / Lustre 2.12.8 too It needed a set of writeconf's doing before they'd mount, and e2fsck has run over any suspect luns. The filesystem "works" in that under light testing I can read/write OK, but as soon as it gets stressed, OSSs are falling over [ 1226.864430] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1226.872281] IP: [] __list_add+0x1b/0xc0 [ 1226.877699] PGD 1ffba0d067 PUD 1ffa48e067 PMD 0 [ 1226.882360] Oops: [#1] SMP [ 1226.885619] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) dm_round_robin ib_srp scsi_transport_srp scsi_tgt tcp_diag inet_diag ib_isert iscsi_target_mod target_core_mod rpcrdma rdma_ucm ib_iser ib_umad bonding rdma_cm ib_ipoib iw_cm libiscsi scsi_transport_iscsi ib_cm mlx4_ib ib_uverbs ib_core sunrpc ext4 mbcache jbd2 sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt kvm iTCO_vendor_support irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_i801 lpc_ich mei_me joydev mei sg ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler dm_multipath acpi_pad acpi_power_meter dm_mod ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mlx4_en ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb ahci libahci mpt2sas mlx4_core ptp crct10dif_pclmul crct10dif_common libata crc32c_intel pps_core dca raid_class devlink i2c_algo_bit drm_panel_orientation_quirks scsi_transport_sas nfit libnvdimm [last unloaded: scsi_tgt] [ 1226.987670] CPU: 6 PID: 366 Comm: kworker/u24:6 Kdump: loaded Tainted: G OE 3.10.0-1160.49.1.el7_lustre.x86_64 #1 [ 1227.000168] Hardware name: SGI.COM CH-C1104-GP6/X10SRW-F, BIOS 3.1 06/06/2018 [ 1227.007310] Workqueue: rdma_cm cma_work_handler [rdma_cm] [ 1227.012725] task: 934839f0b180 ti: 934836c2 task.ti: 934836c2 [ 1227.020195] RIP: 0010:[] [] __list_add+0x1b/0xc0 [ 1227.028036] RSP: 0018:934836c23d68 EFLAGS: 00010246 [ 1227.09] RAX: RBX: 934836c23d90 RCX: [ 1227.040463] RDX: 932fa518e680 RSI: RDI: 934836c23d90 [ 1227.047587] RBP: 934836c23d80 R08: R09: b2df8c1b3dcb3100 [ 1227.054712] R10: b2df8c1b3dcb3100 R11: 00ff R12: 932fa518e680 [ 1227.061835] R13: R14: R15: 932fa518e680 [ 1227.068958] FS: () GS:93483f38() knlGS: [ 1227.077034] CS: 0010 DS: ES: CR0: 80050033 [ 1227.082772] CR2: CR3: 001fe47a8000 CR4: 003607e0 [ 1227.089895] DR0: DR1: DR2: [ 1227.097020] DR3: DR6: fffe0ff0 DR7: 0400 [ 1227.104142] Call Trace: [ 1227.106593] [] __mutex_lock_slowpath+0xa6/0x1d0 [ 1227.112770] [] ? __switch_to+0xce/0x580 [ 1227.118255] [] mutex_lock+0x1f/0x2f [ 1227.123399] [] cma_work_handler+0x25/0xa0 [rdma_cm] [ 1227.129922] [] process_one_work+0x17f/0x440 [ 1227.135752] [] worker_thread+0x126/0x3c0 [ 1227.141324] [] ? manage_workers.isra.26+0x2a0/0x2a0 [ 1227.147849] [] kthread+0xd1/0xe0 [ 1227.152729] [] ? insert_kthread_work+0x40/0x40 [ 1227.158822] [] ret_from_fork_nospec_begin+0x7/0x21 [ 1227.165260] [] ? insert_kthread_work+0x40/0x40 [ 1227.171348] Code: ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 4c 8b 42 08 48 89 fb 49 39 f0 75 2a <4d> 8b 45 00 4d 39 c4 75 68 4c 39 e3 74 3e 4c 39 eb 74 39 49 89 [ 1227.191295] RIP [] __list_add+0x1b/0xc0 [ 1227.196798] RSP [ 1227.200284] CR2: and I'm able to reproduce this on multiple servers :-/ I can see a few mentions (https://access.redhat.com/solutions/4969471 for example) that seem to hint it's low memory triggered, but they also say it's fixed in the Red Hat 7.9 kernel (and we're running the 2.12.8 stock 3.10.0-1160.49.1.el7_lustre.x86_64) I've got a case open with the vendor to see if there are any firmware updates - but I'm not hopeful. These are 6 core single socket broadwells. with 128G of RAM, Storage disks are mounted over SRP from a DDN appliance. Would jumping to MOFED make a difference? Otherwise I'm open to suggestions as it's getting very tiring wrangling servers back to life [root@astrofs-oss1 ~]# ls -l /var/crash/ | grep 2022 drwxr-xr-x 2 root root 44
Re: [lustre-discuss] Corrupted? MDT not mounting
Thanks Stéphane, It's looking more like something filled up our space - I'm just copying the files out as a backup (mounted as ldiskfs just now) - we're running DNE (MDT and this one MDT0001) but I don't understand why so much space is being taken up in REMOTE_PARENT_DIR - we seem to have actual user data stashed in there [root@astrofs-mds2 SSINS_uvfits]# pwd /mnt/REMOTE_PARENT_DIR/0xa40002340:0x1:0x0/MWA/data/1061313128/SSINS_uvfits [root@astrofs-mds2 SSINS_uvfits]# ls -l total 0 -rw-rw-r--+ 1 redacted redacted 67153694400 Oct 9 2018 1061313128_noavg_noflag_00.uvfits -rw-rw-r--+ 1 redacted redacted 0 Oct 9 2018 1061313128_noavg_noflag_01.uvfits [root@astrofs-mds2 SSINS_uvfits]# and although this one was noticeably large, it's not the only non-zero sized file under REMOTE_PARENT_DIR: [root@astrofs-mds2 1061314832]# ls -l | head total 116 -rw-rw-r--+ 1 redacted redacted7338240 Nov 14 2017 1061314832_01.mwaf -rw-rw-r--+ 1 redacted redacted7338240 Nov 14 2017 1061314832_02.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_03.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_04.mwaf -rw-rw-r--+ 1 redacted redacted7338240 Nov 14 2017 1061314832_05.mwaf -rw-rw-r--+ 1 redacted redacted7338240 Nov 14 2017 1061314832_06.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_07.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_08.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_09.mwaf [root@astrofs-mds2 1061314832]# pwd /mnt/REMOTE_PARENT_DIR/0xa40002340:0x1:0x0/MWA/data/1061314832 Suggestions for how to clean up and recover anyone? Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Corrupted? MDT not mounting
Hi Folks, One of our filesystems seemed to fail over the holiday weekend - we're running DNE and MDT0001 won't mount. At first it looked like we'd run out of space (rc = -28) but then we were seeing this mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 failed: File exists retries left: 0 mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 failed: File exists possibly kernel: LustreError: 13921:0:(genops.c:478:class_register_device()) astrofs-OST-osc-MDT0001: already exists, won't add lustre_rmmod wouldn't remove everything cleanly (osc in use) and so after a reboot everything *seemed* to start OK [root@astrofs-mds1 ~]# mount -t lustre /dev/mapper/MGS on /lustre/MGS type lustre (ro) /dev/mapper/MDT on /lustre/astrofs-MDT type lustre (ro) /dev/mapper/MDT0001 on /lustre/astrofs-MDT0001 type lustre (ro) ... but not for long kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG possibly corrupt llog? I see LU-12674 which looks like our problem, but only backported to 2.12 branch (these servers are still 2.10.8) Piecing together what *might* have happened is a user possibly ran out of inodes and then did a rm -r before the system stopped responding. Mounting just now I'm getting: [ 1985.078422] LustreError: 10953:0:(llog.c:654:llog_process_thread()) astrofs-OST0001-osc-MDT0001: Local llog found corrupted #0x7ede0:1:0 plain index 35518 count 2 [ 1985.095129] LustreError: 10959:0:(llog_osd.c:961:llog_osd_next_block()) astrofs-MDT0001-osd: invalid llog tail at log id [0x7ef40:0x1:0x0]:0 offset 577536 bytes 4096 [ 1985.109892] LustreError: 10959:0:(osp_sync.c:1242:osp_sync_thread()) astrofs-OST0004-osc-MDT0001: llog process with osp_sync_process_queues failed: -22 [ 1985.126797] LustreError: 10973:0:(llog_cat.c:269:llog_cat_id2handle()) astrofs-OST000b-osc-MDT0001: error opening log id [0x7ef76:0x1:0x0]:0: rc = -2 [ 1985.140169] LustreError: 10973:0:(llog_cat.c:823:llog_cat_process_cb()) astrofs-OST000b-osc-MDT0001: cannot find handle for llog [0x7ef76:0x1:0x0]: rc = -2 [ 1985.155321] Lustre: astrofs-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 1985.169404] Lustre: astrofs-MDT0001: in recovery but waiting for the first client to connect [ 1985.177869] Lustre: astrofs-MDT0001: Will be in recovery for at least 2:30, or until 1508 clients reconnect [ 1985.187612] Lustre: astrofs-MDT0001: Connection restored to a5e41149-73fc-b60a-30b1-da096a5c2527 (at 1170@gni1) [ 2017.251374] Lustre: astrofs-MDT0001: Connection restored to 7a388f58-bc16-6bd7-e0c8-4ffa7c0dd305 (at 400@gni1) [ 2017.261374] Lustre: Skipped 1275 previous similar messages [ 2081.458117] Lustre: astrofs-MDT0001: Connection restored to 10.10.36.143@o2ib4 (at 10.10.36.143@o2ib4) [ 2081.467419] Lustre: Skipped 277 previous similar messages [ 2082.324547] Lustre: astrofs-MDT0001: Recovery over after 1:37, of 1508 clients 1508 recovered and 0 were evicted. Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ... kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ... kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG [ 2082.392381] LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: [ 2082.401422] LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG [ 2082.408558] Pid: 11082, comm: orph_cleanup_as 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon May 27 03:45:37 UTC 2019 [ 2082.418891] Call Trace: [ 2082.421340] [] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 2082.427890] [] lbug_with_loc+0x4c/0xa0 [libcfs] [ 2082.434077] [] osp_sync_declare_add+0x3a9/0x3e0 [osp] [ 2082.440797] [] osp_declare_destroy+0xc9/0x1c0 [osp] [ 2082.447338] [] lod_sub_declare_destroy+0xce/0x2d0 [lod] [ 2082.454237] [] lod_obj_stripe_destroy_cb+0x85/0x90 [lod] [ 2082.461213] [] lod_obj_for_each_stripe+0xb6/0x230 [lod] [ 2082.468104] [] lod_declare_destroy+0x43b/0x5c0 [lod] [ 2082.474736] [] orph_key_test_and_del+0x5f6/0xd30 [mdd] [ 2082.481538] [] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] [ 2082.488250] [] kthread+0xd1/0xe0 [ 2082.493147] [] ret_from_fork_nospec_begin+0x7/0x21 [ 2082.499601] [] 0x [ 2082.504585] Kernel panic - not syncing: LBUG e2fsck when mounted as lfiskfs seems to be clean, but is there a way I can get it mounted enough to run lfsck? Alternatively, can I upgrade the MDSs to 2.12.x while having the OSSs still on 2.10? yes I know this isn't ideal but I wasn't planning a large upgrade at zero notice to our users (also, we still have a legacy system accessing it with a 2.7 client - it's replacement arrived last Sept, but still hasn't been handed over to us yet, so I really don't want to get too out of step) Many thanks Andrew
[lustre-discuss] Hardware advice for homelab
Hi folks, Given my homelab testing for Lustre tends to be contained within VirtualBox on laptop ($work has a physical hardware test bed once mucking around gets serious), I'm considering expanding to some real hardware at home for testing. My MythTV days are over, but I'd ideally like an aarch64 client that can run on a Raspberry Pi, incase I ever poke at Kodi. What server hardware would people advise that fulfils: * low running cost (it's my electricity bill!) * fairly cheap to buy (own budget) * if I'm buying a cased 'nuc' type thing, it must be able to fit in a 3.5" SATA drive (as I have some old ones that fell off the back of a rack) * not full of screaming fans Given it's not planned for production use 24/7 I don't care about HA with multi-tailed drives, but would quite like the ability to add more OSSs as required. Cable sprawl / mounting isn't that much of an issue, providing it can live in the shed Any suggestions? Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Determining server version from client
Hi All, Is there a trivial command to determine the server side version of lustre (in my case, trying to confirm what types of quotas are allowed (project - 2.10+, default - 2.12+) ? I was hoping there'd be something in lfs, such as lfs getname --version which would ideally spit out something like $ lfs getname --version fs1-9920dde7d000 /fs1 2.10.4 testfs-992073597800 /testfs 2.12.5 but that's wishful thinking :-) as lfs --version merely gives me the client version as expected Is this something that's fairly trivial and I'll open a jira ticket for the request - I know it's done at mount time as the kernel can log kernel: Lustre: Server MGS version (2.5.1.0) is much older than client. Consider upgrading server (2.12.5) Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] quotas not being enforced
On Thu, 14 Jan 2021 at 17:12, Andrew Elwell wrote: > I'm struggling to debug quota enforcement (or more worryingly, lack > of) in recentish LTS releases. > > [root@pgfs-mds3 ~]# lctl conf_param testfs.quota.ost=g > ... time passes > [root@pgfs-mds4 ~]# lctl get_param osd-*.*.quota_slave.info | egrep > '(name|enabled)' > target name:testfs-OST > quota enabled: none > target name:testfs-OST0001 > quota enabled: none > > doesn't seem to be rippling out. Not sure I've done the right thing, but poking via lctl set_param osd-ldiskfs..quota_slave.enabled= *seems* to have done the trick! [root@pgfs-mds3 ~]# lctl get_param osd-*.*.quota_slave.info osd-ldiskfs.testfs-MDT.quota_slave.info= target name:testfs-MDT pool ID:0 type: md quota enabled: ug conn to master: setup space acct: ug user uptodate: glb[1],slv[1],reint[0] group uptodate: glb[1],slv[1],reint[0] project uptodate: glb[0],slv[0],reint[0] [root@pgfs-mds3 ~]# lctl get_param osd-ldiskfs.*.quota_slave.enabled osd-ldiskfs.testfs-MDT.quota_slave.enabled=ug [root@pgfs-mds3 ~]# lctl set_param osd-ldiskfs.testfs-MDT.quota_slave.enabled=g osd-ldiskfs.testfs-MDT.quota_slave.enabled=g [root@pgfs-mds3 ~]# lctl get_param osd-*.*.quota_slave.info osd-ldiskfs.testfs-MDT.quota_slave.info= target name:testfs-MDT pool ID:0 type: md quota enabled: g conn to master: setup space acct: ug user uptodate: glb[1],slv[1],reint[0] group uptodate: glb[1],slv[1],reint[0] project uptodate: glb[0],slv[0],reint[0] Another Q - if I use the 2.12+ default quota feature (ie sudo lfs setquota -G -B 10m -I 1025 /testfs) and then read/write with an old client will it enforce the quotas (even if I can't manipulate / display them clearly) OK or is Something Bad (tm) going to happen behind the scenes? - it _seemed_ to be behaving as expected from a client side (2.7.mumble) Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] quotas not being enforced
Hi folks, I'm struggling to debug quota enforcement (or more worryingly, lack of) in recentish LTS releases. our test system (2 servers shared SAS disks between them) is running lustre-2.12.6-1.el7.x86_64 e2fsprogs-1.45.6.wc3-0.el7.x86_64 kernel-3.10.0-1160.2.1.el7_lustre.x86_64 but the storage luns have been upgraded from 2.7 onwards (maybe a reformat at 2.10? - its a test system so gets a hard life) couple of things 1) in the lustre manual (snapshot as at 2021-01-13) section 25.8 0 Lustre Quota Statistics -- are these obsolete as in pre 2.4 versions? [root@pgfs-mds4 ~]# lctl get_param lquota.testfs-OST.stats error: get_param: param_path 'lquota/testfs-OST/stats': No such file or directory ie - is this what's referred to in LUDOC-362 2) how long should I have to wait for changing enforcement on MGS to seeing it rattle out onto MDT/OSTs? [root@pgfs-mds3 ~]# lctl get_param osd-*.*.quota_slave.info | egrep '(name|enabled)' target name:testfs-MDT quota enabled: ug [root@pgfs-mds3 ~]# lctl conf_param testfs.quota.mdt=g [root@pgfs-mds3 ~]# mount -t lustre /dev/mapper/TEST_MGT on /lustre/testfs-MGT type lustre (ro,svname=MGS,nosvc,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro) /dev/mapper/TEST_MDT on /lustre/testfs-MDT type lustre (ro,svname=testfs-MDT,mgsnode=10.10.36.145@o2ib4:10.10.36.145@o2ib4,osd=osd-ldiskfs,user_xattr,errors=remount-ro) [root@pgfs-mds3 ~]# lctl get_param osd-*.*.quota_slave.info osd-ldiskfs.testfs-MDT.quota_slave.info= target name:testfs-MDT pool ID:0 type: md quota enabled: ug conn to master: setup space acct: ug user uptodate: glb[1],slv[1],reint[0] group uptodate: glb[1],slv[1],reint[0] project uptodate: glb[0],slv[0],reint[0] [root@pgfs-mds3 ~]# ie still showing user enforcement similarly [root@pgfs-mds3 ~]# lctl conf_param testfs.quota.ost=g ... time passes [root@pgfs-mds4 ~]# lctl get_param osd-*.*.quota_slave.info | egrep '(name|enabled)' target name:testfs-OST quota enabled: none target name:testfs-OST0001 quota enabled: none doesn't seem to be rippling out. Do I need to umount / tunefs --writeconf / e2fsck / whatever them? - where do I look for debug info on what's (not) happening? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] CentOS / LTS plans
Hi All, I'm guessing most of you have heard of the recent roadmap for CentOS (discussion of which isn't on topic for this list), but can we have a vague (happy for it to be "at this point we're thinking about X, but we haven't really decided" level) indication of what the plan for the upcoming releases are likely to be? Thanks for the 2.12.6 update the other day - that's on this afternoon's plan to get it on our testbed and I see from Peter's mail that 2.12.7 will be the next LTS release. Will this likely be using RHEL 7.x for server again? Are the remaining 2.12.x LTS releases likely to stick with RHEL 7 for server? Is the "next big branch" LTS release (whatever that may be) likely to be based on RHEL 8 for server? Many thanks Andrew (who's trying to work out what licence purchases we're likely to need to include in storage plans) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] status of HSM copytools?
Hi folks, I'm looking round to see what's current / 'supported' / working in the state of copytools. Ideally one that can migrate to/from object stores (Ceph or S3). The github repo for Lemur (https://github.com/whamcloud/lemur/commits/master) doesn't seem to have had any substantial work since it left Intel - unlucky timing with the owner shift? I've seen another from Compute Canada (https://github.com/ComputeCanada/lustre-obj-copytool) but that too hasn't been touched for years. Anyone care to comment on some working ones? Horror stories? Ones to avoid? hey, I'm even (I'll probably regret this) open to _email_ from salesdroids if you have a working product and can point me to some users (but don't try and phone me or make me sit through a webinar). Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Commvault lustre backup / archive
Hi folks, I see from their release notes that Commvault should be able to act on changelogs for backup. Anyone here doing so? Any gotchas to worry about? Is it better than scanning (ugh) and making the MDS unhappy? Similarly how good is the archive functionally? Does it play well with lustre HSM design? (Feel free to contact me off list if you'd rather) - I tried to get info out of our local reseller without much success... Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Pacemaker resource Agents
> I've been trying to locate the Lustre specific Pacemaker resource agents but > I've had no luck at github where they were meant to be hosted, maybe I am > looking at the wrong project? > Has anyone recently implemented a HA lustre cluster using pacemaker and did > you use lustre specific RA's? I just grabbed them from the repo https://downloads.whamcloud.com/public/lustre/latest-2.10-release/el7/server/RPMS/x86_64/lustre-resource-agents-2.10.8-1.el7.x86_64.rpm (yum install lustre-resource-agents) Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Jobstats harvesting
On Mon., 17 Feb. 2020, 18:06 Andreas Dilger, wrote: > You don't mention which Lustre release you are using, but newer > releases allow "complex JobIDs" that can contain both the SLURMJobID > as well as other constant strings (e.g. cluster name), hostname, UID, GID, > and process name. > Yeah, i twigged that once I'd sent the mail: we're still 2.10.8 in production, so having the option of the more complex jobid string is another reason for upgrading Related, ive found the DDN fork of collectd, and i see the lustre2.c plugin is GPL2 but are there any plans to get it merged upstream? Andrew (Also who's mad enough to be running mythtv on lustre judging from the examples?) > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Jobstats harvesting
Hi folks, I've finally got round to enabling jobstats on a test system. As we're a Slurm shop, setting this to jobid_var=SLURM_JOB_ID works OK, but is it possible to use a combination of variables? ie ${PAWSEY_CLUSTER}-${SLURM_JOB_ID} (or even SLURM_CLUSTER_NAME which is the same as $PAWSEY_CLUSTER)? if so, what's the syntax? (Yes, I know that setting it to federated would jump up the JobId namespace to include a cluster identifier, but that's not happening for now. However, main reason for mail is to find out what people use to harvest the stats off the MDT/OSTs - I'm aware of Roland Laifer's LAD15 presentation (sadly his tarball misses a sample config file out, so it's taken me a bit of iteration over the Perl scripts to recreate syntax) which saves to a file based structure, and I've seen others using Prometheus (via https://grafana.com/grafana/dashboards/9671) We've got influxdb (lnet / mds / ost stats gathered as well as regular collectd output) and mariaDB (slurmdbd and robinhood) DBs available, so I'd rather go with something that fed into that. We're not doing serious high throughput (financial style) but more traditional HPC with a lot (sigh) of single node jobs over 4 production filesystems (of which 3 are non-appliance LTS releases maintained by us) Hopefully the discussion here will lead to some updated content at http://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide (hat tip to Scott for a great start) Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Slow mount on clients
> HA / MGS running on second node in fstab :-) that was one of the first things we checked, and I've tried manually mounting it but no change 10.10.36.224@o2ib4:10.10.36.225@o2ib4:/askapfs1 3.7P 3.0P 507T 86% /askapbuffer hpc-admin2:~ # lctl ping 10.10.36.224@o2ib4 12345-0@lo 12345-10.10.36.224@o2ib4 hpc-admin2:~ # lctl ping 10.10.36.225@o2ib4 12345-0@lo 12345-10.10.36.225@o2ib4 hpc-admin2:~ # umount /askapbuffer hpc-admin2:~ # time mount /askapbuffer/ real 1m15.099s user 0m0.012s sys 0m0.021s hpc-admin2:~ # and on the server: [root@askap-fs1-mds01 ~]# mount -t lustre /dev/mapper/array00_2 on /lustre/MGS type lustre (ro) /dev/mapper/array00_1 on /lustre/askapfs1-MDT0001 type lustre (ro) [root@askap-fs1-mds01 ~]# lctl list_nids 10.10.36.224@o2ib4 [root@askap-fs1-mds01 ~]# tunefs.lustre --dryrun /dev/mapper/array00_2 checking for existing Lustre data: found Reading CONFIGS/mountdata Read previous values: Target: MGS Index: unassigned Lustre FS: askapfs1 Mount type: ldiskfs Flags: 0x1004 (MGS no_primnode ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: failover.node=10.10.36.224@o2ib4:10.10.36.225@o2ib4 Permanent disk data: Target: MGS Index: unassigned Lustre FS: askapfs1 Mount type: ldiskfs Flags: 0x1004 (MGS no_primnode ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: failover.node=10.10.36.224@o2ib4:10.10.36.225@o2ib4 exiting before disk write. [root@askap-fs1-mds01 ~]# (MDT is mounted on the other node at this time). ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Running an older Lustre server (2.5) with a newer client (2.11)
On Fri., 30 Aug. 2019, 09:01 Kirill Lozinskiy, wrote: > Is there anyone out there running Lustre server version 2.5.x with a > Lustre client version 2.11.x? I'm curious if you are running this > combination and whether or not you saw and gains or losses when you went to > the newer Lustre client. > Not quite that new (we're still sles12 based), but we still have a 2.5 based filesystem (neo 2.0) and happily mount it on 2.10.8 clients (together with our 2.10 LTS filesystems) Given that 2.11 was chosen by the same supplier as their 2.5 based system, I suspect you'd have a good case for support if you hit any issues... Andrew > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Wanted: multipath.conf for dell ME4 series arrays
Hi Jeff, On Wed, 21 Aug 2019 at 17:34, Jeff Johnson wrote: > What underlying Lustre target filesystem? (assuming ldiskfs with a hardware > RAID array) correct - ldiskfs, using 8* raid6 luns per ME4084 > What does your current multipath.conf look like? we just had blacklist, WWNs and mappings, we were missing any ME4 specific device {} settings, however I've since found the magic incantation from https://downloads.dell.com/manuals/common/powervault-me4-series-linux-dell-emc-2018-3924-bp-l_wp_en-us.pdf, notably device { vendor "DellEMC" product "ME4" path_grouping_policy "group_by_prio" path_checker "tur" hardware_handler "1 alua" prio "alua" failback immediate rr_weight "uniform" path_selector "service-time 0" } and it seems to be working a whole lot better :-) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Wanted: multipath.conf for dell ME4 series arrays
Hi folks, we're seeing MMP reluctance to hand over the (umounted) OSTs to the partner pair on our shiny new ME4084 arrays, Does anyone have the device {} settings they'd be willing to share? My gut feel is we've not defined path failover properly and some timeouts need tweaking (4* ME4084's per 2 740 servers with SAS cabling, Lustre 2.10.8 and CentOS 7.x) Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] State of arm client?
Hi folks, I remember seeing a press release by DDN/Whamcloud last November that they were going to support ARM, but can anyone point me to the current state of client? I'd like to deploy it onto a raspberry pi cluster (only 4-5 nodes) ideally on raspbian for demo / training purposes. (Yes I know it won't *quite* be infiniband performance, but as it's hitting a VM based set of lustre servers, that's the least of my worries). Ideally 2.10.x, but I'd take a 2.12 client if it can talk to 2.10.x servers Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] lfs check *, change of behaviour from 2.7 to 2.10?
I've just noticed that 'lfs check mds / servers no longer works (2.10.0 or greater clients) for unprivileged users, yet it worked for 2.7.x clients. Is this by design? (lfs quota thankfully still works as a normal user tho) Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Suspended jobs and rebooting lustre servers
On Tue, 26 Feb 2019 at 23:25, Andreas Dilger wrote: > I agree that having an option that creates the OSTs as inactive might be > helpful, though I wouldn't want that to be the default as I'd imagine it > would also cause problems for the majority users that wouldn't know that they > need to enable the OSTs after they are mounted. > Could you file a feature request for this in Jira? Done https://jira.whamcloud.com/browse/LU-12036 ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Command line tool to monitor Lustre I/O ?
On Fri., 21 Dec. 2018, 01:05 Laifer, Roland (SCC) Dear Lustre administrators, > > what is a good command line tool to monitor current Lustre metadata and > throughput operations on the local client or server? > I wrote a small python script to parse lctl get_param and inject it straight into our influxdb server - As I was dropping this onto a sonnexion (as well as our newer systems which had collectd installed) I didn't want to require any software not already installed on the system. My plan [one of these days in my spare time] is to wrap it properly as a collectd python plugin - would people be interested and I'll probably see if I can find some time to work on it over xmas? Once it's in influx, we can then just plot it with our normal tooling (grafana) - some pictures in the pptx at https://www.dropbox.com/s/rck1lm73wlwlg6v/monitoring.pptx?dl=0 (near end) Andrew > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Openstack Manila + Lustre?
Hi All, Is there anyone on list exporting Lustre filesystems to (private) cloud services - possibly via Manila? I can see https://www.openstack.org/assets/science/CrossroadofCloudandHPC-Print.pdf and Simon's talk from LUG2016 (https://nci.org.au/wp-content/uploads/2016/07/LUG-2016-sjjfowler-hpc-data-in-the-cloud.pdf) which seems to be pretty much what I'm after. Does anyone else have updated notes / success / horror stories they'd be willing to share? Many thanks Andrew (no prizes for guessing who's been asked to look at integrating our HPC storage with cloud...) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] rsync target for https://downloads.whamcloud.com/public/?
Hi folks, Is there an rsync (or other easily mirrorable) target for downloads.whamcloud.com ? I'm trying to pull e2fsprogs/latest/el7/ and lustre/latest-release/el7/server/ locally to reinstall a bunch of machines Many thanks, Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Does lustre 2.10 client support 2.5 server ?
> My Lustre server is running the version 2.5 and I want to use 2.10 client. > Is this combination supported ? Is there anything that I need to be aware of 2 of our storage appliances (sonnexion 1600 based) run 2.5.1, I've mounted this OK on infiniband clients fine with 2.10.0 and 2.10.1 OK, but a colleague has since had to downgrade some of our clients to 2.9.0 on OPA / KNL hosts as we were seeing strange issues (can't remember the ticket details) We do see the warnings at startup: Lustre: Server MGS version (2.5.1.0) is much older than client. Consider upgrading server (2.10.0) Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 1 MDS and 1 OSS
On 31 Oct. 2017 07:20, "Dilger, Andreas"wrote: Having a larger MDT isn't bad if you plan future expansion. That said, you would get better performance over FDR if you used SSDs for the MDT rather than HDDs (if you aren't already planning this), and for a single OSS you probably don't need the extra MDT capacity. With both ldiskfs+LVM and ZFS you can also expand the MDT size in the future if you need more capacity. Can someone with wiki editing rights summarise the advantages of different hardware combinations? For example I remember Daniel @ NCI had some nice comments about which components (MDS v OSS) benefited from faster cores over thread count and where more RAM was important. I feel this would be useful for people building small test systems and comparing vendor responses for large tenders. Many thanks, Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Point release updates
Hi Folks, We currently have a couple of storage systems based on IEEEL 3.0: [root@pgfs-oss1 ~]# cat /proc/fs/lustre/version lustre: 2.7.16.8 kernel: patchless_client build: jenkins-arch=x86_64,build_type=client,distro=el7,ib_stack=inkernel-15--PRISTINE-3.10.0-327.36.1.el7_lustre.x86_64 [root@pgfs-oss1 ~]# [root@astrofs-oss1 ~]# cat /proc/fs/lustre/version lustre: 2.7.19.8 kernel: patchless_client build: jenkins-arch=x86_64,build_type=server,distro=el7,ib_stack=inkernel-165--PRISTINE-3.10.0-514.2.2.el7_lustre.x86_64 [root@astrofs-oss1 ~]# and we'd like to update these -- preferred choice will be to 2.10.1 once it's out, but happy to go for an intermediate 2.7 release. (mainly because we're seeing lots of these in the logs: astrofs-MDT0001-osd: FID [whatever] != self_fid [whatever] - which seem to be https://jira.hpdd.intel.com/browse/LU-8532 / LU-8319 which apparently has a backport to 2.7). However - where do we get "blessed" point release updates from? https://downloads.hpdd.intel.com/public/lustre/ doesn't seem to have any. Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org