[lustre-discuss] BCP for High Availability?
Hi Folks, I'm just rebuilding my testbed and have got to the "sort out all the pacemaker stuff" part. What's the best current practice for the current LTS (2.15.x) release tree? I've always done this as multiple individual HA clusters covering each pair of servers with common dual connected drive array(s), but I remember seeing a talk some years ago where one of the US labs was using ?pacemaker-remote? and bringing them all up from a central node I note there's a few (old) crib notes on the wiki - referenced from the lustre manual, but nothing updated in the last couple of years. What are people out there doing? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] 2.15.x with ConnectX-3 cards
Hi Gang, I've just gone and reimaged a test system in prep for doing an upgrade to Rocky 8 + 2.15.1 (What's the bets 2.15.2 comes out the night I push to prod?) However, the 2.15.1-ib release uses mofed 5.6 ... which no longer supports CX-3 cards. (yeah, it's olde hardware...) Having been badly bitten (see posts passim) with using non-OFED 2.10/2.12 on this hardware (it needs to talk ib_srp to a SFA7700X) what's the advice for going to a 2.15 server? non-OFED or some rebuild based on 4.9? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Version interoperability
Hi folks, We're faced with a (short term measured in months, not years thankfully) seriously large gap in versions between our existing clients (2.7.5) and new hardware clients (2.15.0) that will be mounting the same file system. It's currently on 2.10.8-ib (ldiskfs) with connectx-5 cards, and I have a maintenance window coming up where I have the opportunity to upgrade it. Which is likely to cause less breakage: * stick with 2.10.8 on server and the annoyances with multi-rail / discovery when talking to our new system * upgrade to 2.12.9, sticking with the same OS major version * upgrade to 2.15.1 including a jump to RHEL 8 / rocky 8 (depending on licencing as we seem to have lost our HA add-on) I can't upgrade the old 2.7.5 clients, as this system is already on the decommissioning roadmap for next year Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] 2.12.9-ib release?
Hi folks, I see the 2.12.9/ release tree on https://downloads.whamcloud.com/public/lustre/, but I don't see the accompanying 2.12.9-ib/ one. ISTR someone needed to poke a build process last time to get this public - can they do the same this time please? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] unclear language in Operations manual
Hi folks, I've recently come across this snippet in the ops manual (section 13.8. Running Multiple Lustre File Systems, page 111 in the current pdf) > Note > If a client(s) will be mounted on several file systems, add the following > line to /etc/ xattr.conf file to avoid problems when files are moved between > the file systems: lustre.* skip Is this describing the case where a single client mounts more than one lustre filesystem simultaneously? ie mount -t lustre mgsnode:/foo /mnt/foo AND mount -t lustre mgsnode:/bar /mnt/bar? I suspect I should file a LUDOC if so, as the language doesn't flow. As (ahem) we've never done this, what's it likely to screw up when a user's copying from /mnt/foo to /mnt/bar ? Many thanks Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] jobstats
Hi folks, I've finally started to re-investigate pushing jobstats to our central dashboards and realised there's a dearth of scripts / tooling to actually gather the job_stats files and push them to $whatever. I have seen the telegraf one, and the DDN fork of collectd seems somewhat abandonware. Hence at this stage I'm back to rolling another Python script to feed influxdb. Yes, I know all the cool kids are using prometheus, but I'm not one of them. However while rummaging I came across LU-11407 (Improve stats data) - Andreas commented[1] he was hoping to add start_time and elapsed_time fields, but are these targeted in an upcoming release (still shows 'open') - It's also referred to in LU-15826 - Is that likely to make a point release of 2.15 or will it be the targeted at the next major release? It might be handy to save me correlating with slurm job start times, especially if the user job does $other_stuff before actually hitting the disks. Many thanks Andrew [1] https://jira.whamcloud.com/browse/LU-11407?focusedCommentId=234830&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-234830 ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Corrupted? MDT not mounting
On Wed, 11 May 2022 at 04:37, Laura Hild wrote: > The non-dummy SRP module is in the kmod-srp package, which isn't included in > the Lustre repository... Thanks Laura, Yeah, I realised that earlier in the week, and have rebuilt the srp module from source via mlnxofedinstall, and sure enough installing srp-4.9-OFED.4.9.4.1.6.1.kver.3.10.0_1160.49.1.el7_lustre.x86_64.x86_64.rpm (gotta love those short names) gives me working srp again. Hat tip to a DDN contact here (we owe him even more beers now) for some extra tuning parameters: options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048 allow_ext_sg=1 ch_count=1 use_imm_data=0 but I'm pleased to say that it _seems_ to be working much better. I'd done one half of the HA pairs earlier in the week, lfsck completed, full robinhood scan done (dropped the DB and rescanned from fresh) and I'm just bringing the other half of the pairs up to the same software stack now. Couple of pointers for anyone caught in the same boat that apparently we did correctly: * upgrade your f2fsprogs to the latest - if your fsck'ing disks make sure you're not introducing more problems with a buggy old e2fsck * tunefs.lustre --writeconf isn't too destructive (see the warnings, you'll lose pool info but in our case that wasn't critical) * monitoring is good but tbh the rate of change and that it happened out of hours means we likely couldn't have intervened * so quotas are better. Thanks to those who replied on and off-list - I'm just grateful we only had the pair of MDTs, not the 40 (!!!) that Origin's getting (yeah, I was watching the LUG talk last night) - service isn't quite back to users but we're getting there! Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Corrupted? MDT not mounting
On Fri, 6 May 2022 at 20:04, Andreas Dilger wrote: > MOFED is usually preferred over in-kernel OFED, it is just tested and fixed a > lot more. Fair enough, However is the 2.12.8-ib tree built with all the features? specifically https://downloads.whamcloud.com/public/lustre/lustre-2.12.8-ib/MOFED-4.9-4.1.7.0/el7/server/ If I compare the ib_srp module from 2.12 in-kernel [root@astrofs-oss3 ~]# find /lib/modules/`uname -r` -name ib_srp.ko.xz /lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz [root@astrofs-oss3 ~]# rpm -qf /lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz kernel-3.10.0-1160.49.1.el7_lustre.x86_64 [root@astrofs-oss3 ~]# modinfo ib_srp filename: /lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz license:Dual BSD/GPL description:InfiniBand SCSI RDMA Protocol initiator author: Roland Dreier retpoline: Y rhelversion:7.9 srcversion: 1FB80E3A962EE7F39AD3959 depends:ib_core,scsi_transport_srp,ib_cm,rdma_cm intree: Y vermagic: 3.10.0-1160.49.1.el7_lustre.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:FA:A3:27:4B:D9:17:36:F0:FD:43:6A:42:1B:6A:A4:FA:FE:D0:AC:FA sig_hashalgo: sha256 parm: srp_sg_tablesize:Deprecated name for cmd_sg_entries (uint) parm: cmd_sg_entries:Default number of gather/scatter entries in the SRP command (default is 12, max 255) (uint) parm: indirect_sg_entries:Default max number of gather/scatter entries (default is 12, max is 2048) (uint) parm: allow_ext_sg:Default behavior when there are more than cmd_sg_entries S/G entries after mapping; fails the request when false (default false) (bool) parm: topspin_workarounds:Enable workarounds for Topspin/Cisco SRP target bugs if != 0 (int) parm: prefer_fr:Whether to use fast registration if both FMR and fast registration are supported (bool) parm: register_always:Use memory registration even for contiguous memory regions (bool) parm: never_register:Never register memory (bool) parm: reconnect_delay:Time between successive reconnect attempts parm: fast_io_fail_tmo:Number of seconds between the observation of a transport layer error and failing all I/O. "off" means that this functionality is disabled. parm: dev_loss_tmo:Maximum number of seconds that the SRP transport should insulate transport layer errors. After this time has been exceeded the SCSI host is removed. Should be between 1 and SCSI_DEVICE_BLOCK_MAX_TIMEOUT if fast_io_fail_tmo has not been set. "off" means that this functionality is disabled. parm: ch_count:Number of RDMA channels to use for communication with an SRP target. Using more than one channel improves performance if the HCA supports multiple completion vectors. The default value is the minimum of four times the number of online CPU sockets and the number of completion vectors supported by the HCA. (uint) parm: use_blk_mq:Use blk-mq for SRP (bool) [root@astrofs-oss3 ~]# .. it all looks normal and capable of mounting our exascaler luns cf the one from 2.12.8-ib = PackageArch Version Repository Size = Installing: kernel x86_64 3.10.0-1160.49.1.el7_lustre lustre-2.12-mofed50 M kmod-lustre-osd-ldiskfsx86_64 2.12.8_6_g5457c37-1.el7 lustre-2.12-mofed 469 k lustre x86_64 2.12.8_6_g5457c37-1.el7 lustre-2.12-mofed 805 k Installing for dependencies: kmod-lustrex86_64 2.12.8_6_g5457c37-1.el7 lustre-2.12-mofed 3.9 M kmod-mlnx-ofa_kernel x86_64 4.9-OFED.4.9.4.1.7.1 lustre-2.12-mofed 1.3 M lustre-osd-ldiskfs-mount x86_64 2.12.8_6_g5457c37-1.el7 lustre-2.12-mofed15 k mlnx-ofa_kernelx86_64 4.9-OFED.4.9.4.1.7.1 lustre-2.12-mofed 108 k [root@astrofs-oss1 ~]# find /lib/modules/`uname -r` -name ib_srp.ko.xz /lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniban
Re: [lustre-discuss] Corrupted? MDT not mounting
> It's looking more like something filled up our space - I'm just > copying the files out as a backup (mounted as ldiskfs just now) - Ahem. Inode quotas are a good idea. Turns out that a user creating about 130 million directories rapidly is more than a small MDT volume can take. An update on recovery progress - Upgrading the MDS to 2.12 got us over the issue in LU-12674 enough to recover, and I've migrated half (one of the HA pairs) of the OSSs to RHEL 7.9 / Lustre 2.12.8 too It needed a set of writeconf's doing before they'd mount, and e2fsck has run over any suspect luns. The filesystem "works" in that under light testing I can read/write OK, but as soon as it gets stressed, OSSs are falling over [ 1226.864430] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1226.872281] IP: [] __list_add+0x1b/0xc0 [ 1226.877699] PGD 1ffba0d067 PUD 1ffa48e067 PMD 0 [ 1226.882360] Oops: [#1] SMP [ 1226.885619] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) dm_round_robin ib_srp scsi_transport_srp scsi_tgt tcp_diag inet_diag ib_isert iscsi_target_mod target_core_mod rpcrdma rdma_ucm ib_iser ib_umad bonding rdma_cm ib_ipoib iw_cm libiscsi scsi_transport_iscsi ib_cm mlx4_ib ib_uverbs ib_core sunrpc ext4 mbcache jbd2 sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt kvm iTCO_vendor_support irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_i801 lpc_ich mei_me joydev mei sg ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler dm_multipath acpi_pad acpi_power_meter dm_mod ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mlx4_en ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb ahci libahci mpt2sas mlx4_core ptp crct10dif_pclmul crct10dif_common libata crc32c_intel pps_core dca raid_class devlink i2c_algo_bit drm_panel_orientation_quirks scsi_transport_sas nfit libnvdimm [last unloaded: scsi_tgt] [ 1226.987670] CPU: 6 PID: 366 Comm: kworker/u24:6 Kdump: loaded Tainted: G OE 3.10.0-1160.49.1.el7_lustre.x86_64 #1 [ 1227.000168] Hardware name: SGI.COM CH-C1104-GP6/X10SRW-F, BIOS 3.1 06/06/2018 [ 1227.007310] Workqueue: rdma_cm cma_work_handler [rdma_cm] [ 1227.012725] task: 934839f0b180 ti: 934836c2 task.ti: 934836c2 [ 1227.020195] RIP: 0010:[] [] __list_add+0x1b/0xc0 [ 1227.028036] RSP: 0018:934836c23d68 EFLAGS: 00010246 [ 1227.09] RAX: RBX: 934836c23d90 RCX: [ 1227.040463] RDX: 932fa518e680 RSI: RDI: 934836c23d90 [ 1227.047587] RBP: 934836c23d80 R08: R09: b2df8c1b3dcb3100 [ 1227.054712] R10: b2df8c1b3dcb3100 R11: 00ff R12: 932fa518e680 [ 1227.061835] R13: R14: R15: 932fa518e680 [ 1227.068958] FS: () GS:93483f38() knlGS: [ 1227.077034] CS: 0010 DS: ES: CR0: 80050033 [ 1227.082772] CR2: CR3: 001fe47a8000 CR4: 003607e0 [ 1227.089895] DR0: DR1: DR2: [ 1227.097020] DR3: DR6: fffe0ff0 DR7: 0400 [ 1227.104142] Call Trace: [ 1227.106593] [] __mutex_lock_slowpath+0xa6/0x1d0 [ 1227.112770] [] ? __switch_to+0xce/0x580 [ 1227.118255] [] mutex_lock+0x1f/0x2f [ 1227.123399] [] cma_work_handler+0x25/0xa0 [rdma_cm] [ 1227.129922] [] process_one_work+0x17f/0x440 [ 1227.135752] [] worker_thread+0x126/0x3c0 [ 1227.141324] [] ? manage_workers.isra.26+0x2a0/0x2a0 [ 1227.147849] [] kthread+0xd1/0xe0 [ 1227.152729] [] ? insert_kthread_work+0x40/0x40 [ 1227.158822] [] ret_from_fork_nospec_begin+0x7/0x21 [ 1227.165260] [] ? insert_kthread_work+0x40/0x40 [ 1227.171348] Code: ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 4c 8b 42 08 48 89 fb 49 39 f0 75 2a <4d> 8b 45 00 4d 39 c4 75 68 4c 39 e3 74 3e 4c 39 eb 74 39 49 89 [ 1227.191295] RIP [] __list_add+0x1b/0xc0 [ 1227.196798] RSP [ 1227.200284] CR2: and I'm able to reproduce this on multiple servers :-/ I can see a few mentions (https://access.redhat.com/solutions/4969471 for example) that seem to hint it's low memory triggered, but they also say it's fixed in the Red Hat 7.9 kernel (and we're running the 2.12.8 stock 3.10.0-1160.49.1.el7_lustre.x86_64) I've got a case open with the vendor to see if there are any firmware updates - but I'm not hopeful. These are 6 core single socket broadwells. with 128G of RAM, Storage disks are mounted over SRP from a DDN appliance. Would jumping to MOFED make a difference? Otherwise I'm open to suggestions as it's getting very tiring wrangling servers back to life [root@astrofs-oss1 ~]# ls -l /var/crash/ | grep 2022 drwxr-xr-x 2 root root 44 M
Re: [lustre-discuss] Corrupted? MDT not mounting
Thanks Stéphane, It's looking more like something filled up our space - I'm just copying the files out as a backup (mounted as ldiskfs just now) - we're running DNE (MDT and this one MDT0001) but I don't understand why so much space is being taken up in REMOTE_PARENT_DIR - we seem to have actual user data stashed in there [root@astrofs-mds2 SSINS_uvfits]# pwd /mnt/REMOTE_PARENT_DIR/0xa40002340:0x1:0x0/MWA/data/1061313128/SSINS_uvfits [root@astrofs-mds2 SSINS_uvfits]# ls -l total 0 -rw-rw-r--+ 1 redacted redacted 67153694400 Oct 9 2018 1061313128_noavg_noflag_00.uvfits -rw-rw-r--+ 1 redacted redacted 0 Oct 9 2018 1061313128_noavg_noflag_01.uvfits [root@astrofs-mds2 SSINS_uvfits]# and although this one was noticeably large, it's not the only non-zero sized file under REMOTE_PARENT_DIR: [root@astrofs-mds2 1061314832]# ls -l | head total 116 -rw-rw-r--+ 1 redacted redacted7338240 Nov 14 2017 1061314832_01.mwaf -rw-rw-r--+ 1 redacted redacted7338240 Nov 14 2017 1061314832_02.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_03.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_04.mwaf -rw-rw-r--+ 1 redacted redacted7338240 Nov 14 2017 1061314832_05.mwaf -rw-rw-r--+ 1 redacted redacted7338240 Nov 14 2017 1061314832_06.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_07.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_08.mwaf -rw-rw-r--+ 1 redacted redacted7404480 Nov 14 2017 1061314832_09.mwaf [root@astrofs-mds2 1061314832]# pwd /mnt/REMOTE_PARENT_DIR/0xa40002340:0x1:0x0/MWA/data/1061314832 Suggestions for how to clean up and recover anyone? Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Corrupted? MDT not mounting
Hi Folks, One of our filesystems seemed to fail over the holiday weekend - we're running DNE and MDT0001 won't mount. At first it looked like we'd run out of space (rc = -28) but then we were seeing this mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 failed: File exists retries left: 0 mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 failed: File exists possibly kernel: LustreError: 13921:0:(genops.c:478:class_register_device()) astrofs-OST-osc-MDT0001: already exists, won't add lustre_rmmod wouldn't remove everything cleanly (osc in use) and so after a reboot everything *seemed* to start OK [root@astrofs-mds1 ~]# mount -t lustre /dev/mapper/MGS on /lustre/MGS type lustre (ro) /dev/mapper/MDT on /lustre/astrofs-MDT type lustre (ro) /dev/mapper/MDT0001 on /lustre/astrofs-MDT0001 type lustre (ro) ... but not for long kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG possibly corrupt llog? I see LU-12674 which looks like our problem, but only backported to 2.12 branch (these servers are still 2.10.8) Piecing together what *might* have happened is a user possibly ran out of inodes and then did a rm -r before the system stopped responding. Mounting just now I'm getting: [ 1985.078422] LustreError: 10953:0:(llog.c:654:llog_process_thread()) astrofs-OST0001-osc-MDT0001: Local llog found corrupted #0x7ede0:1:0 plain index 35518 count 2 [ 1985.095129] LustreError: 10959:0:(llog_osd.c:961:llog_osd_next_block()) astrofs-MDT0001-osd: invalid llog tail at log id [0x7ef40:0x1:0x0]:0 offset 577536 bytes 4096 [ 1985.109892] LustreError: 10959:0:(osp_sync.c:1242:osp_sync_thread()) astrofs-OST0004-osc-MDT0001: llog process with osp_sync_process_queues failed: -22 [ 1985.126797] LustreError: 10973:0:(llog_cat.c:269:llog_cat_id2handle()) astrofs-OST000b-osc-MDT0001: error opening log id [0x7ef76:0x1:0x0]:0: rc = -2 [ 1985.140169] LustreError: 10973:0:(llog_cat.c:823:llog_cat_process_cb()) astrofs-OST000b-osc-MDT0001: cannot find handle for llog [0x7ef76:0x1:0x0]: rc = -2 [ 1985.155321] Lustre: astrofs-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 1985.169404] Lustre: astrofs-MDT0001: in recovery but waiting for the first client to connect [ 1985.177869] Lustre: astrofs-MDT0001: Will be in recovery for at least 2:30, or until 1508 clients reconnect [ 1985.187612] Lustre: astrofs-MDT0001: Connection restored to a5e41149-73fc-b60a-30b1-da096a5c2527 (at 1170@gni1) [ 2017.251374] Lustre: astrofs-MDT0001: Connection restored to 7a388f58-bc16-6bd7-e0c8-4ffa7c0dd305 (at 400@gni1) [ 2017.261374] Lustre: Skipped 1275 previous similar messages [ 2081.458117] Lustre: astrofs-MDT0001: Connection restored to 10.10.36.143@o2ib4 (at 10.10.36.143@o2ib4) [ 2081.467419] Lustre: Skipped 277 previous similar messages [ 2082.324547] Lustre: astrofs-MDT0001: Recovery over after 1:37, of 1508 clients 1508 recovered and 0 were evicted. Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ... kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ... kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG [ 2082.392381] LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: [ 2082.401422] LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG [ 2082.408558] Pid: 11082, comm: orph_cleanup_as 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon May 27 03:45:37 UTC 2019 [ 2082.418891] Call Trace: [ 2082.421340] [] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 2082.427890] [] lbug_with_loc+0x4c/0xa0 [libcfs] [ 2082.434077] [] osp_sync_declare_add+0x3a9/0x3e0 [osp] [ 2082.440797] [] osp_declare_destroy+0xc9/0x1c0 [osp] [ 2082.447338] [] lod_sub_declare_destroy+0xce/0x2d0 [lod] [ 2082.454237] [] lod_obj_stripe_destroy_cb+0x85/0x90 [lod] [ 2082.461213] [] lod_obj_for_each_stripe+0xb6/0x230 [lod] [ 2082.468104] [] lod_declare_destroy+0x43b/0x5c0 [lod] [ 2082.474736] [] orph_key_test_and_del+0x5f6/0xd30 [mdd] [ 2082.481538] [] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] [ 2082.488250] [] kthread+0xd1/0xe0 [ 2082.493147] [] ret_from_fork_nospec_begin+0x7/0x21 [ 2082.499601] [] 0x [ 2082.504585] Kernel panic - not syncing: LBUG e2fsck when mounted as lfiskfs seems to be clean, but is there a way I can get it mounted enough to run lfsck? Alternatively, can I upgrade the MDSs to 2.12.x while having the OSSs still on 2.10? yes I know this isn't ideal but I wasn't planning a large upgrade at zero notice to our users (also, we still have a legacy system accessing it with a 2.7 client - it's replacement arrived last Sept, but still hasn't been handed over to us yet, so I really don't want to get too out of step) Many thanks Andrew ___
[lustre-discuss] Hardware advice for homelab
Hi folks, Given my homelab testing for Lustre tends to be contained within VirtualBox on laptop ($work has a physical hardware test bed once mucking around gets serious), I'm considering expanding to some real hardware at home for testing. My MythTV days are over, but I'd ideally like an aarch64 client that can run on a Raspberry Pi, incase I ever poke at Kodi. What server hardware would people advise that fulfils: * low running cost (it's my electricity bill!) * fairly cheap to buy (own budget) * if I'm buying a cased 'nuc' type thing, it must be able to fit in a 3.5" SATA drive (as I have some old ones that fell off the back of a rack) * not full of screaming fans Given it's not planned for production use 24/7 I don't care about HA with multi-tailed drives, but would quite like the ability to add more OSSs as required. Cable sprawl / mounting isn't that much of an issue, providing it can live in the shed Any suggestions? Andrew ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org