[lustre-discuss] Lnet errors

2023-10-05 Thread Alastair Basden via lustre-discuss
Hi, Lustre 2.12.2. We are seeing lots of errors on the servers such as: Oct 5 11:16:48 oss04 kernel: LNetError: 6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-172.19.171.15@o2ib1: -125 Oct 5 11:16:48 oss04 kernel: LustreError:

[lustre-discuss] Changing OST servicenode

2022-11-16 Thread Alastair Basden via lustre-discuss
Hi, We want to change the service node of an OST. We think this involves: 1. umount the OST 2. tunefs.lustre --erase-param failover.node --servicenode=172.18.100.1@o2ib,172.17.100.1@tcp pool1/ost1 Is this all? Unclear from the documentation whether a writeconf is required (if it is, then

Re: [lustre-discuss] Lustre recycle bin

2022-10-17 Thread Alastair Basden via lustre-discuss
Hi Francois, We had something similar a few months back - I suspect a bug somewhere. Basically files weren't getting removed from the OST. Eventually, we mounted as ext, and removed them manually, I think. A reboot of the file system meant that rm operations then proceeded correctly after

[lustre-discuss] (no subject)

2022-05-17 Thread Alastair Basden via lustre-discuss
Hi all, We had a problem with one of our MDS (ldiskfs) on Lustre 2.12.6, which we think is a bug - but haven't been able to identify it. Can anyone shed any light? We unmounted and remounted the mdt at around 23:00. Client logs: May 16 22:15:41 m8011 kernel: LustreError: 11-0:

[lustre-discuss] ZFS wobble

2022-04-28 Thread Alastair Basden via lustre-discuss
Hi, We have OSDs on ZFS (0.7.9) / Lustre 2.12.6. Recently, one of our JBODs had a wobble, and the disks (as presented to the OS) disappeared for a few seconds (and then returned). This upset a few zpools which SUSPENDED. A zpool clear on these then started the resilvering process, and zpool

Re: [lustre-discuss] 2.12.6 freeze

2021-12-01 Thread Alastair Basden
Hi, Turns out there is a problem with the zpool, which we think got corrupted by a stonith event when a disk on another pool started to do a predicted failure. A zpool scrub has been done, and there are 5 files with permanent errors (zpool status -v): errors: Permanent errors have been

Re: [lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden
Additional info - exporting the pool, importing on another (HA) server and attempting to mount there also has the same problem, i.e. a kernel panic, and the trace shown below. A writeconf does not help. On Mon, 29 Nov 2021, Alastair Basden wrote: [EXTERNAL EMAIL] Some more information

Re: [lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden
We suspect corruption on the OST caused by a stonith event, but could be wrong. Any tips in how to manually solve would be great... Thanks, Alastair. On Mon, 29 Nov 2021, Alastair Basden wrote: [EXTERNAL EMAIL] Hi all, Upon attempting to mount a zfs OST, we are getting: Message from sysl

[lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden
Hi all, Upon attempting to mount a zfs OST, we are getting: Message from syslogd@c8oss01 at Nov 29 18:11:47 ... kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1 Message from syslogd@c8oss01 at Nov 29 18:11:47 ...

Re: [lustre-discuss] Full OST

2021-09-22 Thread Alastair Basden
, at 04:42, Alastair Basden mailto:a.g.bas...@durham.ac.uk>> wrote: Next step would be to unmount OST004e, run a full e2fsck, and then check lost+found and/or a regular "find /mnt/ost -type f -size +1M" or similar to find where the files are. Thanks. e2fsck returns clean (on

Re: [lustre-discuss] Full OST

2021-09-16 Thread Alastair Basden
or until you get the older software replaced (or change your clients to operate the old way). -Cory On 9/16/21, 3:45 AM, "lustre-discuss on behalf of Alastair Basden" wrote: Hi all, We mounted as ext4, removed the files, and then remounted as lustre (and did the l

Re: [lustre-discuss] Full OST

2021-09-16 Thread Alastair Basden
). Cheers, Alastair. On Thu, 9 Sep 2021, Andreas Dilger wrote: [EXTERNAL EMAIL] On Sep 8, 2021, at 04:42, Alastair Basden mailto:a.g.bas...@durham.ac.uk>> wrote: Next step would be to unmount OST004e, run a full e2fsck, and then check lost+found and/or a regular "find /mnt/

Re: [lustre-discuss] Full OST

2021-09-08 Thread Alastair Basden
Next step would be to unmount OST004e, run a full e2fsck, and then check lost+found and/or a regular "find /mnt/ost -type f -size +1M" or similar to find where the files are. Thanks. e2fsck returns clean (on its own, with -p and with -f). Now, the find command does return a large number

Re: [lustre-discuss] Full OST

2021-09-06 Thread Alastair Basden
dir. I think Andreas is referring especially to directory '0', '1' and '10' is your output. Try looking into them, you should see multiple 'dXX' directories with objects in them. Aurélien Le 06/09/2021 10:12, « Alastair Basden » a écrit : CAUTION: This email originated from outside

Re: [lustre-discuss] Full OST

2021-09-06 Thread Alastair Basden
ose objects. Cheers, Andreas On Sep 4, 2021, at 00:54, Alastair Basden wrote: Ah, of course - has to be done on a client. None of these files are on the dodgy OST. Any further suggestions? Essentially we have what seems to be a full OST with nothing on it. Thanks, Alastair. On Sat, 4 Se

Re: [lustre-discuss] Full OST

2021-09-04 Thread Alastair Basden
, 2021, at 14:51, Alastair Basden mailto:a.g.bas...@durham.ac.uk>> wrote: Hi, lctl get_param mdt.*.exports.*.open_files returns: mdt.snap8-MDT.exports.172.18.180.21@o2ib.open_files= [0x2b90e:0x10aa:0x0] mdt.snap8-MDT.exports.172.18.180.22@o2ib.open_files= [0x2b90e:0x21

Re: [lustre-discuss] Full OST

2021-09-03 Thread Alastair Basden
ien Le 03/09/2021 09:50, « lustre-discuss au nom de Alastair Basden » mailto:lustre-discuss-boun...@lists.lustre.org> au nom de a.g.bas...@durham.ac.uk<mailto:a.g.bas...@durham.ac.uk>> a écrit : CAUTION: This email originated from outside of the organization. Do not click links or o

Re: [lustre-discuss] Full OST

2021-09-03 Thread Alastair Basden
, this is due to an open-unlinked file, typically a log file which is still in use and some processes keep writing to it until it fills the OSTs it is using. Look for such files on your clients (use lsof). Aurélien Le 03/09/2021 09:50, « lustre-discuss au nom de Alastair Basden » a écrit

[lustre-discuss] Full OST

2021-09-03 Thread Alastair Basden
Hi, We have a file system where each OST is a single SSD. One of those is reporting as 100% full (lfs df -h /snap8): snap8-OST004d_UUID 5.8T2.0T3.5T 37% /snap8[OST:77] snap8-OST004e_UUID 5.8T5.5T7.5G 100% /snap8[OST:78] snap8-OST004f_UUID

Re: [lustre-discuss] OST not being used

2021-06-23 Thread Alastair Basden
Hi Megan, Thanks - yes, lctl ping responds. In the end, we did a writeconf, and this seems to have fixed the problem, so probably some previous transient. I would however have expected it to heal whilst online - taking the filesystem down and doing a writeconf seems a bit drastic! Cheers,

Re: [lustre-discuss] OST not being used

2021-06-21 Thread Alastair Basden
Hi Megan, all, Yes, sorry, I should have said. Its 2.12.6. A bit more detail. I can set the stripe index to 0-3 and 8-191, and it works fine. However, when I set the stripe index to 4-7, they all end up on OST 8. It is a system with 192 OSTs and 24 OSSs. These 4 OSTs are all served on

[lustre-discuss] OST not being used

2021-06-21 Thread Alastair Basden
Hi, I'm trying to specify a particular OST to be used with: lfs setstripe --stripe-index 7 myfile.dat However, a lfs getstripe reveals that it hasn't managed to use this OST: myfile.dat lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen:0

Re: [lustre-discuss] Multiple IB Interfaces

2021-03-12 Thread Alastair Basden via lustre-discuss
Hi all, Thanks for the replies. The issue as I see it is with sending data from an OST to the client, avoiding the inter-CPU link. So, if I have: cpu1 - IB card 1 (10.0.0.1), nvme1 (OST1) cpu2 - IB card 2 (10.0.0.2), nvme2 (OST2) Both IB cards on the same subnet. Therefore, by default,

[lustre-discuss] Multiple IB interfaces

2021-03-09 Thread Alastair Basden via lustre-discuss
Hi, We are installing some new Lustre servers with 2 InfiniBand cards, 1 attached to each CPU socket. Storage is nvme, again, some drives attached to each socket. We want to ensure that data to/from each drive uses the appropriate IB card, and doesn't need to travel through the inter-cpu

[lustre-discuss] Help mounting MDT

2020-10-05 Thread Alastair Basden
Hi all, We are having a problem mounting a ldiskfs mdt. The mount command is hanging, with /var/log/messages containing: Oct 5 16:26:17 c6mds1 kernel: INFO: task mount.lustre:4285 blocked for more than 120 seconds. Oct 5 16:26:17 c6mds1 kernel: "echo 0 >

Re: [lustre-discuss] Centos 7.7 upgrade

2020-06-02 Thread Alastair Basden
the same errors when the kmod-lustre-osd-zfs package was missing on my system. cheers Pascal On 6/2/20 1:20 AM, Alastair Basden wrote: Hi, We have just upgraded Lustre servers from 2.12.2 on centos 7.6 to 2.12.3 on centos 7.7. The OSSs are on top of zfs (0.7.13 as recommended), and we

Re: [lustre-discuss] Centos 7.7 upgrade

2020-06-02 Thread Alastair Basden
? --Jeff On Mon, Jun 1, 2020 at 4:21 PM Alastair Basden wrote: Hi, We have just upgraded Lustre servers from 2.12.2 on centos 7.6 to 2.12.3 on centos 7.7. The OSSs are on top of zfs (0.7.13 as recommended), and we are using 3.10.0-1062.1.1.el7_lustre.x86_64 After the update, Lustre

[lustre-discuss] Centos 7.7 upgrade

2020-06-01 Thread Alastair Basden
Hi, We have just upgraded Lustre servers from 2.12.2 on centos 7.6 to 2.12.3 on centos 7.7. The OSSs are on top of zfs (0.7.13 as recommended), and we are using 3.10.0-1062.1.1.el7_lustre.x86_64 After the update, Lustre will no longer mount - and messages such as: Jun 2 00:02:44 hostname