On Oct 1, 2023, at 00:36, Tung-Han Hsieh via lustre-discuss 
<lustre-discuss@lists.lustre.org> wrote:
> I should apologize for replying late. Here I would like to clarify why in my 
> opinion the Lustre ldiskfs code is not self-contained.
> 
> In the past, to compile lustre with ldiskfs, we needed to patch Linux kernel 
> using the patches provided by Lustre source code. And YES, for the recent 
> Lustre versions, the necessary patches are fewer, and even OK without 
> applying any patches to Linux kernel.  However, there are another patches to 
> Lustre ldiskfs code, namely copying Linux kernel ext4fs code into Lustre 
> source tree, applying patches provided by Lustre, and make it becomes the 
> ldiskfs code, during the compile time.

Hello T.H.,
it is true that Lustre needs to patch the ldiskfs code in order to properly 
integrate with the ext4/jbd2 transaction handling, and to add some features 
that ext4 is lacking that Lustre depends on.  Ideally we could get these 
features integrated into the upstream ext4 code to avoid the need to patch it, 
or at least minimize the number of patches, but that is often difficult and 
time consuming and doesn't get done as often as anyone would want it to.

Lustre is an open-source project, and if you are depending on the Linux-5.4 
stable kernel branch for your servers, it would be welcome for you to submit 
patches to update the kernel patch series if there are issues arising with 
these patches with a new stable kernel, as other developers maintain the kernel 
patch series for the distros that they are interested in (primarily RHEL 
derivatives, but more recently Ubuntu).  

> Here is the compilation log indicating the patching procedure. To compile 
> lustre-2.15.3 with Linux-5.4.135, after running
> 
> ./configure --prefix=/opt/lustre --with-linux=/usr/src/linux-5.4.135 
> --with-o2ib=no --with-ldiskfsprogs=/opt/e2fs --enable-mpitests=no 
> 
> and then run "make". The log reads:
> 
> ====================================================================================================
> rm -rf linux-stage linux sources trace
> mkdir -p linux-stage/fs/ext4 linux-stage/include/linux \
>          linux-stage/include/trace/events
> cp /usr/src/linux-5.4.135/fs/ext4/file.c 
> /usr/src/linux-5.4.135/fs/ext4/ioctl.c /usr/src/linux-5.4.135/fs/ext4/dir.c 
> .... linux-stage/fs/ext4
> if test -n "" ; then \
>         cp  linux-stage/include/linux; \
> fi
> if test -n "/usr/src/linux-5.4.135/include/trace/events/ext4.h" ; then \
>         cp /usr/src/linux-5.4.135/include/trace/events/ext4.h 
> linux-stage/include/trace/events; \
> fi
> ln -s ../../ldiskfs/kernel_patches/patches linux-stage/patches
> ln -s ../../ldiskfs/kernel_patches/series/ldiskfs-5.4.136-ml.series 
> linux-stage/
> series
> cd linux-stage && quilt push -a -q
> Applying patch patches/rhel8/ext4-inode-version.patch
> Applying patch patches/linux-5.4/ext4-lookup-dotdot.patch
> Applying patch patches/suse15/ext4-print-inum-in-htree-warning.patch
> Applying patch patches/rhel8/ext4-prealloc.patch
> Applying patch patches/ubuntu18/ext4-osd-iop-common.patch
> Applying patch patches/ubuntu19/ext4-misc.patch
> Applying patch patches/rhel8/ext4-mballoc-extra-checks.patch
> Applying patch patches/linux-5.4/ext4-hash-indexed-dir-dotdot-update.patch
> Applying patch patches/linux-5.4/ext4-kill-dx-root.patch
> Applying patch patches/rhel7.6/ext4-mballoc-pa-free-mismatch.patch
> Applying patch patches/linux-5.4/ext4-data-in-dirent.patch
> Applying patch patches/rhel8/ext4-nocmtime.patch
> Applying patch patches/base/ext4-htree-lock.patch
> Applying patch patches/linux-5.4/ext4-pdirop.patch
> Applying patch patches/rhel8/ext4-max-dir-size.patch
> Applying patch 
> patches/rhel8/ext4-corrupted-inode-block-bitmaps-handling-patches.patch
> Applying patch 
> patches/linux-5.4/ext4-give-warning-with-dir-htree-growing.patch
> Applying patch patches/ubuntu18/ext4-jcb-optimization.patch
> Applying patch patches/linux-5.4/ext4-attach-jinode-in-writepages.patch
> Applying patch patches/rhel8/ext4-dont-check-before-replay.patch
> Applying patch 
> patches/rhel7.6/ext4-use-GFP_NOFS-in-ext4_inode_attach_jinode.patch
> Applying patch patches/rhel7.6/ext4-export-orphan-add.patch
> Applying patch patches/rhel8/ext4-export-mb-stream-allocator-variables.patch
> Applying patch patches/ubuntu19/ext4-iget-with-flags.patch
> Applying patch patches/linux-5.4/export-ext4fs-dirhash-helper.patch
> Applying patch patches/linux-5.4/ext4-misc.patch
> Applying patch patches/linux-5.4/ext4-simple-blockalloc.patch
> Applying patch patches/linux-5.4/ext4-xattr-disable-credits-check.patch
> Applying patch patches/base/ext4-no-max-dir-size-limit-for-iam-objects.patch
> Applying patch patches/rhel8/ext4-ialloc-uid-gid-and-pass-owner-down.patch
> Applying patch patches/base/ext4-projid-xattrs.patch
> Applying patch patches/linux-5.4/ext4-enc-flag.patch
> Applying patch patches/base/ext4-delayed-iput.patch
> Now at patch patches/base/ext4-delayed-iput.patch
> ====================================================================================================
> 
> If you check the Lustre source code, before running "make", the directory 
> "Lustre-2.15.3/ldiskfs/linux-stage" does not exist. After running "make",
> it was created by the above patching procedure.
> 
> Look into the Lustre codes further, there is a directory providing the 
> patches to create ldiskfs code:
> 
> lustre-2.15.3/ldiskfs/kernel_patches/series
> 
> Obviously these patches highly depend on the version of Linux kernel. We have 
> to balance the requirement to drive our hardware, the long-term maintained 
> Linux kernel (see https://www.kernel.org/), and the compatibility with 
> Lustre. For now, we are lucky to find vanilla Linux-5.4.135 fulfilling the 
> above requirements. In case that Linux-5.4.135 has bugs in driving our fiber 
> storage device, and we have to change to, say, Linux-5.4.150 or later, then 
> we will get into trouble. This rare situation did happen before.

In the majority of cases, if a patch series fails to apply after a maintenance 
update (e.g. 5.4.135 to 5.4.150) this is usually due to trivial conflicts in 
the patches that can be resolved with a small edit to the patch or the patched 
file, with very minimal need to understand the code details.  See 
ldiskfs/README for details.

Alternately, you might consider to use a kernel more commonly used on the 
servers.  For example, the Ubuntu20 series seems to have a similar version and 
looks like it is updated more regularly than the vanilla kernel patches.

> So it would be very appreciated if, in the future release, the ldiskfs code 
> could be a self-contained code similar to ZFS, instead of copying from Linux 
> kernel and applying patches to convert to ldiskfs code. Doing this, it is 
> expected that ldiskfs could be less dependent on the minor revisions of Linux 
> kernel, to bring us more freedom of choosing a suitable Linux kernel version 
> in our applications.

We've thought about this in the past - to provide a pre-patched ldiskfs source 
tree with Lustre, but then the same complexity exists in a different form.  The 
kernel APIs are not "static" and the common ldiskfs code would need to have 
conditional build support for a wide range of kernel versions.  It would need 
to be continually updated to include all of the patches in the upstream ext4 
code.  It would also make tracking and merging of ext4/ldiskfs patches into 
upstream Linux *much* more complex, because the changes would no longer be 
available as separate patches, but would need to be extracted out of the 
patched ldiskfs source tree itself.

As I mentioned above, for "maintenance" kernel updates, the changes needed to 
apply patches are usually trivial, and if that work needs to be done by someone 
then it would benefit everyone in the community to submit the patches to Gerrit 
to avoid the problem for the next user.  That is how an open-source community 
works.

Cheers, Andreas 

> Audet, Martin <martin.au...@cnrc-nrc.gc.ca> 於 2023年9月28日 週四 上午2:02寫道:
>> Hello Hsieh,
>> 
>> Thanks for sharing your experience in response to my request. Thanks also 
>> for the clarification concerning --writeconf. Since I didn't see that in the 
>> documentation for major updates, I thought I missed something. Let's hope 
>> that other will follow and that more information becomes available.
>> 
>> From your message I noticed that you mentioned that compiling Lustre with 
>> ldiskfs as a back end is a nightmare. You argue that Lustre code is not self 
>> contained and essentially consist of a set of patches over the kernel ext4 
>> sources. While I don't contest this I have noticed that hopefully the number 
>> and the volume of patches tend to decrease with time from RHEL 7 to 8 and 
>> finally to 9. With RHEL 9.2 I think there is only two patches remaining. 
>> There is also the patchless option for servers. I don't know if you have any 
>> experience with that. Older documentation warned about a performance penalty 
>> when taking advantage of this feature...
>> 
>> From what you describe I see that your systems are way larger than ours. As 
>> I said our cluster is small and our Lustre installation is minimal (24 
>> compute nodes each with two 12 cores CPUs, Infiniband EDR, 192 GB RAM per 
>> node, one similarly configure head and file server node, the file server 
>> node host the Lustre MGS, MDS and OSS servers, it is connected to two RAID-6 
>> arrays consisting each of six 8 TB NLSAS HDD for data and one RAID-1 array 
>> of two 480 GB SATA SSD for metadata). We use Lustre to essentially have a 
>> consistent "relatively large and fast" file system able do to MPI-IO 
>> correctly. Our system is sufficient for our small team. As I said before, we 
>> plan in the next weeks to switch from CentOS 7.9 with Lustre 2.12.9 and 
>> MOFED 4.9 LTS to Lustre 2.15.3, MOFED 5.8 LTS with Alma/RHEL 9.2 on the head 
>> and compute nodes and Alma/RHEL 8.8 on the file server (Lustre) node.
>> 
>> Let's hope we don't hit too many problems during this large update !
>> 
>> Thanks,
>> Martin Audet 
>> 
>> 
>> P.S. Why do you want to stay with MOFED 4.9 ? MOFED 4.9 include UCX 1.8.0 
>> which is considered problematic (ex: memory corruption) by OpenMPI libraries 
>> version > 4.1.0 (they refuse to compile, hopefully we never encountered this 
>> problem to my knowledge). More recent MOFED also come bundled with xpmem 
>> which is supposed to accelerate intro-node communications (zero copy 
>> mechanism) way better than CMA or KNEM (and xpmem is supported by both 
>> OpenMPI and mpich).
>> 
>> From: Tung-Han Hsieh <tunghan.hs...@gmail.com>
>>> Sent: September 26, 2023 9:57 PM
>>> To: Audet, Martin
>>> Cc: lustre-discuss@lists.lustre.org
>>> Subject: Re: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 
>>> 2.12.6 to 2.15.3
>>>  
>>> ***Attention*** This email originated from outside of the NRC. 
>>> ***Attention*** Ce courriel provient de l'extérieur du CNRC.
>>> 
>>> OK. Let me provide more details about the upgrade from 2.12.X to 2.15.X.
>>> 
>>> We have several production clusters. Due to different requirements and 
>>> hardware, each has slightly different Lustre configurations. Our Linux 
>>> distribution is Debian. Sometimes due to hardware spec, we need to use 
>>> different Linux kernel versions. So we always have to compile Kernel 
>>> (vanilla Linux kernel), MLNX_OFED, e2fsprog / ZFS, and Lustre. Here I 
>>> temporarily skip the compilation (that involved a lot of details), just 
>>> mention the upgrade procedure for two different cases.
>>> 
>>> 1. Two of our clusters have Lustre-2.12.6, with MDT with ZFS backend, and 
>>> OST with ZFS backend, where ZFS is 0.7.13. From the compatibility list 
>>> mentioned in documentation, the upgrade involves both ZFS upgrade to 2.0.7 
>>> and Lustre upgrade to 2.15.X. It was very smooth and successful. The 
>>> procedure is:
>>>     1) Compile and install new versions of Linux kernel, MLNX_OFED, ZFS, 
>>> and Lustre. Depending on the hardware, some servers we use Linux kernel 
>>> 4.19.X + MLNX_OFED-4.6, or Linux kernel 4.19.X + MLNX_OFED-4.9, or Linux 
>>> kernel 5.4.X + MLNX_OFED-4.9. They are all compatible to Lustre-2.15.X and 
>>> ZFS-2.0.7.
>>>      2) After installation of the new software, start up ZFS first:
>>> ```
>>> modprobe zfs
>>> zpool import <lustre_pool_name>
>>> zpool status <lustre_pool_name>
>>> ```
>>>          There will be messages about ZFS pool needs to be upgraded to 
>>> enable new features. But after upgrade, the ZFS pool will not be compatible 
>>> to the old ZFS version anymore. The command to upgrade ZFS pool is (for all 
>>> the MDT and OST pools):
>>> ```
>>> zpool upgrade <lustre_pool_name>
>>> zpool status <lustre_pool_name>
>>> ```
>>>           Checking zpool status again, the warning messages were 
>>> disappeared.
>>>     3) I checked the Lustre Operation Manual again, and saw that in the 
>>> recent version it does not need to run tunefs.lustre --writeconf command. 
>>> Sorry that it was my fault. But in our case, we run this command whenever 
>>> doing Lustre upgrade. Note that to run it, we have to run it in all the 
>>> Lustre MDT and OST pools. It cleared out the logs and will regenerate it 
>>> when mounting Lustre.
>>> 
>>> 2. One of our clusters have Lustre-2.12.6, with MDT with ldiskfs backend, 
>>> and OST with ZFS backend, where ZFS is 0.7.13. We have to upgrade it to 
>>> e2fsprogs-1.47.0, ZFS-2.0.7 and Lustre-2.15.3. The upgrade of OST part is 
>>> exactly the same as the above, so I don't repeat. The major challenge is 
>>> MDT with ldiskfs. What I have done is
>>>     1) After installing all the new version of the software, running 
>>> tunefs.lustre --writeconf (for all MDT and OST). Probably this is the wrong 
>>> step for the upgrade to Lustre-2.15.X.
>>>     2) According to Lustre Operation manual chapter 17, to upgrade Lustre 
>>> from 2.13.0 and before to 2.15.X, we should run
>>> tune2fs -O ea_inode /dev/mdtdev
>>> After that, as I have posted, we encounter problem of mounting MDT. Then we 
>>> cure this problem by following the section 18 of Lustre Operation manual.
>>> 
>>> My personal suggestions are:
>>> 1. In the future, to do major revision upgrade for our production run 
>>> systems (say,  2.12.X to 2.15.X, or 2.15.X to 2.16 or later), I will 
>>> develop a small testing system, installing exactly the same software of the 
>>> production run, and test the upgrade, to make sure that every steps are 
>>> correct. We did this for upgrading Lustre with ZFS backend. But this time 
>>> due to time pressure, we skip this step for upgrading Lustre with ldiskfs 
>>> backend. I think no matter what situation, it is still worth to do this 
>>> step in order to avoid any risks.
>>> 
>>> 2. Currently compiling Lustre with ldiskfs backend is still a nightmare. 
>>> The ldiskfs code is not a self-contained, stand along code. It actually 
>>> copied codes from the kernel ext4 code, did a lot of patches, and then did 
>>> the compilation, on the fly. So we have to be very careful to select the 
>>> Linux kernel, to select a compatible one for both our hardware and Lustre 
>>> version. The ZFS backend is much more cleaner. It is a stand along and 
>>> self-contained code. We don't need to do patches on the fly. So I would 
>>> like to suggest the Lustre developer to consider to make the ldiskfs to be 
>>> a stand along and self-contained code in the future release. That will 
>>> bring us a lot of convenient.
>>> 
>>> Hope that the above experiences could be useful to our community.
>>> 
>>> ps. Lustre Operation Manual: 
>>> https://doc.lustre.org/lustre_manual.xhtml#Upgrading_2.x
>>> 
>>> Best Regards,
>>> 
>>> T.H.Hsieh
>>> 
>>> Audet, Martin <martin.au...@cnrc-nrc.gc.ca> 於 2023年9月27日 週三 上午3:44寫道:
>>>> Hello all,
>>>> 
>>>> 
>>>> 
>>>> I would appreciate if the community would give more attention to this 
>>>> issue because upgrading from 2.12.x to 2.15.x, two LTS versions, is 
>>>> something that we can expect many cluster admin will try to do in the next 
>>>> few months...
>>>> 
>>>> 
>>>> 
>>>> We ourselves plan to upgrade a small Lustre (production) system from 
>>>> 2.12.9 to 2.15.3 in the next couple of weeks...
>>>> 
>>>> 
>>>> 
>>>> After seeing problems reports like this we start feeling a bit nervous...
>>>> 
>>>> 
>>>> 
>>>> The documentation for doing this major update appears to me as not very 
>>>> specific...
>>>> 
>>>> 
>>>> 
>>>> In this document for example, 
>>>> https://doc.lustre.org/lustre_manual.xhtml#upgradinglustre , the update 
>>>> process appears not so difficult and there is no mention of using 
>>>> "tunefs.lustre --writeconf" for this kind of update.
>>>> 
>>>> 
>>>> 
>>>> Or am I missing something ?
>>>> 
>>>> 
>>>> 
>>>> Thanks in advance for providing more tips for this kind of update.
>>>> 
>>>> 
>>>> 
>>>> Martin Audet
>>>> 
>>>> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf 
>>>> of Tung-Han Hsieh via lustre-discuss <lustre-discuss@lists.lustre.org>
>>>>> Sent: September 23, 2023 2:20 PM
>>>>> To: lustre-discuss@lists.lustre.org
>>>>> Subject: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 
>>>>> 2.12.6 to 2.15.3
>>>>>  
>>>>> ***Attention*** This email originated from outside of the NRC. 
>>>>> ***Attention*** Ce courriel provient de l'extérieur du CNRC.
>>>>> 
>>>>> Dear All,
>>>>> 
>>>>> Today we tried to upgrade Lustre file system from version 2.12.6 to 
>>>>> 2.15.3. But after the work, we cannot mount MDT successfully. Our MDT is 
>>>>> ldiskfs backend. The procedure of upgrade is
>>>>> 
>>>>> 1. Install the new version of e2fsprogs-1.47.0
>>>>> 2. Install Lustre-2.15.3
>>>>> 3. After reboot, run: tunefs.lustre --writeconf /dev/md0
>>>>> 
>>>>> Then when mounting MDT, we got the error message in dmesg:
>>>>> 
>>>>> ===========================================================
>>>>> [11662.434724] LDISKFS-fs (md0): mounted filesystem with ordered data 
>>>>> mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
>>>>> [11662.584593] Lustre: 3440:0:(scrub.c:189:scrub_file_load()) 
>>>>> chome-MDT0000: reset scrub OI count for format change (LU-16655)
>>>>> [11666.036253] Lustre: MGS: Logs for fs chome were removed by user 
>>>>> request.  All servers must be restarted in order to regenerate the logs: 
>>>>> rc = 0
>>>>> [11666.523144] Lustre: chome-MDT0000: Imperative Recovery not enabled, 
>>>>> recovery window 300-900
>>>>> [11666.594098] LustreError: 3440:0:(mdd_device.c:1355:mdd_prepare()) 
>>>>> chome-MDD0000: get default LMV of root failed: rc = -2
>>>>> [11666.594291] LustreError: 
>>>>> 3440:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start 
>>>>> targets: -2
>>>>> [11666.594951] Lustre: Failing over chome-MDT0000
>>>>> [11672.868438] Lustre: 3440:0:(client.c:2295:ptlrpc_expire_one_request()) 
>>>>> @@@ Request sent has timed out for slow reply: [sent 1695492248/real 
>>>>> 1695492248]  req@000000005dfd9b53 x1777852464760768/t0(0) 
>>>>> o251->MGC192.168.32.240@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 
>>>>> 1695492254 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
>>>>> [11672.925905] Lustre: server umount chome-MDT0000 complete
>>>>> [11672.926036] LustreError: 3440:0:(super25.c:183:lustre_fill_super()) 
>>>>> llite: Unable to mount <unknown>: rc = -2
>>>>> [11872.893970] LDISKFS-fs (md0): mounted filesystem with ordered data 
>>>>> mode. Opts: (null)
>>>>> ============================================================
>>>>> 
>>>>> Could anyone help to solve this problem ? Sorry that it is really urgent.
>>>>> 
>>>>> Thank you very much.
>>>>> 
>>>>> T.H.Hsieh
>>>>> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to