You can check the ashift of the zpool via "zpool get all | grep ashift".  If 
this is different, it will make a huge difference in space usage. There are a 
number of ZFS articles that discuss this, it isn't specific to Lustre.

Also, RAID-Z2 is going to have much more space overhead for the MDT than 
mirroring, because the MDT is almost entirely small blocks.  Normally the MDT 
is using mirrored VDEVs.

The reason is that RAID-Z2 has two parity sectors per data stripe vs. a single 
extra mirror per data block, so if all data blocks are 4KB that would double 
the parity overhead vs. mirroring. Secondly, depending on the geometry, RAID-Z2 
needs padding sectors to align the variable RAID-Z stripes, which mirrors do 
not.

For large files/blocks RAID-Z2 is better, but that isn't the workload on the 
MDT unless you are storing DoM files there (eg. 64KB or larger).

Cheers, Andreas

On Nov 11, 2019, at 13:48, Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>> wrote:

Recordsize/ahift: in both cases default values were used (but on different 
versions of Lustre). How can I check different values for recordsize/ashift for 
actual values to compare?

zpool mirroring is quite different though – bad drive is a simple raidz2:

          raidz2-0  ONLINE       0     0     0
            sdd     ONLINE       0     0     0
….

errors: No known data errors

the good drive uses 10 mirrors:

        NAME         STATE     READ WRITE CKSUM
        mdt0000      ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            sdd      ONLINE       0     0     0
            sde      ONLINE       0     0     0
          mirror-1   ONLINE       0     0     0
            sdf      ONLINE       0     0     0
            sdg      ONLINE       0     0     0
          mirror-2   ONLINE       0     0     0
            sdh      ONLINE       0     0     0
            sdi      ONLINE       0     0     0
          mirror-3   ONLINE       0     0     0
            sdj      ONLINE       0     0     0
            sdk      ONLINE       0     0     0
          mirror-4   ONLINE       0     0     0
            sdl      ONLINE       0     0     0
            sdm      ONLINE       0     0     0
          mirror-5   ONLINE       0     0     0
            sdn      ONLINE       0     0     0
            sdo      ONLINE       0     0     0
          mirror-6   ONLINE       0     0     0
            sdp      ONLINE       0     0     0
            sdq      ONLINE       0     0     0
          mirror-7   ONLINE       0     0     0
            sdr      ONLINE       0     0     0
            sds      ONLINE       0     0     0
          mirror-8   ONLINE       0     0     0
            sdt      ONLINE       0     0     0
            sdu      ONLINE       0     0     0
          mirror-9   ONLINE       0     0     0
            sdv      ONLINE       0     0     0
            sdw      ONLINE       0     0     0
          mirror-10  ONLINE       0     0     0
            sdx      ONLINE       0     0     0
            sdy      ONLINE       0     0     0

thanks
Michael

From: Andreas Dilger <adil...@whamcloud.com<mailto:adil...@whamcloud.com>>
Sent: Monday, November 11, 2019 14:42
To: Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank <rm...@utk.edu<mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

There isn't really enough information to make any kind of real analysis.


My guess would be that you are using a larger ZFS recordsize or ashift on the 
new filesystem, or the RAID config is different?

Cheers, Andreas

On Nov 7, 2019, at 08:45, Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>> wrote:
So we went ahead and used the FS – using rsync to duplicate the existing FS. 
The inodes available on the NEW mdt (which is almost twice the size of the 
second mdt) are dropping rapidly and are now LESS than on the smaller mdt (even 
though the sync is only 90% complete). Both FS are running almost identical 
Lustre 2.10. I cannot say anymore which ZFS version was used to format the good 
FS.

Any ideas why those 2 MDTs behave so differently?

old GOOD FS:
# df -i
mgt/mgt           81718714       205   81718509    1% /lfs/lfsarc02/mgt
mdt0000/mdt0000  458995000 130510339  328484661   29% /lfs/lfsarc02/mdt
# df -h
mgt/mgt          427G  7.0M  427G   1% /lfs/lfsarc02/mgt
mdt0000/mdt0000  4.6T  1.4T  3.3T  29% /lfs/lfsarc02/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.10.4-1.el7.x86_64
lustre-zfs-dkms-2.10.4-1.el7.noarch
zfs-0.7.9-1.el7.x86_64
zfs-dkms-0.7.9-1.el7.noarch

new BAD FS
# df -ih
mgt/mgt            83M   169   83M    1% /lfs/lfsarc01/mgt
mdt0000/mdt0000   297M  122M  175M   42% /lfs/lfsarc01/mdt
# df -h
mgt/mgt          427G  5.8M  427G   1% /lfs/lfsarc01/mgt
mdt0000/mdt0000  8.2T  3.4T  4.9T  41% /lfs/lfsarc01/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.10.8-1.el7.x86_64
lustre-zfs-dkms-2.10.8-1.el7.noarch
zfs-0.7.9-1.el7.x86_64
zfs-dkms-0.7.9-1.el7.noarch

From: Andreas Dilger <adil...@whamcloud.com<mailto:adil...@whamcloud.com>>
Sent: Thursday, October 03, 2019 20:38
To: Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank <rm...@utk.edu<mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

On Oct 3, 2019, at 20:09, Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>> wrote:

So bottom line – don’t change the default values, it won’t get better?

Like I wrote previously, there *are* no default/tunable values to change for 
ZFS.  The tunables are only for ldiskfs, which statically allocates everything, 
but is will cause problems if you guessed incorrectly at the instant you format 
the filesystem.

The number reported by raw ZFS and by Lustre-on-ZFS is just an estimate, and 
you will (essentially) run out of inodes once you run out of space on the MDT 
or all OSTs.  And I didn't say "it won't get better", actually I said the 
estimate _will_ get better once you actually start using the filesystem.

If the (my estimate) 2-3B inodes on the MDT is insufficient, you can always add 
another (presumably mirrored) VDEV to the MDT, or add a new MDT to the 
filesystem to increase the number of inodes available.

Cheers, Andreas




From: Andreas Dilger <adil...@whamcloud.com<mailto:adil...@whamcloud.com>>
Sent: Thursday, October 03, 2019 19:38
To: Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank <rm...@utk.edu<mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

On Oct 3, 2019, at 05:03, Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>> wrote:

So you are saying on a zfs based Lustre there is no way to increase the number 
of available inodes? I have 8TB MDT with roughly 17G inodes

[root@elfsa1m1 ~]# df -h
Filesystem       Size  Used Avail Use% Mounted on
mdt0000          8.3T  256K  8.3T   1% /mdt0000

[root@elfsa1m1 ~]# df -i
Filesystem           Inodes  IUsed       IFree IUse% Mounted on
mdt0000         17678817874      6 17678817868    1% /mdt0000

For ZFS the only way to increase inodes on the *MDT* is to increase the size of 
the MDT, though more on that below.  Note that the "number of inodes" reported 
by ZFS is an estimate based on the currently-allocated blocks and inodes (i.e. 
bytes_per_inode_ratio = bytes_used / inodes_used, total inode estimate = 
bytes_free / inode_ratio + inodes_used), which becomes more accurate as the MDT 
becomes more full.  With 17B inodes on a 8TB MDT that is an bytes-per-inode 
ratio of 497, which is unrealistically low for Lustre since the MDT will always 
stores multiple xattrs on each inode.  Note that the filesystem only has 6 
inodes allocated, so the ZFS total inodes estimate is unrealistically high and 
will get better as more inodes are allocated in the filesystem.

Formating under Lustre 2.10.8

mkfs.lustre --mdt --backfstype=zfs --fsname=lfsarc01 --index=0 
--mgsnid="36.101.92.22@tcp<mailto:36.101.92.22@tcp>" --reformat mdt0000/mdt0000

this translates to only 948M inodes on the Lustre FS.

[root@elfsa1m1 ~]# df -i
Filesystem           Inodes  IUsed       IFree IUse% Mounted on
mdt0000         17678817874      6 17678817868    1% /mdt0000
mdt0000/mdt0000   948016092    263   948015829    1% /lfs/lfsarc01/mdt

[root@elfsa1m1 ~]# df -h
Filesystem       Size  Used Avail Use% Mounted on
mdt0000          8.3T  256K  8.3T   1% /mdt0000
mdt0000/mdt0000  8.2T   24M  8.2T   1% /lfs/lfsarc01/mdt

and there is no reasonable option to provide more file entries except for 
adding another MDT?

The Lustre statfs code will weight in some initial estimates for the 
bytes-per-inode ratio when computing the total inode estimate for the 
filesystem.  When the filesystem is nearly empty, as is the case here, then 
those initial estimates will dominate, but once you've allocated a few thousand 
inodes in the filesystem the actual values will dominate and you will have a 
much more accurate number for the total inode count.  This will probably be 
more in the range of 2B-4B inodes in the end, unless you also use Data-on-MDT 
(Lustre 2.11 and later) to store small files directly on the MDT.

You've also excluded the OST lines from the above output?  For the Lustre 
filesystem you (typically) also need at least one OST inode (object) for each 
file in the filesystem, possibly more than one, so "df" of the Lustre 
filesystem may also be limited by the number of inodes reported by the OSTs 
(which may themselves depend on the average bytes-per-inode for files stored on 
the OST).  If you use Data-on-MDT and only have a small files, then no OST 
object is needed for small files, but you consume correspondingly more space on 
the MDT.

Cheers, Andreas


From: Andreas Dilger <adil...@whamcloud.com<mailto:adil...@whamcloud.com>>
Sent: Wednesday, October 02, 2019 18:49
To: Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank <rm...@utk.edu<mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

There are several confusing/misleading comments on this thread that need to be 
clarified...

On Oct 2, 2019, at 13:45, Hebenstreit, Michael 
<michael.hebenstr...@intel.com<mailto:michael.hebenstr...@intel.com>> wrote:

http://wiki.lustre.org/Lustre_Tuning#Number_of_Inodes_for_MDS

Note that I've updated this page to reflect current defaults.  The Lustre 
Operations Manual has a much better description of these parameters.


and I'd like to use --mkfsoptions='-i 1024' to have more inodes in the MDT. We 
already run out of inodes on that FS (probably due to an ZFS bug in early IEEL 
version) - so I'd like to increase #inodes if possible.

The "-i 1024" option (bytes-per-inode ratio) is only needed for ldiskfs since 
it statically allocates the inodes at mkfs time, it is not relevant for ZFS 
since ZFS dynamically allocates inodes and blocks as needed.

On Oct 2, 2019, at 14:00, Colin Faber 
<cfa...@gmail.com<mailto:cfa...@gmail.com>> wrote:
With 1K inodes you won't have space to accommodate new features, IIRC the 
current minimal limit on modern lustre is 2K now. If you're running out of MDT 
space you might consider DNE and multiple MDT's to accommodate that larger name 
space.

To clarify, since Lustre 2.10 any new ldiskfs MDT will allocate 1024 bytes for 
the inode itself (-I 1024).  That allows enough space *within* the inode to 
efficiently store xattrs for more complex layouts (PFL, FLR, DoM).  If xattrs 
do not fit inside the inode itself then they will be stored in an external 4KB 
inode block.

The MDT is formatted with a bytes-per-inode *ratio* of 2.5KB, which means 
(approximately) one inode will be created for every 2.5kB of the total MDT 
size.  That 2.5KB of space includes the 1KB for the inode itself, plus space 
for a directory entry (or multiple if hard-linked), extra xattrs, the journal 
(up to 4GB for large MDTs), Lustre recovery logs, ChangeLogs, etc.  Each 
directory inode will have at least one 4KB block allocated.

So, it is _possible_ to reduce the inode *ratio* below 2.5KB if you know what 
you are doing (e.g. 2KB/inode or 1.5KB/inode, this can be an arbitrary number 
of bytes, it doesn't have to be an even multiple of anything) but it definitely 
isn't possible to have 1KB inode size and 1KB per inode ratio, as there 
wouldn't be *any* space left for directories, log files, journal, etc.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud





_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to