Re: [lustre-discuss] changing inode size on MDT

2019-11-11 Thread Andreas Dilger
You can check the ashift of the zpool via "zpool get all | grep ashift".  If 
this is different, it will make a huge difference in space usage. There are a 
number of ZFS articles that discuss this, it isn't specific to Lustre.

Also, RAID-Z2 is going to have much more space overhead for the MDT than 
mirroring, because the MDT is almost entirely small blocks.  Normally the MDT 
is using mirrored VDEVs.

The reason is that RAID-Z2 has two parity sectors per data stripe vs. a single 
extra mirror per data block, so if all data blocks are 4KB that would double 
the parity overhead vs. mirroring. Secondly, depending on the geometry, RAID-Z2 
needs padding sectors to align the variable RAID-Z stripes, which mirrors do 
not.

For large files/blocks RAID-Z2 is better, but that isn't the workload on the 
MDT unless you are storing DoM files there (eg. 64KB or larger).

Cheers, Andreas

On Nov 11, 2019, at 13:48, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

Recordsize/ahift: in both cases default values were used (but on different 
versions of Lustre). How can I check different values for recordsize/ashift for 
actual values to compare?

zpool mirroring is quite different though – bad drive is a simple raidz2:

  raidz2-0  ONLINE   0 0 0
sdd ONLINE   0 0 0
….

errors: No known data errors

the good drive uses 10 mirrors:

NAME STATE READ WRITE CKSUM
mdt  ONLINE   0 0 0
  mirror-0   ONLINE   0 0 0
sdd  ONLINE   0 0 0
sde  ONLINE   0 0 0
  mirror-1   ONLINE   0 0 0
sdf  ONLINE   0 0 0
sdg  ONLINE   0 0 0
  mirror-2   ONLINE   0 0 0
sdh  ONLINE   0 0 0
sdi  ONLINE   0 0 0
  mirror-3   ONLINE   0 0 0
sdj  ONLINE   0 0 0
sdk  ONLINE   0 0 0
  mirror-4   ONLINE   0 0 0
sdl  ONLINE   0 0 0
sdm  ONLINE   0 0 0
  mirror-5   ONLINE   0 0 0
sdn  ONLINE   0 0 0
sdo  ONLINE   0 0 0
  mirror-6   ONLINE   0 0 0
sdp  ONLINE   0 0 0
sdq  ONLINE   0 0 0
  mirror-7   ONLINE   0 0 0
sdr  ONLINE   0 0 0
sds  ONLINE   0 0 0
  mirror-8   ONLINE   0 0 0
sdt  ONLINE   0 0 0
sdu  ONLINE   0 0 0
  mirror-9   ONLINE   0 0 0
sdv  ONLINE   0 0 0
sdw  ONLINE   0 0 0
  mirror-10  ONLINE   0 0 0
sdx  ONLINE   0 0 0
sdy  ONLINE   0 0 0

thanks
Michael

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Monday, November 11, 2019 14:42
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] changing inode size on MDT

There isn't really enough information to make any kind of real analysis.


My guess would be that you are using a larger ZFS recordsize or ashift on the 
new filesystem, or the RAID config is different?

Cheers, Andreas

On Nov 7, 2019, at 08:45, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:
So we went ahead and used the FS – using rsync to duplicate the existing FS. 
The inodes available on the NEW mdt (which is almost twice the size of the 
second mdt) are dropping rapidly and are now LESS than on the smaller mdt (even 
though the sync is only 90% complete). Both FS are running almost identical 
Lustre 2.10. I cannot say anymore which ZFS version was used to format the good 
FS.

Any ideas why those 2 MDTs behave so differently?

old GOOD FS:
# df -i
mgt/mgt   81718714   205   817185091% /lfs/lfsarc02/mgt
mdt/mdt  458995000 130510339  328484661   29% /lfs/lfsarc02/mdt
# df -h
mgt/mgt  427G  7.0M  427G   1% /lfs/lfsarc02/mgt
mdt/mdt  4.6T  1.4T  3.3T  29% /lfs/lfsarc02/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.10.4-1.el7.x86_64
lustre-zfs-dkms-2.10.4-1.el7.noarch
zfs-0.7.9-1.el7.x86_64
zfs-dkms-0.7.9-1.el7.noarch

new BAD FS
# df -ih
mgt/mgt83M   169   83M1% /lfs/lfsarc01/mgt
mdt/mdt   297M  122M  175M   42% /lfs/lfsarc01/mdt
# df -h
mgt/mgt  427G  5.8M  427G   1% /lfs/lfsarc01/mgt
mdt/mdt  8.2T  3.4T  4.9T  41% /lfs/lfsarc01/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64

Re: [lustre-discuss] changing inode size on MDT

2019-11-11 Thread Andreas Dilger
There isn't really enough information to make any kind of real analysis.

My guess would be that you are using a larger ZFS recordsize or ashift on the 
new filesystem, or the RAID config is different?

Cheers, Andreas

On Nov 7, 2019, at 08:45, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So we went ahead and used the FS – using rsync to duplicate the existing FS. 
The inodes available on the NEW mdt (which is almost twice the size of the 
second mdt) are dropping rapidly and are now LESS than on the smaller mdt (even 
though the sync is only 90% complete). Both FS are running almost identical 
Lustre 2.10. I cannot say anymore which ZFS version was used to format the good 
FS.

Any ideas why those 2 MDTs behave so differently?

old GOOD FS:
# df -i
mgt/mgt   81718714   205   817185091% /lfs/lfsarc02/mgt
mdt/mdt  458995000 130510339  328484661   29% /lfs/lfsarc02/mdt
# df -h
mgt/mgt  427G  7.0M  427G   1% /lfs/lfsarc02/mgt
mdt/mdt  4.6T  1.4T  3.3T  29% /lfs/lfsarc02/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.10.4-1.el7.x86_64
lustre-zfs-dkms-2.10.4-1.el7.noarch
zfs-0.7.9-1.el7.x86_64
zfs-dkms-0.7.9-1.el7.noarch

new BAD FS
# df -ih
mgt/mgt83M   169   83M1% /lfs/lfsarc01/mgt
mdt/mdt   297M  122M  175M   42% /lfs/lfsarc01/mdt
# df -h
mgt/mgt  427G  5.8M  427G   1% /lfs/lfsarc01/mgt
mdt/mdt  8.2T  3.4T  4.9T  41% /lfs/lfsarc01/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.10.8-1.el7.x86_64
lustre-zfs-dkms-2.10.8-1.el7.noarch
zfs-0.7.9-1.el7.x86_64
zfs-dkms-0.7.9-1.el7.noarch

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Thursday, October 03, 2019 20:38
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] changing inode size on MDT

On Oct 3, 2019, at 20:09, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So bottom line – don’t change the default values, it won’t get better?

Like I wrote previously, there *are* no default/tunable values to change for 
ZFS.  The tunables are only for ldiskfs, which statically allocates everything, 
but is will cause problems if you guessed incorrectly at the instant you format 
the filesystem.

The number reported by raw ZFS and by Lustre-on-ZFS is just an estimate, and 
you will (essentially) run out of inodes once you run out of space on the MDT 
or all OSTs.  And I didn't say "it won't get better", actually I said the 
estimate _will_ get better once you actually start using the filesystem.

If the (my estimate) 2-3B inodes on the MDT is insufficient, you can always add 
another (presumably mirrored) VDEV to the MDT, or add a new MDT to the 
filesystem to increase the number of inodes available.

Cheers, Andreas



From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Thursday, October 03, 2019 19:38
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] changing inode size on MDT

On Oct 3, 2019, at 05:03, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So you are saying on a zfs based Lustre there is no way to increase the number 
of available inodes? I have 8TB MDT with roughly 17G inodes

[root@elfsa1m1 ~]# df -h
Filesystem   Size  Used Avail Use% Mounted on
mdt  8.3T  256K  8.3T   1% /mdt

[root@elfsa1m1 ~]# df -i
Filesystem   Inodes  IUsed   IFree IUse% Mounted on
mdt 17678817874  6 176788178681% /mdt

For ZFS the only way to increase inodes on the *MDT* is to increase the size of 
the MDT, though more on that below.  Note that the "number of inodes" reported 
by ZFS is an estimate based on the currently-allocated blocks and inodes (i.e. 
bytes_per_inode_ratio = bytes_used / inodes_used, total inode estimate = 
bytes_free / inode_ratio + inodes_used), which becomes more accurate as the MDT 
becomes more full.  With 17B inodes on a 8TB MDT that is an bytes-per-inode 
ratio of 497, which is unrealistically low for Lustre since the MDT will always 
stores multiple xattrs on each inode.  Note that the filesystem only has 6 
inodes allocated, so the ZFS total inodes estimate is unrealistically high and 
will get better as more inodes are allocated in the filesystem.

Formating under Lustre 2.10.8

mkfs.lustre --mdt --backfstype=zfs --fsname=lfsarc01 --index=0 
--mgsnid="36.101.92.22@tcp" --reformat mdt/mdt

this translates to only 948M inodes on the Lustre FS.

[root@elfsa1m1 ~]# df -i
Filesystem   Inodes  IUsed   IFree IUse% Mounted on
mdt 17678817874  6 176788178681% /mdt
mdt/mdt