Re: [lustre-discuss] Experience with resizing MDT

2018-09-20 Thread Mohr Jr, Richard Frank (Rick Mohr)

> On Sep 19, 2018, at 8:09 PM, Colin Faber  wrote:
> 
> Why wouldn't you use DNE?

I am considering it as an option, but there appear to be some potential 
drawbacks.

If I use DNE1, then I have to manually create directories on specific MDTs.  I 
will need to monitor MDT usage and make adjustments as necessary (which is not 
the end of the world, but still involves some additional work).  This might be 
fine when I am creating new top-level directories for new users/projects, but 
any existing directories created before we add a new MDT will still only use 
MDT0.  Since the bulk of our user/project directories will be created early on, 
we still have the potential issue of running out of inodes on MDT0.  

Based on that, I think DNE2 would be the better alternative, but it still has 
similar limitations.  The directories created initially will still be only 
striped over a single MDT.  When another MDT is added, I would need to 
recursively adjust all the existing directories to have a stripe count of 2 (or 
risk having MDT0 run out of inodes).  Based on my understanding of how the 
striped directories work, all the files in a striped directory are about evenly 
split across all the MDTs that the directory is striped across (which doesn’t 
work very well if MDT0 is mostly full and MDT1 is mostly empty).  Most likely 
we would want to have every directory striped across all MDTs, but there is a 
note in the lustre manual explicitly mentioning that it’s not a good idea to do 
this.

So that is why I was thinking that resizing the MDT might be the simplest 
approach.   Of course, I might be mistunderstanding something about DNE2, and 
if that is the case, someone can correct me.  Of if there are options I am not 
considering, I would welcome those too.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Experience with resizing MDT

2018-09-20 Thread Andreas Dilger
On Sep 19, 2018, at 18:49, Mohr Jr, Richard Frank (Rick Mohr)  
wrote:
> 
> Has anyone had recent experience resizing a ldiskfs-backed MDT using the 
> resize2fs tool?  We may be purchasing a small lustre file system in the near 
> future with the expectation that it could grow considerably over time.  Since 
> we don’t have a clear idea of how many inodes we might need in the future, I 
> wanted to consider options that would allow us to potentially expand the MDT 
> capacity over time.  (Of course, using ZFS would allow us to easily expand 
> capacity, but I do have some concern about metadata performance.  Also, we 
> will likely need to use project quotas which as far I remember is not 
> available for ZFS at the moment .)

Rick,
I've used resize2fs to resize my MDT in the past, but this was for relatively 
small resize increments, and I haven't done it recently.  Note that for Lustre 
MDT/OST mounts, you can't resize the filesystem directly while mounted.  It is 
possible to *also* mount the filesystem as type ldiskfs to run the resize2fs 
command (basically to give it a place to call the resize ioctl()), but YMMV.

It is always a good idea to have an MDT backup (every few days if possible) 
since it is a relatively small amount of space to store an MDT backup, which 
may avoid a large amount of data loss/restore.  Even a "dd" backup of the live 
MDT is likely usable after e2fsck, and better than a broken MDT if something 
goes wrong.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud









signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Second read or write performance

2018-09-20 Thread Andreas Dilger
On Sep 20, 2018, at 03:07, fırat yılmaz  wrote:
> 
> Hi all,
> 
> OS=Redhat 7.4
> Lustre Version: Intel® Manager for Lustre* software 4.0.3.0
> İnterconnect: Mellanox OFED, ConnectX-5
> 72 OST over 6 OSS with HA
> 1mdt and 1 mgt on 2 MDS with HA
> 
> Lustre servers fine tuning parameters:
> lctl set_param timeout=600
> lctl set_param ldlm_timeout=200
> lctl set_param at_min=250
> lctl set_param at_max=600
> lctl set_param obdfilter.*.read_cache_enable=1
> lctl set_param obdfilter.*.writethrough_cache_enable=1
> lctl set_param obdfilter.lfs3test-OST*.brw_size=16
> 
> Lustre clients fine tuning parameters:
> lctl set_param osc.*.checksums=0
> lctl set_param timeout=600
> lctl set_param at_min=250
> lctl set_param at_max=600
> lctl set_param ldlm.namespaces.*.lru_size=2000
> lctl set_param osc.*OST*.max_rpcs_in_flight=256
> lctl set_param osc.*OST*.max_dirty_mb=1024
> lctl set_param osc.*.max_pages_per_rpc=1024
> lctl set_param llite.*.max_read_ahead_mb=1024
> lctl set_param llite.*.max_read_ahead_per_file_mb=1024
> 
> Mountpoint stripe count:72 stripesize:1M
> 
> I have a 2Pb lustre filesystem, In the benchmark tests i get the optimum 
> values for read and write, but when i start a concurrent I/O operation, 
> second job throughput stays around 100-200Mb/s. I have tried lovering the 
> stripe count to 36 but since the concurrent operations will not occur in a 
> way that keeps OST volume inbalance, i think that its not a good way to move 
> on, secondly i saw some discussion about turning off flock which ended up 
> unpromising.
> 
> As i check the stripe behaviour,
> first operation starts to use first 36 OST
> when a second job starts during a first job, it uses second 36 OST
> 
> But when second job starts after 1st job it uses first 36 OST's which causes 
> OST unbalance.
> 
> Is there a round robin setup that each 36 OST pair used in a round robin way?
> 
> And any kind of suggestions are appreciated.

Can you please describe what command you are using for testing.  Lustre is 
already using round-robin OST allocation by default, so the second job should 
use the next set of 36 OSTs, unless the file layout has been specified e.g. to 
start on OST or the space usage of the OSTs is very imbalanced (more than 
17% of the remaining free space).

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud









signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre/ZFS snapshots mount error

2018-09-20 Thread Kirk, Benjamin (JSC-EG311)
I just opened an LU on the issue https://jira.whamcloud.com/browse/LU-11411 for 
anyone interested.

Thanks a lot!

-Ben



On Aug 27, 2018, at 4:56 PM, Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:

It's probably best to file an LU ticket for this issue.

It looks like there is something with the log processing at mount that is 
trying to modify the configuration files.  I'm not sure whether that should be 
allowed or not.

Does fab have the same MGS as fsA?  Does it have the same MDS node as fsA?
If it has a different MDS, you might consider to give it its own MGS as well.
That doesn't have to be a separate MGS node, just a separate filesystem (ZFS 
fileset in the same zpool) on the MDS node.

Cheers, Andreas

On Aug 27, 2018, at 10:18, Kirk, Benjamin (JSC-EG311) 
mailto:benjamin.k...@nasa.gov>> wrote:

Hi all,

We have two filesystems, fsA & fsB (eadc below).  Both of which get snapshots 
taken daily, rotated over a week.  It’s a beautiful feature we’ve been using in 
production ever since it was introduced with 2.10.

-) We’ve got Lustre/ZFS 2.10.4 on CentOS 7.5.
-) Both fsA & fsB have changelogs active.
-) fsA has combined mgt/mdt on a single ZFS filesystem.
-) fsB has a single mdt on a single ZFS filesystem.
-) for fsA, I have no issues mounting any of the snapshots via lctl.
-) for fsB, I can mount the most three recent snapshots, then encounter errors:

[root@hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Mon
mounted the snapshot eadc_AutoSS-Mon with fsname 3d40bbc
[root@hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n eadc_AutoSS-Mon
[root@hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Sun
mounted the snapshot eadc_AutoSS-Sun with fsname 584c07a
[root@hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n eadc_AutoSS-Sun
[root@hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Sat
mounted the snapshot eadc_AutoSS-Sat with fsname 4e646fe
[root@hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n eadc_AutoSS-Sat
[root@hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Fri
mount.lustre: mount metadata/meta-eadc@eadc_AutoSS-Fri at 
/mnt/eadc_AutoSS-Fri_MDT failed: Read-only file system
Can't mount the snapshot eadc_AutoSS-Fri: Read-only file system

The relevant bits from dmesg are
[1353434.417762] Lustre: 3d40bbc-MDT: set dev_rdonly on this device
[1353434.417765] Lustre: Skipped 3 previous similar messages
[1353434.649480] Lustre: 3d40bbc-MDT: Imperative Recovery enabled, recovery 
window shrunk from 300-900 down to 150-900
[1353434.649484] Lustre: Skipped 3 previous similar messages
[1353434.866228] Lustre: 3d40bbc-MDD: changelog on
[1353434.866233] Lustre: Skipped 1 previous similar message
[1353435.427744] Lustre: 3d40bbc-MDT: Connection restored to ...@tcp (at 
...@tcp)
[1353435.427747] Lustre: Skipped 23 previous similar messages
[1353445.255899] Lustre: Failing over 3d40bbc-MDT
[1353445.255903] Lustre: Skipped 3 previous similar messages
[1353445.256150] LustreError: 11-0: 3d40bbc-OST-osc-MDT: operation 
ost_disconnect to node ...@tcp failed: rc = -107
[1353445.257896] LustreError: Skipped 23 previous similar messages
[1353445.353874] Lustre: server umount 3d40bbc-MDT complete
[1353445.353877] Lustre: Skipped 3 previous similar messages
[1353475.302224] Lustre: 4e646fe-MDD: changelog on
[1353475.302228] Lustre: Skipped 1 previous similar message
[1353498.964016] LustreError: 25582:0:(osd_handler.c:341:osd_trans_create()) 
36ca26b-MDT-osd: someone try to start transaction under readonly mode, 
should be disabled.
[1353498.967260] LustreError: 25582:0:(osd_handler.c:341:osd_trans_create()) 
Skipped 1 previous similar message
[1353498.968829] CPU: 6 PID: 25582 Comm: mount.lustre Kdump: loaded Tainted: P  
 OE     3.10.0-862.6.3.el7.x86_64 #1
[1353498.968830] Hardware name: Supermicro SYS-6027TR-D71FRF/X9DRT, BIOS 3.2a 
08/04/2015
[1353498.968832] Call Trace:
[1353498.968841]  [] dump_stack+0x19/0x1b
[1353498.968851]  [] osd_trans_create+0x38b/0x3d0 [osd_zfs]
[1353498.968876]  [] llog_destroy+0x1f4/0x3f0 [obdclass]
[1353498.968887]  [] llog_cat_reverse_process_cb+0x246/0x3f0 
[obdclass]
[1353498.968897]  [] llog_reverse_process+0x38c/0xaa0 
[obdclass]
[1353498.968910]  [] ? llog_cat_process_cb+0x4e0/0x4e0 
[obdclass]
[1353498.968922]  [] llog_cat_reverse_process+0x179/0x270 
[obdclass]
[1353498.968932]  [] ? llog_init_handle+0xd5/0x9a0 [obdclass]
[1353498.968943]  [] ? llog_open_create+0x78/0x320 [obdclass]
[1353498.968949]  [] ? mdd_root_get+0xf0/0xf0 [mdd]
[1353498.968954]  [] mdd_prepare+0x13ff/0x1c70 [mdd]
[1353498.968966]  [] mdt_prepare+0x57/0x3b0 [mdt]
[1353498.968983]  [] server_start_targets+0x234d/0x2bd0 
[obdclass]
[1353498.968999]  [] ? class_config_dump_handler+0x7e0/0x7e0 
[obdclass]
[1353498.969012]  [] server_fill_super+0x109d/0x185a 
[obdclass]
[1353498.969025]  [] lustre_fill_super+0x328/0x950 [obdclass]
[1353498.969038]  [] ? lustre_common_put_super+0x270/0x270 

[lustre-discuss] Second read or write performance

2018-09-20 Thread fırat yılmaz
Hi all,

OS=Redhat 7.4
Lustre Version: Intel® Manager for Lustre* software 4.0.3.0
İnterconnect: Mellanox OFED, ConnectX-5
72 OST over 6 OSS with HA
1mdt and 1 mgt on 2 MDS with HA

Lustre servers fine tuning parameters:
lctl set_param timeout=600
lctl set_param ldlm_timeout=200
lctl set_param at_min=250
lctl set_param at_max=600
lctl set_param obdfilter.*.read_cache_enable=1
lctl set_param obdfilter.*.writethrough_cache_enable=1
lctl set_param obdfilter.lfs3test-OST*.brw_size=16

Lustre clients fine tuning parameters:
lctl set_param osc.*.checksums=0
lctl set_param timeout=600
lctl set_param at_min=250
lctl set_param at_max=600
lctl set_param ldlm.namespaces.*.lru_size=2000
lctl set_param osc.*OST*.max_rpcs_in_flight=256
lctl set_param osc.*OST*.max_dirty_mb=1024
lctl set_param osc.*.max_pages_per_rpc=1024
lctl set_param llite.*.max_read_ahead_mb=1024
lctl set_param llite.*.max_read_ahead_per_file_mb=1024

Mountpoint stripe count:72 stripesize:1M

I have a 2Pb lustre filesystem, In the benchmark tests i get the optimum
values for read and write, but when i start a concurrent I/O operation,
second job throughput stays around 100-200Mb/s. I have tried lovering the
stripe count to 36 but since the concurrent operations will not occur in a
way that keeps OST volume inbalance, i think that its not a good way to
move on, secondly i saw some discussion about turning off flock which ended
up unpromising.

As i check the stripe behaviour,
first operation starts to use first 36 OST
when a second job starts during a first job, it uses second 36 OST

But when second job starts after 1st job it uses first 36 OST's which
causes OST unbalance.

Is there a round robin setup that each 36 OST pair used in a round robin
way?

And any kind of suggestions are appreciated.


Best regards.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org