Re: [lustre-discuss] Experience with resizing MDT
> On Sep 19, 2018, at 8:09 PM, Colin Faber wrote: > > Why wouldn't you use DNE? I am considering it as an option, but there appear to be some potential drawbacks. If I use DNE1, then I have to manually create directories on specific MDTs. I will need to monitor MDT usage and make adjustments as necessary (which is not the end of the world, but still involves some additional work). This might be fine when I am creating new top-level directories for new users/projects, but any existing directories created before we add a new MDT will still only use MDT0. Since the bulk of our user/project directories will be created early on, we still have the potential issue of running out of inodes on MDT0. Based on that, I think DNE2 would be the better alternative, but it still has similar limitations. The directories created initially will still be only striped over a single MDT. When another MDT is added, I would need to recursively adjust all the existing directories to have a stripe count of 2 (or risk having MDT0 run out of inodes). Based on my understanding of how the striped directories work, all the files in a striped directory are about evenly split across all the MDTs that the directory is striped across (which doesn’t work very well if MDT0 is mostly full and MDT1 is mostly empty). Most likely we would want to have every directory striped across all MDTs, but there is a note in the lustre manual explicitly mentioning that it’s not a good idea to do this. So that is why I was thinking that resizing the MDT might be the simplest approach. Of course, I might be mistunderstanding something about DNE2, and if that is the case, someone can correct me. Of if there are options I am not considering, I would welcome those too. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Experience with resizing MDT
On Sep 19, 2018, at 18:49, Mohr Jr, Richard Frank (Rick Mohr) wrote: > > Has anyone had recent experience resizing a ldiskfs-backed MDT using the > resize2fs tool? We may be purchasing a small lustre file system in the near > future with the expectation that it could grow considerably over time. Since > we don’t have a clear idea of how many inodes we might need in the future, I > wanted to consider options that would allow us to potentially expand the MDT > capacity over time. (Of course, using ZFS would allow us to easily expand > capacity, but I do have some concern about metadata performance. Also, we > will likely need to use project quotas which as far I remember is not > available for ZFS at the moment .) Rick, I've used resize2fs to resize my MDT in the past, but this was for relatively small resize increments, and I haven't done it recently. Note that for Lustre MDT/OST mounts, you can't resize the filesystem directly while mounted. It is possible to *also* mount the filesystem as type ldiskfs to run the resize2fs command (basically to give it a place to call the resize ioctl()), but YMMV. It is always a good idea to have an MDT backup (every few days if possible) since it is a relatively small amount of space to store an MDT backup, which may avoid a large amount of data loss/restore. Even a "dd" backup of the live MDT is likely usable after e2fsck, and better than a broken MDT if something goes wrong. Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud signature.asc Description: Message signed with OpenPGP ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Second read or write performance
On Sep 20, 2018, at 03:07, fırat yılmaz wrote: > > Hi all, > > OS=Redhat 7.4 > Lustre Version: Intel® Manager for Lustre* software 4.0.3.0 > İnterconnect: Mellanox OFED, ConnectX-5 > 72 OST over 6 OSS with HA > 1mdt and 1 mgt on 2 MDS with HA > > Lustre servers fine tuning parameters: > lctl set_param timeout=600 > lctl set_param ldlm_timeout=200 > lctl set_param at_min=250 > lctl set_param at_max=600 > lctl set_param obdfilter.*.read_cache_enable=1 > lctl set_param obdfilter.*.writethrough_cache_enable=1 > lctl set_param obdfilter.lfs3test-OST*.brw_size=16 > > Lustre clients fine tuning parameters: > lctl set_param osc.*.checksums=0 > lctl set_param timeout=600 > lctl set_param at_min=250 > lctl set_param at_max=600 > lctl set_param ldlm.namespaces.*.lru_size=2000 > lctl set_param osc.*OST*.max_rpcs_in_flight=256 > lctl set_param osc.*OST*.max_dirty_mb=1024 > lctl set_param osc.*.max_pages_per_rpc=1024 > lctl set_param llite.*.max_read_ahead_mb=1024 > lctl set_param llite.*.max_read_ahead_per_file_mb=1024 > > Mountpoint stripe count:72 stripesize:1M > > I have a 2Pb lustre filesystem, In the benchmark tests i get the optimum > values for read and write, but when i start a concurrent I/O operation, > second job throughput stays around 100-200Mb/s. I have tried lovering the > stripe count to 36 but since the concurrent operations will not occur in a > way that keeps OST volume inbalance, i think that its not a good way to move > on, secondly i saw some discussion about turning off flock which ended up > unpromising. > > As i check the stripe behaviour, > first operation starts to use first 36 OST > when a second job starts during a first job, it uses second 36 OST > > But when second job starts after 1st job it uses first 36 OST's which causes > OST unbalance. > > Is there a round robin setup that each 36 OST pair used in a round robin way? > > And any kind of suggestions are appreciated. Can you please describe what command you are using for testing. Lustre is already using round-robin OST allocation by default, so the second job should use the next set of 36 OSTs, unless the file layout has been specified e.g. to start on OST or the space usage of the OSTs is very imbalanced (more than 17% of the remaining free space). Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud signature.asc Description: Message signed with OpenPGP ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre/ZFS snapshots mount error
I just opened an LU on the issue https://jira.whamcloud.com/browse/LU-11411 for anyone interested. Thanks a lot! -Ben On Aug 27, 2018, at 4:56 PM, Andreas Dilger mailto:adil...@whamcloud.com>> wrote: It's probably best to file an LU ticket for this issue. It looks like there is something with the log processing at mount that is trying to modify the configuration files. I'm not sure whether that should be allowed or not. Does fab have the same MGS as fsA? Does it have the same MDS node as fsA? If it has a different MDS, you might consider to give it its own MGS as well. That doesn't have to be a separate MGS node, just a separate filesystem (ZFS fileset in the same zpool) on the MDS node. Cheers, Andreas On Aug 27, 2018, at 10:18, Kirk, Benjamin (JSC-EG311) mailto:benjamin.k...@nasa.gov>> wrote: Hi all, We have two filesystems, fsA & fsB (eadc below). Both of which get snapshots taken daily, rotated over a week. It’s a beautiful feature we’ve been using in production ever since it was introduced with 2.10. -) We’ve got Lustre/ZFS 2.10.4 on CentOS 7.5. -) Both fsA & fsB have changelogs active. -) fsA has combined mgt/mdt on a single ZFS filesystem. -) fsB has a single mdt on a single ZFS filesystem. -) for fsA, I have no issues mounting any of the snapshots via lctl. -) for fsB, I can mount the most three recent snapshots, then encounter errors: [root@hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Mon mounted the snapshot eadc_AutoSS-Mon with fsname 3d40bbc [root@hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n eadc_AutoSS-Mon [root@hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Sun mounted the snapshot eadc_AutoSS-Sun with fsname 584c07a [root@hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n eadc_AutoSS-Sun [root@hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Sat mounted the snapshot eadc_AutoSS-Sat with fsname 4e646fe [root@hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n eadc_AutoSS-Sat [root@hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Fri mount.lustre: mount metadata/meta-eadc@eadc_AutoSS-Fri at /mnt/eadc_AutoSS-Fri_MDT failed: Read-only file system Can't mount the snapshot eadc_AutoSS-Fri: Read-only file system The relevant bits from dmesg are [1353434.417762] Lustre: 3d40bbc-MDT: set dev_rdonly on this device [1353434.417765] Lustre: Skipped 3 previous similar messages [1353434.649480] Lustre: 3d40bbc-MDT: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [1353434.649484] Lustre: Skipped 3 previous similar messages [1353434.866228] Lustre: 3d40bbc-MDD: changelog on [1353434.866233] Lustre: Skipped 1 previous similar message [1353435.427744] Lustre: 3d40bbc-MDT: Connection restored to ...@tcp (at ...@tcp) [1353435.427747] Lustre: Skipped 23 previous similar messages [1353445.255899] Lustre: Failing over 3d40bbc-MDT [1353445.255903] Lustre: Skipped 3 previous similar messages [1353445.256150] LustreError: 11-0: 3d40bbc-OST-osc-MDT: operation ost_disconnect to node ...@tcp failed: rc = -107 [1353445.257896] LustreError: Skipped 23 previous similar messages [1353445.353874] Lustre: server umount 3d40bbc-MDT complete [1353445.353877] Lustre: Skipped 3 previous similar messages [1353475.302224] Lustre: 4e646fe-MDD: changelog on [1353475.302228] Lustre: Skipped 1 previous similar message [1353498.964016] LustreError: 25582:0:(osd_handler.c:341:osd_trans_create()) 36ca26b-MDT-osd: someone try to start transaction under readonly mode, should be disabled. [1353498.967260] LustreError: 25582:0:(osd_handler.c:341:osd_trans_create()) Skipped 1 previous similar message [1353498.968829] CPU: 6 PID: 25582 Comm: mount.lustre Kdump: loaded Tainted: P OE 3.10.0-862.6.3.el7.x86_64 #1 [1353498.968830] Hardware name: Supermicro SYS-6027TR-D71FRF/X9DRT, BIOS 3.2a 08/04/2015 [1353498.968832] Call Trace: [1353498.968841] [] dump_stack+0x19/0x1b [1353498.968851] [] osd_trans_create+0x38b/0x3d0 [osd_zfs] [1353498.968876] [] llog_destroy+0x1f4/0x3f0 [obdclass] [1353498.968887] [] llog_cat_reverse_process_cb+0x246/0x3f0 [obdclass] [1353498.968897] [] llog_reverse_process+0x38c/0xaa0 [obdclass] [1353498.968910] [] ? llog_cat_process_cb+0x4e0/0x4e0 [obdclass] [1353498.968922] [] llog_cat_reverse_process+0x179/0x270 [obdclass] [1353498.968932] [] ? llog_init_handle+0xd5/0x9a0 [obdclass] [1353498.968943] [] ? llog_open_create+0x78/0x320 [obdclass] [1353498.968949] [] ? mdd_root_get+0xf0/0xf0 [mdd] [1353498.968954] [] mdd_prepare+0x13ff/0x1c70 [mdd] [1353498.968966] [] mdt_prepare+0x57/0x3b0 [mdt] [1353498.968983] [] server_start_targets+0x234d/0x2bd0 [obdclass] [1353498.968999] [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [1353498.969012] [] server_fill_super+0x109d/0x185a [obdclass] [1353498.969025] [] lustre_fill_super+0x328/0x950 [obdclass] [1353498.969038] [] ? lustre_common_put_super+0x270/0x270
[lustre-discuss] Second read or write performance
Hi all, OS=Redhat 7.4 Lustre Version: Intel® Manager for Lustre* software 4.0.3.0 İnterconnect: Mellanox OFED, ConnectX-5 72 OST over 6 OSS with HA 1mdt and 1 mgt on 2 MDS with HA Lustre servers fine tuning parameters: lctl set_param timeout=600 lctl set_param ldlm_timeout=200 lctl set_param at_min=250 lctl set_param at_max=600 lctl set_param obdfilter.*.read_cache_enable=1 lctl set_param obdfilter.*.writethrough_cache_enable=1 lctl set_param obdfilter.lfs3test-OST*.brw_size=16 Lustre clients fine tuning parameters: lctl set_param osc.*.checksums=0 lctl set_param timeout=600 lctl set_param at_min=250 lctl set_param at_max=600 lctl set_param ldlm.namespaces.*.lru_size=2000 lctl set_param osc.*OST*.max_rpcs_in_flight=256 lctl set_param osc.*OST*.max_dirty_mb=1024 lctl set_param osc.*.max_pages_per_rpc=1024 lctl set_param llite.*.max_read_ahead_mb=1024 lctl set_param llite.*.max_read_ahead_per_file_mb=1024 Mountpoint stripe count:72 stripesize:1M I have a 2Pb lustre filesystem, In the benchmark tests i get the optimum values for read and write, but when i start a concurrent I/O operation, second job throughput stays around 100-200Mb/s. I have tried lovering the stripe count to 36 but since the concurrent operations will not occur in a way that keeps OST volume inbalance, i think that its not a good way to move on, secondly i saw some discussion about turning off flock which ended up unpromising. As i check the stripe behaviour, first operation starts to use first 36 OST when a second job starts during a first job, it uses second 36 OST But when second job starts after 1st job it uses first 36 OST's which causes OST unbalance. Is there a round robin setup that each 36 OST pair used in a round robin way? And any kind of suggestions are appreciated. Best regards. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org