Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-28 Thread Andrew Elwell
On Tue, 26 Feb 2019 at 23:25, Andreas Dilger wrote: > I agree that having an option that creates the OSTs as inactive might be > helpful, though I wouldn't want that to be the default as I'd imagine it > would also cause problems for the majority users that wouldn't know that they > need to

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-28 Thread Patrick Farrell
, 2019 6:09:18 AM To: Stephane Thiell Cc: lustre-discuss Subject: Re: [lustre-discuss] Suspended jobs and rebooting lustre servers My strategy for adding new OSTs on live filesystem is to define a pool with currently running OST and apply pool stripe (lfs setstripe -p [live-ost-pool]) on all

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-28 Thread Jongwoo Han
My strategy for adding new OSTs on live filesystem is to define a pool with currently running OST and apply pool stripe (lfs setstripe -p [live-ost-pool]) on all existing directories. It is better when it is done at first filesystem creation. After that, you can safely add new OSTs without newly

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-27 Thread Stephane Thiell
On one of our filesystem, we add a few new OSTs almost every month with no downtime, this is very convenient. The only thing that I would recommend is to avoid doing that during a peak of I/Os on your filesystem (we usually do it as early as possible in the morning), as the added OSTs will

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-22 Thread Andreas Dilger
This is not really correct. Lustre clients can handle the addition of OSTs to a running filesystem. The MGS will register the new OSTs, and the clients will be notified by the MGS that the OSTs have been added, so no need to unmount the clients during this process. Cheers, Andreas On Feb 21,

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
Got it. I rather be safe than sorry. This is my first time doing a lustre configuration change. Raj On Thu, Feb 21, 2019, 11:55 PM Raj wrote: > I also agree with Colin's comment. > If the current OSTs are not touched, and you are only adding new OSTs to > existing OSS nodes and adding new

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj
I also agree with Colin's comment. If the current OSTs are not touched, and you are only adding new OSTs to existing OSS nodes and adding new ost-mount resources in your existing (already running) Pacemaker configuration, you can achieve the upgrade with no downtime. If your Corosync-Pacemaker

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
Hi Raj, Thanks for the explanation. We will have to rethink our upgrade process. Thanks again. Raj On Thu, Feb 21, 2019, 10:23 PM Raj wrote: > Hello Raj, > It’s best and safe to unmount from all the clients and then do the > upgrade. Your FS is getting more OSTs and changing conf in the

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj
Hello Raj, It’s best and safe to unmount from all the clients and then do the upgrade. Your FS is getting more OSTs and changing conf in the existing ones, your client needs to get the new layout by remounting it. Also you mentioned about client eviction, during eviction the client has to drop

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
What can I expect to happen to the jobs that are suspended during the file system restart? Will the processes holding an open file handle die when I unsuspend them after the filesystem restart? Thanks! -Raj On Thu, Feb 21, 2019 at 12:52 PM Colin Faber wrote: > Ah yes, > > If you're adding to

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Colin Faber
Ah yes, If you're adding to an existing OSS, then you will need to reconfigure the file system which requires writeconf event. On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam wrote: > The new OST's will be added to the existing file system (the OSS nodes are > already part of the

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
The new OST's will be added to the existing file system (the OSS nodes are already part of the filesystem), I will have to re-configure the current HA resource configuration to tell it about the 4 new OST's. Our exascaler's HA monitors the individual OST and I need to re-configure the HA on the

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Colin Faber
It seems to me that steps may still be missing? You're going to rack/stack and provision the OSS nodes with new OSTs'. Then you're going to introduce failover options somewhere? new osts? existing system? etc? If you're introducing failover with the new OST's and leaving the existing system in

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
Our upgrade strategy is as follows: 1) Load all disks into the storage array. 2) Create RAID pools and virtual disks. 3) Create lustre file system using mkfs.lustre command. (I still have to figure out all the parameters used on the existing OSTs). 4) Create mount points on all OSSs. 5) Mount the

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-20 Thread Colin Faber
Can you provide more details on your upgrade strategy? In some cases expanding your storage shouldn't impact client / job activity at all. On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam wrote: > Hello, > > We are planning on expanding our storage by adding more OSTs to our lustre > file

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-20 Thread Gin Tan
Hi Raj, You can add the OSTs online, we have been doing it but if you are expanding the storage array, you might want to think about what is involved such as cabling etc depends on the recommendation from your storage vendor. We added an expansion on every Dell storage array last year and because

[lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-20 Thread Raj Ayyampalayam
Hello, We are planning on expanding our storage by adding more OSTs to our lustre file system. It looks like it would be easier to expand if we bring the filesystem down and perform the necessary operations. We are planning to suspend all the jobs running on the cluster. We originally planned to