Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

Andreas Dilger Fri, 22 Feb 2019 14:04:06 -0800

This is not really correct.

Lustre clients can handle the addition of OSTs to a running filesystem. The MGS 
will register the new OSTs, and the clients will be notified by the MGS that 
the OSTs have been added, so no need to unmount the clients during this process.



Cheers, Andreas

On Feb 21, 2019, at 19:23, Raj 
<rajgau...@gmail.com<mailto:rajgau...@gmail.com>> wrote:

Hello Raj,
It’s best and safe to unmount from all the clients and then do the upgrade. 
Your FS is getting more OSTs and changing conf in the existing ones, your 
client needs to get the new layout by remounting it.
Also you mentioned about client eviction, during eviction the client has to 
drop it’s dirty pages and all the open file descriptors in the FS will be gone.

On Thu, Feb 21, 2019 at 12:25 PM Raj Ayyampalayam 
<ans...@gmail.com<mailto:ans...@gmail.com>> wrote:
What can I expect to happen to the jobs that are suspended during the file 
system restart?
Will the processes holding an open file handle die when I unsuspend them after 
the filesystem restart?

Thanks!
-Raj


On Thu, Feb 21, 2019 at 12:52 PM Colin Faber 
<cfa...@gmail.com<mailto:cfa...@gmail.com>> wrote:
Ah yes,

If you're adding to an existing OSS, then you will need to reconfigure the file 
system which requires writeconf event.

On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam 
<ans...@gmail.com<mailto:ans...@gmail.com>> wrote:
The new OST's will be added to the existing file system (the OSS nodes are 
already part of the filesystem), I will have to re-configure the current HA 
resource configuration to tell it about the 4 new OST's.
Our exascaler's HA monitors the individual OST and I need to re-configure the 
HA on the existing filesystem.

Our vendor support has confirmed that we would have to restart the filesystem 
if we want to regenerate the HA configs to include the new OST's.

Thanks,
-Raj


On Thu, Feb 21, 2019 at 11:23 AM Colin Faber 
<cfa...@gmail.com<mailto:cfa...@gmail.com>> wrote:
It seems to me that steps may still be missing?

You're going to rack/stack and provision the OSS nodes with new OSTs'.

Then you're going to introduce failover options somewhere? new osts? existing 
system? etc?

If you're introducing failover with the new OST's and leaving the existing 
system in place, you should be able to accomplish this without bringing the 
system offline.

If you're going to be introducing failover to your existing system then you 
will need to reconfigure the file system to accommodate the new failover 
settings (failover nides, etc.)

-cf


On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam 
<ans...@gmail.com<mailto:ans...@gmail.com>> wrote:
Our upgrade strategy is as follows:

1) Load all disks into the storage array.
2) Create RAID pools and virtual disks.
3) Create lustre file system using mkfs.lustre command. (I still have to figure 
out all the parameters used on the existing OSTs).
4) Create mount points on all OSSs.
5) Mount the lustre OSTs.
6) Maybe rebalance the filesystem.
My understanding is that the above can be done without bringing the filesystem 
down. I want to create the HA configuration (corosync and pacemaker) for the 
new OSTs. This step requires the filesystem to be down. I want to know what 
would happen to the suspended processes across the cluster when I bring the 
filesystem down to re-generate the HA configs.

Thanks,
-Raj

On Thu, Feb 21, 2019 at 12:59 AM Colin Faber 
<cfa...@gmail.com<mailto:cfa...@gmail.com>> wrote:
Can you provide more details on your upgrade strategy? In some cases expanding 
your storage shouldn't impact client / job activity at all.

On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam 
<ans...@gmail.com<mailto:ans...@gmail.com>> wrote:
Hello,

We are planning on expanding our storage by adding more OSTs to our lustre file 
system. It looks like it would be easier to expand if we bring the filesystem 
down and perform the necessary operations. We are planning to suspend all the 
jobs running on the cluster. We originally planned to add new OSTs to the live 
filesystem.

We are trying to determine the potential impact to the suspended jobs if we 
bring down the filesystem for the upgrade.
One of the questions we have is what would happen to the suspended processes 
that hold an open file handle in the lustre file system when the filesystem is 
brought down for the upgrade?
Will they recover from the client eviction?

We do have vendor support and have engaged them. I wanted to ask the community 
and get some feedback.

Thanks,
-Raj
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

Reply via email to