Re: [lustre-discuss] Repeatable ldlm_enqueue error

2019-10-31 Thread Raj Ayyampalayam
I had the same thought and I checked all the nodes, and they were all
exactly the same time.

Raj

On Wed, Oct 30, 2019, 10:19 PM Raj  wrote:

> Raj,
> Just eyeballing your logs from server and client, it looks like they have
> different time. Are they out of sync? It is important to have both clients
> and server to have same time.
>
> On Wed, Oct 30, 2019 at 3:37 PM Raj Ayyampalayam  wrote:
>
>> Hello,
>>
>> A particular job (MPI Maker genome annotation) on our cluster produces
>> the following error and the job errors out with a "Could not open file
>> error."
>> Server: The server is running lustre-2.10.4
>> Client: I've tried it with 2.10.5, 2.10.8 and 2.12.3 with the same result.
>> I don't see any other servers (Other MDS and OSS server nodes) reporting
>> communication loss to the client. The IB fabric is stable. The job runs to
>> completion when using a local storage on the node or a NFS mounted storage.
>> The job creates a lot of IO but it does not increase the load on the
>> luster servers.
>>
>> Client:
>> Oct 22 14:56:39 n305 kernel: LustreError: 11-0:
>> lustre2-MDT-mdc-8c3f222c4800: operation ldlm_enqueue to node
>> 10.55.49.215@o2ib failed: rc = -107
>> Oct 22 14:56:39 n305 kernel: Lustre:
>> lustre2-MDT-mdc-8c3f222c4800: Connection to lustre2-MDT (at
>> 10.55.49.215@o2ib) was lost; in progress operations using this service
>> will wait for recovery to complete
>> Oct 22 14:56:39 n305 kernel: Lustre: Skipped 2 previous similar messages
>> Oct 22 14:56:39 n305 kernel: LustreError: 167-0:
>> lustre2-MDT-mdc-8c3f222c4800: This client was evicted by
>> lustre2-MDT; in progress operations using this service will fail.
>> Oct 22 14:56:39 n305 kernel: LustreError:
>> 125851:0:(file.c:172:ll_close_inode_openhandle())
>> lustre2-clilmv-8c3f222c4800: inode [0x2ef38:0xffd6:0x0] mdc close
>> failed: rc = -108
>> Oct 22 14:56:39 n305 kernel: LustreError: Skipped 1 previous similar
>> message
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) lustre2: revalidate FID
>> [0x2eedf:0xed9d:0x0] error: rc = -108
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125665:0:(vvp_io.c:1474:vvp_io_init()) lustre2: refresh file layout
>> [0x2ef38:0xff55:0x0] error -108.
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125883:0:(ldlm_resource.c:1100:ldlm_resource_complain())
>> lustre2-MDT-mdc-8c3f222c4800: namespace resource
>> [0x2ef38:0xff55:0x0].0x0 (8bdc6823c9c0) refcount nonzero (1) after
>> lock cleanup; forcing cleanup.
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125883:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource:
>> [0x2ef38:0xff55:0x0].0x0 (8bdc6823c9c0) refcount = 1
>> Oct 22 14:56:40 n305 kernel: Lustre:
>> lustre2-MDT-mdc-8c3f222c4800: Connection restored to
>> 10.55.49.215@o2ib (at 10.55.49.215@o2ib)
>> Oct 22 14:56:40 n305 kernel: Lustre: Skipped 1 previous similar message
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) Skipped 2 previous
>> similar messages
>>
>> Server:
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError:
>> 7182:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid
>> 10.55.14.49@o2ib) failed to reply to blocking AST (req@881b0e68b900
>> x1635734905828112 status 0 rc -110), evict it ns: mdt-lustre2-MDT_UUID
>> lock: 88187ec45e00/0x121438a5db957b5 lrc: 4/0,0 mode: PR/PR res:
>> [0x2ef38:0xffec:0x0].0x0 bits 0x20 rrc: 4 type: IBT flags:
>> 0x6020040020 nid: 10.55.14.49@o2ib remote: 0x3154abaef2786884
>> expref: 72083 pid: 7182 timeout: 16143455124 lvb_type: 0
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError: 138-a:
>> lustre2-MDT: A client on nid 10.55.14.49@o2ib was evicted due to a
>> lock blocking callback time out: rc -110
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: Lustre: lustre2-MDT:
>> Connection restored to 3b42ec33-0885-6b7f-6575-9b200c4b6f55 (at
>> 10.55.14.49@o2ib)
>> mds2-eno1: Oct 22 14:59:37 mds2 kernel: LustreError:
>> 8936:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
>> req@881b0e68b900 x1635734905828176/t0(0)
>> o104->lustre2-MDT@10.55.14.49@o2ib:15/16 lens 296/224 e 0 to 0 dl 0
>> ref 1 fl Rpc:/0/ rc 0/-1
>>
>>
>> Can anyone point me in the right direction on how to debug this issue?
>>
>> Thanks,
>> -Raj
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Repeatable ldlm_enqueue error

2019-10-30 Thread Raj Ayyampalayam
Hello,

A particular job (MPI Maker genome annotation) on our cluster produces the
following error and the job errors out with a "Could not open file error."
Server: The server is running lustre-2.10.4
Client: I've tried it with 2.10.5, 2.10.8 and 2.12.3 with the same result.
I don't see any other servers (Other MDS and OSS server nodes) reporting
communication loss to the client. The IB fabric is stable. The job runs to
completion when using a local storage on the node or a NFS mounted storage.
The job creates a lot of IO but it does not increase the load on the luster
servers.

Client:
Oct 22 14:56:39 n305 kernel: LustreError: 11-0:
lustre2-MDT-mdc-8c3f222c4800: operation ldlm_enqueue to node
10.55.49.215@o2ib failed: rc = -107
Oct 22 14:56:39 n305 kernel: Lustre: lustre2-MDT-mdc-8c3f222c4800:
Connection to lustre2-MDT (at 10.55.49.215@o2ib) was lost; in progress
operations using this service will wait for recovery to complete
Oct 22 14:56:39 n305 kernel: Lustre: Skipped 2 previous similar messages
Oct 22 14:56:39 n305 kernel: LustreError: 167-0:
lustre2-MDT-mdc-8c3f222c4800: This client was evicted by
lustre2-MDT; in progress operations using this service will fail.
Oct 22 14:56:39 n305 kernel: LustreError:
125851:0:(file.c:172:ll_close_inode_openhandle())
lustre2-clilmv-8c3f222c4800: inode [0x2ef38:0xffd6:0x0] mdc close
failed: rc = -108
Oct 22 14:56:39 n305 kernel: LustreError: Skipped 1 previous similar message
Oct 22 14:56:40 n305 kernel: LustreError:
125959:0:(file.c:3644:ll_inode_revalidate_fini()) lustre2: revalidate FID
[0x2eedf:0xed9d:0x0] error: rc = -108
Oct 22 14:56:40 n305 kernel: LustreError:
125665:0:(vvp_io.c:1474:vvp_io_init()) lustre2: refresh file layout
[0x2ef38:0xff55:0x0] error -108.
Oct 22 14:56:40 n305 kernel: LustreError:
125883:0:(ldlm_resource.c:1100:ldlm_resource_complain())
lustre2-MDT-mdc-8c3f222c4800: namespace resource
[0x2ef38:0xff55:0x0].0x0 (8bdc6823c9c0) refcount nonzero (1) after
lock cleanup; forcing cleanup.
Oct 22 14:56:40 n305 kernel: LustreError:
125883:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource:
[0x2ef38:0xff55:0x0].0x0 (8bdc6823c9c0) refcount = 1
Oct 22 14:56:40 n305 kernel: Lustre: lustre2-MDT-mdc-8c3f222c4800:
Connection restored to 10.55.49.215@o2ib (at 10.55.49.215@o2ib)
Oct 22 14:56:40 n305 kernel: Lustre: Skipped 1 previous similar message
Oct 22 14:56:40 n305 kernel: LustreError:
125959:0:(file.c:3644:ll_inode_revalidate_fini()) Skipped 2 previous
similar messages

Server:
mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError:
7182:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid
10.55.14.49@o2ib) failed to reply to blocking AST (req@881b0e68b900
x1635734905828112 status 0 rc -110), evict it ns: mdt-lustre2-MDT_UUID
lock: 88187ec45e00/0x121438a5db957b5 lrc: 4/0,0 mode: PR/PR res:
[0x2ef38:0xffec:0x0].0x0 bits 0x20 rrc: 4 type: IBT flags:
0x6020040020 nid: 10.55.14.49@o2ib remote: 0x3154abaef2786884 expref:
72083 pid: 7182 timeout: 16143455124 lvb_type: 0
mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError: 138-a:
lustre2-MDT: A client on nid 10.55.14.49@o2ib was evicted due to a lock
blocking callback time out: rc -110
mds2-eno1: Oct 22 14:59:36 mds2 kernel: Lustre: lustre2-MDT: Connection
restored to 3b42ec33-0885-6b7f-6575-9b200c4b6f55 (at 10.55.14.49@o2ib)
mds2-eno1: Oct 22 14:59:37 mds2 kernel: LustreError:
8936:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req@881b0e68b900 x1635734905828176/t0(0)
o104->lustre2-MDT@10.55.14.49@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref
1 fl Rpc:/0/ rc 0/-1


Can anyone point me in the right direction on how to debug this issue?

Thanks,
-Raj
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
Got it. I rather be safe than sorry. This is my first time doing a lustre
configuration change.

Raj

On Thu, Feb 21, 2019, 11:55 PM Raj  wrote:

> I also agree with Colin's comment.
> If the current OSTs are not touched, and you are only adding new OSTs to
> existing OSS nodes and adding new ost-mount resources in your existing
> (already running) Pacemaker configuration, you can achieve the upgrade with
> no downtime. If your Corosync-Pacemaker configuration is working correctly,
> you can failover and failback and take turn to reboot each OSS nodes. But,
> chances of human error is too high in doing this.
>
> On Thu, Feb 21, 2019 at 10:30 PM Raj Ayyampalayam 
> wrote:
>
>> Hi Raj,
>>
>> Thanks for the explanation. We will have to rethink our upgrade process.
>>
>> Thanks again.
>> Raj
>>
>> On Thu, Feb 21, 2019, 10:23 PM Raj  wrote:
>>
>>> Hello Raj,
>>> It’s best and safe to unmount from all the clients and then do the
>>> upgrade. Your FS is getting more OSTs and changing conf in the existing
>>> ones, your client needs to get the new layout by remounting it.
>>> Also you mentioned about client eviction, during eviction the client has
>>> to drop it’s dirty pages and all the open file descriptors in the FS will
>>> be gone.
>>>
>>> On Thu, Feb 21, 2019 at 12:25 PM Raj Ayyampalayam 
>>> wrote:
>>>
>>>> What can I expect to happen to the jobs that are suspended during the
>>>> file system restart?
>>>> Will the processes holding an open file handle die when I unsuspend
>>>> them after the filesystem restart?
>>>>
>>>> Thanks!
>>>> -Raj
>>>>
>>>>
>>>> On Thu, Feb 21, 2019 at 12:52 PM Colin Faber  wrote:
>>>>
>>>>> Ah yes,
>>>>>
>>>>> If you're adding to an existing OSS, then you will need to reconfigure
>>>>> the file system which requires writeconf event.
>>>>>
>>>>
>>>>> On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam 
>>>>> wrote:
>>>>>
>>>>>> The new OST's will be added to the existing file system (the OSS
>>>>>> nodes are already part of the filesystem), I will have to re-configure 
>>>>>> the
>>>>>> current HA resource configuration to tell it about the 4 new OST's.
>>>>>> Our exascaler's HA monitors the individual OST and I need to
>>>>>> re-configure the HA on the existing filesystem.
>>>>>>
>>>>>> Our vendor support has confirmed that we would have to restart the
>>>>>> filesystem if we want to regenerate the HA configs to include the new 
>>>>>> OST's.
>>>>>>
>>>>>> Thanks,
>>>>>> -Raj
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 21, 2019 at 11:23 AM Colin Faber 
>>>>>> wrote:
>>>>>>
>>>>>>> It seems to me that steps may still be missing?
>>>>>>>
>>>>>>> You're going to rack/stack and provision the OSS nodes with new
>>>>>>> OSTs'.
>>>>>>>
>>>>>>> Then you're going to introduce failover options somewhere? new osts?
>>>>>>> existing system? etc?
>>>>>>>
>>>>>>> If you're introducing failover with the new OST's and leaving the
>>>>>>> existing system in place, you should be able to accomplish this without
>>>>>>> bringing the system offline.
>>>>>>>
>>>>>>> If you're going to be introducing failover to your existing system
>>>>>>> then you will need to reconfigure the file system to accommodate the new
>>>>>>> failover settings (failover nides, etc.)
>>>>>>>
>>>>>>> -cf
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Our upgrade strategy is as follows:
>>>>>>>>
>>>>>>>> 1) Load all disks into the storage array.
>>>>>>>> 2) Create RAID pools and virtual disks.
>>>>>>>> 3) Create lustre file system using mkfs.lustre command. (I still
>>>>>&g

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
Hi Raj,

Thanks for the explanation. We will have to rethink our upgrade process.

Thanks again.
Raj

On Thu, Feb 21, 2019, 10:23 PM Raj  wrote:

> Hello Raj,
> It’s best and safe to unmount from all the clients and then do the
> upgrade. Your FS is getting more OSTs and changing conf in the existing
> ones, your client needs to get the new layout by remounting it.
> Also you mentioned about client eviction, during eviction the client has
> to drop it’s dirty pages and all the open file descriptors in the FS will
> be gone.
>
> On Thu, Feb 21, 2019 at 12:25 PM Raj Ayyampalayam 
> wrote:
>
>> What can I expect to happen to the jobs that are suspended during the
>> file system restart?
>> Will the processes holding an open file handle die when I unsuspend them
>> after the filesystem restart?
>>
>> Thanks!
>> -Raj
>>
>>
>> On Thu, Feb 21, 2019 at 12:52 PM Colin Faber  wrote:
>>
>>> Ah yes,
>>>
>>> If you're adding to an existing OSS, then you will need to reconfigure
>>> the file system which requires writeconf event.
>>>
>>
>>> On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam 
>>> wrote:
>>>
>>>> The new OST's will be added to the existing file system (the OSS nodes
>>>> are already part of the filesystem), I will have to re-configure the
>>>> current HA resource configuration to tell it about the 4 new OST's.
>>>> Our exascaler's HA monitors the individual OST and I need to
>>>> re-configure the HA on the existing filesystem.
>>>>
>>>> Our vendor support has confirmed that we would have to restart the
>>>> filesystem if we want to regenerate the HA configs to include the new 
>>>> OST's.
>>>>
>>>> Thanks,
>>>> -Raj
>>>>
>>>>
>>>> On Thu, Feb 21, 2019 at 11:23 AM Colin Faber  wrote:
>>>>
>>>>> It seems to me that steps may still be missing?
>>>>>
>>>>> You're going to rack/stack and provision the OSS nodes with new OSTs'.
>>>>>
>>>>> Then you're going to introduce failover options somewhere? new osts?
>>>>> existing system? etc?
>>>>>
>>>>> If you're introducing failover with the new OST's and leaving the
>>>>> existing system in place, you should be able to accomplish this without
>>>>> bringing the system offline.
>>>>>
>>>>> If you're going to be introducing failover to your existing system
>>>>> then you will need to reconfigure the file system to accommodate the new
>>>>> failover settings (failover nides, etc.)
>>>>>
>>>>> -cf
>>>>>
>>>>>
>>>>> On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam 
>>>>> wrote:
>>>>>
>>>>>> Our upgrade strategy is as follows:
>>>>>>
>>>>>> 1) Load all disks into the storage array.
>>>>>> 2) Create RAID pools and virtual disks.
>>>>>> 3) Create lustre file system using mkfs.lustre command. (I still have
>>>>>> to figure out all the parameters used on the existing OSTs).
>>>>>> 4) Create mount points on all OSSs.
>>>>>> 5) Mount the lustre OSTs.
>>>>>> 6) Maybe rebalance the filesystem.
>>>>>> My understanding is that the above can be done without bringing the
>>>>>> filesystem down. I want to create the HA configuration (corosync and
>>>>>> pacemaker) for the new OSTs. This step requires the filesystem to be 
>>>>>> down.
>>>>>> I want to know what would happen to the suspended processes across the
>>>>>> cluster when I bring the filesystem down to re-generate the HA configs.
>>>>>>
>>>>>> Thanks,
>>>>>> -Raj
>>>>>>
>>>>>> On Thu, Feb 21, 2019 at 12:59 AM Colin Faber 
>>>>>> wrote:
>>>>>>
>>>>>>> Can you provide more details on your upgrade strategy? In some cases
>>>>>>> expanding your storage shouldn't impact client / job activity at all.
>>>>>>>
>>>>>>> On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>>

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
What can I expect to happen to the jobs that are suspended during the file
system restart?
Will the processes holding an open file handle die when I unsuspend them
after the filesystem restart?

Thanks!
-Raj


On Thu, Feb 21, 2019 at 12:52 PM Colin Faber  wrote:

> Ah yes,
>
> If you're adding to an existing OSS, then you will need to reconfigure the
> file system which requires writeconf event.
>
> On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam 
> wrote:
>
>> The new OST's will be added to the existing file system (the OSS nodes
>> are already part of the filesystem), I will have to re-configure the
>> current HA resource configuration to tell it about the 4 new OST's.
>> Our exascaler's HA monitors the individual OST and I need to re-configure
>> the HA on the existing filesystem.
>>
>> Our vendor support has confirmed that we would have to restart the
>> filesystem if we want to regenerate the HA configs to include the new OST's.
>>
>> Thanks,
>> -Raj
>>
>>
>> On Thu, Feb 21, 2019 at 11:23 AM Colin Faber  wrote:
>>
>>> It seems to me that steps may still be missing?
>>>
>>> You're going to rack/stack and provision the OSS nodes with new OSTs'.
>>>
>>> Then you're going to introduce failover options somewhere? new osts?
>>> existing system? etc?
>>>
>>> If you're introducing failover with the new OST's and leaving the
>>> existing system in place, you should be able to accomplish this without
>>> bringing the system offline.
>>>
>>> If you're going to be introducing failover to your existing system then
>>> you will need to reconfigure the file system to accommodate the new
>>> failover settings (failover nides, etc.)
>>>
>>> -cf
>>>
>>>
>>> On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam 
>>> wrote:
>>>
>>>> Our upgrade strategy is as follows:
>>>>
>>>> 1) Load all disks into the storage array.
>>>> 2) Create RAID pools and virtual disks.
>>>> 3) Create lustre file system using mkfs.lustre command. (I still have
>>>> to figure out all the parameters used on the existing OSTs).
>>>> 4) Create mount points on all OSSs.
>>>> 5) Mount the lustre OSTs.
>>>> 6) Maybe rebalance the filesystem.
>>>> My understanding is that the above can be done without bringing the
>>>> filesystem down. I want to create the HA configuration (corosync and
>>>> pacemaker) for the new OSTs. This step requires the filesystem to be down.
>>>> I want to know what would happen to the suspended processes across the
>>>> cluster when I bring the filesystem down to re-generate the HA configs.
>>>>
>>>> Thanks,
>>>> -Raj
>>>>
>>>> On Thu, Feb 21, 2019 at 12:59 AM Colin Faber  wrote:
>>>>
>>>>> Can you provide more details on your upgrade strategy? In some cases
>>>>> expanding your storage shouldn't impact client / job activity at all.
>>>>>
>>>>> On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam 
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We are planning on expanding our storage by adding more OSTs to our
>>>>>> lustre file system. It looks like it would be easier to expand if we 
>>>>>> bring
>>>>>> the filesystem down and perform the necessary operations. We are planning
>>>>>> to suspend all the jobs running on the cluster. We originally planned to
>>>>>> add new OSTs to the live filesystem.
>>>>>>
>>>>>> We are trying to determine the potential impact to the suspended jobs
>>>>>> if we bring down the filesystem for the upgrade.
>>>>>> One of the questions we have is what would happen to the suspended
>>>>>> processes that hold an open file handle in the lustre file system when 
>>>>>> the
>>>>>> filesystem is brought down for the upgrade?
>>>>>> Will they recover from the client eviction?
>>>>>>
>>>>>> We do have vendor support and have engaged them. I wanted to ask the
>>>>>> community and get some feedback.
>>>>>>
>>>>>> Thanks,
>>>>>> -Raj
>>>>>>
>>>>> ___
>>>>>> lustre-discuss mailing list
>>>>>> lustre-discuss@lists.lustre.org
>>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>>>
>>>>>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
The new OST's will be added to the existing file system (the OSS nodes are
already part of the filesystem), I will have to re-configure the current HA
resource configuration to tell it about the 4 new OST's.
Our exascaler's HA monitors the individual OST and I need to re-configure
the HA on the existing filesystem.

Our vendor support has confirmed that we would have to restart the
filesystem if we want to regenerate the HA configs to include the new OST's.

Thanks,
-Raj


On Thu, Feb 21, 2019 at 11:23 AM Colin Faber  wrote:

> It seems to me that steps may still be missing?
>
> You're going to rack/stack and provision the OSS nodes with new OSTs'.
>
> Then you're going to introduce failover options somewhere? new osts?
> existing system? etc?
>
> If you're introducing failover with the new OST's and leaving the existing
> system in place, you should be able to accomplish this without bringing the
> system offline.
>
> If you're going to be introducing failover to your existing system then
> you will need to reconfigure the file system to accommodate the new
> failover settings (failover nides, etc.)
>
> -cf
>
>
> On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam  wrote:
>
>> Our upgrade strategy is as follows:
>>
>> 1) Load all disks into the storage array.
>> 2) Create RAID pools and virtual disks.
>> 3) Create lustre file system using mkfs.lustre command. (I still have to
>> figure out all the parameters used on the existing OSTs).
>> 4) Create mount points on all OSSs.
>> 5) Mount the lustre OSTs.
>> 6) Maybe rebalance the filesystem.
>> My understanding is that the above can be done without bringing the
>> filesystem down. I want to create the HA configuration (corosync and
>> pacemaker) for the new OSTs. This step requires the filesystem to be down.
>> I want to know what would happen to the suspended processes across the
>> cluster when I bring the filesystem down to re-generate the HA configs.
>>
>> Thanks,
>> -Raj
>>
>> On Thu, Feb 21, 2019 at 12:59 AM Colin Faber  wrote:
>>
>>> Can you provide more details on your upgrade strategy? In some cases
>>> expanding your storage shouldn't impact client / job activity at all.
>>>
>>> On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are planning on expanding our storage by adding more OSTs to our
>>>> lustre file system. It looks like it would be easier to expand if we bring
>>>> the filesystem down and perform the necessary operations. We are planning
>>>> to suspend all the jobs running on the cluster. We originally planned to
>>>> add new OSTs to the live filesystem.
>>>>
>>>> We are trying to determine the potential impact to the suspended jobs
>>>> if we bring down the filesystem for the upgrade.
>>>> One of the questions we have is what would happen to the suspended
>>>> processes that hold an open file handle in the lustre file system when the
>>>> filesystem is brought down for the upgrade?
>>>> Will they recover from the client eviction?
>>>>
>>>> We do have vendor support and have engaged them. I wanted to ask the
>>>> community and get some feedback.
>>>>
>>>> Thanks,
>>>> -Raj
>>>>
>>> ___
>>>> lustre-discuss mailing list
>>>> lustre-discuss@lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>
>>>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-21 Thread Raj Ayyampalayam
Our upgrade strategy is as follows:

1) Load all disks into the storage array.
2) Create RAID pools and virtual disks.
3) Create lustre file system using mkfs.lustre command. (I still have to
figure out all the parameters used on the existing OSTs).
4) Create mount points on all OSSs.
5) Mount the lustre OSTs.
6) Maybe rebalance the filesystem.
My understanding is that the above can be done without bringing the
filesystem down. I want to create the HA configuration (corosync and
pacemaker) for the new OSTs. This step requires the filesystem to be down.
I want to know what would happen to the suspended processes across the
cluster when I bring the filesystem down to re-generate the HA configs.

Thanks,
-Raj

On Thu, Feb 21, 2019 at 12:59 AM Colin Faber  wrote:

> Can you provide more details on your upgrade strategy? In some cases
> expanding your storage shouldn't impact client / job activity at all.
>
> On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam  wrote:
>
>> Hello,
>>
>> We are planning on expanding our storage by adding more OSTs to our
>> lustre file system. It looks like it would be easier to expand if we bring
>> the filesystem down and perform the necessary operations. We are planning
>> to suspend all the jobs running on the cluster. We originally planned to
>> add new OSTs to the live filesystem.
>>
>> We are trying to determine the potential impact to the suspended jobs if
>> we bring down the filesystem for the upgrade.
>> One of the questions we have is what would happen to the suspended
>> processes that hold an open file handle in the lustre file system when the
>> filesystem is brought down for the upgrade?
>> Will they recover from the client eviction?
>>
>> We do have vendor support and have engaged them. I wanted to ask the
>> community and get some feedback.
>>
>> Thanks,
>> -Raj
>>
> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-20 Thread Raj Ayyampalayam
Hello,

We are planning on expanding our storage by adding more OSTs to our lustre
file system. It looks like it would be easier to expand if we bring the
filesystem down and perform the necessary operations. We are planning to
suspend all the jobs running on the cluster. We originally planned to add
new OSTs to the live filesystem.

We are trying to determine the potential impact to the suspended jobs if we
bring down the filesystem for the upgrade.
One of the questions we have is what would happen to the suspended
processes that hold an open file handle in the lustre file system when the
filesystem is brought down for the upgrade?
Will they recover from the client eviction?

We do have vendor support and have engaged them. I wanted to ask the
community and get some feedback.

Thanks,
-Raj
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client 2.10.3 with 2.1 server

2018-02-28 Thread Raj Ayyampalayam
Yes, this is a CStor 1500 unit originally supplied by Xyratek.
Thanks for your recommendation.

-Raj

On Tue, Feb 27, 2018 at 8:05 PM Dilger, Andreas 
wrote:

> On Feb 27, 2018, at 16:19, Raj Ayyampalayam  wrote:
> >
> > We are using a lustre 2.1 server with 2.5 client.
> >
> > Can the latest 2.10.3 client can be used with the 2.1 server?
> > I figured I would ask the list before I start installing the client on a
> test node.
>
> I don't believe this is possible, due to changes in the protocol.  In any
> case, we haven't tested the 2.1 code in many years.
>
> Very likely your "2.1" server is really a vendor port with thousands of
> patches, so you might consider to ask the vendor, in case they've tested
> this.  If not, then I'd strongly recommend to upgrade to a newer release on
> the server.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre client 2.10.3 with 2.1 server

2018-02-27 Thread Raj Ayyampalayam
Hello,

We are using a lustre 2.1 server with 2.5 client.

Can the latest 2.10.3 client can be used with the 2.1 server?
I figured I would ask the list before I start installing the client on a
test node.

Thanks!
Raj
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org