Re: [lustre-discuss] Repeatable ldlm_enqueue error

2019-10-31 Thread Raj Ayyampalayam
I had the same thought and I checked all the nodes, and they were all
exactly the same time.

Raj

On Wed, Oct 30, 2019, 10:19 PM Raj  wrote:

> Raj,
> Just eyeballing your logs from server and client, it looks like they have
> different time. Are they out of sync? It is important to have both clients
> and server to have same time.
>
> On Wed, Oct 30, 2019 at 3:37 PM Raj Ayyampalayam  wrote:
>
>> Hello,
>>
>> A particular job (MPI Maker genome annotation) on our cluster produces
>> the following error and the job errors out with a "Could not open file
>> error."
>> Server: The server is running lustre-2.10.4
>> Client: I've tried it with 2.10.5, 2.10.8 and 2.12.3 with the same result.
>> I don't see any other servers (Other MDS and OSS server nodes) reporting
>> communication loss to the client. The IB fabric is stable. The job runs to
>> completion when using a local storage on the node or a NFS mounted storage.
>> The job creates a lot of IO but it does not increase the load on the
>> luster servers.
>>
>> Client:
>> Oct 22 14:56:39 n305 kernel: LustreError: 11-0:
>> lustre2-MDT-mdc-8c3f222c4800: operation ldlm_enqueue to node
>> 10.55.49.215@o2ib failed: rc = -107
>> Oct 22 14:56:39 n305 kernel: Lustre:
>> lustre2-MDT-mdc-8c3f222c4800: Connection to lustre2-MDT (at
>> 10.55.49.215@o2ib) was lost; in progress operations using this service
>> will wait for recovery to complete
>> Oct 22 14:56:39 n305 kernel: Lustre: Skipped 2 previous similar messages
>> Oct 22 14:56:39 n305 kernel: LustreError: 167-0:
>> lustre2-MDT-mdc-8c3f222c4800: This client was evicted by
>> lustre2-MDT; in progress operations using this service will fail.
>> Oct 22 14:56:39 n305 kernel: LustreError:
>> 125851:0:(file.c:172:ll_close_inode_openhandle())
>> lustre2-clilmv-8c3f222c4800: inode [0x2ef38:0xffd6:0x0] mdc close
>> failed: rc = -108
>> Oct 22 14:56:39 n305 kernel: LustreError: Skipped 1 previous similar
>> message
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) lustre2: revalidate FID
>> [0x2eedf:0xed9d:0x0] error: rc = -108
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125665:0:(vvp_io.c:1474:vvp_io_init()) lustre2: refresh file layout
>> [0x2ef38:0xff55:0x0] error -108.
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125883:0:(ldlm_resource.c:1100:ldlm_resource_complain())
>> lustre2-MDT-mdc-8c3f222c4800: namespace resource
>> [0x2ef38:0xff55:0x0].0x0 (8bdc6823c9c0) refcount nonzero (1) after
>> lock cleanup; forcing cleanup.
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125883:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource:
>> [0x2ef38:0xff55:0x0].0x0 (8bdc6823c9c0) refcount = 1
>> Oct 22 14:56:40 n305 kernel: Lustre:
>> lustre2-MDT-mdc-8c3f222c4800: Connection restored to
>> 10.55.49.215@o2ib (at 10.55.49.215@o2ib)
>> Oct 22 14:56:40 n305 kernel: Lustre: Skipped 1 previous similar message
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) Skipped 2 previous
>> similar messages
>>
>> Server:
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError:
>> 7182:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid
>> 10.55.14.49@o2ib) failed to reply to blocking AST (req@881b0e68b900
>> x1635734905828112 status 0 rc -110), evict it ns: mdt-lustre2-MDT_UUID
>> lock: 88187ec45e00/0x121438a5db957b5 lrc: 4/0,0 mode: PR/PR res:
>> [0x2ef38:0xffec:0x0].0x0 bits 0x20 rrc: 4 type: IBT flags:
>> 0x6020040020 nid: 10.55.14.49@o2ib remote: 0x3154abaef2786884
>> expref: 72083 pid: 7182 timeout: 16143455124 lvb_type: 0
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError: 138-a:
>> lustre2-MDT: A client on nid 10.55.14.49@o2ib was evicted due to a
>> lock blocking callback time out: rc -110
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: Lustre: lustre2-MDT:
>> Connection restored to 3b42ec33-0885-6b7f-6575-9b200c4b6f55 (at
>> 10.55.14.49@o2ib)
>> mds2-eno1: Oct 22 14:59:37 mds2 kernel: LustreError:
>> 8936:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
>> req@881b0e68b900 x1635734905828176/t0(0)
>> o104->lustre2-MDT@10.55.14.49@o2ib:15/16 lens 296/224 e 0 to 0 dl 0
>> ref 1 fl Rpc:/0/ rc 0/-1
>>
>>
>> Can anyone point me in the right direction on how to debug this issue?
>>
>> Thanks,
>> -Raj
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Disk caching

2019-10-31 Thread Louis Allen
I'm currently looking at how to get the best performance out of Lustre when 
deploying to Azure and would like to know if disk caching should be 
enabled/considered at all at the OS or Azure level? I have the following 
options available to me in Azure:

Host Caching: None, Read-Only or Read-Write

I'm getting high IO_WAIT build up with my current configuration with Host 
Caching: Read-Write enabled in Azure for my disks (as well as the default OS 
disk caching) and was wondering if that might be the cause.

Thanks
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org