I had the same thought and I checked all the nodes, and they were all exactly the same time.
Raj On Wed, Oct 30, 2019, 10:19 PM Raj <rajgau...@gmail.com> wrote: > Raj, > Just eyeballing your logs from server and client, it looks like they have > different time. Are they out of sync? It is important to have both clients > and server to have same time. > > On Wed, Oct 30, 2019 at 3:37 PM Raj Ayyampalayam <ans...@gmail.com> wrote: > >> Hello, >> >> A particular job (MPI Maker genome annotation) on our cluster produces >> the following error and the job errors out with a "Could not open file >> error." >> Server: The server is running lustre-2.10.4 >> Client: I've tried it with 2.10.5, 2.10.8 and 2.12.3 with the same result. >> I don't see any other servers (Other MDS and OSS server nodes) reporting >> communication loss to the client. The IB fabric is stable. The job runs to >> completion when using a local storage on the node or a NFS mounted storage. >> The job creates a lot of IO but it does not increase the load on the >> luster servers. >> >> Client: >> Oct 22 14:56:39 n305 kernel: LustreError: 11-0: >> lustre2-MDT0000-mdc-ffff8c3f222c4800: operation ldlm_enqueue to node >> 10.55.49.215@o2ib failed: rc = -107 >> Oct 22 14:56:39 n305 kernel: Lustre: >> lustre2-MDT0000-mdc-ffff8c3f222c4800: Connection to lustre2-MDT0000 (at >> 10.55.49.215@o2ib) was lost; in progress operations using this service >> will wait for recovery to complete >> Oct 22 14:56:39 n305 kernel: Lustre: Skipped 2 previous similar messages >> Oct 22 14:56:39 n305 kernel: LustreError: 167-0: >> lustre2-MDT0000-mdc-ffff8c3f222c4800: This client was evicted by >> lustre2-MDT0000; in progress operations using this service will fail. >> Oct 22 14:56:39 n305 kernel: LustreError: >> 125851:0:(file.c:172:ll_close_inode_openhandle()) >> lustre2-clilmv-ffff8c3f222c4800: inode [0x20000ef38:0xffd6:0x0] mdc close >> failed: rc = -108 >> Oct 22 14:56:39 n305 kernel: LustreError: Skipped 1 previous similar >> message >> Oct 22 14:56:40 n305 kernel: LustreError: >> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) lustre2: revalidate FID >> [0x20000eedf:0xed9d:0x0] error: rc = -108 >> Oct 22 14:56:40 n305 kernel: LustreError: >> 125665:0:(vvp_io.c:1474:vvp_io_init()) lustre2: refresh file layout >> [0x20000ef38:0xff55:0x0] error -108. >> Oct 22 14:56:40 n305 kernel: LustreError: >> 125883:0:(ldlm_resource.c:1100:ldlm_resource_complain()) >> lustre2-MDT0000-mdc-ffff8c3f222c4800: namespace resource >> [0x20000ef38:0xff55:0x0].0x0 (ffff8bdc6823c9c0) refcount nonzero (1) after >> lock cleanup; forcing cleanup. >> Oct 22 14:56:40 n305 kernel: LustreError: >> 125883:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: >> [0x20000ef38:0xff55:0x0].0x0 (ffff8bdc6823c9c0) refcount = 1 >> Oct 22 14:56:40 n305 kernel: Lustre: >> lustre2-MDT0000-mdc-ffff8c3f222c4800: Connection restored to >> 10.55.49.215@o2ib (at 10.55.49.215@o2ib) >> Oct 22 14:56:40 n305 kernel: Lustre: Skipped 1 previous similar message >> Oct 22 14:56:40 n305 kernel: LustreError: >> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) Skipped 2 previous >> similar messages >> >> Server: >> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError: >> 7182:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid >> 10.55.14.49@o2ib) failed to reply to blocking AST (req@ffff881b0e68b900 >> x1635734905828112 status 0 rc -110), evict it ns: mdt-lustre2-MDT0000_UUID >> lock: ffff88187ec45e00/0x121438a5db957b5 lrc: 4/0,0 mode: PR/PR res: >> [0x20000ef38:0xffec:0x0].0x0 bits 0x20 rrc: 4 type: IBT flags: >> 0x60200400000020 nid: 10.55.14.49@o2ib remote: 0x3154abaef2786884 >> expref: 72083 pid: 7182 timeout: 16143455124 lvb_type: 0 >> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError: 138-a: >> lustre2-MDT0000: A client on nid 10.55.14.49@o2ib was evicted due to a >> lock blocking callback time out: rc -110 >> mds2-eno1: Oct 22 14:59:36 mds2 kernel: Lustre: lustre2-MDT0000: >> Connection restored to 3b42ec33-0885-6b7f-6575-9b200c4b6f55 (at >> 10.55.14.49@o2ib) >> mds2-eno1: Oct 22 14:59:37 mds2 kernel: LustreError: >> 8936:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED >> req@ffff881b0e68b900 x1635734905828176/t0(0) >> o104->lustre2-MDT0000@10.55.14.49@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 >> ref 1 fl Rpc:/0/ffffffff rc 0/-1 >> >> >> Can anyone point me in the right direction on how to debug this issue? >> >> Thanks, >> -Raj >> _______________________________________________ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org