Centos 7.2
Lustre 2.8.0
ZFS 0.6.5.5
OPA 10.2.0.0.158

The clients and servers are on the same OPA network, no routing. Once a client gets in this state, the filesystem performance drops to a faction of what it is capable of.
The client must be rebooted to clear the issue.

I imagine I am missing a bug in jira for this issue, does this look like a known issue?


Pertinent debug messages from the server:

00000800:00020000:34.0:1478026118.277782:0:29892:0:(o2iblnd_cb.c:3109:kiblnd_check_txs_locked()) Timed out tx: active_txs, 4 seconds 00000800:00020000:34.0:1478026118.277785:0:29892:0:(o2iblnd_cb.c:3172:kiblnd_check_conns()) Timed out RDMA with 10.4.119.112@o2ib (3): c: 112, oc: 0, rc: 66 00000800:00000100:34.0:1478026118.277787:0:29892:0:(o2iblnd_cb.c:1913:kiblnd_close_conn_locked()) Closing conn to 10.4.119.112@o2ib: error -110(waiting) 00000100:00020000:34.0:1478026118.277844:0:29892:0:(events.c:447:server_bulk_callback()) event type 5, status -103, desc ffff883e8e8bcc00 00000100:00020000:34.0:1478026118.288714:0:29892:0:(events.c:447:server_bulk_callback()) event type 3, status -103, desc ffff883e8e8bcc00 00000100:00020000:34.0:1478026118.299574:0:29892:0:(events.c:447:server_bulk_callback()) event type 5, status -103, desc ffff8810e92e9c00 00000100:00020000:34.0:1478026118.310434:0:29892:0:(events.c:447:server_bulk_callback()) event type 3, status -103, desc ffff8810e92e9c00


And from the client:

00000400:00000100:8.0:1477949860.565777:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000400:00000100:8.0:1477949860.565782:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000800:00020000:8.0:1477949860.702666:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.4.108.81@o2ib (32), src idx/frags: 16/27 dst idx/frags: 16/27 00000800:00020000:8.0:1477949860.702667:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.4.108.81@o2ib: -90 00000100:00020000:8.0:1477949860.702669:0:3629:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff880fd5d9bc00 00000800:00020000:8.0:1477949860.816666:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.4.108.81@o2ib (32), src idx/frags: 16/27 dst idx/frags: 16/27 00000800:00020000:8.0:1477949860.816668:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.4.108.81@o2ib: -90 00000100:00020000:8.0:1477949860.816669:0:3629:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff880fd5d9bc00 00000400:00000100:8.0:1477949861.573660:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000400:00000100:8.0:1477949861.573664:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000400:00000100:8.0:1477949861.573667:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000400:00000100:8.0:1477949861.573669:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000400:00000100:8.0:1477949861.573671:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000400:00000100:8.0:1477949861.573673:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000400:00000100:8.0:1477949861.573675:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000400:00000100:8.0:1477949861.573677:0:3629:0:(lib-move.c:1489:lnet_parse_put()) Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 offset 0 length 192: 4 00000800:00020000:8.0:1477949861.721668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.4.108.81@o2ib (32), src idx/frags: 16/27 dst idx/frags: 16/27 00000800:00020000:8.0:1477949861.721669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.4.108.81@o2ib: -90 00000100:00020000:8.0:1477949861.721670:0:3629:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff880fd5d9bc00 00000800:00020000:8.0:1477949861.836668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.4.108.81@o2ib (32), src idx/frags: 16/27 dst idx/frags: 16/27 00000800:00020000:8.0:1477949861.836669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.4.108.81@o2ib: -90 00000100:00020000:8.0:1477949861.836670:0:3629:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff880fd5d9bc00 00000800:00020000:8.0:1477949862.061668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.4.108.81@o2ib (32), src idx/frags: 16/27 dst idx/frags: 16/27 00000800:00020000:8.0:1477949862.061669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.4.108.81@o2ib: -90 00000100:00020000:8.0:1477949862.061670:0:3629:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc ffff880fd5d9bc00

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to