For the “RDMA has too many fragments” issue, you need newly landed patch: http://review.whamcloud.com/12451. For the slow access, not sure if that is related to the too many fragments error. Once you get the too many fragments error, that node usually needs to unload/reload the LNet module to recover.
Doug On May 1, 2017, at 7:47 AM, Hans Henrik Happe <ha...@nbi.ku.dk<mailto:ha...@nbi.ku.dk>> wrote: Hi, We have experienced problems with loosing connection to OSS. It starts with: May 1 03:35:46 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst idx/frags: 128/236 May 1 03:35:46 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116@o2ib: -90 The rest of the log is attached. After this Lustre access is very slow. I.e. a 'df' can take minutes. Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add' makes ping work again until file I/O starts. Then I/O errors again. We use both IB and TCP on servers, so no routers. In the attached log astro-OST0001 has been moved to the other server in the HA pair. This is because 'lctl dl -t' showed strange output when on the right server: # lctl dl -t 0 UP mgc MGC10.21.10.102@o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5 1 UP lov astro-clilov-ffff88107412e800 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4 2 UP lmv astro-clilmv-ffff88107412e800 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4 3 UP mdc astro-MDT0000-mdc-ffff88107412e800 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102@o2ib 4 UP osc astro-OST0002-osc-ffff88107412e800 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116@o2ib 5 UP osc astro-OST0001-osc-ffff88107412e800 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115@tcp1 6 UP osc astro-OST0003-osc-ffff88107412e800 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117@o2ib 7 UP osc astro-OST0000-osc-ffff88107412e800 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114@o2ib So astro-OST0001 seems to be connected through 172.20.10.115@tcp1, even though it uses 10.21.10.115@o2ib (verified by performance test and disabling tcp1 on IB nodes). Please ask for more details if needed. Cheers, Hans Henrik <client.log>_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org