For the “RDMA has too many fragments” issue, you need newly landed patch: 
http://review.whamcloud.com/12451.  For the slow access, not sure if that is 
related to the too many fragments error.  Once you get the too many fragments 
error, that node usually needs to unload/reload the LNet module to recover.

Doug

On May 1, 2017, at 7:47 AM, Hans Henrik Happe 
<ha...@nbi.ku.dk<mailto:ha...@nbi.ku.dk>> wrote:

Hi,

We have experienced problems with loosing connection to OSS. It starts with:

May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
fragments for peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst
idx/frags: 128/236
May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
10.21.10.116@o2ib: -90

The rest of the log is attached.

After this Lustre access is very slow. I.e. a 'df' can take minutes.
Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
makes ping work again until file I/O starts. Then I/O errors again.

We use both IB and TCP on servers, so no routers.

In the attached log astro-OST0001 has been moved to the other server in
the HA pair. This is because 'lctl dl -t' showed strange output when on
the right server:

# lctl dl -t
 0 UP mgc MGC10.21.10.102@o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
 1 UP lov astro-clilov-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
 2 UP lmv astro-clilmv-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
 3 UP mdc astro-MDT0000-mdc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102@o2ib
 4 UP osc astro-OST0002-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116@o2ib
 5 UP osc astro-OST0001-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115@tcp1
 6 UP osc astro-OST0003-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117@o2ib
 7 UP osc astro-OST0000-osc-ffff88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114@o2ib

So astro-OST0001 seems to be connected through 172.20.10.115@tcp1, even
though it uses 10.21.10.115@o2ib (verified by performance test and
disabling tcp1 on IB nodes).

Please ask for more details if needed.

Cheers,
Hans Henrik

<client.log>_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to