Hi, I tried some of the network tuning suggestions made by other people on this list who faced similar problems:
net.core.rmem_max = 33554432 net.core.wmem_max = 33554432 net.core.rmem_default = 87380 net.core.wmem_default = 65536 net.core.optmem_max = 25165824 net.ipv4.tcp_rmem = 4096 87380 33554432 net.ipv4.tcp_wmem = 4096 65536 33554432 net.ipv4.tcp_moderate_rcvbuf = 0 net.ipv4.tcp_synack_retries = 2 net.core.somaxconn = 16384 net.core.netdev_max_backlog = 250000 net.ipv4.tcp_max_syn_backlog = 252144 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 Some of these seemed to help a bit as it would take longer to trigger the issue (about 10-15 minutes instead of 5 minutes). However what ultimately solved the issue was upgrading the kernel to 3.19. I also had to update the Mellanox drivers at the same time, so I guess it could have been that. Anyway I have now done a "rados bench write" for over 100 minutes with no sign of the issue. Thanks, Brendan
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com