Hi I want to test the performance for Ceph with RDMA, so I build the ceph with RDMA and deploy into my test environment manually.
I use the fio for my performance evaluation and it works fine if the Cepu use the *async + posix* as its ms_type. After changing the ms_type from *async + posix* to *async + rdma, *some osd's status will turn down during the performance testing and that causing the fio can't finish its job. The log file of those strange OSD shows that there're something wrong when OSD try to send a message and you can see below. ... 2017-03-20 09:43:10.096042 7faac163e700 -1 Infiniband recv_msg got error -104: (104) Connection reset by peer 2017-03-20 09:43:10.096314 7faac163e700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.17:6813/32315 conn(0x563de5282000 :-1 s=STATE_OPEN pgs=264 cs=29 l=0).fault initiating reconnect 2017-03-20 09:43:10.251606 7faac1e3f700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe 2017-03-20 09:43:10.251755 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.17:6821/32509 conn(0x563de51f1000 :-1 s=STATE_OPEN pgs=314 cs=24 l=0).fault initiating reconnect 2017-03-20 09:43:10.254103 7faac1e3f700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe 2017-03-20 09:43:10.254375 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.15:6821/48196 conn(0x563de514b000 :6809 s=STATE_OPEN pgs=275 cs=30 l=0).fault initiating reconnect 2017-03-20 09:43:10.260622 7faac1e3f700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe 2017-03-20 09:43:10.260693 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.15:6805/47835 conn(0x563de537d800 :-1 s=STATE_OPEN pgs=310 cs=11 l=0).fault with nothing to send, going to standby 2017-03-20 09:43:10.264621 7faac163e700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe 2017-03-20 09:43:10.264682 7faac163e700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.15:6829/48397 conn(0x563de5fdb000 :-1 s=STATE_OPEN pgs=231 cs=23 l=0).fault with nothing to send, going to standby 2017-03-20 09:43:10.291832 7faac163e700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe 2017-03-20 09:43:10.291895 7faac163e700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.17:6817/32412 conn(0x563de50f5800 :-1 s=STATE_OPEN pgs=245 cs=25 l=0).fault initiating reconnect 2017-03-20 09:43:10.387540 7faac2e41700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe 2017-03-20 09:43:10.387565 7faac2e41700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe 2017-03-20 09:43:10.387635 7faac2e41700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.17:6801/32098 conn(0x563de51ab800 :6809 s=STATE_OPEN pgs=268 cs=23 l=0).fault with nothing to send, going to standby 2017-03-20 09:43:11.453373 7faabdee0700 -1 osd.10 902 heartbeat_check: no reply from 10.0.0.15:6803 osd.0 since back 2017-03-20 09:42:50.610507 front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) 2017-03-20 09:43:11.453422 7faabdee0700 -1 osd.10 902 heartbeat_check: no reply from 10.0.0.15:6807 osd.1 since back 2017-03-20 09:42:50.610507 front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) 2017-03-20 09:43:11.453435 7faabdee0700 -1 osd.10 902 heartbeat_check: no reply from 10.0.0.15:6811 osd.2 since back 2017-03-20 09:42:50.610507 front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) 2017-03-20 09:43:11.453444 7faabdee0700 -1 osd.10 902 heartbeat_check: no reply from 10.0.0.15:6815 osd.3 since back 2017-03-20 09:42:50.610507 front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) *...* The following is my environment. *[Software]* *Ceph Version*: ceph version 12.0.0-1356-g7ba32cb (I build my self with master branch) *Deployment*: Without ceph-deploy and systemd, just manually invoke every daemons. *Host*: Ubuntu 16.04.1 LTS (x86_64 ), with linux kernel 4.4.0-66-generic. *NIC*: Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] *NIC Driver*: MLNX_OFED_LINUX-4.0-1.0.1.0 (OFED-4.0-1.0.1): *[Configuration]* Ceph.conf [global] fsid = 0612cc7e-6239-456c-978b-b4df781fe831 mon initial members = ceph-1,ceph-2,ceph-3 mon host = 10.0.0.15,10.0.0.16,10.0.0.17 osd pool default size = 2 osd pool default pg num = 1024 osd pool default pgp num = 1024 ms_type=async+rdma ms_async_rdma_device_name = mlx4_0 Fio.conf [global] ioengine=rbd clientname=admin pool=rbd rbdname=rbd clustername=ceph runtime=120 iodepth=128 numjobs=6 group_reporting size=256G direct=1 ramp_time=5 [r75w25] bs=4k rw=randrw rwmixread=75 *[Cluster Env]* 1. Total three Node. 2. 3 ceph monitors on each node. 3. 8 ceph osd on each node (total 24 osd). Thanks
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com