plz uses master branch to test rdma On Sun, Mar 19, 2017 at 11:08 PM, Hung-Wei Chiu (邱宏瑋) <hwc...@cs.nctu.edu.tw > wrote:
> Hi > > I want to test the performance for Ceph with RDMA, so I build the ceph > with RDMA and deploy into my test environment manually. > > I use the fio for my performance evaluation and it works fine if the Cepu > use the *async + posix* as its ms_type. > After changing the ms_type from *async + posix* to *async + rdma, *some > osd's status will turn down during the performance testing and that causing > the fio can't finish its job. > The log file of those strange OSD shows that there're something wrong when > OSD try to send a message and you can see below. > > ... > 2017-03-20 09:43:10.096042 7faac163e700 -1 Infiniband recv_msg got error > -104: (104) Connection reset by peer > 2017-03-20 09:43:10.096314 7faac163e700 0 -- 10.0.0.16:6809/23853 >> > 10.0.0.17:6813/32315 conn(0x563de5282000 :-1 s=STATE_OPEN pgs=264 cs=29 > l=0).fault initiating reconnect > 2017-03-20 09:43:10.251606 7faac1e3f700 -1 Infiniband send_msg send > returned error 32: (32) Broken pipe > 2017-03-20 09:43:10.251755 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> > 10.0.0.17:6821/32509 conn(0x563de51f1000 :-1 s=STATE_OPEN pgs=314 cs=24 > l=0).fault initiating reconnect > 2017-03-20 09:43:10.254103 7faac1e3f700 -1 Infiniband send_msg send > returned error 32: (32) Broken pipe > 2017-03-20 09:43:10.254375 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> > 10.0.0.15:6821/48196 conn(0x563de514b000 :6809 s=STATE_OPEN pgs=275 cs=30 > l=0).fault initiating reconnect > 2017-03-20 09:43:10.260622 7faac1e3f700 -1 Infiniband send_msg send > returned error 32: (32) Broken pipe > 2017-03-20 09:43:10.260693 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> > 10.0.0.15:6805/47835 conn(0x563de537d800 :-1 s=STATE_OPEN pgs=310 cs=11 > l=0).fault with nothing to send, going to standby > 2017-03-20 09:43:10.264621 7faac163e700 -1 Infiniband send_msg send > returned error 32: (32) Broken pipe > 2017-03-20 09:43:10.264682 7faac163e700 0 -- 10.0.0.16:6809/23853 >> > 10.0.0.15:6829/48397 conn(0x563de5fdb000 :-1 s=STATE_OPEN pgs=231 cs=23 > l=0).fault with nothing to send, going to standby > 2017-03-20 09:43:10.291832 7faac163e700 -1 Infiniband send_msg send > returned error 32: (32) Broken pipe > 2017-03-20 09:43:10.291895 7faac163e700 0 -- 10.0.0.16:6809/23853 >> > 10.0.0.17:6817/32412 conn(0x563de50f5800 :-1 s=STATE_OPEN pgs=245 cs=25 > l=0).fault initiating reconnect > 2017-03-20 09:43:10.387540 7faac2e41700 -1 Infiniband send_msg send > returned error 32: (32) Broken pipe > 2017-03-20 09:43:10.387565 7faac2e41700 -1 Infiniband send_msg send > returned error 32: (32) Broken pipe > 2017-03-20 09:43:10.387635 7faac2e41700 0 -- 10.0.0.16:6809/23853 >> > 10.0.0.17:6801/32098 conn(0x563de51ab800 :6809 s=STATE_OPEN pgs=268 cs=23 > l=0).fault with nothing to send, going to standby > 2017-03-20 09:43:11.453373 7faabdee0700 -1 osd.10 902 heartbeat_check: no > reply from 10.0.0.15:6803 osd.0 since back 2017-03-20 09:42:50.610507 > front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) > 2017-03-20 09:43:11.453422 7faabdee0700 -1 osd.10 902 heartbeat_check: no > reply from 10.0.0.15:6807 osd.1 since back 2017-03-20 09:42:50.610507 > front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) > 2017-03-20 09:43:11.453435 7faabdee0700 -1 osd.10 902 heartbeat_check: no > reply from 10.0.0.15:6811 osd.2 since back 2017-03-20 09:42:50.610507 > front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) > 2017-03-20 09:43:11.453444 7faabdee0700 -1 osd.10 902 heartbeat_check: no > reply from 10.0.0.15:6815 osd.3 since back 2017-03-20 09:42:50.610507 > front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) > *...* > > > The following is my environment. > *[Software]* > *Ceph Version*: ceph version 12.0.0-1356-g7ba32cb (I build my self with > master branch) > > *Deployment*: Without ceph-deploy and systemd, just manually invoke every > daemons. > > *Host*: Ubuntu 16.04.1 LTS (x86_64 ), with linux kernel 4.4.0-66-generic. > > *NIC*: Ethernet controller: Mellanox Technologies MT27520 Family > [ConnectX-3 Pro] > > *NIC Driver*: MLNX_OFED_LINUX-4.0-1.0.1.0 (OFED-4.0-1.0.1): > > > *[Configuration]* > Ceph.conf > > [global] > fsid = 0612cc7e-6239-456c-978b-b4df781fe831 > mon initial members = ceph-1,ceph-2,ceph-3 > mon host = 10.0.0.15,10.0.0.16,10.0.0.17 > osd pool default size = 2 > osd pool default pg num = 1024 > osd pool default pgp num = 1024 > ms_type=async+rdma > ms_async_rdma_device_name = mlx4_0 > > Fio.conf > > [global] > > ioengine=rbd > clientname=admin > pool=rbd > rbdname=rbd > clustername=ceph > runtime=120 > iodepth=128 > numjobs=6 > group_reporting > size=256G > direct=1 > ramp_time=5 > [r75w25] > bs=4k > rw=randrw > rwmixread=75 > > > *[Cluster Env]* > > 1. Total three Node. > 2. 3 ceph monitors on each node. > 3. 8 ceph osd on each node (total 24 osd). > > > Thanks > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com