Just to be clear about the issue: You have a 3 servers setup, performance is good. You add a server (with 1 OSD?) and performance goes down, is that right?
Can you give us more details? What's your complete setup? How many OSDs per node, bluestore/filestore, WAL/DB setup, etc. You're talking about sdb, sde, etc.. are those supposed to be OSD disks? What performance do you see before adding the last server? And how does it compare to the performance after? Are your OSD weights set correctly after the move (and after data settles)? Mohamad On 04/05/2018 11:23 AM, Steven Vacaroaia wrote: > Hi, > > I have a strange issue - OSDs from a specific server are introducing > huge performance issue > > This is a brand new installation on 3 identical servers - > DELL R620 with PERC H710 , bluestore DB and WAL on SSD, 10GB > dedicated private/public networks > > > When I add the OSD I see gaps like below and huge latency > > atop provides no clear culprit EXCEPT very low network and specific > disk utilization BUT 100% DSK for ceph-osd process which stay like > that ( 100%) for the duration of the test > ( see below) > > Not sure why ceph-osd process DSK stays at 100% while all the > specific DSK ( for sdb, sde ..etc) are 1% busy ? > > Any help/ instructions for how to troubleshooting this will be > appreciated > > (apologies if the format is not being kept) > > > CPU | sys 4% | user 1% | | irq 1% | > | idle 794% | wait 0% | | > | steal 0% | guest 0% | curf 2.20GHz | > | curscal ?% | > CPL | avg1 0.00 | | avg5 0.00 | avg15 0.00 | > | | | csw 547/s | > | intr 832/s | | | numcpu 8 > | | > MEM | tot 62.9G | free 61.4G | cache 520.6M | dirty 0.0M | > buff 7.5M | slab 98.9M | slrec 64.8M | shmem 8.8M | shrss > 0.0M | shswp 0.0M | vmbal 0.0M | | hptot 0.0M > | hpuse 0.0M | > SWP | tot 6.0G | free 6.0G | | | > | | | | > | | | vmcom 1.5G | > | vmlim 37.4G | > LVM | dm-0 | busy 1% | | read 0/s | > write 54/s | | KiB/r 0 | KiB/w 455 | MBr/s > 0.0 | | MBw/s 24.0 | avq 3.69 | > | avio 0.14 ms | > DSK | sdb | busy 1% | | read 0/s | > write 102/s | | KiB/r 0 | KiB/w 240 | MBr/s > 0.0 | | MBw/s 24.0 | avq 6.69 | > | avio 0.08 ms | > DSK | sda | busy 0% | | read 0/s | > write 12/s | | KiB/r 0 | KiB/w 4 | MBr/s > 0.0 | | MBw/s 0.1 | avq 1.00 | > | avio 0.05 ms | > DSK | sde | busy 0% | | read 0/s | > write 0/s | | KiB/r 0 | KiB/w 0 | MBr/s > 0.0 | | MBw/s 0.0 | avq 1.00 | > | avio 2.50 ms | > NET | transport | tcpi 718/s | tcpo 972/s | udpi 0/s | > | udpo 0/s | tcpao 0/s | tcppo 0/s | tcprs > 21/s | tcpie 0/s | tcpor 0/s | | udpnp 0/s > | udpie 0/s | > NET | network | ipi 719/s | | ipo 399/s | > ipfrw 0/s | | deliv 719/s | | > | | | icmpi 0/s | > | icmpo 0/s | > NET | eth5 1% | pcki 2214/s | pcko 939/s | | > sp 10 Gbps | si 154 Mbps | so 52 Mbps | | coll > 0/s | mlti 0/s | erri 0/s | erro 0/s | drpi 0/s > | drpo 0/s | > NET | eth4 0% | pcki 712/s | pcko 54/s | | > sp 10 Gbps | si 50 Mbps | so 90 Kbps | | coll > 0/s | mlti 0/s | erri 0/s | erro 0/s | drpi 0/s > | drpo 0/s | > > PID TID > RDDSK WRDSK > WCANCL DSK > CMD 1/21 > 2067 - > 0K/s 0.0G/s > 0K/s 100% > ceph-osd > > > > > > 2018-04-05 10:55:24.316549 min lat: 0.0203278 max lat: 10.7501 avg > lat: 0.496822 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 40 16 1096 1080 107.988 0 - > 0.496822 > 41 16 1096 1080 105.354 0 - > 0.496822 > 42 16 1096 1080 102.846 0 - > 0.496822 > 43 16 1096 1080 100.454 0 - > 0.496822 > 44 16 1205 1189 108.079 48.4444 0.0430396 > 0.588127 > 45 16 1234 1218 108.255 116 0.0318717 > 0.575485 > 46 16 1234 1218 105.901 0 - > 0.575485 > 47 16 1234 1218 103.648 0 - > 0.575485 > 48 16 1234 1218 101.489 0 - > 0.575485 > 49 16 1261 1245 101.622 27 0.157469 > 0.604268 > 50 16 1335 1319 105.508 296 0.191907 > 0.604862 > 51 16 1418 1402 109.949 332 0.0367004 > 0.573429 > 52 16 1437 1421 109.296 76 0.031818 > 0.566289 > 53 16 1481 1465 110.554 176 0.0405567 > 0.564885 > 54 16 1516 1500 111.099 140 0.0272873 > 0.552698 > 55 16 1516 1500 109.079 0 - > 0.552698 > 56 16 1516 1500 107.131 0 - > 0.552698 > 57 16 1516 1500 105.252 0 - > 0.552698 > 58 16 1555 1539 106.127 39 0.15675 > 0.601747 > > Total time run: 58.971664 > Total reads made: 1565 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 106.153 > Average IOPS: 26 > Stddev IOPS: 33 > Max IOPS: 121 > Min IOPS: 0 > Average Latency(s): 0.600788 > Max latency(s): 10.7501 > Min latency(s): 0.019135 > > > megacli -LDGetProp -cache -Lall -a0 > > Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, > ReadAheadNone, Direct, Write Cache OK if bad BBU > Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, > Cached, No Write Cache if bad BBU > Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive, > Cached, No Write Cache if bad BBU > Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive, > Cached, No Write Cache if bad BBU > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com