Below is ceph -s > cluster: > id: {id} > health: HEALTH_WARN > noout flag(s) set > 260610/1068004947 objects misplaced (0.024%) > Degraded data redundancy: 23157232/1068004947 objects degraded > (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized > > services: > mon: 3 daemons, quorum mon02,mon01,mon03 > mgr: mon03(active), standbys: mon02 > mds: cephfs-1/1/1 up {0=mon03=up:active}, 1 up:standby > osd: 74 osds: 74 up, 74 in; 332 remapped pgs > flags noout > > data: > pools: 5 pools, 5316 pgs > objects: 339M objects, 46627 GB > usage: 154 TB used, 108 TB / 262 TB avail > pgs: 23157232/1068004947 objects degraded (2.168%) > 260610/1068004947 objects misplaced (0.024%) > 4984 active+clean > 183 active+undersized+degraded+remapped+backfilling > 145 active+undersized+degraded+remapped+backfill_wait > 3 active+remapped+backfill_wait > 1 active+remapped+backfilling > > io: > client: 8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr > recovery: 37057 kB/s, 50 keys/s, 217 objects/s
Also the two pools on the SSDs, are the objects pool at 4096 PG, and the fs-metadata pool at 32 PG. > Are you sure the recovery is actually going slower, or are the individual ops > larger or more expensive? The objects should not vary wildly in size. Even if they were differing in size, the SSDs are roughly idle in their current state of backfilling when examining wait in iotop, or atop, or sysstat/iostat. This compares to when I was fully saturating the SATA backplane with over 1000MB/s of writes to multiple disks when the backfills were going “full speed.” Here is a breakdown of recovery io by pool: > pool objects-ssd id 20 > recovery io 6779 kB/s, 92 objects/s > client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr > > pool fs-metadata-ssd id 16 > recovery io 0 B/s, 28 keys/s, 2 objects/s > client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr > > pool cephfs-hdd id 17 > recovery io 40542 kB/s, 158 objects/s > client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client traffic at the moment, which seems conspicuous to me. Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with one OSD occasionally spiking up to 300-500 for a few minutes. Stats being pulled by both local CollectD instances on each node, as well as the Influx plugin in MGR as we evaluate that against collectd. Thanks, Reed > On Feb 22, 2018, at 6:21 PM, Gregory Farnum <gfar...@redhat.com> wrote: > > What's the output of "ceph -s" while this is happening? > > Is there some identifiable difference between these two states, like you get > a lot of throughput on the data pools but then metadata recovery is slower? > > Are you sure the recovery is actually going slower, or are the individual ops > larger or more expensive? > > My WAG is that recovering the metadata pool, composed mostly of directories > stored in omap objects, is going much slower for some reason. You can adjust > the cost of those individual ops some by changing > osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure > which way you want to go or indeed if this has anything to do with the > problem you're seeing. (eg, it could be that reading out the omaps is > expensive, so you can get higher recovery op numbers by turning down the > number of entries per request, but not actually see faster backfilling > because you have to issue more requests.) > -Greg > > On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <reed.d...@focusvq.com > <mailto:reed.d...@focusvq.com>> wrote: > Hi all, > > I am running into an odd situation that I cannot easily explain. > I am currently in the midst of destroy and rebuild of OSDs from filestore to > bluestore. > With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing > unexpected behavior. The HDDs and SSDs are set in crush accordingly. > > My path to replacing the OSDs is to set the noout, norecover, norebalance > flag, destroy the OSD, create the OSD back, (iterate n times, all within a > single failure domain), unset the flags, and let it go. It finishes, rinse, > repeat. > > For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 > NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for > block.db (previously filestore journals). > 2x10GbE networking between the nodes. SATA backplane caps out at around 10 > Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2. > > When the flags are unset, recovery starts and I see a very large rush of > traffic, however, after the first machine completed, the performance tapered > off at a rapid pace and trickles. Comparatively, I’m getting 100-200 recovery > ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting 150-250 > recovery ops on 5 SSDs, backfilling from 40 other SSDs. Every once in a while > I will see a spike up to 500, 1000, or even 2000 ops on the SSDs, often a few > hundred recovery ops from one OSD, and 8-15 ops from the others that are > backfilling. > > This is a far cry from the more than 15-30k recovery ops that it started off > recovering with 1-3k recovery ops from a single OSD to the backfilling > OSD(s). And an even farther cry from the >15k recovery ops I was sustaining > for over an hour or more before. I was able to rebuild a 1.9T SSD (1.1T used) > in a little under an hour, and I could do about 5 at a time and still keep it > at roughly an hour to backfill all of them, but then I hit a roadblock after > the first machine, when I tried to do 10 at a time (single machine). I am now > still experiencing the same thing on the third node, while doing 5 OSDs at a > time. > > The pools associated with these SSDs are cephfs-metadata, as well as a pure > rados object pool we use for our own internal applications. Both are size=3, > min_size=2. > > It appears I am not the first to run into this, but it looks like there was > no resolution: https://www.spinics.net/lists/ceph-users/msg41493.html > <https://www.spinics.net/lists/ceph-users/msg41493.html> > > Recovery parameters for the OSDs match what was in the previous thread, sans > the osd conf block listed. And current osd_max_backfills = 30 and > osd_recovery_max_active = 35. Very little activity on the OSDs during this > period, so should not be any contention for iops on the SSDs. > > The only oddity that I can attribute to things is that we had a few periods > of time where the disk load on one of the mons was high enough to cause the > mon to drop out of quorum for a brief amount of time, a few times. But I > wouldn’t think backfills would just get throttled due to mons flapping. > > Hopefully someone has some experience or can steer me in a path to improve > the performance of the backfills so that I’m not stuck in backfill purgatory > longer than I need to be. > > Linking an imgur album with some screen grabs of the recovery ops over time > for the first machine, versus the second and third machines to demonstrate > the delta between them. > https://imgur.com/a/OJw4b <https://imgur.com/a/OJw4b> > > Also including a ceph osd df of the SSDs, highlighted in red are the OSDs > currently backfilling. Could this possibly be PG overdose? I don’t ever run > into ‘stuck activating’ PGs, its just painfully slow backfills, like they are > being throttled by ceph, that are causing me to worry. Drives aren’t worn, > <30 P/E cycles on the drives, so plenty of life left in them. > > Thanks, > Reed > >> $ ceph osd df >> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS >> 24 ssd 1.76109 1.00000 1803G 1094G 708G 60.69 1.08 260 >> 25 ssd 1.76109 1.00000 1803G 1136G 667G 63.01 1.12 271 >> 26 ssd 1.76109 1.00000 1803G 1018G 785G 56.46 1.01 243 >> 27 ssd 1.76109 1.00000 1803G 1065G 737G 59.10 1.05 253 >> 28 ssd 1.76109 1.00000 1803G 1026G 776G 56.94 1.02 245 >> 29 ssd 1.76109 1.00000 1803G 1132G 671G 62.79 1.12 270 >> 30 ssd 1.76109 1.00000 1803G 944G 859G 52.35 0.93 224 >> 31 ssd 1.76109 1.00000 1803G 1061G 742G 58.85 1.05 252 >> 32 ssd 1.76109 1.00000 1803G 1003G 799G 55.67 0.99 239 >> 33 ssd 1.76109 1.00000 1803G 1049G 753G 58.20 1.04 250 >> 34 ssd 1.76109 1.00000 1803G 1086G 717G 60.23 1.07 257 >> 35 ssd 1.76109 1.00000 1803G 978G 824G 54.26 0.97 232 >> 36 ssd 1.76109 1.00000 1803G 1057G 745G 58.64 1.05 252 >> 37 ssd 1.76109 1.00000 1803G 1025G 777G 56.88 1.01 244 >> 38 ssd 1.76109 1.00000 1803G 1047G 756G 58.06 1.04 250 >> 39 ssd 1.76109 1.00000 1803G 1031G 771G 57.20 1.02 246 >> 40 ssd 1.76109 1.00000 1803G 1029G 774G 57.07 1.02 245 >> 41 ssd 1.76109 1.00000 1803G 1033G 770G 57.28 1.02 245 >> 42 ssd 1.76109 1.00000 1803G 993G 809G 55.10 0.98 236 >> 43 ssd 1.76109 1.00000 1803G 1072G 731G 59.45 1.06 256 >> 44 ssd 1.76109 1.00000 1803G 1039G 763G 57.64 1.03 248 >> 45 ssd 1.76109 1.00000 1803G 992G 810G 55.06 0.98 236 >> 46 ssd 1.76109 1.00000 1803G 1068G 735G 59.23 1.06 254 >> 47 ssd 1.76109 1.00000 1803G 1020G 783G 56.57 1.01 242 >> 48 ssd 1.76109 1.00000 1803G 945G 857G 52.44 0.94 225 >> 49 ssd 1.76109 1.00000 1803G 649G 1154G 36.01 0.64 139 >> 50 ssd 1.76109 1.00000 1803G 426G 1377G 23.64 0.42 83 >> 51 ssd 1.76109 1.00000 1803G 610G 1193G 33.84 0.60 131 >> 52 ssd 1.76109 1.00000 1803G 558G 1244G 30.98 0.55 118 >> 53 ssd 1.76109 1.00000 1803G 731G 1072G 40.54 0.72 161 >> 54 ssd 1.74599 1.00000 1787G 859G 928G 48.06 0.86 229 >> 55 ssd 1.74599 1.00000 1787G 942G 844G 52.74 0.94 252 >> 56 ssd 1.74599 1.00000 1787G 928G 859G 51.94 0.93 246 >> 57 ssd 1.74599 1.00000 1787G 1039G 748G 58.15 1.04 277 >> 58 ssd 1.74599 1.00000 1787G 963G 824G 53.87 0.96 255 >> 59 ssd 1.74599 1.00000 1787G 909G 877G 50.89 0.91 241 >> 60 ssd 1.74599 1.00000 1787G 1039G 748G 58.15 1.04 277 >> 61 ssd 1.74599 1.00000 1787G 892G 895G 49.91 0.89 238 >> 62 ssd 1.74599 1.00000 1787G 927G 859G 51.90 0.93 245 >> 63 ssd 1.74599 1.00000 1787G 864G 922G 48.39 0.86 229 >> 64 ssd 1.74599 1.00000 1787G 968G 819G 54.16 0.97 257 >> 65 ssd 1.74599 1.00000 1787G 892G 894G 49.93 0.89 237 >> 66 ssd 1.74599 1.00000 1787G 951G 836G 53.23 0.95 252 >> 67 ssd 1.74599 1.00000 1787G 878G 908G 49.16 0.88 232 >> 68 ssd 1.74599 1.00000 1787G 899G 888G 50.29 0.90 238 >> 69 ssd 1.74599 1.00000 1787G 948G 839G 53.04 0.95 252 >> 70 ssd 1.74599 1.00000 1787G 914G 873G 51.15 0.91 246 >> 71 ssd 1.74599 1.00000 1787G 1004G 782G 56.21 1.00 266 >> 72 ssd 1.74599 1.00000 1787G 812G 974G 45.47 0.81 216 >> 73 ssd 1.74599 1.00000 1787G 932G 855G 52.15 0.93 247 > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com