Yeah, the objects being degraded here are a consequence of stuff being written while backfill is happening; it doesn't last long because it's only a certain range of them. I didn't think that should upgrade to the PG being marked degraded but may be misinformed. Still planning to dig through that but haven't gotten to it yet. :)
On Thu, Jul 20, 2017 at 8:13 AM Andras Pataki <apat...@flatironinstitute.org> wrote: > Hi Greg, > > I have just now added a single drive/osd to a clean cluster, and can see > the degradation immediately. We are on ceph 10.2.9 everywhere. > > Here is how the cluster looked before the OSD got added: > > cluster d7b33135-0940-4e48-8aa6-1d2026597c2f > health HEALTH_WARN > noout flag(s) set > monmap e31: 3 mons at {cephmon00= > 10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0 > } > election epoch 46092, quorum 0,1,2 > cephmon00,cephmon01,cephmon02 > fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby > osdmap e681227: 1270 osds: 1270 up, 1270 in > flags noout,sortbitwise,require_jewel_osds > pgmap v54583934: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects > 4471 TB used, 3416 TB / 7887 TB avail > 42491 active+clean > 5 active+clean+scrubbing+deep > client io 2193 kB/s rd, 27240 kB/s wr, 85 op/s rd, 47 op/s wr > > > And this is shortly after it was added (after all the peering was done): > > cluster d7b33135-0940-4e48-8aa6-1d2026597c2f > health HEALTH_WARN > 141 pgs backfill_wait > 117 pgs backfilling > 20 pgs degraded > 20 pgs recovery_wait > 56 pgs stuck unclean > recovery 130/1376744346 objects degraded (0.000%) > recovery 3827502/1376744346 objects misplaced (0.278%) > noout flag(s) set > monmap e31: 3 mons at {cephmon00= > 10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0 > } > election epoch 46092, quorum 0,1,2 > cephmon00,cephmon01,cephmon02 > fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby > osdmap e681238: 1271 osds: 1271 up, 1271 in; 258 remapped pgs > flags noout,sortbitwise,require_jewel_osds > pgmap v54585141: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects > 4471 TB used, 3423 TB / 7895 TB avail > * 130/1376744346 objects degraded (0.000%)* > 3827502/1376744346 objects misplaced (0.278%) > 42210 active+clean > 141 active+remapped+wait_backfill > 117 active+remapped+backfilling > * 20 active+recovery_wait+degraded* > 7 active+clean+scrubbing+deep > 1 active+clean+scrubbing > recovery io 17375 MB/s, 5069 objects/s > client io 12210 kB/s rd, 29887 kB/s wr, 4 op/s rd, 140 op/s wr > > > Even though there was no failure, we have 20 degraded PGs, and 130 > degraded objects. My expectation was for some data to move around, start > filling the added drive, but I would not expect to see degraded objects or > PGs. > > Also, as time passes, the number of degraded objects increases steadily, > here is a snapshot a little later: > > cluster d7b33135-0940-4e48-8aa6-1d2026597c2f > health HEALTH_WARN > 63 pgs backfill_wait > 4 pgs backfilling > 67 pgs stuck unclean > recovery 706/1377244134 objects degraded (0.000%) > recovery 843267/1377244134 objects misplaced (0.061%) > noout flag(s) set > monmap e31: 3 mons at {cephmon00= > 10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0 > } > election epoch 46092, quorum 0,1,2 > cephmon00,cephmon01,cephmon02 > fsmap e26640: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby > osdmap e681569: 1271 osds: 1271 up, 1271 in; 67 remapped pgs > flags noout,sortbitwise,require_jewel_osds > pgmap v54588554: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects > 4471 TB used, 3423 TB / 7895 TB avail > * 706/1377244134 objects degraded (0.000%)* > 843267/1377244134 objects misplaced (0.061%) > 42422 active+clean > 63 active+remapped+wait_backfill > 5 active+clean+scrubbing+deep > 4 active+remapped+backfilling > 2 active+clean+scrubbing > recovery io 779 MB/s, 229 objects/s > client io 306 MB/s rd, 344 MB/s wr, 138 op/s rd, 226 op/s wr > > From past experience, the degraded object count keeps going up for most of > the time the disk is being filled. Towards the end it decreases. Is > writing to a pool that is waiting for backfilling causing degraded objects > to appear perhaps? > > I took a 'pg dump' before and after the change, as well as an 'osd tree' > before and after. All these are available at > http://voms.simonsfoundation.org:50013/m1Maf76sV1kS95spXQpijycmne92yjm/ceph-20170720/ > > All pools are now with replicated size 3 and min size 2. Let me know if > any other info would be helpful. > > > Andras > > > > On 07/06/2017 02:30 PM, Andras Pataki wrote: > > Hi Greg, > > At the moment our cluster is all in balance. We have one failed drive > that will be replaced in a few days (the OSD has been removed from ceph and > will be re-added with the replacement drive). I'll document the state of > the PGs before the addition of the drive and during the recovery process > and report back. > > We have a few pools, most are on 3 replicas now, some with non-critical > data that we have elsewhere are on 2. But I've seen the degradation even > on the 3 replica pools (I think in my original example there was an example > of such a pool as well). > > Andras > > > On 06/30/2017 04:38 PM, Gregory Farnum wrote: > > On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki < > apat...@flatironinstitute.org> wrote: > >> Hi cephers, >> >> I noticed something I don't understand about ceph's behavior when adding >> an OSD. When I start with a clean cluster (all PG's active+clean) and add >> an OSD (via ceph-deploy for example), the crush map gets updated and PGs >> get reassigned to different OSDs, and the new OSD starts getting filled >> with data. As the new OSD gets filled, I start seeing PGs in degraded >> states. Here is an example: >> >> pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects >> 3164 TB used, 781 TB / 3946 TB avail >> * 8017/994261437 objects degraded (0.001%)* >> 2220581/994261437 objects misplaced (0.223%) >> 42393 active+clean >> 91 active+remapped+wait_backfill >> 9 active+clean+scrubbing+deep >> * 1 active+recovery_wait+degraded* >> 1 active+clean+scrubbing >> 1 active+remapped+backfilling >> >> >> Any ideas why there would be any persistent degradation in the cluster >> while the newly added drive is being filled? It takes perhaps a day or two >> to fill the drive - and during all this time the cluster seems to be >> running degraded. As data is written to the cluster, the number of >> degraded objects increases over time. Once the newly added OSD is filled, >> the cluster comes back to clean again. >> >> Here is the PG that is degraded in this picture: >> >> 7.87c 1 0 2 0 0 4194304 7 7 >> active+recovery_wait+degraded 2017-06-20 14:12:44.119921 344610'7 >> 583572:2797 [402,521] 402 [402,521] 402 344610'7 >> 2017-06-16 06:04:55.822503 344610'7 2017-06-16 06:04:55.822503 >> >> The newly added osd here is 521. Before it got added, this PG had two >> replicas clean, but one got forgotten somehow? >> > > This sounds a bit concerning at first glance. Can you provide some output > of exactly what commands you're invoking, and the "ceph -s" output as it > changes in response? > > I really don't see how adding a new OSD can result in it "forgetting" > about existing valid copies — it's definitely not supposed to — so I wonder > if there's a collision in how it's deciding to remove old locations. > > Are you running with only two copies of your data? It shouldn't matter but > there could also be errors resulting in a behavioral difference between two > and three copies. > -Greg > > >> >> Other remapped PGs have 521 in their "up" set but still have the two >> existing copies in their "acting" set - and no degradation is shown. >> Examples: >> >> 2.f24 14282 0 16 28564 0 51014850801 3102 3102 >> active+remapped+wait_backfill 2017-06-20 14:12:42.650308 >> 583553'2033479 583573:2033266 [467,521] 467 [467,499] 467 >> 582430'2033337 2017-06-16 09:08:51.055131 582036'2030837 >> 2017-05-31 20:37:54.831178 >> 6.2b7d 10499 0 140 20998 0 37242874687 3673 >> 3673 active+remapped+wait_backfill 2017-06-20 14:12:42.070019 >> 583569'165163 583572:342128 [541,37,521] 541 [541,37,532] >> 541 582430'161890 2017-06-18 09:42:49.148402 582430'161890 >> 2017-06-18 09:42:49.148402 >> >> We are running the latest Jewel patch level everywhere (10.2.7). Any >> insights would be appreciated. >> >> Andras >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com