Re: [ceph-users] Degraded objects while OSD is being added/filled

Gregory Farnum Mon, 24 Jul 2017 10:00:07 -0700

Yeah, the objects being degraded here are a consequence of stuff being
written while backfill is happening; it doesn't last long because it's only
a certain range of them.
I didn't think that should upgrade to the PG being marked degraded but may
be misinformed. Still planning to dig through that but haven't gotten to it
yet. :)


On Thu, Jul 20, 2017 at 8:13 AM Andras Pataki <apat...@flatironinstitute.org>
wrote:

> Hi Greg,
>
> I have just now added a single drive/osd to a clean cluster, and can see
> the degradation immediately.  We are on ceph 10.2.9 everywhere.
>
> Here is how the cluster looked before the OSD got added:
>
>     cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>      health HEALTH_WARN
>             noout flag(s) set
>      monmap e31: 3 mons at {cephmon00=
> 10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0
> }
>             election epoch 46092, quorum 0,1,2
> cephmon00,cephmon01,cephmon02
>       fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
>      osdmap e681227: 1270 osds: 1270 up, 1270 in
>             flags noout,sortbitwise,require_jewel_osds
>       pgmap v54583934: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
>             4471 TB used, 3416 TB / 7887 TB avail
>                42491 active+clean
>                    5 active+clean+scrubbing+deep
>   client io 2193 kB/s rd, 27240 kB/s wr, 85 op/s rd, 47 op/s wr
>
>
> And this is shortly after it was added (after all the peering was done):
>
>     cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>      health HEALTH_WARN
>             141 pgs backfill_wait
>             117 pgs backfilling
>             20 pgs degraded
>             20 pgs recovery_wait
>             56 pgs stuck unclean
>             recovery 130/1376744346 objects degraded (0.000%)
>             recovery 3827502/1376744346 objects misplaced (0.278%)
>             noout flag(s) set
>      monmap e31: 3 mons at {cephmon00=
> 10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0
> }
>             election epoch 46092, quorum 0,1,2
> cephmon00,cephmon01,cephmon02
>       fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
>      osdmap e681238: 1271 osds: 1271 up, 1271 in; 258 remapped pgs
>             flags noout,sortbitwise,require_jewel_osds
>       pgmap v54585141: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
>             4471 TB used, 3423 TB / 7895 TB avail
> *            130/1376744346 objects degraded (0.000%)*
>             3827502/1376744346 objects misplaced (0.278%)
>                42210 active+clean
>                  141 active+remapped+wait_backfill
>                  117 active+remapped+backfilling
> *                  20 active+recovery_wait+degraded*
>                    7 active+clean+scrubbing+deep
>                    1 active+clean+scrubbing
> recovery io 17375 MB/s, 5069 objects/s
>   client io 12210 kB/s rd, 29887 kB/s wr, 4 op/s rd, 140 op/s wr
>
>
> Even though there was no failure, we have 20 degraded PGs, and 130
> degraded objects.  My expectation was for some data to move around, start
> filling the added drive, but I would not expect to see degraded objects or
> PGs.
>
> Also, as time passes, the number of degraded objects increases steadily,
> here is a snapshot a little later:
>
>     cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>      health HEALTH_WARN
>             63 pgs backfill_wait
>             4 pgs backfilling
>             67 pgs stuck unclean
>             recovery 706/1377244134 objects degraded (0.000%)
>             recovery 843267/1377244134 objects misplaced (0.061%)
>             noout flag(s) set
>      monmap e31: 3 mons at {cephmon00=
> 10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0
> }
>             election epoch 46092, quorum 0,1,2
> cephmon00,cephmon01,cephmon02
>       fsmap e26640: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
>      osdmap e681569: 1271 osds: 1271 up, 1271 in; 67 remapped pgs
>             flags noout,sortbitwise,require_jewel_osds
>       pgmap v54588554: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
>             4471 TB used, 3423 TB / 7895 TB avail
> *            706/1377244134 objects degraded (0.000%)*
>             843267/1377244134 objects misplaced (0.061%)
>                42422 active+clean
>                   63 active+remapped+wait_backfill
>                    5 active+clean+scrubbing+deep
>                    4 active+remapped+backfilling
>                    2 active+clean+scrubbing
> recovery io 779 MB/s, 229 objects/s
>   client io 306 MB/s rd, 344 MB/s wr, 138 op/s rd, 226 op/s wr
>
> From past experience, the degraded object count keeps going up for most of
> the time the disk is being filled.  Towards the end it decreases.  Is
> writing to a pool that is waiting for backfilling causing degraded objects
> to appear perhaps?
>
> I took a 'pg dump' before and after the change, as well as an 'osd tree'
> before and after.  All these are available at
> http://voms.simonsfoundation.org:50013/m1Maf76sV1kS95spXQpijycmne92yjm/ceph-20170720/
>
> All pools are now with replicated size 3 and min size 2. Let me know if
> any other info would be helpful.
>
>
> Andras
>
>
>
> On 07/06/2017 02:30 PM, Andras Pataki wrote:
>
> Hi Greg,
>
> At the moment our cluster is all in balance.  We have one failed drive
> that will be replaced in a few days (the OSD has been removed from ceph and
> will be re-added with the replacement drive).  I'll document the state of
> the PGs before the addition of the drive and during the recovery process
> and report back.
>
> We have a few pools, most are on 3 replicas now, some with non-critical
> data that we have elsewhere are on 2.  But I've seen the degradation even
> on the 3 replica pools (I think in my original example there was an example
> of such a pool as well).
>
> Andras
>
>
> On 06/30/2017 04:38 PM, Gregory Farnum wrote:
>
> On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki <
> apat...@flatironinstitute.org> wrote:
>
>> Hi cephers,
>>
>> I noticed something I don't understand about ceph's behavior when adding
>> an OSD.  When I start with a clean cluster (all PG's active+clean) and add
>> an OSD (via ceph-deploy for example), the crush map gets updated and PGs
>> get reassigned to different OSDs, and the new OSD starts getting filled
>> with data.  As the new OSD gets filled, I start seeing PGs in degraded
>> states.  Here is an example:
>>
>>       pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects
>>             3164 TB used, 781 TB / 3946 TB avail
>> *            8017/994261437 objects degraded (0.001%)*
>>             2220581/994261437 objects misplaced (0.223%)
>>                42393 active+clean
>>                   91 active+remapped+wait_backfill
>>                    9 active+clean+scrubbing+deep
>> *                   1 active+recovery_wait+degraded*
>>                    1 active+clean+scrubbing
>>                    1 active+remapped+backfilling
>>
>>
>> Any ideas why there would be any persistent degradation in the cluster
>> while the newly added drive is being filled?  It takes perhaps a day or two
>> to fill the drive - and during all this time the cluster seems to be
>> running degraded.  As data is written to the cluster, the number of
>> degraded objects increases over time.  Once the newly added OSD is filled,
>> the cluster comes back to clean again.
>>
>> Here is the PG that is degraded in this picture:
>>
>> 7.87c    1    0    2    0    0    4194304    7    7
>> active+recovery_wait+degraded    2017-06-20 14:12:44.119921    344610'7
>> 583572:2797    [402,521]    402    [402,521]    402    344610'7
>> 2017-06-16 06:04:55.822503    344610'7    2017-06-16 06:04:55.822503
>>
>> The newly added osd here is 521.  Before it got added, this PG had two
>> replicas clean, but one got forgotten somehow?
>>
>
> This sounds a bit concerning at first glance. Can you provide some output
> of exactly what commands you're invoking, and the "ceph -s" output as it
> changes in response?
>
> I really don't see how adding a new OSD can result in it "forgetting"
> about existing valid copies — it's definitely not supposed to — so I wonder
> if there's a collision in how it's deciding to remove old locations.
>
> Are you running with only two copies of your data? It shouldn't matter but
> there could also be errors resulting in a behavioral difference between two
> and three copies.
> -Greg
>
>
>>
>> Other remapped PGs have 521 in their "up" set but still have the two
>> existing copies in their "acting" set - and no degradation is shown.
>> Examples:
>>
>> 2.f24    14282    0    16    28564    0    51014850801    3102    3102
>> active+remapped+wait_backfill    2017-06-20 14:12:42.650308
>> 583553'2033479    583573:2033266    [467,521]    467    [467,499]    467
>> 582430'2033337    2017-06-16 09:08:51.055131    582036'2030837
>> 2017-05-31 20:37:54.831178
>> 6.2b7d    10499    0    140    20998    0    37242874687    3673
>> 3673    active+remapped+wait_backfill    2017-06-20 14:12:42.070019
>> 583569'165163    583572:342128    [541,37,521]    541    [541,37,532]
>> 541    582430'161890    2017-06-18 09:42:49.148402    582430'161890
>> 2017-06-18 09:42:49.148402
>>
>> We are running the latest Jewel patch level everywhere (10.2.7).  Any
>> insights would be appreciated.
>>
>> Andras
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Degraded objects while OSD is being added/filled

Reply via email to