Hi Greg,

I have just now added a single drive/osd to a clean cluster, and can see the degradation immediately. We are on ceph 10.2.9 everywhere.

Here is how the cluster looked before the OSD got added:

        cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
         health HEALTH_WARN
                noout flag(s) set
         monmap e31: 3 mons at
   
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}
                election epoch 46092, quorum 0,1,2
   cephmon00,cephmon01,cephmon02
          fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
         osdmap e681227: 1270 osds: 1270 up, 1270 in
                flags noout,sortbitwise,require_jewel_osds
          pgmap v54583934: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
                4471 TB used, 3416 TB / 7887 TB avail
                   42491 active+clean
                       5 active+clean+scrubbing+deep
      client io 2193 kB/s rd, 27240 kB/s wr, 85 op/s rd, 47 op/s wr


And this is shortly after it was added (after all the peering was done):

        cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
         health HEALTH_WARN
                141 pgs backfill_wait
                117 pgs backfilling
                20 pgs degraded
                20 pgs recovery_wait
                56 pgs stuck unclean
                recovery 130/1376744346 objects degraded (0.000%)
                recovery 3827502/1376744346 objects misplaced (0.278%)
                noout flag(s) set
         monmap e31: 3 mons at
   
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}
                election epoch 46092, quorum 0,1,2
   cephmon00,cephmon01,cephmon02
          fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
         osdmap e681238: 1271 osds: 1271 up, 1271 in; 258 remapped pgs
                flags noout,sortbitwise,require_jewel_osds
          pgmap v54585141: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
                4471 TB used, 3423 TB / 7895 TB avail
   *            130/1376744346 objects degraded (0.000%)*
                3827502/1376744346 objects misplaced (0.278%)
                   42210 active+clean
                     141 active+remapped+wait_backfill
                     117 active+remapped+backfilling
   *                  20 active+recovery_wait+degraded*
                       7 active+clean+scrubbing+deep
                       1 active+clean+scrubbing
   recovery io 17375 MB/s, 5069 objects/s
      client io 12210 kB/s rd, 29887 kB/s wr, 4 op/s rd, 140 op/s wr


Even though there was no failure, we have 20 degraded PGs, and 130 degraded objects. My expectation was for some data to move around, start filling the added drive, but I would not expect to see degraded objects or PGs.

Also, as time passes, the number of degraded objects increases steadily, here is a snapshot a little later:

        cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
         health HEALTH_WARN
                63 pgs backfill_wait
                4 pgs backfilling
                67 pgs stuck unclean
                recovery 706/1377244134 objects degraded (0.000%)
                recovery 843267/1377244134 objects misplaced (0.061%)
                noout flag(s) set
         monmap e31: 3 mons at
   
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}
                election epoch 46092, quorum 0,1,2
   cephmon00,cephmon01,cephmon02
          fsmap e26640: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
         osdmap e681569: 1271 osds: 1271 up, 1271 in; 67 remapped pgs
                flags noout,sortbitwise,require_jewel_osds
          pgmap v54588554: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
                4471 TB used, 3423 TB / 7895 TB avail
   *            706/1377244134 objects degraded (0.000%)*
                843267/1377244134 objects misplaced (0.061%)
                   42422 active+clean
                      63 active+remapped+wait_backfill
                       5 active+clean+scrubbing+deep
                       4 active+remapped+backfilling
                       2 active+clean+scrubbing
   recovery io 779 MB/s, 229 objects/s
      client io 306 MB/s rd, 344 MB/s wr, 138 op/s rd, 226 op/s wr

From past experience, the degraded object count keeps going up for most of the time the disk is being filled. Towards the end it decreases. Is writing to a pool that is waiting for backfilling causing degraded objects to appear perhaps?

I took a 'pg dump' before and after the change, as well as an 'osd tree' before and after. All these are available at http://voms.simonsfoundation.org:50013/m1Maf76sV1kS95spXQpijycmne92yjm/ceph-20170720/

All pools are now with replicated size 3 and min size 2. Let me know if any other info would be helpful.

Andras


On 07/06/2017 02:30 PM, Andras Pataki wrote:
Hi Greg,

At the moment our cluster is all in balance. We have one failed drive that will be replaced in a few days (the OSD has been removed from ceph and will be re-added with the replacement drive). I'll document the state of the PGs before the addition of the drive and during the recovery process and report back.

We have a few pools, most are on 3 replicas now, some with non-critical data that we have elsewhere are on 2. But I've seen the degradation even on the 3 replica pools (I think in my original example there was an example of such a pool as well).

Andras


On 06/30/2017 04:38 PM, Gregory Farnum wrote:
On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki <apat...@flatironinstitute.org <mailto:apat...@flatironinstitute.org>> wrote:

    Hi cephers,

    I noticed something I don't understand about ceph's behavior when
    adding an OSD.  When I start with a clean cluster (all PG's
    active+clean) and add an OSD (via ceph-deploy for example), the
    crush map gets updated and PGs get reassigned to different OSDs,
    and the new OSD starts getting filled with data.  As the new OSD
    gets filled, I start seeing PGs in degraded states.  Here is an
    example:

              pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390
        Mobjects
                    3164 TB used, 781 TB / 3946 TB avail
        *            8017/994261437 objects degraded (0.001%)*
                    2220581/994261437 objects misplaced (0.223%)
                       42393 active+clean
                          91 active+remapped+wait_backfill
                           9 active+clean+scrubbing+deep
        *                   1 active+recovery_wait+degraded*
                           1 active+clean+scrubbing
                           1 active+remapped+backfilling


    Any ideas why there would be any persistent degradation in the
    cluster while the newly added drive is being filled?  It takes
    perhaps a day or two to fill the drive - and during all this time
    the cluster seems to be running degraded.  As data is written to
    the cluster, the number of degraded objects increases over time.
    Once the newly added OSD is filled, the cluster comes back to
    clean again.

    Here is the PG that is degraded in this picture:

    7.87c    1    0    2    0    0    4194304    7    7
active+recovery_wait+degraded 2017-06-20 14:12:44.119921 344610'7 583572:2797 [402,521] 402 [402,521] 402 344610'7 2017-06-16 06:04:55.822503 344610'7 2017-06-16
    06:04:55.822503

    The newly added osd here is 521.  Before it got added, this PG
    had two replicas clean, but one got forgotten somehow?


This sounds a bit concerning at first glance. Can you provide some output of exactly what commands you're invoking, and the "ceph -s" output as it changes in response?

I really don't see how adding a new OSD can result in it "forgetting" about existing valid copies — it's definitely not supposed to — so I wonder if there's a collision in how it's deciding to remove old locations.

Are you running with only two copies of your data? It shouldn't matter but there could also be errors resulting in a behavioral difference between two and three copies.
-Greg


    Other remapped PGs have 521 in their "up" set but still have the
    two existing copies in their "acting" set - and no degradation is
    shown.  Examples:

2.f24 14282 0 16 28564 0 51014850801 3102 3102 active+remapped+wait_backfill 2017-06-20 14:12:42.650308 583553'2033479 583573:2033266 [467,521] 467 [467,499] 467 582430'2033337 2017-06-16
    09:08:51.055131 582036'2030837    2017-05-31 20:37:54.831178
6.2b7d 10499 0 140 20998 0 37242874687 3673 3673 active+remapped+wait_backfill 2017-06-20 14:12:42.070019 583569'165163 583572:342128 [541,37,521] 541 [541,37,532] 541 582430'161890 2017-06-18
    09:42:49.148402 582430'161890    2017-06-18 09:42:49.148402

We are running the latest Jewel patch level everywhere (10.2.7). Any insights would be appreciated.

    Andras

    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to