Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

Graham Allan Tue, 09 Oct 2018 11:14:57 -0700


On 10/9/2018 12:19 PM, Gregory Farnum wrote:

On Wed, Oct 3, 2018 at 10:18 AM Graham Allan <g...@umn.edu<mailto:g...@umn.edu>> wrote:


    However I have one pg which is stuck in state remapped+incomplete
    because it has only 4 out of 6 osds running, and I have been unable to
    bring the missing two back into service.

     > PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg
    incomplete
     >     pg 70.82d is remapped+incomplete, acting [2147483647
    <tel:%28214%29%20748-3647>,2147483647
    <tel:%28214%29%20748-3647>,190,448,61,315] (reducing pool
    .rgw.buckets.ec42 min_size from 5 may help; search ceph.com/docs
    <http://ceph.com/docs> for 'incomplete')

    I don't think I want to do anything with min_size as that would make
    all
    other pgs vulnerable to running dangerously undersized (unless there is
    any way to force that state for only a single pg). It seems to me that
    with 4/6 osds available, it should maybe be possible to force ceph to
    select one or two new osds to rebalance this pg to?

I think unfortunately the easiest thing for you to fix this will be toset the min_size back to 4 until the PG is recovered (or at least has 5shards done). This will be fixed in a later version of Ceph and probablybackported, but sadly it's not done yet.

-Greg

Thanks Greg, though sadly I've tried that; whatever I do, one of the 4osds involved will simply crash (not just the ones I previously tried tore-import via ceph-objectstore-tool). I just spend time chasing themaround but never succeeding in having a complete set run long enough tomake progress. They seem to crash when starting backfill on the nextobject. There has to be something in the current set of shards which itcan't handle.

Since then I've been focusing on trying to get the pg to revert to anearlier interval using osd_find_best_info_ignore_history_les, though theinformation I find around it is minimal.

Most sources seem to suggest setting it for the primary osd then eithersetting it down or restarting it, but that simply seems to result in theosd disappearing from the pg. After setting this flag for all of the"acting" osds (most recent interval), the pg switched to having the setof "active" osds == "up" osds, but still "incomplete" (it's not revertedto the set of osds in an earlier interval). Still stuck with condition"peering_blocked_by_history_les_bound" at present.

I'm guessing that I actually need to set the flagosd_find_best_info_ignore_history_les for *all* osds involved in thehistorical record of this pg (the "probing osds" list?), and restartthem all...

Still also trying to understand exactly how the flag works. I think Isee now that the "_les" bit must refer to "last epoch started"...


--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

Reply via email to