Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-30 Thread Wido den Hollander


On 1/30/19 9:08 PM, David Zafman wrote:
> 
> Strange, I can't reproduce this with v13.2.4.  I tried the following
> scenarios:
> 
> pg acting 1, 0, 2 -> up 1, 0 4 (osd.2 marked out).  The df on osd.2
> shows 0 space, but only osd.4 (backfill target) checks full space.
> 
> pg acting 1, 0, 2 -> up 4,3,5 (osd,1,0,2 all marked out).  The df for
> 1,0,2 show 0 space but osd.4,3,4 (backafill targets) check full space.
> 
> FYI, In a later release even when a backfill target is below
> backfillfull_ratio, if there isn't enough room for the pg to fit then
> backfill_toofull occurs.
> 
> 
> The question in your case is was any of  OSDs 999, 1900, or 145 above
> 90% (backfillfull_ratio) usage.

I triple-checked and this was not the case. I've had two Instances of
Mimic 13.2.4 where I ran into this and had somebody else report it to me.

In a few weeks I'll be performing an expansion with a customer where I'm
expecting this to show up again.

I'll check again and note the use on all OSDs and report back.

Wido

> 
> David
> 
> On 1/27/19 11:34 PM, Wido den Hollander wrote:
>>
>> On 1/25/19 8:33 AM, Gregory Farnum wrote:
>>> This doesn’t look familiar to me. Is the cluster still doing recovery so
>>> we can at least expect them to make progress when the “out” OSDs get
>>> removed from the set?
>> The recovery has already finished. It resolves itself, but in the
>> meantime I saw many PGs in the backfill_toofull state for a long time.
>>
>> This is new since Mimic.
>>
>> Wido
>>
>>> On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander >> > wrote:
>>>
>>>  Hi,
>>>
>>>  I've got a couple of PGs which are stuck in backfill_toofull,
>>> but none
>>>  of them are actually full.
>>>
>>>    "up": [
>>>      999,
>>>      1900,
>>>      145
>>>    ],
>>>    "acting": [
>>>      701,
>>>      1146,
>>>      1880
>>>    ],
>>>    "backfill_targets": [
>>>      "145",
>>>      "999",
>>>      "1900"
>>>    ],
>>>    "acting_recovery_backfill": [
>>>      "145",
>>>      "701",
>>>      "999",
>>>      "1146",
>>>      "1880",
>>>      "1900"
>>>    ],
>>>
>>>  I checked all these OSDs, but they are all <75% utilization.
>>>
>>>  full_ratio 0.95
>>>  backfillfull_ratio 0.9
>>>  nearfull_ratio 0.9
>>>
>>>  So I started checking all the PGs and I've noticed that each of
>>> these
>>>  PGs has one OSD in the 'acting_recovery_backfill' which is
>>> marked as
>>>  out.
>>>
>>>  In this case osd.1880 is marked as out and thus it's capacity is
>>> shown
>>>  as zero.
>>>
>>>  [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
>>>  1880   hdd 4.54599        0     0 B      0 B      0 B     0   
>>> 0  27
>>>  [ceph@ceph-mgr ~]$
>>>
>>>  This is on a Mimic 13.2.4 cluster. Is this expected or is this a
>>> unknown
>>>  side-effect of one of the OSDs being marked as out?
>>>
>>>  Thanks,
>>>
>>>  Wido
>>>  ___
>>>  ceph-users mailing list
>>>  ceph-users@lists.ceph.com 
>>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-30 Thread David Zafman


Strange, I can't reproduce this with v13.2.4.  I tried the following 
scenarios:


pg acting 1, 0, 2 -> up 1, 0 4 (osd.2 marked out).  The df on osd.2 
shows 0 space, but only osd.4 (backfill target) checks full space.


pg acting 1, 0, 2 -> up 4,3,5 (osd,1,0,2 all marked out).  The df for 
1,0,2 show 0 space but osd.4,3,4 (backafill targets) check full space.


FYI, In a later release even when a backfill target is below 
backfillfull_ratio, if there isn't enough room for the pg to fit then 
backfill_toofull occurs.



The question in your case is was any of  OSDs 999, 1900, or 145 above 
90% (backfillfull_ratio) usage.


David

On 1/27/19 11:34 PM, Wido den Hollander wrote:


On 1/25/19 8:33 AM, Gregory Farnum wrote:

This doesn’t look familiar to me. Is the cluster still doing recovery so
we can at least expect them to make progress when the “out” OSDs get
removed from the set?

The recovery has already finished. It resolves itself, but in the
meantime I saw many PGs in the backfill_toofull state for a long time.

This is new since Mimic.

Wido


On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander mailto:w...@42on.com>> wrote:

 Hi,

 I've got a couple of PGs which are stuck in backfill_toofull, but none
 of them are actually full.

   "up": [
     999,
     1900,
     145
   ],
   "acting": [
     701,
     1146,
     1880
   ],
   "backfill_targets": [
     "145",
     "999",
     "1900"
   ],
   "acting_recovery_backfill": [
     "145",
     "701",
     "999",
     "1146",
     "1880",
     "1900"
   ],

 I checked all these OSDs, but they are all <75% utilization.

 full_ratio 0.95
 backfillfull_ratio 0.9
 nearfull_ratio 0.9

 So I started checking all the PGs and I've noticed that each of these
 PGs has one OSD in the 'acting_recovery_backfill' which is marked as
 out.

 In this case osd.1880 is marked as out and thus it's capacity is shown
 as zero.

 [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
 1880   hdd 4.54599        0     0 B      0 B      0 B     0    0  27
 [ceph@ceph-mgr ~]$

 This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
 side-effect of one of the OSDs being marked as out?

 Thanks,

 Wido
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-27 Thread Wido den Hollander


On 1/25/19 8:33 AM, Gregory Farnum wrote:
> This doesn’t look familiar to me. Is the cluster still doing recovery so
> we can at least expect them to make progress when the “out” OSDs get
> removed from the set?

The recovery has already finished. It resolves itself, but in the
meantime I saw many PGs in the backfill_toofull state for a long time.

This is new since Mimic.

Wido

> On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander  > wrote:
> 
> Hi,
> 
> I've got a couple of PGs which are stuck in backfill_toofull, but none
> of them are actually full.
> 
>   "up": [
>     999,
>     1900,
>     145
>   ],
>   "acting": [
>     701,
>     1146,
>     1880
>   ],
>   "backfill_targets": [
>     "145",
>     "999",
>     "1900"
>   ],
>   "acting_recovery_backfill": [
>     "145",
>     "701",
>     "999",
>     "1146",
>     "1880",
>     "1900"
>   ],
> 
> I checked all these OSDs, but they are all <75% utilization.
> 
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.9
> 
> So I started checking all the PGs and I've noticed that each of these
> PGs has one OSD in the 'acting_recovery_backfill' which is marked as
> out.
> 
> In this case osd.1880 is marked as out and thus it's capacity is shown
> as zero.
> 
> [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
> 1880   hdd 4.54599        0     0 B      0 B      0 B     0    0  27
> [ceph@ceph-mgr ~]$
> 
> This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
> side-effect of one of the OSDs being marked as out?
> 
> Thanks,
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-25 Thread Paul Emmerich
I've also seen this effect a few times since Mimic (never happened in
Luminous). It always resolved itself but the HEALTH_ERROR can be
confusing to users.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Jan 25, 2019 at 8:33 AM Gregory Farnum  wrote:
>
> This doesn’t look familiar to me. Is the cluster still doing recovery so we 
> can at least expect them to make progress when the “out” OSDs get removed 
> from the set?
> On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander  wrote:
>>
>> Hi,
>>
>> I've got a couple of PGs which are stuck in backfill_toofull, but none
>> of them are actually full.
>>
>>   "up": [
>> 999,
>> 1900,
>> 145
>>   ],
>>   "acting": [
>> 701,
>> 1146,
>> 1880
>>   ],
>>   "backfill_targets": [
>> "145",
>> "999",
>> "1900"
>>   ],
>>   "acting_recovery_backfill": [
>> "145",
>> "701",
>> "999",
>> "1146",
>> "1880",
>> "1900"
>>   ],
>>
>> I checked all these OSDs, but they are all <75% utilization.
>>
>> full_ratio 0.95
>> backfillfull_ratio 0.9
>> nearfull_ratio 0.9
>>
>> So I started checking all the PGs and I've noticed that each of these
>> PGs has one OSD in the 'acting_recovery_backfill' which is marked as out.
>>
>> In this case osd.1880 is marked as out and thus it's capacity is shown
>> as zero.
>>
>> [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
>> 1880   hdd 4.545990 0 B  0 B  0 B 00  27
>> [ceph@ceph-mgr ~]$
>>
>> This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
>> side-effect of one of the OSDs being marked as out?
>>
>> Thanks,
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-24 Thread Gregory Farnum
This doesn’t look familiar to me. Is the cluster still doing recovery so we
can at least expect them to make progress when the “out” OSDs get removed
from the set?
On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander  wrote:

> Hi,
>
> I've got a couple of PGs which are stuck in backfill_toofull, but none
> of them are actually full.
>
>   "up": [
> 999,
> 1900,
> 145
>   ],
>   "acting": [
> 701,
> 1146,
> 1880
>   ],
>   "backfill_targets": [
> "145",
> "999",
> "1900"
>   ],
>   "acting_recovery_backfill": [
> "145",
> "701",
> "999",
> "1146",
> "1880",
> "1900"
>   ],
>
> I checked all these OSDs, but they are all <75% utilization.
>
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.9
>
> So I started checking all the PGs and I've noticed that each of these
> PGs has one OSD in the 'acting_recovery_backfill' which is marked as out.
>
> In this case osd.1880 is marked as out and thus it's capacity is shown
> as zero.
>
> [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
> 1880   hdd 4.545990 0 B  0 B  0 B 00  27
> [ceph@ceph-mgr ~]$
>
> This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
> side-effect of one of the OSDs being marked as out?
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] backfill_toofull while OSDs are not full

2019-01-22 Thread Wido den Hollander
Hi,

I've got a couple of PGs which are stuck in backfill_toofull, but none
of them are actually full.

  "up": [
999,
1900,
145
  ],
  "acting": [
701,
1146,
1880
  ],
  "backfill_targets": [
"145",
"999",
"1900"
  ],
  "acting_recovery_backfill": [
"145",
"701",
"999",
"1146",
"1880",
"1900"
  ],

I checked all these OSDs, but they are all <75% utilization.

full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.9

So I started checking all the PGs and I've noticed that each of these
PGs has one OSD in the 'acting_recovery_backfill' which is marked as out.

In this case osd.1880 is marked as out and thus it's capacity is shown
as zero.

[ceph@ceph-mgr ~]$ ceph osd df|grep 1880
1880   hdd 4.545990 0 B  0 B  0 B 00  27
[ceph@ceph-mgr ~]$

This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
side-effect of one of the OSDs being marked as out?

Thanks,

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com