Re: [ceph-users] recovery process stops

Harald Rößler Tue, 21 Oct 2014 01:26:33 -0700

Hi all,

thank you for your support, now the file system is not degraded any more. Now I 
have a minus degrading :-)


2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281 
active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB / 6178 GB 
avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded (-0.123%)

but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery 
-1638/1329293 degraded (-0.123%)

I think this warning is reported because there are 47 active+remapped objects, 
some ideas how to fix that now?

Kind Regards
Harald Roessler


Am 21.10.2014 um 01:03 schrieb Craig Lewis 
<cle...@centraldesktop.com<mailto:cle...@centraldesktop.com>>:

I've been in a state where reweight-by-utilization was deadlocked (not the 
daemons, but the remap scheduling).  After successive osd reweight commands, 
two OSDs wanted to swap PGs, but they were both toofull.  I ended up 
temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the 
impediment, and everything finished remapping.  Everything went smoothly, and I 
changed it back when all the remapping finished.

Just be careful if you need to get close to mon_osd_full_ratio.  Ceph does 
greater-than on these percentages, not greater-than-equal.  You really don't 
want the disks to get greater-than mon_osd_full_ratio, because all external IO 
will stop until you resolve that.


On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master 
<keks...@gmail.com<mailto:keks...@gmail.com>> wrote:
You can set lower weight on full osds, or try changing the osd_near_full_ratio 
parameter in your cluster from 85 to for example 89. But i don't know what can 
go wrong when you do that.


2014-10-20 17:12 GMT+02:00 Wido den Hollander 
<w...@42on.com<mailto:w...@42on.com>>:
On 10/20/2014 05:10 PM, Harald Rößler wrote:
> yes, tomorrow I will get the replacement of the failed disk, to get a new 
> node with many disk will take a few days.
> No other idea?
>

If the disks are all full, then, no.

Sorry to say this, but it came down to poor capacity management. Never
let any disk in your cluster fill over 80% to prevent these situations.

Wido

> Harald Rößler
>
>
>> Am 20.10.2014 um 16:45 schrieb Wido den Hollander 
>> <w...@42on.com<mailto:w...@42on.com>>:
>>
>> On 10/20/2014 04:43 PM, Harald Rößler wrote:
>>> Yes, I had some OSD which was near full, after that I tried to fix the 
>>> problem with "ceph osd reweight-by-utilization", but this does not help. 
>>> After that I set the near full ratio to 88% with the idea that the 
>>> remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
>>> the same time I had a hardware failure of on disk. :-(. After that failure 
>>> the recovery process start at "degraded ~ 13%“ and stops at 7%.
>>> Honestly I am scared in the moment I am doing the wrong operation.
>>>
>>
>> Any chance of adding a new node with some fresh disks? Seems like you
>> are operating on the storage capacity limit of the nodes and that your
>> only remedy would be adding more spindles.
>>
>> Wido
>>
>>> Regards
>>> Harald Rößler
>>>
>>>
>>>
>>>> Am 20.10.2014 um 14:51 schrieb Wido den Hollander 
>>>> <w...@42on.com<mailto:w...@42on.com>>:
>>>>
>>>> On 10/20/2014 02:45 PM, Harald Rößler wrote:
>>>>> Dear All
>>>>>
>>>>> I have in them moment a issue with my cluster. The recovery process stops.
>>>>>
>>>>
>>>> See this: 2 active+degraded+remapped+backfill_toofull
>>>>
>>>> 156 pgs backfill_toofull
>>>>
>>>> You have one or more OSDs which are to full and that causes recovery to
>>>> stop.
>>>>
>>>> If you add more capacity to the cluster recovery will continue and finish.
>>>>
>>>>> ceph -s
>>>>>  health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
>>>>> backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck 
>>>>> unclean; recovery 111487/1488290 degraded (7.491%)
>>>>>  monmap e2: 3 mons at 
>>>>> {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0<http://10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0>},
>>>>>  election epoch 332, quorum 0,1,2 0,12,6
>>>>>  osdmap e6748: 24 osds: 23 up, 23 in
>>>>>   pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
>>>>> active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
>>>>> active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
>>>>> 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
>>>>> active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
>>>>> active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
>>>>> 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
>>>>> active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
>>>>> active+degraded+remapped+backfill_toofull, 2 
>>>>> active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 
>>>>> GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
>>>>> degraded (7.491%)
>>>>>
>>>>>
>>>>> I have tried to restart all OSD in the cluster, but does not help to 
>>>>> finish the recovery of the cluster.
>>>>>
>>>>> Have someone any idea
>>>>>
>>>>> Kind Regards
>>>>> Harald Rößler
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>>
>>>> --
>>>> Wido den Hollander
>>>> Ceph consultant and trainer
>>>> 42on B.V.
>>>>
>>>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>>>> Skype: contact42on
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Wido den Hollander
>> Ceph consultant and trainer
>> 42on B.V.
>>
>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>> Skype: contact42on
>


--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


2014-10-20 17:12 GMT+02:00 Wido den Hollander 
<w...@42on.com<mailto:w...@42on.com>>:
On 10/20/2014 05:10 PM, Harald Rößler wrote:
> yes, tomorrow I will get the replacement of the failed disk, to get a new 
> node with many disk will take a few days.
> No other idea?
>

If the disks are all full, then, no.

Sorry to say this, but it came down to poor capacity management. Never
let any disk in your cluster fill over 80% to prevent these situations.

Wido

> Harald Rößler
>
>
>> Am 20.10.2014 um 16:45 schrieb Wido den Hollander 
>> <w...@42on.com<mailto:w...@42on.com>>:
>>
>> On 10/20/2014 04:43 PM, Harald Rößler wrote:
>>> Yes, I had some OSD which was near full, after that I tried to fix the 
>>> problem with "ceph osd reweight-by-utilization", but this does not help. 
>>> After that I set the near full ratio to 88% with the idea that the 
>>> remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
>>> the same time I had a hardware failure of on disk. :-(. After that failure 
>>> the recovery process start at "degraded ~ 13%“ and stops at 7%.
>>> Honestly I am scared in the moment I am doing the wrong operation.
>>>
>>
>> Any chance of adding a new node with some fresh disks? Seems like you
>> are operating on the storage capacity limit of the nodes and that your
>> only remedy would be adding more spindles.
>>
>> Wido
>>
>>> Regards
>>> Harald Rößler
>>>
>>>
>>>
>>>> Am 20.10.2014 um 14:51 schrieb Wido den Hollander 
>>>> <w...@42on.com<mailto:w...@42on.com>>:
>>>>
>>>> On 10/20/2014 02:45 PM, Harald Rößler wrote:
>>>>> Dear All
>>>>>
>>>>> I have in them moment a issue with my cluster. The recovery process stops.
>>>>>
>>>>
>>>> See this: 2 active+degraded+remapped+backfill_toofull
>>>>
>>>> 156 pgs backfill_toofull
>>>>
>>>> You have one or more OSDs which are to full and that causes recovery to
>>>> stop.
>>>>
>>>> If you add more capacity to the cluster recovery will continue and finish.
>>>>
>>>>> ceph -s
>>>>>  health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
>>>>> backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck 
>>>>> unclean; recovery 111487/1488290 degraded (7.491%)
>>>>>  monmap e2: 3 mons at 
>>>>> {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0<http://10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0>},
>>>>>  election epoch 332, quorum 0,1,2 0,12,6
>>>>>  osdmap e6748: 24 osds: 23 up, 23 in
>>>>>   pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
>>>>> active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
>>>>> active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
>>>>> 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
>>>>> active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
>>>>> active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
>>>>> 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
>>>>> active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
>>>>> active+degraded+remapped+backfill_toofull, 2 
>>>>> active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 
>>>>> GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
>>>>> degraded (7.491%)
>>>>>
>>>>>
>>>>> I have tried to restart all OSD in the cluster, but does not help to 
>>>>> finish the recovery of the cluster.
>>>>>
>>>>> Have someone any idea
>>>>>
>>>>> Kind Regards
>>>>> Harald Rößler
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>>
>>>> --
>>>> Wido den Hollander
>>>> Ceph consultant and trainer
>>>> 42on B.V.
>>>>
>>>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>>>> Skype: contact42on
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Wido den Hollander
>> Ceph consultant and trainer
>> 42on B.V.
>>
>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>> Skype: contact42on
>


--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

Reply via email to