Re: [ceph-users] How to recover from OSDs full in small cluster

Jan Schermer Wed, 17 Feb 2016 16:15:07 -0800

Hmm, it's possible there aren't any safeguards against filling the whole drive 
when increasing PGs, actually I think ceph only cares about free space when 
backilling which is not what happened (at least directly) in your case.
However, having a completely full OSD filesystem is not going to end well - 
better trash the OSD if it crashes because of it.
Be aware that whenever ceph starts backfilling it temporarily needs more space, 
and sometimes it shuffles more data than you'd expect. What can happen is that 
while OSD1 is trying to get rid of data, it simultaneously gets filled with 
data from another OSD (because crush-magic happens) and if that eats the last 
bits of space it's going to go FUBAR.


You can set "nobackfill" on the cluster, that will prevent ceph from shuffling 
anything temporarily (set that before you restart the OSDs), but I wonder if 
it's too late - that 20MB free in the df output scares me. 

The safest way would probably be to trash osd.5 and osd.4 in your case, create 
two new OSDs in their place and backfill them again (with lower reweight). It's 
up to you whether you can afford the IO it will cause.

Which OSDs actually crashed? 4 and 5? Too late to save them methinks...

Jan



> On 17 Feb 2016, at 23:06, Lukáš Kubín <lukas.ku...@gmail.com> wrote:
> 
> You're right, the "full" osd was still up and in until I increased the pg 
> values of one of the pools. The redistribution has not completed yet and 
> perhaps that's what is still filling the drive. With this info - do you think 
> I'm still safe to follow the steps suggested in previous post?
> 
> Thanks!
> 
> Lukas
> 
> On Wed, Feb 17, 2016 at 10:29 PM Jan Schermer <j...@schermer.cz 
> <mailto:j...@schermer.cz>> wrote:
> Something must be on those 2 OSDs that ate all that space - ceph by default 
> doesn't allow OSD to get completely full (filesystem-wise) and from what 
> you've shown those filesystems are really really full.
> OSDs don't usually go down when "full" (95%) .. or do they? I don't think 
> so... so the reason they stopped is likely a completely full filfeystem. You 
> have to move something out of the way, restart those OSDs with lower reweight 
> and hopefully everything will be good.
> 
> Jan
> 
> 
>> On 17 Feb 2016, at 22:22, Lukáš Kubín <lukas.ku...@gmail.com 
>> <mailto:lukas.ku...@gmail.com>> wrote:
>> 
>> Ahoj Jan, thanks for the quick hint!
>> 
>> Those 2 OSDs are currently full and down. How should I handle that? Is it ok 
>> that I delete some pg directories again and start the OSD daemons, on both 
>> drives in parallel. Then set the weights as recommended ?
>> 
>> What effect should I expect then - will the cluster attempt to move some pgs 
>> out of these drives to different local OSDs? I'm asking because when I've 
>> attempted to delete pg dirs and restart OSD for the first time, the OSD get 
>> full again very fast.
>> 
>> Thank you.
>> 
>> Lukas
>> 
>> 
>> 
>> On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer <j...@schermer.cz 
>> <mailto:j...@schermer.cz>> wrote:
>> Ahoj ;-)
>> 
>> You can reweight them temporarily, that shifts the data from the full drives.
>> 
>> ceph osd reweight osd.XX YY
>> (XX = the number of full OSD, YY is "weight" which default to 1)
>> 
>> This is different from "crush reweight" which defaults to drive size in TB.
>> 
>> Beware that reweighting will (afaik) only shuffle the data to other local 
>> drives, so you should reweight both the full drives at the same time and 
>> only by little bit at a time (0.95 is a good starting point).
>> 
>> Jan
>> 
>>  
>> 
>>> On 17 Feb 2016, at 21:43, Lukáš Kubín <lukas.ku...@gmail.com 
>>> <mailto:lukas.ku...@gmail.com>> wrote:
>>> 
>> 
>>> Hi,
>>> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 
>>> pools, each of size=2. Today, one of our OSDs got full, another 2 near 
>>> full. Cluster turned into ERR state. I have noticed uneven space 
>>> distribution among OSD drives between 70 and 100 perce. I have realized 
>>> there's a low amount of pgs in those 2 pools (128 each) and increased one 
>>> of them to 512, expecting a magic to happen and redistribute the space 
>>> evenly. 
>>> 
>>> Well, something happened - another OSD became full during the 
>>> redistribution and cluster stopped both OSDs and marked them down. After 
>>> some hours the remaining drives partially rebalanced and cluster get to 
>>> WARN state. 
>>> 
>>> I've deleted 3 placement group directories from one of the full OSD's 
>>> filesystem which allowed me to start it up again. Soon, however this drive 
>>> became full again.
>>> 
>>> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no 
>>> drives to add. 
>>> 
>>> Is there a way how to get out of this situation without adding OSDs? I will 
>>> attempt to release some space, just waiting for colleague to identify RBD 
>>> volumes (openstack images and volumes) which can be deleted.
>>> 
>>> Thank you.
>>> 
>>> Lukas
>>> 
>>> 
>>> This is my cluster state now:
>>> 
>>> [root@compute1 ~]# ceph -w
>>>     cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>>>      health HEALTH_WARN
>>>             10 pgs backfill_toofull
>>>             114 pgs degraded
>>>             114 pgs stuck degraded
>>>             147 pgs stuck unclean
>>>             114 pgs stuck undersized
>>>             114 pgs undersized
>>>             1 requests are blocked > 32 sec
>>>             recovery 56923/640724 objects degraded (8.884%)
>>>             recovery 29122/640724 objects misplaced (4.545%)
>>>             3 near full osd(s)
>>>      monmap e3: 3 mons at 
>>> {compute1=10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
>>>  
>>> <http://10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0>}
>>>             election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>>>      osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>>>       pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
>>>             4365 GB used, 890 GB / 5256 GB avail
>>>             56923/640724 objects degraded (8.884%)
>>>             29122/640724 objects misplaced (4.545%)
>>>                  493 active+clean
>>>                  108 active+undersized+degraded
>>>                   29 active+remapped
>>>                    6 active+undersized+degraded+remapped+backfill_toofull
>>>                    4 active+remapped+backfill_toofull
>>> 
>>> [root@ceph1 ~]# df|grep osd
>>> /dev/sdg1               580496384 500066812  80429572  87% 
>>> /var/lib/ceph/osd/ceph-3
>>> /dev/sdf1               580496384 502131428  78364956  87% 
>>> /var/lib/ceph/osd/ceph-2
>>> /dev/sde1               580496384 506927100  73569284  88% 
>>> /var/lib/ceph/osd/ceph-0
>>> /dev/sdb1               287550208 287550188        20 100% 
>>> /var/lib/ceph/osd/ceph-5
>>> /dev/sdd1               580496384 580496364        20 100% 
>>> /var/lib/ceph/osd/ceph-4
>>> /dev/sdc1               580496384 478675672 101820712  83% 
>>> /var/lib/ceph/osd/ceph-1
>>> 
>>> [root@ceph2 ~]# df|grep osd
>>> /dev/sdf1               580496384 448689872 131806512  78% 
>>> /var/lib/ceph/osd/ceph-7
>>> /dev/sdb1               287550208 227054336  60495872  79% 
>>> /var/lib/ceph/osd/ceph-11
>>> /dev/sdd1               580496384 464175196 116321188  80% 
>>> /var/lib/ceph/osd/ceph-10
>>> /dev/sdc1               580496384 489451300  91045084  85% 
>>> /var/lib/ceph/osd/ceph-6
>>> /dev/sdg1               580496384 470559020 109937364  82% 
>>> /var/lib/ceph/osd/ceph-9
>>> /dev/sde1               580496384 490289388  90206996  85% 
>>> /var/lib/ceph/osd/ceph-8
>>> 
>>> [root@ceph2 ~]# ceph df
>>> GLOBAL:
>>>     SIZE      AVAIL     RAW USED     %RAW USED
>>>     5256G      890G        4365G         83.06
>>> POOLS:
>>>     NAME       ID     USED      %USED     MAX AVAIL     OBJECTS
>>>     glance     6      1714G     32.61          385G      219579
>>>     cinder     7       676G     12.86          385G       97488
>>> 
>>> [root@ceph2 ~]# ceph osd pool get glance pg_num
>>> pg_num: 512
>>> [root@ceph2 ~]# ceph osd pool get cinder pg_num
>>> pg_num: 128
>>> 
>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to recover from OSDs full in small cluster

Reply via email to