[ceph-users] RBD client newer than cluster

2017-02-14 Thread Lukáš Kubín
Hi,
I'm most probably hitting bug http://tracker.ceph.com/issues/13755 - when
libvirt mounted RBD disks suspend I/O during snapshot creation until hard
reboot.

My Ceph cluster (monitors and OSDs) is running v0.94.3, while clients
(OpenStack/KVM computes) run v0.94.5. Can I still update the client
packages (librbd1 and dependencies) to a patched release 0.94.7, while
keeping the cluster on v0.94.3?

I realize it's not ideal but does it present any risk? Can I assume that
patching the client is sufficient to resolve the mentioned bug?

Ceph cluster nodes can't receive updates currently and this will stay so
for some time still, but I need to resolve the snapshot bug urgently.

Greetings,

Lukáš
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD client newer than cluster

2017-02-14 Thread Lukáš Kubín
Yes, also. The main reason though is temporarily missing connection from
Ceph nodes to package repo - this will take some days or weeks to reconnect.
The client nodes can connect and update.

Thanks,

Lukáš

On Tue, Feb 14, 2017 at 6:56 PM, Shinobu Kinjo  wrote:

> On Wed, Feb 15, 2017 at 2:18 AM, Lukáš Kubín 
> wrote:
> > Hi,
> > I'm most probably hitting bug http://tracker.ceph.com/issues/13755 -
> when
> > libvirt mounted RBD disks suspend I/O during snapshot creation until hard
> > reboot.
> >
> > My Ceph cluster (monitors and OSDs) is running v0.94.3, while clients
> > (OpenStack/KVM computes) run v0.94.5. Can I still update the client
> packages
> > (librbd1 and dependencies) to a patched release 0.94.7, while keeping the
> > cluster on v0.94.3?
>
> The latest hammer is v0.94.9 and hammer will be EOL in this spring.
> Why do you want to keep v0.94.3? Is it because you just want to avoid
> any risks regarding to upgrading packages?
>
> >
> > I realize it's not ideal but does it present any risk? Can I assume that
> > patching the client is sufficient to resolve the mentioned bug?
> >
> > Ceph cluster nodes can't receive updates currently and this will stay so
> for
> > some time still, but I need to resolve the snapshot bug urgently.
> >
> > Greetings,
> >
> > Lukáš
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
Hi,
I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
pools, each of size=2. Today, one of our OSDs got full, another 2 near
full. Cluster turned into ERR state. I have noticed uneven space
distribution among OSD drives between 70 and 100 perce. I have realized
there's a low amount of pgs in those 2 pools (128 each) and increased one
of them to 512, expecting a magic to happen and redistribute the space
evenly.

Well, something happened - another OSD became full during the
redistribution and cluster stopped both OSDs and marked them down. After
some hours the remaining drives partially rebalanced and cluster get to
WARN state.

I've deleted 3 placement group directories from one of the full OSD's
filesystem which allowed me to start it up again. Soon, however this drive
became full again.

So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no
drives to add.

Is there a way how to get out of this situation without adding OSDs? I will
attempt to release some space, just waiting for colleague to identify RBD
volumes (openstack images and volumes) which can be deleted.

Thank you.

Lukas


This is my cluster state now:

[root@compute1 ~]# ceph -w
cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
 health HEALTH_WARN
10 pgs backfill_toofull
114 pgs degraded
114 pgs stuck degraded
147 pgs stuck unclean
114 pgs stuck undersized
114 pgs undersized
1 requests are blocked > 32 sec
recovery 56923/640724 objects degraded (8.884%)
recovery 29122/640724 objects misplaced (4.545%)
3 near full osd(s)
 monmap e3: 3 mons at {compute1=
10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
}
election epoch 128, quorum 0,1,2 compute1,compute2,compute3
 osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
  pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
4365 GB used, 890 GB / 5256 GB avail
56923/640724 objects degraded (8.884%)
29122/640724 objects misplaced (4.545%)
 493 active+clean
 108 active+undersized+degraded
  29 active+remapped
   6 active+undersized+degraded+remapped+backfill_toofull
   4 active+remapped+backfill_toofull

[root@ceph1 ~]# df|grep osd
/dev/sdg1   580496384 500066812  80429572  87%
/var/lib/ceph/osd/ceph-3
/dev/sdf1   580496384 502131428  78364956  87%
/var/lib/ceph/osd/ceph-2
/dev/sde1   580496384 506927100  73569284  88%
/var/lib/ceph/osd/ceph-0
/dev/sdb1   287550208 28755018820 100%
/var/lib/ceph/osd/ceph-5
/dev/sdd1   580496384 58049636420 100%
/var/lib/ceph/osd/ceph-4
/dev/sdc1   580496384 478675672 101820712  83%
/var/lib/ceph/osd/ceph-1

[root@ceph2 ~]# df|grep osd
/dev/sdf1   580496384 448689872 131806512  78%
/var/lib/ceph/osd/ceph-7
/dev/sdb1   287550208 227054336  60495872  79%
/var/lib/ceph/osd/ceph-11
/dev/sdd1   580496384 464175196 116321188  80%
/var/lib/ceph/osd/ceph-10
/dev/sdc1   580496384 489451300  91045084  85%
/var/lib/ceph/osd/ceph-6
/dev/sdg1   580496384 470559020 109937364  82%
/var/lib/ceph/osd/ceph-9
/dev/sde1   580496384 490289388  90206996  85%
/var/lib/ceph/osd/ceph-8

[root@ceph2 ~]# ceph df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
5256G  890G4365G 83.06
POOLS:
NAME   ID USED  %USED MAX AVAIL OBJECTS
glance 6  1714G 32.61  385G  219579
cinder 7   676G 12.86  385G   97488

[root@ceph2 ~]# ceph osd pool get glance pg_num
pg_num: 512
[root@ceph2 ~]# ceph osd pool get cinder pg_num
pg_num: 128
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
Ahoj Jan, thanks for the quick hint!

Those 2 OSDs are currently full and down. How should I handle that? Is it
ok that I delete some pg directories again and start the OSD daemons, on
both drives in parallel. Then set the weights as recommended ?

What effect should I expect then - will the cluster attempt to move some
pgs out of these drives to different local OSDs? I'm asking because when
I've attempted to delete pg dirs and restart OSD for the first time, the
OSD get full again very fast.

Thank you.

Lukas



On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer  wrote:

> Ahoj ;-)
>
> You can reweight them temporarily, that shifts the data from the full
> drives.
>
> ceph osd reweight osd.XX YY
> (XX = the number of full OSD, YY is "weight" which default to 1)
>
> This is different from "crush reweight" which defaults to drive size in TB.
>
> Beware that reweighting will (afaik) only shuffle the data to other local
> drives, so you should reweight both the full drives at the same time and
> only by little bit at a time (0.95 is a good starting point).
>
> Jan
>
>
>
> On 17 Feb 2016, at 21:43, Lukáš Kubín  wrote:
>
> Hi,
> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
> pools, each of size=2. Today, one of our OSDs got full, another 2 near
> full. Cluster turned into ERR state. I have noticed uneven space
> distribution among OSD drives between 70 and 100 perce. I have realized
> there's a low amount of pgs in those 2 pools (128 each) and increased one
> of them to 512, expecting a magic to happen and redistribute the space
> evenly.
>
> Well, something happened - another OSD became full during the
> redistribution and cluster stopped both OSDs and marked them down. After
> some hours the remaining drives partially rebalanced and cluster get to
> WARN state.
>
> I've deleted 3 placement group directories from one of the full OSD's
> filesystem which allowed me to start it up again. Soon, however this drive
> became full again.
>
> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no
> drives to add.
>
> Is there a way how to get out of this situation without adding OSDs? I
> will attempt to release some space, just waiting for colleague to identify
> RBD volumes (openstack images and volumes) which can be deleted.
>
> Thank you.
>
> Lukas
>
>
> This is my cluster state now:
>
> [root@compute1 ~]# ceph -w
> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>  health HEALTH_WARN
> 10 pgs backfill_toofull
> 114 pgs degraded
> 114 pgs stuck degraded
> 147 pgs stuck unclean
> 114 pgs stuck undersized
> 114 pgs undersized
> 1 requests are blocked > 32 sec
> recovery 56923/640724 objects degraded (8.884%)
> recovery 29122/640724 objects misplaced (4.545%)
> 3 near full osd(s)
>  monmap e3: 3 mons at {compute1=
> 10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
> }
> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
> 4365 GB used, 890 GB / 5256 GB avail
> 56923/640724 objects degraded (8.884%)
> 29122/640724 objects misplaced (4.545%)
>  493 active+clean
>  108 active+undersized+degraded
>   29 active+remapped
>6 active+undersized+degraded+remapped+backfill_toofull
>4 active+remapped+backfill_toofull
>
> [root@ceph1 ~]# df|grep osd
> /dev/sdg1   580496384 500066812  80429572  87%
> /var/lib/ceph/osd/ceph-3
> /dev/sdf1   580496384 502131428  78364956  87%
> /var/lib/ceph/osd/ceph-2
> /dev/sde1   580496384 506927100  73569284  88%
> /var/lib/ceph/osd/ceph-0
> /dev/sdb1   287550208 28755018820 100%
> /var/lib/ceph/osd/ceph-5
> /dev/sdd1   580496384 58049636420 100%
> /var/lib/ceph/osd/ceph-4
> /dev/sdc1   580496384 478675672 101820712  83%
> /var/lib/ceph/osd/ceph-1
>
> [root@ceph2 ~]# df|grep osd
> /dev/sdf1   580496384 448689872 131806512  78%
> /var/lib/ceph/osd/ceph-7
> /dev/sdb1   287550208 227054336  60495872  79%
> /var/lib/ceph/osd/ceph-11
> /dev/sdd1   580496384 464175196 116321188  80%
> /var/lib/ceph/osd/ceph-10
> /dev/sdc1   580496384 489451300  91045084  85%
> /var/lib/ceph/osd/ceph-6
> /dev/sdg1   580496384 470559020 109937364  82%

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
You're right, the "full" osd was still up and in until I increased the pg
values of one of the pools. The redistribution has not completed yet and
perhaps that's what is still filling the drive. With this info - do you
think I'm still safe to follow the steps suggested in previous post?

Thanks!

Lukas

On Wed, Feb 17, 2016 at 10:29 PM Jan Schermer  wrote:

> Something must be on those 2 OSDs that ate all that space - ceph by
> default doesn't allow OSD to get completely full (filesystem-wise) and from
> what you've shown those filesystems are really really full.
> OSDs don't usually go down when "full" (95%) .. or do they? I don't think
> so... so the reason they stopped is likely a completely full filfeystem.
> You have to move something out of the way, restart those OSDs with lower
> reweight and hopefully everything will be good.
>
> Jan
>
>
> On 17 Feb 2016, at 22:22, Lukáš Kubín  wrote:
>
> Ahoj Jan, thanks for the quick hint!
>
> Those 2 OSDs are currently full and down. How should I handle that? Is it
> ok that I delete some pg directories again and start the OSD daemons, on
> both drives in parallel. Then set the weights as recommended ?
>
> What effect should I expect then - will the cluster attempt to move some
> pgs out of these drives to different local OSDs? I'm asking because when
> I've attempted to delete pg dirs and restart OSD for the first time, the
> OSD get full again very fast.
>
> Thank you.
>
> Lukas
>
>
>
> On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer  wrote:
>
>> Ahoj ;-)
>>
>> You can reweight them temporarily, that shifts the data from the full
>> drives.
>>
>> ceph osd reweight osd.XX YY
>> (XX = the number of full OSD, YY is "weight" which default to 1)
>>
>> This is different from "crush reweight" which defaults to drive size in
>> TB.
>>
>> Beware that reweighting will (afaik) only shuffle the data to other local
>> drives, so you should reweight both the full drives at the same time and
>> only by little bit at a time (0.95 is a good starting point).
>>
>> Jan
>>
>>
>>
>> On 17 Feb 2016, at 21:43, Lukáš Kubín  wrote:
>>
>> Hi,
>> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
>> pools, each of size=2. Today, one of our OSDs got full, another 2 near
>> full. Cluster turned into ERR state. I have noticed uneven space
>> distribution among OSD drives between 70 and 100 perce. I have realized
>> there's a low amount of pgs in those 2 pools (128 each) and increased one
>> of them to 512, expecting a magic to happen and redistribute the space
>> evenly.
>>
>> Well, something happened - another OSD became full during the
>> redistribution and cluster stopped both OSDs and marked them down. After
>> some hours the remaining drives partially rebalanced and cluster get to
>> WARN state.
>>
>> I've deleted 3 placement group directories from one of the full OSD's
>> filesystem which allowed me to start it up again. Soon, however this drive
>> became full again.
>>
>> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no
>> drives to add.
>>
>> Is there a way how to get out of this situation without adding OSDs? I
>> will attempt to release some space, just waiting for colleague to identify
>> RBD volumes (openstack images and volumes) which can be deleted.
>>
>> Thank you.
>>
>> Lukas
>>
>>
>> This is my cluster state now:
>>
>> [root@compute1 ~]# ceph -w
>> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>>  health HEALTH_WARN
>> 10 pgs backfill_toofull
>> 114 pgs degraded
>> 114 pgs stuck degraded
>> 147 pgs stuck unclean
>> 114 pgs stuck undersized
>> 114 pgs undersized
>> 1 requests are blocked > 32 sec
>> recovery 56923/640724 objects degraded (8.884%)
>> recovery 29122/640724 objects misplaced (4.545%)
>> 3 near full osd(s)
>>  monmap e3: 3 mons at {compute1=
>> 10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
>> }
>> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
>> 4365 GB used, 890 GB / 5256 GB avail
>> 56923/640724 objects degraded (8.884%)
>>   

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-18 Thread Lukáš Kubín
Hi,
we've managed to release some space from our cluster. Now I would like to
restart those 2 full OSDs. As they're completely full I probably need to
delete some data from them.

I would like to ask: Is it OK to delete all pg directories (eg. all
subdirectories in /var/lib/ceph/osd/ceph-5/current/) and start the stopped
OSD daemon then? This process seems most simple I'm just not sure if it is
correct - if ceph can handle such situation. (I've noticed similar advice
here:
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
 )

Another option as suggested by Jan is to remove OSD from cluster, and
recreate them back. That presents more steps though and perhaps some more
safety prerequirements (nobackfill?) to prevent more block movements/disks
full while removing/readding.

Thanks!

Lukas


Current status:

[root@ceph1 ~]# ceph osd stat
 osdmap e1107: 12 osds: 10 up, 10 in; 29 remapped pgs
[root@ceph1 ~]# ceph pg stat
v21691144: 640 pgs: 503 active+clean, 29 active+remapped, 108
active+undersized+degraded; 1892 GB data, 3476 GB used, 1780 GB / 5256 GB
avail; 0 B/s rd, 323 kB/s wr, 49 op/s; 42998/504482 objects degraded
(8.523%); 10304/504482 objects misplaced (2.042%)
[root@ceph1 ~]# df -h|grep osd
/dev/sdg1554G  383G  172G  70% /var/lib/ceph/osd/ceph-3
/dev/sdf1554G  401G  154G  73% /var/lib/ceph/osd/ceph-2
/dev/sde1554G  381G  174G  69% /var/lib/ceph/osd/ceph-0
/dev/sdb1275G  275G   20K 100% /var/lib/ceph/osd/ceph-5
/dev/sdd1554G  554G   20K 100% /var/lib/ceph/osd/ceph-4
/dev/sdc1554G  359G  196G  65% /var/lib/ceph/osd/ceph-1
[root@ceph1 ~]# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.93991 root default
-2 2.96996 host ceph1
 0 0.53999 osd.0   up  1.0  1.0
 1 0.53999 osd.1   up  1.0  1.0
 2 0.53999 osd.2   up  1.0  1.0
 3 0.53999 osd.3   up  1.0  1.0
 4 0.53999 osd.4 down0  1.0
 5 0.26999 osd.5 down0  1.0
-3 2.96996 host ceph2
 6 0.53999 osd.6   up  1.0  1.0
 7 0.53999 osd.7   up  1.0  1.0
 8 0.53999 osd.8   up  1.0  1.0
 9 0.53999 osd.9   up  1.0  1.0
10 0.53999 osd.10  up  1.0  1.0
11 0.26999 osd.11  up  1.0  1.0


On Wed, Feb 17, 2016 at 9:43 PM Lukáš Kubín  wrote:

> Hi,
> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
> pools, each of size=2. Today, one of our OSDs got full, another 2 near
> full. Cluster turned into ERR state. I have noticed uneven space
> distribution among OSD drives between 70 and 100 perce. I have realized
> there's a low amount of pgs in those 2 pools (128 each) and increased one
> of them to 512, expecting a magic to happen and redistribute the space
> evenly.
>
> Well, something happened - another OSD became full during the
> redistribution and cluster stopped both OSDs and marked them down. After
> some hours the remaining drives partially rebalanced and cluster get to
> WARN state.
>
> I've deleted 3 placement group directories from one of the full OSD's
> filesystem which allowed me to start it up again. Soon, however this drive
> became full again.
>
> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no
> drives to add.
>
> Is there a way how to get out of this situation without adding OSDs? I
> will attempt to release some space, just waiting for colleague to identify
> RBD volumes (openstack images and volumes) which can be deleted.
>
> Thank you.
>
> Lukas
>
>
> This is my cluster state now:
>
> [root@compute1 ~]# ceph -w
> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>  health HEALTH_WARN
> 10 pgs backfill_toofull
> 114 pgs degraded
> 114 pgs stuck degraded
> 147 pgs stuck unclean
> 114 pgs stuck undersized
> 114 pgs undersized
> 1 requests are blocked > 32 sec
> recovery 56923/640724 objects degraded (8.884%)
> recovery 29122/640724 objects misplaced (4.545%)
> 3 near full osd(s)
>  monmap e3: 3 mons at {compute1=
> 10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
> }
> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
> 4365 GB used, 890 GB / 5256 GB avail
> 56923/640724 ob

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-19 Thread Lukáš Kubín
Hi Jan, Bryan,
the cluster is healthy now again. Thanks for your support and tips!

I have identified all active+clean pgs which had their directories also
located on those 2 full disks and deleted those directories from
/var/lib/ceph/osd/ceph-N/current. That released lot of space from those
disks and I could restart OSDs one by one. I had to manipulate weights
using "ceph osd reweight" when the utilization of those OSDs exceeded other
drives' average again. That helped.

One strange thing - few pgs still remained in degraded or misplaced state
until I set weights of all OSDs back to 1.

When all degraded pgs turned to active+clean state, I still had "1 requests
are blocked > 32 sec" error on one OSD (other than those which get full).
Fixed by restarting that OSD's daemon. After that, I got the HEALTH_OK
state.

Thanks and greetings,

Lukas

On Thu, Feb 18, 2016 at 11:51 PM Stillwell, Bryan <
bryan.stillw...@twcable.com> wrote:

> When I've run into this situation I look for PGs that are on the full
> drives, but are in an active+clean state in the cluster.  That way I can
> safely remove the PGs from the full drives and not have to risk data loss.
>
> It usually doesn't take much before you can restart the OSDs and let ceph
> take care of the rest.
>
> Bryan
>
> From:  ceph-users  on behalf of Lukáš
> Kubín 
> Date:  Thursday, February 18, 2016 at 2:39 PM
> To:  "ceph-users@lists.ceph.com" 
> Subject:  Re: [ceph-users] How to recover from OSDs full in small cluster
>
>
> >Hi,
> >we've managed to release some space from our cluster. Now I would like to
> >restart those 2 full OSDs. As they're completely full I probably need to
> >delete some data from them.
> >
> >I would like to ask: Is it OK to delete all pg directories (eg. all
> >subdirectories in /var/lib/ceph/osd/ceph-5/current/) and start the
> >stopped OSD daemon then? This process seems most simple I'm just not sure
> >if it is correct - if ceph can handle such
> > situation. (I've noticed similar advice here:
> >
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd
> >/ )
> >
> >Another option as suggested by Jan is to remove OSD from cluster, and
> >recreate them back. That presents more steps though and perhaps some more
> >safety prerequirements (nobackfill?) to prevent more block
> >movements/disks full while removing/readding.
> >
> >Thanks!
> >
> >Lukas
> >
> >
> >Current status:
> >
> >[root@ceph1 ~]# ceph osd stat
> > osdmap e1107: 12 osds: 10 up, 10 in; 29 remapped pgs
> >
> >[root@ceph1 ~]# ceph pg stat
> >
> >v21691144: 640 pgs: 503 active+clean, 29 active+remapped, 108
> >active+undersized+degraded; 1892 GB data, 3476 GB used, 1780 GB / 5256 GB
> >avail; 0 B/s rd, 323 kB/s wr, 49 op/s; 42998/504482 objects degraded
> >(8.523%); 10304/504482 objects misplaced (2.042%)
> >[root@ceph1 ~]# df -h|grep osd
> >/dev/sdg1554G  383G  172G  70% /var/lib/ceph/osd/ceph-3
> >/dev/sdf1554G  401G  154G  73% /var/lib/ceph/osd/ceph-2
> >/dev/sde1554G  381G  174G  69% /var/lib/ceph/osd/ceph-0
> >/dev/sdb1275G  275G   20K 100% /var/lib/ceph/osd/ceph-5
> >/dev/sdd1554G  554G   20K 100% /var/lib/ceph/osd/ceph-4
> >/dev/sdc1554G  359G  196G  65% /var/lib/ceph/osd/ceph-1
> >
> >[root@ceph1 ~]# ceph osd tree
> >ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >-1 5.93991 root default
> >-2 2.96996 host ceph1
> > 0 0.53999 osd.0   up  1.0  1.0
> > 1 0.53999 osd.1   up  1.0  1.0
> > 2 0.53999 osd.2   up  1.0  1.0
> > 3 0.53999 osd.3   up  1.0  1.0
> > 4 0.53999 osd.4 down0  1.0
> > 5 0.26999 osd.5 down0  1.0
> >-3 2.96996 host ceph2
> > 6 0.53999 osd.6   up  1.0  1.0
> > 7 0.53999 osd.7   up  1.0  1.0
> > 8 0.53999 osd.8   up  1.0  1.0
> > 9 0.53999 osd.9   up  1.0  1.0
> >10 0.53999 osd.10  up  1.0  1.0
> >11 0.26999 osd.11  up  1.0  1.0
> >
> >
> >
> >On Wed, Feb 17, 2016 at 9:43 PM Lukáš Kubín 
> wrote:
> >
> >
> >Hi,
> >I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
> >pools, each of size=2. Today, one of our OSDs g

[ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
Hello,
I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down
through night after months of running without change. From Linux logs I
found out the OSD processes were killed because they consumed all available
memory.

Those 5 failed OSDs were from different hosts of my 4-node cluster (see
below). Two hosts act as SSD cache tier in some of my pools. The other two
hosts are the default rotational drives storage.

After checking the Linux was not out of memory I've attempted to restart
those failed OSDs. Most of those OSD daemon exhaust all memory in seconds
and got killed by Linux again:

Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 (ceph-osd)
score 867 or sacrifice child
Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd)
total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB


On the host I've found lots of similar "slow request" messages preceding
the crash:

2014-10-28 22:11:20.885527 7f25f84d1700  0 log [WRN] : slow request
31.117125 seconds old, received at 2014-10-28 22:10:49.768291:
osd_sub_op(client.168752.0:2197931 14.2c7
888596c7/rbd_data.293272f8695e4.006f/head//14 [] v 1551'377417
snapset=0=[]:[] snapc=0=[]) v10 currently no flag points reached
2014-10-28 22:11:21.885668 7f25f84d1700  0 log [WRN] : 67 slow requests, 1
included below; oldest blocked for > 9879.304770 secs


Apparently I can't get the cluster fixed by restarting the OSDs all over
again. Is there any other option then?

Thank you.

Lukas Kubin



[root@q04 ~]# ceph -s
cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99
 health HEALTH_ERR 9 pgs backfill; 1 pgs backfilling; 521 pgs degraded;
425 pgs incomplete; 13 pgs inconsistent; 20 pgs recovering; 50 pgs
recovery_wait; 151 pgs stale; 425 pgs stuck inactive; 151 pgs stuck stale;
1164 pgs stuck unclean; 12070270 requests are blocked > 32 sec; recovery
887322/35206223 objects degraded (2.520%); 119/17131232 unfound (0.001%);
13 scrub errors
 monmap e2: 3 mons at {q03=
10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0},
election epoch 90, quorum 0,1,2 q03,q04,q05
 osdmap e2194: 34 osds: 31 up, 31 in
  pgmap v7429812: 5632 pgs, 7 pools, 1446 GB data, 16729 kobjects
2915 GB used, 12449 GB / 15365 GB avail
887322/35206223 objects degraded (2.520%); 119/17131232 unfound
(0.001%)
  38 active+recovery_wait+remapped
4455 active+clean
  65 stale+incomplete
   3 active+recovering+remapped
 359 incomplete
  12 active+recovery_wait
 139 active+remapped
  86 stale+active+degraded
  16 active+recovering
   1 active+remapped+backfilling
  13 active+clean+inconsistent
   9 active+remapped+wait_backfill
 434 active+degraded
   1 remapped+incomplete
   1 active+recovering+degraded+remapped
  client io 0 B/s rd, 469 kB/s wr, 48 op/s

[root@q04 ~]# ceph osd tree
# idweight  type name   up/down reweight
-5  3.24root ssd
-6  1.62host q06
16  0.18osd.16  up  1
17  0.18osd.17  up  1
18  0.18osd.18  up  1
19  0.18osd.19  up  1
20  0.18osd.20  up  1
21  0.18osd.21  up  1
22  0.18osd.22  up  1
23  0.18osd.23  up  1
24  0.18osd.24  up  1
-7  1.62host q07
25  0.18osd.25  up  1
26  0.18osd.26  up  1
27  0.18osd.27  up  1
28  0.18osd.28  up  1
29  0.18osd.29  up  1
30  0.18osd.30  up  1
31  0.18osd.31  up  1
32  0.18osd.32  up  1
33  0.18osd.33  up  1
-1  14.56   root default
-4  14.56   root sata
-2  7.28host q08
0   0.91osd.0   up  1
1   0.91osd.1   up  1
2   0.91osd.2   up  1
3   0.91osd.3   up  1
11  0.91osd.11  up  1
12  0.91osd.12  up  1
13  0.91osd.13  down0
14  0.91osd.14  up  1
-3  7.28host q09
4   0.91osd.4   up  1
5   0.91osd.5   up  1
6   0.91osd.6   up  1
7   0.91osd.7   up  1
8   0.91 

[ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
Hello,
I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down
through night after months of running without change. From Linux logs I
found out the OSD processes were killed because they consumed all available
memory.

Those 5 failed OSDs were from different hosts of my 4-node cluster (see
below). Two hosts act as SSD cache tier in some of my pools. The other two
hosts are the default rotational drives storage.

After checking the Linux was not out of memory I've attempted to restart
those failed OSDs. Most of those OSD daemon exhaust all memory in seconds
and got killed by Linux again:

Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 (ceph-osd)
score 867 or sacrifice child
Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd)
total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB


On the host I've found lots of similar "slow request" messages preceding
the crash:

2014-10-28 22:11:20.885527 7f25f84d1700  0 log [WRN] : slow request
31.117125 seconds old, received at 2014-10-28 22:10:49.768291:
osd_sub_op(client.168752.0:2197931 14.2c7
888596c7/rbd_data.293272f8695e4.006f/head//14 [] v 1551'377417
snapset=0=[]:[] snapc=0=[]) v10 currently no flag points reached
2014-10-28 22:11:21.885668 7f25f84d1700  0 log [WRN] : 67 slow requests, 1
included below; oldest blocked for > 9879.304770 secs


Apparently I can't get the cluster fixed by restarting the OSDs all over
again. Is there any other option then?

Thank you.

Lukas Kubin



[root@q04 ~]# ceph -s
cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99
 health HEALTH_ERR 9 pgs backfill; 1 pgs backfilling; 521 pgs degraded;
425 pgs incomplete; 13 pgs inconsistent; 20 pgs recovering; 50 pgs
recovery_wait; 151 pgs stale; 425 pgs stuck inactive; 151 pgs stuck stale;
1164 pgs stuck unclean; 12070270 requests are blocked > 32 sec; recovery
887322/35206223 objects degraded (2.520%); 119/17131232 unfound (0.001%);
13 scrub errors
 monmap e2: 3 mons at {q03=
10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0},
election epoch 90, quorum 0,1,2 q03,q04,q05
 osdmap e2194: 34 osds: 31 up, 31 in
  pgmap v7429812: 5632 pgs, 7 pools, 1446 GB data, 16729 kobjects
2915 GB used, 12449 GB / 15365 GB avail
887322/35206223 objects degraded (2.520%); 119/17131232 unfound
(0.001%)
  38 active+recovery_wait+remapped
4455 active+clean
  65 stale+incomplete
   3 active+recovering+remapped
 359 incomplete
  12 active+recovery_wait
 139 active+remapped
  86 stale+active+degraded
  16 active+recovering
   1 active+remapped+backfilling
  13 active+clean+inconsistent
   9 active+remapped+wait_backfill
 434 active+degraded
   1 remapped+incomplete
   1 active+recovering+degraded+remapped
  client io 0 B/s rd, 469 kB/s wr, 48 op/s

[root@q04 ~]# ceph osd tree
# idweight  type name   up/down reweight
-5  3.24root ssd
-6  1.62host q06
16  0.18osd.16  up  1
17  0.18osd.17  up  1
18  0.18osd.18  up  1
19  0.18osd.19  up  1
20  0.18osd.20  up  1
21  0.18osd.21  up  1
22  0.18osd.22  up  1
23  0.18osd.23  up  1
24  0.18osd.24  up  1
-7  1.62host q07
25  0.18osd.25  up  1
26  0.18osd.26  up  1
27  0.18osd.27  up  1
28  0.18osd.28  up  1
29  0.18osd.29  up  1
30  0.18osd.30  up  1
31  0.18osd.31  up  1
32  0.18osd.32  up  1
33  0.18osd.33  up  1
-1  14.56   root default
-4  14.56   root sata
-2  7.28host q08
0   0.91osd.0   up  1
1   0.91osd.1   up  1
2   0.91osd.2   up  1
3   0.91osd.3   up  1
11  0.91osd.11  up  1
12  0.91osd.12  up  1
13  0.91osd.13  down0
14  0.91osd.14  up  1
-3  7.28host q09
4   0.91osd.4   up  1
5   0.91osd.5   up  1
6   0.91osd.6   up  1
7   0.91osd.7   up  1
8   0.91 

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
I've ended up at step "ceph osd unset noin". My OSDs are up, but not in,
even after an hour:

[root@q04 ceph-recovery]# ceph osd stat
 osdmap e2602: 34 osds: 34 up, 0 in
flags nobackfill,norecover,noscrub,nodeep-scrub


There seems to be no activity generated by OSD processes, occasionally they
show 0,3% which I believe is just some basic communication processing. No
load in network interfaces.

Is there some other step needed to bring the OSDs in?

Thank you.

Lukas

On Wed, Oct 29, 2014 at 3:58 PM, Michael J. Kidd 
wrote:

> Hello Lukas,
>   Please try the following process for getting all your OSDs up and
> operational...
>
> * Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover,
> nobackfill
> for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph osd
> set $i; done
>
> * Stop all OSDs (I know, this seems counter productive)
> * Set all OSDs down / out
> for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd down
> $i; ceph osd out $i; done
> * Set recovery / backfill throttles as well as heartbeat and OSD map
> processing tweaks in the /etc/ceph/ceph.conf file under the [osd] section:
> [osd]
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_max_single_start = 1
> osd_backfill_scan_min = 8
> osd_heartbeat_interval = 36
> osd_heartbeat_grace = 240
> osd_map_message_max = 1000
> osd_map_cache_size = 3136
>
> * Start all OSDs
> * Monitor 'top' for 0% CPU on all OSD processes.. it may take a while..  I
> usually issue 'top' then, the keys M c
>  - M = Sort by memory usage
>  - c = Show command arguments
>  - This allows to easily monitor the OSD process and know which OSDs have
> settled, etc..
> * Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag
>  - ceph osd unset noup
> * Again, wait for 0% CPU utilization (may  be immediate, may take a
> while.. just gotta wait)
> * Once all OSDs have hit 0% CPU again, remove the 'noin' flag
>  - ceph osd unset noin
>  - All OSDs should now appear up/in, and will go through peering..
> * Once ceph -s shows no further activity, and OSDs are back at 0% CPU
> again, unset 'nobackfill'
>  - ceph osd unset nobackfill
> * Once ceph -s shows no further activity, and OSDs are back at 0% CPU
> again, unset 'norecover'
>  - ceph osd unset norecover
> * Monitor OSD memory usage... some OSDs may get killed off again, but
> their subsequent restart should consume less memory and allow more recovery
> to occur between each step above.. and ultimately, hopefully... your entire
> cluster will come back online and be usable.
>
> ## Clean-up:
> * Remove all of the above set options from ceph.conf
> * Reset the running OSDs to their defaults:
> ceph tell osd.\* injectargs '--osd_max_backfills 10
> --osd_recovery_max_active 15 --osd_recovery_max_single_start 5
> --osd_backfill_scan_min 64 --osd_heartbeat_interval 6 --osd_heartbeat_grace
> 36 --osd_map_message_max 100 --osd_map_cache_size 500'
> * Unset the noscrub and nodeep-scrub flags:
>  - ceph osd unset noscrub
>  - ceph osd unset nodeep-scrub
>
>
> ## For help identifying why memory usage was so high, please provide:
> * ceph osd dump | grep pool
> * ceph osd crush rule dump
>
> Let us know if this helps... I know it looks extreme, but it's worked for
> me in the past..
>
>
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
> On Wed, Oct 29, 2014 at 8:51 AM, Lukáš Kubín 
> wrote:
>
>> Hello,
>> I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being
>> down through night after months of running without change. From Linux logs
>> I found out the OSD processes were killed because they consumed all
>> available memory.
>>
>> Those 5 failed OSDs were from different hosts of my 4-node cluster (see
>> below). Two hosts act as SSD cache tier in some of my pools. The other two
>> hosts are the default rotational drives storage.
>>
>> After checking the Linux was not out of memory I've attempted to restart
>> those failed OSDs. Most of those OSD daemon exhaust all memory in seconds
>> and got killed by Linux again:
>>
>> Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 (ceph-osd)
>> score 867 or sacrifice child
>> Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd)
>> total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB
>>
>>
>> On the host I've found lots of similar "slow request" messages preceding
>> the crash:
>>
>> 2014-10-28 22:11:20.885527 7f25f84d170

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
ated size 2 min_size 1 crush_ruleset 2 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 634 flags hashpspool
stripe_width 0
pool 7 'volumes' replicated size 2 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 1517 flags hashpspool tiers
14 read_tier 14 write_tier 14 stripe_width 0
pool 8 'images' replicated size 2 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 1519 flags hashpspool
stripe_width 0
pool 12 'backups' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 862 flags hashpspool
stripe_width 0
pool 14 'volumes-cache' replicated size 2 min_size 1 crush_ruleset 1
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1517 flags
hashpspool tier_of 7 cache_mode writeback target_bytes 1
hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0}
3600s x1 stripe_width 0


On Wed, Oct 29, 2014 at 6:43 PM, Michael J. Kidd 
wrote:

> Ah, sorry... since they were set out manually, they'll need to be set in
> manually..
>
> for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd in
> $i; done
>
>
>
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
> On Wed, Oct 29, 2014 at 12:33 PM, Lukáš Kubín 
> wrote:
>
>> I've ended up at step "ceph osd unset noin". My OSDs are up, but not in,
>> even after an hour:
>>
>> [root@q04 ceph-recovery]# ceph osd stat
>>  osdmap e2602: 34 osds: 34 up, 0 in
>> flags nobackfill,norecover,noscrub,nodeep-scrub
>>
>>
>> There seems to be no activity generated by OSD processes, occasionally
>> they show 0,3% which I believe is just some basic communication processing.
>> No load in network interfaces.
>>
>> Is there some other step needed to bring the OSDs in?
>>
>> Thank you.
>>
>> Lukas
>>
>> On Wed, Oct 29, 2014 at 3:58 PM, Michael J. Kidd <
>> michael.k...@inktank.com> wrote:
>>
>>> Hello Lukas,
>>>   Please try the following process for getting all your OSDs up and
>>> operational...
>>>
>>> * Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover,
>>> nobackfill
>>> for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph
>>> osd set $i; done
>>>
>>> * Stop all OSDs (I know, this seems counter productive)
>>> * Set all OSDs down / out
>>> for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd
>>> down $i; ceph osd out $i; done
>>> * Set recovery / backfill throttles as well as heartbeat and OSD map
>>> processing tweaks in the /etc/ceph/ceph.conf file under the [osd] section:
>>> [osd]
>>> osd_max_backfills = 1
>>> osd_recovery_max_active = 1
>>> osd_recovery_max_single_start = 1
>>> osd_backfill_scan_min = 8
>>> osd_heartbeat_interval = 36
>>> osd_heartbeat_grace = 240
>>> osd_map_message_max = 1000
>>> osd_map_cache_size = 3136
>>>
>>> * Start all OSDs
>>> * Monitor 'top' for 0% CPU on all OSD processes.. it may take a while..
>>> I usually issue 'top' then, the keys M c
>>>  - M = Sort by memory usage
>>>  - c = Show command arguments
>>>  - This allows to easily monitor the OSD process and know which OSDs
>>> have settled, etc..
>>> * Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag
>>>  - ceph osd unset noup
>>> * Again, wait for 0% CPU utilization (may  be immediate, may take a
>>> while.. just gotta wait)
>>> * Once all OSDs have hit 0% CPU again, remove the 'noin' flag
>>>  - ceph osd unset noin
>>>  - All OSDs should now appear up/in, and will go through peering..
>>> * Once ceph -s shows no further activity, and OSDs are back at 0% CPU
>>> again, unset 'nobackfill'
>>>  - ceph osd unset nobackfill
>>> * Once ceph -s shows no further activity, and OSDs are back at 0% CPU
>>> again, unset 'norecover'
>>>  - ceph osd unset norecover
>>> * Monitor OSD memory usage... some OSDs may get killed off again, but
>>> their subsequent restart should consume less memory and allow more recovery
>>> to occur between each step above.. and ultimately, hopefully... your entire
>>> cluster will come back online and be usable.
>>>
>>> ## Clean-up:
>>> * Remove all of the above set options from ceph.conf
>>> * Reset the ru

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
Hi,
I've noticed the following messages always accumulate in OSD log before it
exhausts all memory:

2014-10-30 08:48:42.994190 7f80a2019700  0 log [WRN] : slow request
38.901192 seconds old, received at 2014-10-30 08:48:04.092889:
osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.363b@17
[copy-get max 8388608] 7.af87e887
ack+read+ignore_cache+ignore_overlay+map_snap_clone e3359) v4 currently
reached pg


Note this is always from the most frequently failing osd.10 (sata tier)
referring to osd.29 (ssd cache tier). That osd.29 is consuming huge CPU and
memory resources, but keeps running without failures.

Can this be eg. a bug? Or some erroneous I/O request which initiated this
behaviour? Can I eg. attempt to upgrade the Ceph to a more recent release
in the current unhealthy status of the cluster? Can I eg. try disabling the
caching tier? Or just somehow evacuate the problematic OSD?

I'll welcome any ideas. Currently, I'm keeping the osd.10 in an automatic
restart loop with 60 seconds pause before starting again.

Thanks and greetings,

Lukas

On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín  wrote:

> I should have figured that out myself since I did that recently. Thanks.
>
> Unfortunately, I'm still at the step "ceph osd unset noin". After setting
> all the OSDs in, the original issue reapears preventing me to proceed with
> recovery. It now appears mostly at single OSD - osd.10 which consumes ~200%
> CPU and all memory within 45 seconds being killed by Linux then:
>
> Oct 29 18:24:38 q09 kernel: Out of memory: Kill process 17202 (ceph-osd)
> score 912 or sacrifice child
> Oct 29 18:24:38 q09 kernel: Killed process 17202, UID 0, (ceph-osd)
> total-vm:62713176kB, anon-rss:62009772kB, file-rss:328kB
>
>
> I've tried to restart it several times with same result. Similar situation
> with OSDs 0 and 13.
>
> Also, I've noticed one of SSD cache tier's OSD - osd.29 generating high
> CPU utilization around 180%.
>
> All the problematic OSD's have been the same ones all the time -  OSD
> 0,8,10,13 and 29 - they are those which I've found to be down this morning.
>
> There is some minor load coming from client - Openstack instances, I
> preferred not to kill them:
>
> [root@q04 ceph-recovery]# ceph -s
> cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99
>  health HEALTH_ERR 31 pgs backfill; 241 pgs degraded; 62 pgs down; 193
> pgs incomplete; 13 pgs inconsistent; 62 pgs peering; 12 pgs recovering; 205
> pgs recovery_wait; 93 pgs stuck inactive; 608 pgs stuck unclean; 381138
> requests are blocked > 32 sec; recovery 1162468/35207488 objects degraded
> (3.302%); 466/17112963 unfound (0.003%); 13 scrub errors; 1/34 in osds are
> down; nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>  monmap e2: 3 mons at {q03=
> 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0},
> election epoch 92, quorum 0,1,2 q03,q04,q05
>  osdmap e2782: 34 osds: 33 up, 34 in
> flags nobackfill,norecover,noscrub,nodeep-scrub
>   pgmap v7440374: 5632 pgs, 7 pools, 1449 GB data, 16711 kobjects
> 3148 GB used, 15010 GB / 18158 GB avail
> 1162468/35207488 objects degraded (3.302%); 466/17112963
> unfound (0.003%)
>   13 active
>   22 active+recovery_wait+remapped
>1 active+recovery_wait+inconsistent
> 4794 active+clean
>  193 incomplete
>   62 down+peering
>9 active+degraded+remapped+wait_backfill
>  182 active+recovery_wait
>   74 active+remapped
>   12 active+recovering
>   12 active+clean+inconsistent
>   22 active+remapped+wait_backfill
>4 active+clean+replay
>  232 active+degraded
>   client io 0 B/s rd, 1048 kB/s wr, 184 op/s
>
>
> Below I'm sending the requested output.
>
> Do you have any other ideas how to recover from this?
>
> Thanks a lot.
>
> Lukas
>
>
>
>
> [root@q04 ceph-recovery]# ceph osd crush rule dump
> [
> { "rule_id": 0,
>   "rule_name": "replicated_ruleset",
>   "ruleset": 0,
>   "type": 1,
>   "min_size": 1,
>   "max_size": 10,
>   "steps": [
> { "op": "take",
>   "item": -1,
>   "item_name": "default"},
> { "op": "chooseleaf_firstn",
>   "num": 0,
>   "type": "host"},
> { "op": 

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
Thanks Michael, still no luck.

Letting the problematic OSD.10 down has no effect. Within minutes more of
OSDs fail on same issue after consuming ~50GB of memory. Also, I can see
two of those cache-tier OSDs on separate hosts which remain utilized almost
200% CPU all the time

I've performed upgrade of all cluster to 0.80.7. Did not help.

I have also tried to unset norecovery+nobackfill flags to attempt a
recovery completion. No luck, several OSDs fail with the same issue
preventing the recovery to complete. I've performed your fix steps from the
start again and currently I'm behind the "unset noin" step.

I could get some of pools to a state with no degraded objects temporarily.
Then (within minutes) some OSD fails and it's degraded again.

I have also tried to let the OSD processes get restarted automatically to
keep them up as much as possible.

I consider disabling the tiering pool "volumes-cache" as that's something I
can miss:

pool name   category KB  objects   clones
degraded
backups -  000
   0
data-  000
   0
images  -  777989590950270
8883
metadata-  000
   0
rbd -  000
   0
volumes -  11560869325965  179
3307
volumes-cache   -  649577103 16708730 9894
 1144650


Can I just switch it into the forward mode and let it empty
(cache-flush-evict-all) to see if that changes anything?

Could you or any of your colleagues provide anything else to try?

Thank you,

Lukas


On Thu, Oct 30, 2014 at 3:05 PM, Michael J. Kidd 
wrote:

> Hello Lukas,
>   The 'slow request' logs are expected while the cluster is in such a
> state.. the OSD processes simply aren't able to respond quickly to client
> IO requests.
>
> I would recommend trying to recover without the most problematic disk (
> seems to be OSD.10? ).. Simply shut it down and see if the other OSDs
> settle down.  You should also take a look at the kernel logs for any
> indications of a problem with the disks themselves, or possibly do an FIO
> test against the drive with the OSD shut down (to a file on the OSD
> filesystem, not the raw drive.. this would be destructive).
>
> Also, you could upgrade to 0.80.7.  There are some bug fixes, but I'm not
> sure if any would specifically help this situation.. not likely to hurt
> through.
>
> The desired state is for the cluster to be steady-state before the next
> move (unsetting the next flag).  Hopefully this can be achieved without
> needing to take down OSDs in multiple hosts.
>
> I'm also unsure about the cache tiering and how it could relate to the
> load being seen.
>
> Hope this helps...
>
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
> On Thu, Oct 30, 2014 at 4:00 AM, Lukáš Kubín 
> wrote:
>
>> Hi,
>> I've noticed the following messages always accumulate in OSD log before
>> it exhausts all memory:
>>
>> 2014-10-30 08:48:42.994190 7f80a2019700  0 log [WRN] : slow request
>> 38.901192 seconds old, received at 2014-10-30 08:48:04.092889:
>> osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.363b@17
>> [copy-get max 8388608] 7.af87e887
>> ack+read+ignore_cache+ignore_overlay+map_snap_clone e3359) v4 currently
>> reached pg
>>
>>
>> Note this is always from the most frequently failing osd.10 (sata tier)
>> referring to osd.29 (ssd cache tier). That osd.29 is consuming huge CPU and
>> memory resources, but keeps running without failures.
>>
>> Can this be eg. a bug? Or some erroneous I/O request which initiated this
>> behaviour? Can I eg. attempt to upgrade the Ceph to a more recent release
>> in the current unhealthy status of the cluster? Can I eg. try disabling the
>> caching tier? Or just somehow evacuate the problematic OSD?
>>
>> I'll welcome any ideas. Currently, I'm keeping the osd.10 in an automatic
>> restart loop with 60 seconds pause before starting again.
>>
>> Thanks and greetings,
>>
>> Lukas
>>
>> On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín 
>> wrote:
>>
>>> I should have figured that out myself since I did that recently. Thanks.
>>>
>>> Unfortunately, I'm still at the step "ceph osd unset noin". After
>>> setting all the OSDs in, the original issue reapears preventing me to
>>> proceed with recovery. It now appe

[ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
Nevermind, you helped me a lot by showing this OSD startup procedure
Michael. Big Thanks!

I seem to have made some progress now by setting the cache-mode to forward.
The OSD processes of SATA hosts stopped failing immediately. I'm now
waiting for the cache tier to flush. Then I'll try to enable recover and
backfill again to see if the cluster recovers.

Best greetings,

Lukas

On Thu, Oct 30, 2014 at 6:33 PM, Michael J. Kidd 
wrote:

> Hello Lukas,
>   Unfortunately, I'm all out of ideas at the moment.  There are some
> memory profiling techniques which can help identify what is causing the
> memory utilization, but it's a bit beyond what I typically work on.  Others
> on the list may have experience with this (or otherwise have ideas) and may
> chip in...
>
> Wish I could be more help..
>
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
> On Thu, Oct 30, 2014 at 11:00 AM, Lukáš Kubín 
> wrote:
>
>> Thanks Michael, still no luck.
>>
>> Letting the problematic OSD.10 down has no effect. Within minutes more of
>> OSDs fail on same issue after consuming ~50GB of memory. Also, I can see
>> two of those cache-tier OSDs on separate hosts which remain utilized almost
>> 200% CPU all the time
>>
>> I've performed upgrade of all cluster to 0.80.7. Did not help.
>>
>> I have also tried to unset norecovery+nobackfill flags to attempt a
>> recovery completion. No luck, several OSDs fail with the same issue
>> preventing the recovery to complete. I've performed your fix steps from the
>> start again and currently I'm behind the "unset noin" step.
>>
>> I could get some of pools to a state with no degraded objects
>> temporarily. Then (within minutes) some OSD fails and it's degraded again.
>>
>> I have also tried to let the OSD processes get restarted automatically to
>> keep them up as much as possible.
>>
>> I consider disabling the tiering pool "volumes-cache" as that's something
>> I can miss:
>>
>> pool name   category KB  objects   clones
>> degraded
>> backups -  000
>>  0
>> data-  000
>>  0
>> images  -  777989590950270
>>   8883
>> metadata-  000
>>  0
>> rbd -  000
>>  0
>> volumes -  11560869325965  179
>>   3307
>> volumes-cache   -  649577103 16708730 9894
>>1144650
>>
>>
>> Can I just switch it into the forward mode and let it empty
>> (cache-flush-evict-all) to see if that changes anything?
>>
>> Could you or any of your colleagues provide anything else to try?
>>
>> Thank you,
>>
>> Lukas
>>
>>
>> On Thu, Oct 30, 2014 at 3:05 PM, Michael J. Kidd <
>> michael.k...@inktank.com> wrote:
>>
>>> Hello Lukas,
>>>   The 'slow request' logs are expected while the cluster is in such a
>>> state.. the OSD processes simply aren't able to respond quickly to client
>>> IO requests.
>>>
>>> I would recommend trying to recover without the most problematic disk (
>>> seems to be OSD.10? ).. Simply shut it down and see if the other OSDs
>>> settle down.  You should also take a look at the kernel logs for any
>>> indications of a problem with the disks themselves, or possibly do an FIO
>>> test against the drive with the OSD shut down (to a file on the OSD
>>> filesystem, not the raw drive.. this would be destructive).
>>>
>>> Also, you could upgrade to 0.80.7.  There are some bug fixes, but I'm
>>> not sure if any would specifically help this situation.. not likely to hurt
>>> through.
>>>
>>> The desired state is for the cluster to be steady-state before the next
>>> move (unsetting the next flag).  Hopefully this can be achieved without
>>> needing to take down OSDs in multiple hosts.
>>>
>>> I'm also unsure about the cache tiering and how it could relate to the
>>> load being seen.
>>>
>>> Hope this helps...
>>>
>>> Michael J. Kidd
>>> Sr. Storage Consultant
>>> Inktank Professional Services
>>>  - by Red Hat
>>>
>>> On Thu

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
Fixed. My cluster is HEALTH_OK again now. It went fast in the right
direction after I set cache-mode to forward (from original writeback) and
disabling norecover and nobackfill flags.

Still I'm waiting for 15 million of objects to get flushed from the cache
tier.

It seems that the issue was somehow related to the caching tier. Does
anybody have an idea how to prevent this? Did anybody experienced similar
issue with writeback cache tier?

Big thanks to Michael J. Kidd for all his support!


Best greetings,

Lukas

On Thu, Oct 30, 2014 at 8:18 PM, Lukáš Kubín  wrote:

> Nevermind, you helped me a lot by showing this OSD startup procedure
> Michael. Big Thanks!
>
> I seem to have made some progress now by setting the cache-mode to
> forward. The OSD processes of SATA hosts stopped failing immediately. I'm
> now waiting for the cache tier to flush. Then I'll try to enable recover
> and backfill again to see if the cluster recovers.
>
> Best greetings,
>
> Lukas
>
> On Thu, Oct 30, 2014 at 6:33 PM, Michael J. Kidd  > wrote:
>
>> Hello Lukas,
>>   Unfortunately, I'm all out of ideas at the moment.  There are some
>> memory profiling techniques which can help identify what is causing the
>> memory utilization, but it's a bit beyond what I typically work on.  Others
>> on the list may have experience with this (or otherwise have ideas) and may
>> chip in...
>>
>> Wish I could be more help..
>>
>> Michael J. Kidd
>> Sr. Storage Consultant
>> Inktank Professional Services
>>  - by Red Hat
>>
>> On Thu, Oct 30, 2014 at 11:00 AM, Lukáš Kubín 
>> wrote:
>>
>>> Thanks Michael, still no luck.
>>>
>>> Letting the problematic OSD.10 down has no effect. Within minutes more
>>> of OSDs fail on same issue after consuming ~50GB of memory. Also, I can see
>>> two of those cache-tier OSDs on separate hosts which remain utilized almost
>>> 200% CPU all the time
>>>
>>> I've performed upgrade of all cluster to 0.80.7. Did not help.
>>>
>>> I have also tried to unset norecovery+nobackfill flags to attempt a
>>> recovery completion. No luck, several OSDs fail with the same issue
>>> preventing the recovery to complete. I've performed your fix steps from the
>>> start again and currently I'm behind the "unset noin" step.
>>>
>>> I could get some of pools to a state with no degraded objects
>>> temporarily. Then (within minutes) some OSD fails and it's degraded again.
>>>
>>> I have also tried to let the OSD processes get restarted automatically
>>> to keep them up as much as possible.
>>>
>>> I consider disabling the tiering pool "volumes-cache" as that's
>>> something I can miss:
>>>
>>> pool name   category KB  objects   clones
>>>   degraded
>>> backups -  000
>>>  0
>>> data-  000
>>>  0
>>> images  -  777989590950270
>>> 8883
>>> metadata-  000
>>>  0
>>> rbd -  000
>>>  0
>>> volumes -  11560869325965  179
>>>   3307
>>> volumes-cache   -  649577103 16708730 9894
>>>1144650
>>>
>>>
>>> Can I just switch it into the forward mode and let it empty
>>> (cache-flush-evict-all) to see if that changes anything?
>>>
>>> Could you or any of your colleagues provide anything else to try?
>>>
>>> Thank you,
>>>
>>> Lukas
>>>
>>>
>>> On Thu, Oct 30, 2014 at 3:05 PM, Michael J. Kidd <
>>> michael.k...@inktank.com> wrote:
>>>
>>>> Hello Lukas,
>>>>   The 'slow request' logs are expected while the cluster is in such a
>>>> state.. the OSD processes simply aren't able to respond quickly to client
>>>> IO requests.
>>>>
>>>> I would recommend trying to recover without the most problematic disk (
>>>> seems to be OSD.10? ).. Simply shut it down and see if the other OSDs
>>>> settle down.  You should also take a look at the kernel logs for any
>>>> indications of a problem with the disks themselves, or pos

[ceph-users] Tunables client support

2019-08-22 Thread Lukáš Kubín
Hello,
I am considering enabling optimal crush tunables in our Jewel cluster (4
nodes, 52 OSD, used as OpenStack Cinder+Nova backend = RBD images). I've
got two questions:

1. Do I understand right that having the optimal tunables on can be
considered best practice and should be applied in most scenarios? Or is
there something I should be warned about?

2. There's a minimal kernel version requirement for KRBD clients. Does a
similar restriction apply on librbd (libvirt-qemu) clients too? Basically,
I just need to ensure we're not going to harm our clients (OpenStack
instances) after setting the tunables. We're running version 4.4 Linux
kernels on compute nodes, which is not supported by KRBD with Jewel set of
tunables.

Thanks and greetings,

Lukas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increase pg_num while backfilling

2019-08-22 Thread Lukáš Kubín
Hello,
yesterday I've added 4th OSD node (increase from 39 to 52 OSDs) into our
Jewel cluster. Backfilling of remapped pgs is still running and seems it
will run for another day until complete.

I know the pg_num of largest is undersized and I should increase it from
512 to 2048.

The question is - should I wait for backfilling to complete or can I still
increase pg_num (and pgp_num) while backfilling is running?

Thanks and greetings,

Lukas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com