[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-07-08 Thread Mark Schouten

Hi,

Op 15-05-2021 om 22:17 schreef Mark Schouten:

Ok, so that helped for one of the MDS'es. Trying to deactivate another
mds, it started to release inos and dns'es, until it was almost done.
When it had a 50-ish left, a client started to complain and be
blacklisted until I restarted the deactivated MDS..

So no joy yet, not deactivated until a single active MDS. Any ideas to
achieve that are appreciated.


I've been able to deactivate the second MDS. Not sure why, but I had a 
lot of stray entries, which I've cleaned up by running a `find -ls` on 
the whole CephFS tree.


That already was a few weeks ago, but I decided to just try and 
deactivate the second MDS, which now worked. So now I can finally do the 
upgrade. :)


Thanks!

--
Mark Schouten
CTO, Tuxis B.V. | https://www.tuxis.nl/
 | +31 318 200208
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-05-15 Thread Mark Schouten
On Fri, May 14, 2021 at 09:12:07PM +0200, Mark Schouten wrote:
> It seems (documentation was no longer available, so ik took some
> searching) that I needed to run ceph mds deactivate $fs:$rank for every
> MDS I wanted to deactivate.

Ok, so that helped for one of the MDS'es. Trying to deactivate another
mds, it started to release inos and dns'es, until it was almost done.
When it had a 50-ish left, a client started to complain and be
blacklisted until I restarted the deactivated MDS..

So no joy yet, not deactivated until a single active MDS. Any ideas to
achieve that are appreciated.

Thanks!

-- 
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-05-14 Thread Mark Schouten
On Mon, May 10, 2021 at 10:46:45PM +0200, Mark Schouten wrote:
> I still have three active ranks. Do I simply restart two of the MDS'es
> and force max_mds to one daemon, or is there a nicer way to move two
> mds'es from active to standby?

It seems (documentation was no longer available, so ik took some
searching) that I needed to run ceph mds deactivate $fs:$rank for every
MDS I wanted to deactivate.

That helped!

-- 
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-05-11 Thread Mark Schouten
On Tue, May 11, 2021 at 09:13:51AM +, Eugen Block wrote:
> You can check the remaining active daemons if they have pinned subtrees:
> 
> ceph daemon mds.daemon-a get subtrees | jq '.[] | [.dir.path, .auth_first]'

This gives me output, a whole lot of lines. However, none of the
directories are directories anyone has ever actively put pinning on...


-- 
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-05-11 Thread Eugen Block

You can check the remaining active daemons if they have pinned subtrees:

ceph daemon mds.daemon-a get subtrees | jq '.[] | [.dir.path, .auth_first]'
[
  "/dir1/subdir1",
  6
]
[
  "",
  0
]
[
  "~mds6",
  6
]

If there's no pinning enabled it should probably look like this:

[
  "",
  0
]
[
  "~mds0",
  0
]
[
  "/dir2",
  0
]


If you mount the cephfs root directory you can check the  
subdirectories with getfattr:


host:~ # getfattr -n ceph.dir.pin /mnt/dir1/subdir1
getfattr: Removing leading '/' from absolute path names
# file: mnt/dir1/subdir1
ceph.dir.pin="1"


Does that help?

Zitat von Mark Schouten :


On Tue, May 11, 2021 at 08:47:26AM +, Eugen Block wrote:

I don't have a Luminous cluster at hand right now but setting max_mds to 1
already should take care and stop MDS services. Do you have have pinning
enabled (subdirectories pinned to a specific MDS)?


Not on this cluster, AFAIK. How can I check that?

--
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-05-11 Thread Mark Schouten
On Tue, May 11, 2021 at 08:47:26AM +, Eugen Block wrote:
> I don't have a Luminous cluster at hand right now but setting max_mds to 1
> already should take care and stop MDS services. Do you have have pinning
> enabled (subdirectories pinned to a specific MDS)?

Not on this cluster, AFAIK. How can I check that?

-- 
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-05-11 Thread Eugen Block
I don't have a Luminous cluster at hand right now but setting max_mds  
to 1 already should take care and stop MDS services. Do you have have  
pinning enabled (subdirectories pinned to a specific MDS)?



Zitat von Mark Schouten :


On Thu, Apr 29, 2021 at 10:58:15AM +0200, Mark Schouten wrote:

We've done our fair share of Ceph cluster upgrades since Hammer, and
have not seen much problems with them. I'm now at the point that I have
to upgrade a rather large cluster running Luminous and I would like to
hear from other users if they have experiences with issues I can expect
so that I can anticipate on them beforehand.



Thanks for the replies!

Just one question though. Step one for me was to lower max_mds to one.
Documentation seems to suggest that the cluster automagically moves > 1
mds'es to a standby state. However, nothing really happens.

root@osdnode01:~# ceph fs get dadup_pmrb | grep max_mds
max_mds 1

I still have three active ranks. Do I simply restart two of the MDS'es
and force max_mds to one daemon, or is there a nicer way to move two
mds'es from active to standby?

Thanks again!

--
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-05-10 Thread Mark Schouten
On Thu, Apr 29, 2021 at 10:58:15AM +0200, Mark Schouten wrote:
> We've done our fair share of Ceph cluster upgrades since Hammer, and
> have not seen much problems with them. I'm now at the point that I have
> to upgrade a rather large cluster running Luminous and I would like to
> hear from other users if they have experiences with issues I can expect
> so that I can anticipate on them beforehand.


Thanks for the replies! 

Just one question though. Step one for me was to lower max_mds to one.
Documentation seems to suggest that the cluster automagically moves > 1
mds'es to a standby state. However, nothing really happens.

root@osdnode01:~# ceph fs get dadup_pmrb | grep max_mds
max_mds 1

I still have three active ranks. Do I simply restart two of the MDS'es
and force max_mds to one daemon, or is there a nicer way to move two
mds'es from active to standby?

Thanks again!

-- 
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-04-29 Thread Alex Gorbachev
Mark,

My main note was to make sure and NOT enable msgr2 until all OSDs are
upgraded to Nautilus.  I made that mistake early in the lab, and had to
work hard to get it back together.  Otherwise, pretty smooth process.

--
Alex Gorbachev
ISS/Storcium



On Thu, Apr 29, 2021 at 4:58 AM Mark Schouten  wrote:

> Hi,
>
> We've done our fair share of Ceph cluster upgrades since Hammer, and
> have not seen much problems with them. I'm now at the point that I have
> to upgrade a rather large cluster running Luminous and I would like to
> hear from other users if they have experiences with issues I can expect
> so that I can anticipate on them beforehand.
>
> As said, the cluster is running Luminous (12.2.13) and has the following
> services active:
>   services:
> mon: 3 daemons, quorum osdnode01,osdnode02,osdnode04
> mgr: osdnode01(active), standbys: osdnode02, osdnode03
> mds: pmrb-3/3/3 up
> {0=osdnode06=up:active,1=osdnode08=up:active,2=osdnode07=up:active}, 1
> up:standby
> osd: 116 osds: 116 up, 116 in;
> rgw: 3 daemons active
>
>
> Of the OSD's, we have 11 SSD's and 105 HDD. The capacity of the cluster
> is 1.01PiB.
>
> We have 2 active crush-rules on 18 pools. All pools have a size of 3 there
> is a total of 5760 pgs.
> {
> "rule_id": 1,
> "rule_name": "hdd-data",
> "ruleset": 1,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -10,
> "item_name": "default~hdd"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> },
> {
> "rule_id": 2,
> "rule_name": "ssd-data",
> "ruleset": 2,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -21,
> "item_name": "default~ssd"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
>
> rbd -> crush_rule: hdd-data
> .rgw.root -> crush_rule: hdd-data
> default.rgw.control -> crush_rule: hdd-data
> default.rgw.data.root -> crush_rule: ssd-data
> default.rgw.gc -> crush_rule: ssd-data
> default.rgw.log -> crush_rule: ssd-data
> default.rgw.users.uid -> crush_rule: hdd-data
> default.rgw.usage -> crush_rule: ssd-data
> default.rgw.users.email -> crush_rule: hdd-data
> default.rgw.users.keys -> crush_rule: hdd-data
> default.rgw.meta -> crush_rule: hdd-data
> default.rgw.buckets.index -> crush_rule: ssd-data
> default.rgw.buckets.data -> crush_rule: hdd-data
> default.rgw.users.swift -> crush_rule: hdd-data
> default.rgw.buckets.non-ec -> crush_rule: ssd-data
> DB0475 -> crush_rule: hdd-data
> cephfs_pmrb_data -> crush_rule: hdd-data
> cephfs_pmrb_metadata -> crush_rule: ssd-data
>
>
> All but four clients are running Luminous, the four are running Jewel
> (that needs upgrading before proceeding with this upgrade).
>
> So, normally, I would 'just' upgrade all Ceph packages on the
> monitor-nodes and restart mons and then mgrs.
>
> After that, I would upgrade all Ceph packages on the OSD nodes and
> restart all the OSD's. Then, after that, the MDSes and RGWs. Restarting
> the OSD's will probably take a while.
>
> If anyone has a hint on what I should expect to cause some extra load or
> waiting time, that would be great.
>
> Obviously, we have read
> https://ceph.com/releases/v14-2-0-nautilus-released/ , but I'm looking
> for real world experiences.
>
> Thanks!
>
>
> --
> Mark Schouten | Tuxis B.V.
> KvK: 74698818 | http://www.tuxis.nl/
> T: +31 318 200208 | i...@tuxis.nl
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-04-29 Thread Eugen Block

Hi,

I haven't had any issues upgrading from Luminous to Nautilus in  
multiple clusters (mostly RBD usage, but also CephFS), including a  
couple of different setups in my lab (RGW, iGW).
Just recently I upgraded a customer cluster with around 220 OSDs, it  
was pretty straight forward without any hickups.
The only thing sticking out is your multi-active mds setup which I  
didn't have yet. But I don't see any reason why that could be an issue.


Regards,
Eugen


Zitat von Mark Schouten :


Hi,

We've done our fair share of Ceph cluster upgrades since Hammer, and
have not seen much problems with them. I'm now at the point that I have
to upgrade a rather large cluster running Luminous and I would like to
hear from other users if they have experiences with issues I can expect
so that I can anticipate on them beforehand.

As said, the cluster is running Luminous (12.2.13) and has the following
services active:
  services:
mon: 3 daemons, quorum osdnode01,osdnode02,osdnode04
mgr: osdnode01(active), standbys: osdnode02, osdnode03
mds: pmrb-3/3/3 up  
{0=osdnode06=up:active,1=osdnode08=up:active,2=osdnode07=up:active},  
1 up:standby

osd: 116 osds: 116 up, 116 in;
rgw: 3 daemons active


Of the OSD's, we have 11 SSD's and 105 HDD. The capacity of the cluster
is 1.01PiB.

We have 2 active crush-rules on 18 pools. All pools have a size of 3  
there is a total of 5760 pgs.

{
"rule_id": 1,
"rule_name": "hdd-data",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -10,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "ssd-data",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -21,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}

rbd -> crush_rule: hdd-data
.rgw.root -> crush_rule: hdd-data
default.rgw.control -> crush_rule: hdd-data
default.rgw.data.root -> crush_rule: ssd-data
default.rgw.gc -> crush_rule: ssd-data
default.rgw.log -> crush_rule: ssd-data
default.rgw.users.uid -> crush_rule: hdd-data
default.rgw.usage -> crush_rule: ssd-data
default.rgw.users.email -> crush_rule: hdd-data
default.rgw.users.keys -> crush_rule: hdd-data
default.rgw.meta -> crush_rule: hdd-data
default.rgw.buckets.index -> crush_rule: ssd-data
default.rgw.buckets.data -> crush_rule: hdd-data
default.rgw.users.swift -> crush_rule: hdd-data
default.rgw.buckets.non-ec -> crush_rule: ssd-data
DB0475 -> crush_rule: hdd-data
cephfs_pmrb_data -> crush_rule: hdd-data
cephfs_pmrb_metadata -> crush_rule: ssd-data


All but four clients are running Luminous, the four are running Jewel
(that needs upgrading before proceeding with this upgrade).

So, normally, I would 'just' upgrade all Ceph packages on the
monitor-nodes and restart mons and then mgrs.

After that, I would upgrade all Ceph packages on the OSD nodes and
restart all the OSD's. Then, after that, the MDSes and RGWs. Restarting
the OSD's will probably take a while.

If anyone has a hint on what I should expect to cause some extra load or
waiting time, that would be great.

Obviously, we have read
https://ceph.com/releases/v14-2-0-nautilus-released/ , but I'm looking
for real world experiences.

Thanks!


--
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-04-29 Thread Nico Schottelius


I believe it was nautilus that started requiring

ms_bind_ipv4 = false
ms_bind_ipv6 = true

if you run IPv6 only clusters. OSDs prior to nautilus worked without
these settings for us.

I'm not sure if the port change (v1->v2) was part of luminous->nautilus
as well, but you might want to check your firewalling (if any).

Overall I recall luminous->nautilus a bit more rocky than usual
(compared to the previous releases), but nothing too serious.

Cheers,

Nico

Mark Schouten  writes:

> Hi,
>
> We've done our fair share of Ceph cluster upgrades since Hammer, and
> have not seen much problems with them. I'm now at the point that I have
> to upgrade a rather large cluster running Luminous and I would like to
> hear from other users if they have experiences with issues I can expect
> so that I can anticipate on them beforehand.
>
> As said, the cluster is running Luminous (12.2.13) and has the following
> services active:
>   services:
> mon: 3 daemons, quorum osdnode01,osdnode02,osdnode04
> mgr: osdnode01(active), standbys: osdnode02, osdnode03
> mds: pmrb-3/3/3 up 
> {0=osdnode06=up:active,1=osdnode08=up:active,2=osdnode07=up:active}, 1 
> up:standby
> osd: 116 osds: 116 up, 116 in;
> rgw: 3 daemons active
>
>
> Of the OSD's, we have 11 SSD's and 105 HDD. The capacity of the cluster
> is 1.01PiB.
>
> We have 2 active crush-rules on 18 pools. All pools have a size of 3 there is 
> a total of 5760 pgs.
> {
> "rule_id": 1,
> "rule_name": "hdd-data",
> "ruleset": 1,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -10,
> "item_name": "default~hdd"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> },
> {
> "rule_id": 2,
> "rule_name": "ssd-data",
> "ruleset": 2,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -21,
> "item_name": "default~ssd"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
>
> rbd -> crush_rule: hdd-data
> .rgw.root -> crush_rule: hdd-data
> default.rgw.control -> crush_rule: hdd-data
> default.rgw.data.root -> crush_rule: ssd-data
> default.rgw.gc -> crush_rule: ssd-data
> default.rgw.log -> crush_rule: ssd-data
> default.rgw.users.uid -> crush_rule: hdd-data
> default.rgw.usage -> crush_rule: ssd-data
> default.rgw.users.email -> crush_rule: hdd-data
> default.rgw.users.keys -> crush_rule: hdd-data
> default.rgw.meta -> crush_rule: hdd-data
> default.rgw.buckets.index -> crush_rule: ssd-data
> default.rgw.buckets.data -> crush_rule: hdd-data
> default.rgw.users.swift -> crush_rule: hdd-data
> default.rgw.buckets.non-ec -> crush_rule: ssd-data
> DB0475 -> crush_rule: hdd-data
> cephfs_pmrb_data -> crush_rule: hdd-data
> cephfs_pmrb_metadata -> crush_rule: ssd-data
>
>
> All but four clients are running Luminous, the four are running Jewel
> (that needs upgrading before proceeding with this upgrade).
>
> So, normally, I would 'just' upgrade all Ceph packages on the
> monitor-nodes and restart mons and then mgrs.
>
> After that, I would upgrade all Ceph packages on the OSD nodes and
> restart all the OSD's. Then, after that, the MDSes and RGWs. Restarting
> the OSD's will probably take a while.
>
> If anyone has a hint on what I should expect to cause some extra load or
> waiting time, that would be great.
>
> Obviously, we have read
> https://ceph.com/releases/v14-2-0-nautilus-released/ , but I'm looking
> for real world experiences.
>
> Thanks!


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io