[ceph-users] snapshot timestamp

2023-08-03 Thread Tony Liu
Hi,

We know snapshot is on a point of time. Is this point of time tracked 
internally by
some sort of sequence number, or the timestamp showed by "snap ls", or 
something else?

I noticed that when "deep cp", the timestamps of all snapshot are changed to 
copy-time.
Say I create a snapshot at 1PM and make a copy at 3PM, the timestamp of 
snapshot in
the copy is 3PM. If I rollback the copy to this snapshot, I'd assume it will 
actually bring me
back to the state of 1PM. Is that correct?

If the above is true, I won't be able to rely on timestamp to track snapshots.

Say I create a snapshot every hour and make a backup by copy at the end of the 
day.
Then the original image is damaged and backup is used to restore the work. On 
this
backup image, how do I know which snapshot was on 1PM, which was on 2PM, etc.?
Any advices to track snapshots properly in such case?

I can definitely build something else to help on this, but I'd like to know how 
much
Ceph can support it.


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] What's the max of snap ID?

2023-08-03 Thread Tony Liu
Hi,

There is a snap ID for each snapshot. How is this ID allocated, sequentially?
Did some tests, it seems this ID is per pool, starting from 4 and always going 
up.
Is that correct?
What's the max of this ID?
What's going to happen when ID reaches the max, going back to start from 4 
again?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

2023-08-03 Thread Adiga, Anantha
Attached log file

-Original Message-
From: Adiga, Anantha  
Sent: Thursday, August 3, 2023 5:50 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

Adding additional info:

The cluster A and B both have the same name: ceph and  each has a single 
filesystem with the same name cephfs. Is that the issue ? 


Tried using peer_add command and it is hanging as well:

root@fl31ca104ja0201:/#ls /etc/ceph/
cr_ceph.conf  client.mirror_remote.keying ceph.client.admin.keyring  ceph.conf

(remote cluster)
root@cr21meg16ba0101:/etc/ceph# ls /etc/ceph
ceph.client.admin.keyring  ceph.conf   ceph.mon.keyring
  

root@fl31ca104ja0201:/# ceph fs snapshot mirror peer_add cephfs 
client.mirror_remote@cr_ceph  cephfs 
v2:172.18.55.71:3300,v1:172.18.55.71:6789],[v2:172.18.55.72:3300,v1:172.18.55.72:6789],[v2:172.18.55.73:3300,v1:172.18.55.73:6789
 AQCfwMlkM90pLBAAwXtvpp8j04IvC8tqpAG9bA==



Hi

Could you please  provide guidance on how to diagnose this issue:

In this case, there are two  Ceph clusters: cluster A, 4 nodes and cluster B, 3 
node, in different locations.  Both are already running RGW multi-site,  A is 
master.

Cephfs snapshot mirroring is being configured on the clusters.  Cluster A  is 
the primary, cluster B is the peer. Cephfs snapshot mirroring is being 
configured. The bootstrap import  step on the primary node hangs.

On the target cluster :
---
"version": "16.2.5",
"release": "pacific",
"release_type": "stable"

root@cr21meg16ba0101:/# ceph fs snapshot mirror peer_bootstrap create cephfs 
client.mirror_remote flex2-site
{"token": 
"eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0="}
root@cr21meg16ba0101:/var/run/ceph#

On the source cluster:

"version": "17.2.6",
"release": "quincy",
"release_type": "stable"

root@fl31ca104ja0201:/# ceph -s
  cluster:
id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
health: HEALTH_OK

  services:
mon:   3 daemons, quorum 
fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 111m)
mgr:   fl31ca104ja0201.nwpqlh(active, since 11h), standbys: 
fl31ca104ja0203, fl31ca104ja0202
mds:   1/1 daemons up, 2 standby
osd:   44 osds: 44 up (since 111m), 44 in (since 4w)
cephfs-mirror: 1 daemon active (1 hosts)
rgw:   3 daemons active (3 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   25 pools, 769 pgs
objects: 614.40k objects, 1.9 TiB
usage:   2.8 TiB used, 292 TiB / 295 TiB avail
pgs: 769 active+clean

root@fl31ca104ja0302:/# ceph mgr module enable mirroring module 'mirroring' is 
already enabled root@fl31ca104ja0302:/# ceph fs snapshot mirror peer_bootstrap 
import cephfs 
eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0=

root@fl31ca104ja0201:/# ceph fs snapshot mirror daemon status
[{"daemon_id": 5300887, "filesystems": [{"filesystem_id": 1, "name": "cephfs", 
"directory_count": 0, "peers": []}]}]

root@fl31ca104ja0302:/var/run/ceph# ceph --admin-daemon 
/var/run/ceph/ceph-client.cephfs-mirror.fl31ca104ja0302.sypagt.7.94083135960976.asok
 status {
"metadata": {
"ceph_sha1": "d7ff0d10654d2280e08f1ab989c7cdf3064446a5",
"ceph_version": "ceph version 17.2.6 
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
"entity_id": "cephfs-mirror.fl31ca104ja0302.sypagt",
"hostname": "fl31ca104ja0302",
"pid": "7",
"root": "/"
},
"dentry_count": 0,
"dentry_pinned_count": 0,
"id": 5194553,
"inst": {
"name": {
"type": "client",
"num": 5194553
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
}
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
},
"inst_str": "client.5194553 10.45.129.5:0/2497002034",
"addr_str": "10.45.129.5:0/2497002034",
"inode_count": 1,
"mds_epoch": 118,
"osd_epoch": 6266,
"osd_epoch_barrier": 0,
"blocklisted": false,
"fs_name": "cephfs"
}

root@fl31ca104ja0302:/home/general# docker logs 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-cephfs-mirror-fl31ca104ja0302-sypagt 
--tail  10 debug 

[ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

2023-08-03 Thread Adiga, Anantha
Adding additional info:

The cluster A and B both have the same name: ceph and  each has a single 
filesystem with the same name cephfs. Is that the issue ? 


Tried using peer_add command and it is hanging as well:

root@fl31ca104ja0201:/#ls /etc/ceph/
cr_ceph.conf  client.mirror_remote.keying ceph.client.admin.keyring  ceph.conf

(remote cluster)
root@cr21meg16ba0101:/etc/ceph# ls /etc/ceph
ceph.client.admin.keyring  ceph.conf   ceph.mon.keyring
  

root@fl31ca104ja0201:/# ceph fs snapshot mirror peer_add cephfs 
client.mirror_remote@cr_ceph  cephfs 
v2:172.18.55.71:3300,v1:172.18.55.71:6789],[v2:172.18.55.72:3300,v1:172.18.55.72:6789],[v2:172.18.55.73:3300,v1:172.18.55.73:6789
 AQCfwMlkM90pLBAAwXtvpp8j04IvC8tqpAG9bA==



Hi

Could you please  provide guidance on how to diagnose this issue:

In this case, there are two  Ceph clusters: cluster A, 4 nodes and cluster B, 3 
node, in different locations.  Both are already running RGW multi-site,  A is 
master.

Cephfs snapshot mirroring is being configured on the clusters.  Cluster A  is 
the primary, cluster B is the peer. Cephfs snapshot mirroring is being 
configured. The bootstrap import  step on the primary node hangs.

On the target cluster :
---
"version": "16.2.5",
"release": "pacific",
"release_type": "stable"

root@cr21meg16ba0101:/# ceph fs snapshot mirror peer_bootstrap create cephfs 
client.mirror_remote flex2-site
{"token": 
"eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0="}
root@cr21meg16ba0101:/var/run/ceph#

On the source cluster:

"version": "17.2.6",
"release": "quincy",
"release_type": "stable"

root@fl31ca104ja0201:/# ceph -s
  cluster:
id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
health: HEALTH_OK

  services:
mon:   3 daemons, quorum 
fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 111m)
mgr:   fl31ca104ja0201.nwpqlh(active, since 11h), standbys: 
fl31ca104ja0203, fl31ca104ja0202
mds:   1/1 daemons up, 2 standby
osd:   44 osds: 44 up (since 111m), 44 in (since 4w)
cephfs-mirror: 1 daemon active (1 hosts)
rgw:   3 daemons active (3 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   25 pools, 769 pgs
objects: 614.40k objects, 1.9 TiB
usage:   2.8 TiB used, 292 TiB / 295 TiB avail
pgs: 769 active+clean

root@fl31ca104ja0302:/# ceph mgr module enable mirroring module 'mirroring' is 
already enabled root@fl31ca104ja0302:/# ceph fs snapshot mirror peer_bootstrap 
import cephfs 
eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0=

root@fl31ca104ja0201:/# ceph fs snapshot mirror daemon status
[{"daemon_id": 5300887, "filesystems": [{"filesystem_id": 1, "name": "cephfs", 
"directory_count": 0, "peers": []}]}]

root@fl31ca104ja0302:/var/run/ceph# ceph --admin-daemon 
/var/run/ceph/ceph-client.cephfs-mirror.fl31ca104ja0302.sypagt.7.94083135960976.asok
 status {
"metadata": {
"ceph_sha1": "d7ff0d10654d2280e08f1ab989c7cdf3064446a5",
"ceph_version": "ceph version 17.2.6 
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
"entity_id": "cephfs-mirror.fl31ca104ja0302.sypagt",
"hostname": "fl31ca104ja0302",
"pid": "7",
"root": "/"
},
"dentry_count": 0,
"dentry_pinned_count": 0,
"id": 5194553,
"inst": {
"name": {
"type": "client",
"num": 5194553
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
}
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
},
"inst_str": "client.5194553 10.45.129.5:0/2497002034",
"addr_str": "10.45.129.5:0/2497002034",
"inode_count": 1,
"mds_epoch": 118,
"osd_epoch": 6266,
"osd_epoch_barrier": 0,
"blocklisted": false,
"fs_name": "cephfs"
}

root@fl31ca104ja0302:/home/general# docker logs 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-cephfs-mirror-fl31ca104ja0302-sypagt 
--tail  10 debug 2023-08-03T05:24:27.413+ 7f8eb6fc0280  0 ceph version 
17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable), process 
cephfs-mirror, pid 7 debug 2023-08-03T05:24:27.413+ 7f8eb6fc0280  0 
pidfile_write: 

[ceph-users] Ceph Quincy and liburing.so.2 on Rocky Linux 9

2023-08-03 Thread dobrie2
I've been digging and I can't see that this has come up anywhere.

I'm trying to update a client from Pacific 17.2.3-2 to 17.2.6-4 and I'm getting 
the error

Error: 
 Problem: cannot install the best update candidate for package 
ceph-base-2:17.2.3-2.el9s.x86_64
  - nothing provides liburing.so.2()(64bit) needed by 
ceph-base-2:17.2.6-4.el9s.x86_64
  - nothing provides liburing.so.2(LIBURING_2.0)(64bit) needed by 
ceph-base-2:17.2.6-4.el9s.x86_64
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use 
not only best candidate packages)

Did Ceph Pacific switch to requiring liburing 2? Rocky 9 only provides 0.7-7. 
CentOS stream seems to have 1.0.7-3 (at least back to when I set up that repo 
on Foreman; I don't remember if I'm keeping it up-to-date).

Can I/should I just do --nobest when updating? I could probably build it from a 
source RPM from another RH-based distro, but I'd rather keep it clean with the 
same distro.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

2023-08-03 Thread Adiga, Anantha


Tried using peer_add command and it is hanging as well:
root@fl31ca104ja0201:/# ceph fs snapshot mirror peer_add cephfs 
client.mirror_remote@cr_ceph  cephfs 
v2:172.18.55.71:3300,v1:172.18.55.71:6789],[v2:172.18.55.72:3300,v1:172.18.55.72:6789],[v2:172.18.55.73:3300,v1:172.18.55.73:6789
 AQCfwMlkM90pLBAAwXtvpp8j04IvC8tqpAG9bA==



-Original Message-
From: Adiga, Anantha  
Sent: Thursday, August 3, 2023 2:31 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

Hi

Could you please  provide guidance on how to diagnose this issue:

In this case, there are two  Ceph clusters: cluster A, 4 nodes and cluster B, 3 
node, in different locations.  Both are already running RGW multi-site,  A is 
master.

Cephfs snapshot mirroring is being configured on the clusters.  Cluster A  is 
the primary, cluster B is the peer. Cephfs snapshot mirroring is being 
configured. The bootstrap import  step on the primary node hangs.

On the target cluster :
---
"version": "16.2.5",
"release": "pacific",
"release_type": "stable"

root@cr21meg16ba0101:/# ceph fs snapshot mirror peer_bootstrap create cephfs 
client.mirror_remote flex2-site
{"token": 
"eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0="}
root@cr21meg16ba0101:/var/run/ceph#

On the source cluster:

"version": "17.2.6",
"release": "quincy",
"release_type": "stable"

root@fl31ca104ja0201:/# ceph -s
  cluster:
id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
health: HEALTH_OK

  services:
mon:   3 daemons, quorum 
fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 111m)
mgr:   fl31ca104ja0201.nwpqlh(active, since 11h), standbys: 
fl31ca104ja0203, fl31ca104ja0202
mds:   1/1 daemons up, 2 standby
osd:   44 osds: 44 up (since 111m), 44 in (since 4w)
cephfs-mirror: 1 daemon active (1 hosts)
rgw:   3 daemons active (3 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   25 pools, 769 pgs
objects: 614.40k objects, 1.9 TiB
usage:   2.8 TiB used, 292 TiB / 295 TiB avail
pgs: 769 active+clean

root@fl31ca104ja0302:/# ceph mgr module enable mirroring module 'mirroring' is 
already enabled root@fl31ca104ja0302:/# ceph fs snapshot mirror peer_bootstrap 
import cephfs 
eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0=

root@fl31ca104ja0201:/# ceph fs snapshot mirror daemon status
[{"daemon_id": 5300887, "filesystems": [{"filesystem_id": 1, "name": "cephfs", 
"directory_count": 0, "peers": []}]}]

root@fl31ca104ja0302:/var/run/ceph# ceph --admin-daemon 
/var/run/ceph/ceph-client.cephfs-mirror.fl31ca104ja0302.sypagt.7.94083135960976.asok
 status {
"metadata": {
"ceph_sha1": "d7ff0d10654d2280e08f1ab989c7cdf3064446a5",
"ceph_version": "ceph version 17.2.6 
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
"entity_id": "cephfs-mirror.fl31ca104ja0302.sypagt",
"hostname": "fl31ca104ja0302",
"pid": "7",
"root": "/"
},
"dentry_count": 0,
"dentry_pinned_count": 0,
"id": 5194553,
"inst": {
"name": {
"type": "client",
"num": 5194553
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
}
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
},
"inst_str": "client.5194553 10.45.129.5:0/2497002034",
"addr_str": "10.45.129.5:0/2497002034",
"inode_count": 1,
"mds_epoch": 118,
"osd_epoch": 6266,
"osd_epoch_barrier": 0,
"blocklisted": false,
"fs_name": "cephfs"
}

root@fl31ca104ja0302:/home/general# docker logs 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-cephfs-mirror-fl31ca104ja0302-sypagt 
--tail  10 debug 2023-08-03T05:24:27.413+ 7f8eb6fc0280  0 ceph version 
17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable), process 
cephfs-mirror, pid 7 debug 2023-08-03T05:24:27.413+ 7f8eb6fc0280  0 
pidfile_write: ignore empty --pid-file debug 2023-08-03T05:24:27.445+ 
7f8eb6fc0280  1 mgrc service_daemon_register cephfs-mirror.5184622 metadata 
{arch=x86_64,ceph_release=quincy,ceph_version=ceph version 17.2.6 

[ceph-users] Re: unbalanced OSDs

2023-08-03 Thread Pavlo Astakhov

Take a look at https://github.com/TheJJ/ceph-balancer

We switched to it after lot of attempts to make internal balancer work 
as expected and now we have ~even OSD utilization across cluster:


# ./placementoptimizer.py -v balance --ensure-optimal-moves 
--ensure-variance-decrease

[2023-08-03 23:33:27,954] gathering cluster state via ceph api...
[2023-08-03 23:33:36,081] running pg balancer
[2023-08-03 23:33:36,088] current OSD fill rate per crushclasses:
[2023-08-03 23:33:36,089]   ssd: average=49.86%, median=50.27%, 
without_placement_constraints=53.01%

[2023-08-03 23:33:36,090] cluster variance for crushclasses:
[2023-08-03 23:33:36,090]   ssd: 4.163
[2023-08-03 23:33:36,090] min osd.14 44.698%
[2023-08-03 23:33:36,090] max osd.22 51.897%
[2023-08-03 23:33:36,101] in descending full-order, couldn't empty 
osd.22, so we're done. if you want to try more often, set 
--max-full-move-attempts=$nr, this may unlock more balancing possibilities.
[2023-08-03 23:33:36,101] 


[2023-08-03 23:33:36,101] generated 0 remaps.
[2023-08-03 23:33:36,101] total movement size: 0.0B.
[2023-08-03 23:33:36,102] 


[2023-08-03 23:33:36,102] old cluster variance per crushclass:
[2023-08-03 23:33:36,102]   ssd: 4.163
[2023-08-03 23:33:36,102] old min osd.14 44.698%
[2023-08-03 23:33:36,102] old max osd.22 51.897%
[2023-08-03 23:33:36,102] 


[2023-08-03 23:33:36,103] new min osd.14 44.698%
[2023-08-03 23:33:36,103] new max osd.22 51.897%
[2023-08-03 23:33:36,103] new cluster variance:
[2023-08-03 23:33:36,103]   ssd: 4.163
[2023-08-03 23:33:36,103] 




On 03.08.2023 16:38, Spiros Papageorgiou wrote:

On 03-Aug-23 12:11 PM, Eugen Block wrote:

ceph balancer status


I changed the PGs and it started rebalancing (and turned autoscaler 
off) , so now it will not report status:


It reports: "optimize_result": "Too many objects (0.088184 > 0.05) 
are misplaced; try again later"


Lets wait a few hours to see what happens...

Thanx!

Sp

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs snapshot mirror peer_bootstrap import hung

2023-08-03 Thread Adiga, Anantha
Hi

Could you please  provide guidance on how to diagnose this issue:

In this case, there are two  Ceph clusters: cluster A, 4 nodes and cluster B, 3 
node, in different locations.  Both are already running RGW multi-site,  A is 
master.

Cephfs snapshot mirroring is being configured on the clusters.  Cluster A  is 
the primary, cluster B is the peer. Cephfs snapshot mirroring is being 
configured. The bootstrap import  step on the primary node hangs.

On the target cluster :
---
"version": "16.2.5",
"release": "pacific",
"release_type": "stable"

root@cr21meg16ba0101:/# ceph fs snapshot mirror peer_bootstrap create cephfs 
client.mirror_remote flex2-site
{"token": 
"eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0="}
root@cr21meg16ba0101:/var/run/ceph#

On the source cluster:

"version": "17.2.6",
"release": "quincy",
"release_type": "stable"

root@fl31ca104ja0201:/# ceph -s
  cluster:
id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
health: HEALTH_OK

  services:
mon:   3 daemons, quorum 
fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 111m)
mgr:   fl31ca104ja0201.nwpqlh(active, since 11h), standbys: 
fl31ca104ja0203, fl31ca104ja0202
mds:   1/1 daemons up, 2 standby
osd:   44 osds: 44 up (since 111m), 44 in (since 4w)
cephfs-mirror: 1 daemon active (1 hosts)
rgw:   3 daemons active (3 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   25 pools, 769 pgs
objects: 614.40k objects, 1.9 TiB
usage:   2.8 TiB used, 292 TiB / 295 TiB avail
pgs: 769 active+clean

root@fl31ca104ja0302:/# ceph mgr module enable mirroring module 'mirroring' is 
already enabled root@fl31ca104ja0302:/# ceph fs snapshot mirror peer_bootstrap 
import cephfs 
eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0=

root@fl31ca104ja0201:/# ceph fs snapshot mirror daemon status
[{"daemon_id": 5300887, "filesystems": [{"filesystem_id": 1, "name": "cephfs", 
"directory_count": 0, "peers": []}]}]

root@fl31ca104ja0302:/var/run/ceph# ceph --admin-daemon 
/var/run/ceph/ceph-client.cephfs-mirror.fl31ca104ja0302.sypagt.7.94083135960976.asok
 status {
"metadata": {
"ceph_sha1": "d7ff0d10654d2280e08f1ab989c7cdf3064446a5",
"ceph_version": "ceph version 17.2.6 
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
"entity_id": "cephfs-mirror.fl31ca104ja0302.sypagt",
"hostname": "fl31ca104ja0302",
"pid": "7",
"root": "/"
},
"dentry_count": 0,
"dentry_pinned_count": 0,
"id": 5194553,
"inst": {
"name": {
"type": "client",
"num": 5194553
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
}
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
},
"inst_str": "client.5194553 10.45.129.5:0/2497002034",
"addr_str": "10.45.129.5:0/2497002034",
"inode_count": 1,
"mds_epoch": 118,
"osd_epoch": 6266,
"osd_epoch_barrier": 0,
"blocklisted": false,
"fs_name": "cephfs"
}

root@fl31ca104ja0302:/home/general# docker logs 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-cephfs-mirror-fl31ca104ja0302-sypagt 
--tail  10 debug 2023-08-03T05:24:27.413+ 7f8eb6fc0280  0 ceph version 
17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable), process 
cephfs-mirror, pid 7 debug 2023-08-03T05:24:27.413+ 7f8eb6fc0280  0 
pidfile_write: ignore empty --pid-file debug 2023-08-03T05:24:27.445+ 
7f8eb6fc0280  1 mgrc service_daemon_register cephfs-mirror.5184622 metadata 
{arch=x86_64,ceph_release=quincy,ceph_version=ceph version 17.2.6 
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy 
(stable),ceph_version_short=17.2.6,container_hostname=fl31ca104ja0302,container_image=quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e,cpu=Intel(R)
 Xeon(R) Gold 6252 CPU @ 2.10GHz,distro=centos,distro_description=CentOS Stream 
8,distro_version=8,hostname=fl31ca104ja0302,id=fl31ca104ja0302.sypagt,instance_id=5184622,kernel_description=#82-Ub
 untu SMP Tue Jun 6 23:10:23 UTC 

[ceph-users] cephfs snapshot mirror peer_bootstrap import hung

2023-08-03 Thread Adiga, Anantha
Hi

Could you please  provide guidance on how to diagnose this issue:

In this case, there are two  Ceph clusters: cluster A, 4 nodes and cluster B, 3 
node, in different locations.  Both are already running RGW multi-site,  A is 
master.

Cephfs snapshot mirroring is being configured on the clusters.  Cluster A  is 
the primary, cluster B is the peer. Cephfs snapshot mirroring is being 
configured. The bootstrap import  step on the primary node hangs.

On the target cluster :
---
"version": "16.2.5",
"release": "pacific",
"release_type": "stable"

root@cr21meg16ba0101:/# ceph fs snapshot mirror peer_bootstrap create cephfs 
client.mirror_remote flex2-site
{"token": 
"eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0="}
root@cr21meg16ba0101:/var/run/ceph#

On the source cluster:

"version": "17.2.6",
"release": "quincy",
"release_type": "stable"

root@fl31ca104ja0201:/# ceph -s
  cluster:
id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
health: HEALTH_OK

  services:
mon:   3 daemons, quorum 
fl31ca104ja0202,fl31ca104ja0203,fl31ca104ja0201 (age 111m)
mgr:   fl31ca104ja0201.nwpqlh(active, since 11h), standbys: 
fl31ca104ja0203, fl31ca104ja0202
mds:   1/1 daemons up, 2 standby
osd:   44 osds: 44 up (since 111m), 44 in (since 4w)
cephfs-mirror: 1 daemon active (1 hosts)
rgw:   3 daemons active (3 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   25 pools, 769 pgs
objects: 614.40k objects, 1.9 TiB
usage:   2.8 TiB used, 292 TiB / 295 TiB avail
pgs: 769 active+clean

root@fl31ca104ja0302:/# ceph mgr module enable mirroring
module 'mirroring' is already enabled
root@fl31ca104ja0302:/# ceph fs snapshot mirror peer_bootstrap import cephfs 
eyJmc2lkIjogImE2ZjUyNTk4LWU1Y2QtNGEwOC04NDIyLTdiNmZkYjFkNWRiZSIsICJmaWxlc3lzdGVtIjogImNlcGhmcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcmVtb3RlIiwgInNpdGVfbmFtZSI6ICJmbGV4Mi1zaXRlIiwgImtleSI6ICJBUUNmd01sa005MHBMQkFBd1h0dnBwOGowNEl2Qzh0cXBBRzliQT09IiwgIm1vbl9ob3N0IjogIlt2MjoxNzIuMTguNTUuNzE6MzMwMC8wLHYxOjE3Mi4xOC41NS43MTo2Nzg5LzBdIFt2MjoxNzIuMTguNTUuNzM6MzMwMC8wLHYxOjE3Mi4xOC41NS43Mzo2Nzg5LzBdIn0=


root@fl31ca104ja0302:/var/run/ceph# ceph --admin-daemon 
/var/run/ceph/ceph-client.cephfs-mirror.fl31ca104ja0302.sypagt.7.94083135960976.asok
 status
{
"metadata": {
"ceph_sha1": "d7ff0d10654d2280e08f1ab989c7cdf3064446a5",
"ceph_version": "ceph version 17.2.6 
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
"entity_id": "cephfs-mirror.fl31ca104ja0302.sypagt",
"hostname": "fl31ca104ja0302",
"pid": "7",
"root": "/"
},
"dentry_count": 0,
"dentry_pinned_count": 0,
"id": 5194553,
"inst": {
"name": {
"type": "client",
"num": 5194553
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
}
},
"addr": {
"type": "v1",
"addr": "10.45.129.5:0",
"nonce": 2497002034
},
"inst_str": "client.5194553 10.45.129.5:0/2497002034",
"addr_str": "10.45.129.5:0/2497002034",
"inode_count": 1,
"mds_epoch": 118,
"osd_epoch": 6266,
"osd_epoch_barrier": 0,
"blocklisted": false,
"fs_name": "cephfs"
}

root@fl31ca104ja0302:/home/general# docker logs 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-cephfs-mirror-fl31ca104ja0302-sypagt 
--tail  10
debug 2023-08-03T05:24:27.413+ 7f8eb6fc0280  0 ceph version 17.2.6 
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable), process 
cephfs-mirror, pid 7
debug 2023-08-03T05:24:27.413+ 7f8eb6fc0280  0 pidfile_write: ignore empty 
--pid-file
debug 2023-08-03T05:24:27.445+ 7f8eb6fc0280  1 mgrc service_daemon_register 
cephfs-mirror.5184622 metadata 
{arch=x86_64,ceph_release=quincy,ceph_version=ceph version 17.2.6 
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy 
(stable),ceph_version_short=17.2.6,container_hostname=fl31ca104ja0302,container_image=quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e,cpu=Intel(R)
 Xeon(R) Gold 6252 CPU @ 2.10GHz,distro=centos,distro_description=CentOS Stream 
8,distro_version=8,hostname=fl31ca104ja0302,id=fl31ca104ja0302.sypagt,instance_id=5184622,kernel_description=#82-Ubuntu
 SMP Tue Jun 6 23:10:23 UTC 
2023,kernel_version=5.15.0-75-generic,mem_swap_kb=8388604,mem_total_kb=527946928,os=Linux}
debug 2023-08-03T05:27:10.419+ 7f8ea1b2c700  0 client.5194553 
ms_handle_reset on v2:10.45.128.141:3300/0
debug 2023-08-03T05:50:10.917+ 

[ceph-users] ceph-csi-cephfs - InvalidArgument desc = provided secret is empty

2023-08-03 Thread Shawn Weeks
I’m attempting to setup the CephFS CSI on K3s managed by Rancher against an 
external CephFS using the Helm chart. I’m using all default values on the Helm 
chart accept for cephConf and secret. I’ve verified that the configmap 
ceph-config get’s created with the values from Helm and I’ve verified that the 
secret csi-cephfs-secret also get’s created with the same values as seen below. 
Any attempts to create a PVC result in the following error. The only posts I’ve 
found are about expansion and I am not trying to expand a CephFS volume,  just 
create one.

I0803 19:23:39.715036   1 event.go:298] 
Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"coder", 
Name:"test", UID:"9c7e51b6-0321-48e1-9950-444f786c14fb", APIVersion:"v1", 
ResourceVersion:"4523108", FieldPath:""}): type: 'Warning' reason: 
'ProvisioningFailed' failed to provision volume with StorageClass "cephfs": rpc 
error: code = InvalidArgument desc = provided secret is empty

cephConfConfigMapName: ceph-config
cephconf: |
  [global]
fsid = 9b98ccd8-450e-4172-af70-512e4e77bc36
mon_host = [v2:10.0.5.11:3300/0,v1:10.0.5.11:6789/0] 
[v2:10.0.5.12:3300/0,v1:10.0.5.12:6789/0] 
[v2:10.0.5.13:3300/0,v1:10.0.5.13:6789/0]
commonLabels: {}
configMapName: ceph-csi-config
csiConfig: null
driverName: cephfs.csi.ceph.com
externallyManagedConfigmap: false
kubeletDir: /var/lib/kubelet
logLevel: 5
nodeplugin:
  affinity: {}
  fusemountoptions: ''
  httpMetrics:
containerPort: 8081
enabled: true
service:
  annotations: {}
  clusterIP: ''
  enabled: true
  externalIPs: null
  loadBalancerIP: ''
  loadBalancerSourceRanges: null
  servicePort: 8080
  type: ClusterIP
  imagePullSecrets: null
  kernelmountoptions: ''
  name: nodeplugin
  nodeSelector: {}
  plugin:
image:
  pullPolicy: IfNotPresent
  repository: quay.io/cephcsi/cephcsi
  tag: v3.9.0
resources: {}
  priorityClassName: system-node-critical
  profiling:
enabled: false
  registrar:
image:
  pullPolicy: IfNotPresent
  repository: registry.k8s.io/sig-storage/csi-node-driver-registrar
  tag: v2.8.0
resources: {}
  tolerations: null
  updateStrategy: RollingUpdate
pluginSocketFile: csi.sock
provisioner:
  affinity: {}
  enableHostNetwork: false
  httpMetrics:
containerPort: 8081
enabled: true
service:
  annotations: {}
  clusterIP: ''
  enabled: true
  externalIPs: null
  loadBalancerIP: ''
  loadBalancerSourceRanges: null
  servicePort: 8080
  type: ClusterIP
  imagePullSecrets: null
  name: provisioner
  nodeSelector: {}
  priorityClassName: system-cluster-critical
  profiling:
enabled: false
  provisioner:
extraArgs: null
image:
  pullPolicy: IfNotPresent
  repository: registry.k8s.io/sig-storage/csi-provisioner
  tag: v3.5.0
resources: {}
  replicaCount: 3
  resizer:
enabled: true
extraArgs: null
image:
  pullPolicy: IfNotPresent
  repository: registry.k8s.io/sig-storage/csi-resizer
  tag: v1.8.0
name: resizer
resources: {}
  setmetadata: true
  snapshotter:
extraArgs: null
image:
  pullPolicy: IfNotPresent
  repository: registry.k8s.io/sig-storage/csi-snapshotter
  tag: v6.2.2
resources: {}
  strategy:
rollingUpdate:
  maxUnavailable: 50%
type: RollingUpdate
  timeout: 60s
  tolerations: null
provisionerSocketFile: csi-provisioner.sock
rbac:
  create: true
secret:
  adminID: 
  adminKey: 
  create: true
  name: csi-cephfs-secret
selinuxMount: true
serviceAccounts:
  nodeplugin:
create: true
name: null
  provisioner:
create: true
name: null
sidecarLogLevel: 1
storageClass:
  allowVolumeExpansion: true
  annotations: {}
  clusterID: 
  controllerExpandSecret: csi-cephfs-secret
  controllerExpandSecretNamespace: ''
  create: false
  fsName: myfs
  fuseMountOptions: ''
  kernelMountOptions: ''
  mountOptions: null
  mounter: ''
  name: csi-cephfs-sc
  nodeStageSecret: csi-cephfs-secret
  nodeStageSecretNamespace: ''
  pool: ''
  provisionerSecret: csi-cephfs-secret
  provisionerSecretNamespace: ''
  reclaimPolicy: Delete
  volumeNamePrefix: ''
global:
  cattle:
clusterId: c-m-xschvkd5
clusterName: dev-cluster
rkePathPrefix: ''
rkeWindowsPathPrefix: ''
systemProjectId: p-g6rqs
url: https://rancher.example.com
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Backfill Performance for

2023-08-03 Thread Jonathan Suever
I am in the process of expanding our cluster capacity by ~50% and have
noticed some unexpected behavior during the backfill and recovery process
that I'd like to understand and see if there is a better configuration that
will yield a faster and smoother backfill.

Pool Information:

OSDs: 243 spinning HDDs
PGs: 1024 (yes, this is low for our number of disks)

I inherited the cluster and it has the following settings which seem to
have been done in an attempt to get the cluster to recover quickly:

osd_max_backfills: 6 (default is 1)
osd_recovery_sleep_hdd: 0.0 (default is 0.1)
osd_recovery_max_active_hdd: 9

When watching the PGs recover I am noticing a few things:

- All PGs seem to be backfilling at the same time which seems to be in
violation of osd_max_backfills. I understand that there should be 6 readers
and 6 writers at a time, but I'm seeing a given OSD participate in more
than 6 PG backfills. Is an OSD only considered as backfilling if it is not
present in both the UP and ACTING groups (e.g. it will have it's data
altered)?

- Some PGs are recovering at a much slower rate than others (some as little
as kilobytes per second) despite the disks being all of a similar speed. Is
there some way to dig into why that may be?

- In general, the recovery is happening very slowly (between 1 and 5
objects per second per PG). Is it possible the settings above are too
aggressive and causing performance degradation due to disk thrashing?

- Currently, all misplaced PGs are backfilling, if I were to change some of
the settings above (specifically `osd_max_backfills`) would that
essentially pause backfilling PGs or will those backfills have to end and
then start over when it is done waiting?

- Given that all PGs are backfilling simultaneously there is no way to
prioritize one PG over another (we have some disks with very high usage
that we're trying to reduce). Would reducing those max backfills allow for
proper prioritization of PGs with force-backfill?

- We have had some OSDs restart during the process and their misplaced
object count is now zero but they are incrementing their recovering objects
bytes. Is that expected and is there a way to estimate when that will
complete?

Thanks for the help!

-Jonathan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Luminous Bluestore issues and RGW Multi-site Recovery

2023-08-03 Thread Konstantin Shalygin
Hi,

Can you show `smartctl -a` for this device?
This drives show input/output errors in dmesg when you try to run ceph-osd?


k
Sent from my iPhone

> On 2 Aug 2023, at 21:44, Greg O'Neill  wrote:
> 
> Syslog says the drive is not in write-protect mode, however smart says life 
> remaining is at 1%.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: unbalanced OSDs

2023-08-03 Thread Spiros Papageorgiou

On 03-Aug-23 12:11 PM, Eugen Block wrote:

ceph balancer status


I changed the PGs and it started rebalancing (and turned autoscaler off) 
, so now it will not report status:


It reports: "optimize_result": "Too many objects (0.088184 > 0.05) 
are misplaced; try again later"


Lets wait a few hours to see what happens...

Thanx!

Sp

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Upgrading nautilus / centos7 to octopus / ubuntu 20.04. - Suggestions and hints?

2023-08-03 Thread Beaman, Joshua
We went through this exercise, though our starting point was ubuntu 16.04 / 
nautilus.  We reduced our double builds as follows:


  1.  Rebuild each monitor host on 18.04/bionic and rejoin still on nautilus
  2.  Upgrade all mons, mgrs., (and rgws optionally) to pacific
  3.  Convert each mon, mgr, rgw to cephadm and enable orchestrator
  4.  Rebuild each mon, mgr, rgw on 20.04/focal and rejoin pacfic cluster
  5.  Drain and rebuild each osd host on focal and pacific

This has the advantage of only having to drain and rebuild the OSD hosts once.  
Double building the control cluster hosts isn’t so bad, and orchestrator makes 
all of the ceph parts easy once it’s enabled.

The biggest challenge we ran into was: https://tracker.ceph.com/issues/51652 
because we still had a lot of filestore osds.  It’s frustrating, but we managed 
to get through it without much client interruption on a dozen prod clusters, 
most of which were 38 osd hosts and 912 total osds each.  One thing which 
helped, was, before beginning the osd host builds, set all of the old osds 
primary-affinity to something <1.  This way when the new pacific (or octopus) 
osds join the cluster they will automatically be favored for primary on their 
pgs.  If a heartbeat timeout storm starts to get out of control, start by 
setting nodown and noout.  The flapping osds are the worst.  Then figure out 
which osds are the culprit and restart them.

Hopefully your nautilus osds are all bluestore and you won’t have this problem. 
 We put up with it, because the filestore to bluestore conversion was one of 
the most important parts of this upgrade for us.

Best of luck, whatever route you take.

Regards,
Josh Beaman

From: Götz Reinicke 
Date: Tuesday, August 1, 2023 at 1:01 PM
To: ceph-users@ceph.io 
Subject: [EXTERNAL] [ceph-users] Upgrading nautilus / centos7 to octopus / 
ubuntu 20.04. - Suggestions and hints?
Hi,

As I’v read and thought a lot about the migration as this is a bigger project, 
I was wondering if anyone has done that already and might share some notes or 
playbooks, because in all readings there where some parts missing or miss 
understandable to me.

I do have some different approaches in mind, so may be you have some 
suggestions or hints.

a) upgrade nautilus on centos 7 with the few missing features like dashboard 
and prometheus. After that migrate one node after an other to ubuntu 20.04 with 
octopus and than upgrade ceph to the recent stable version.

b) migrate one node after an other to ubuntu 18.04 with nautilus and then 
upgrade to octupus and after that to ubuntu 20.04.

or

c) upgrade one node after an other to ubuntu 20.04 with octopus and join it to 
the cluster until all nodes are upgraded.


For test I tried c) with a mon node, but adding that to the cluster fails with 
some failed state, still probing for the other mons. (I dont have the right log 
at hand right now.)

So my questions are:

a) What would be the best (most stable) migration path and

b) is it in general possible to add a new octopus mon (not upgraded one) to a 
nautilus cluster, where the other mons are still on nautilus?


I hope my thoughts and questions are understandable :)

Thanks for any hint and suggestion. Best . Götz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: unbalanced OSDs

2023-08-03 Thread Eugen Block
Turn off the autoscaler and increase pg_num to 512 or so (power of 2).  
The recommendation is to have between 100 and 150 PGs per OSD (incl.  
replicas). And then let the balancer handle the rest. What is the  
current balancer status (ceph balancer status)?


Zitat von Spiros Papageorgiou :


Hi all,


I have a ceph cluster with 3 nodes. ceph version is 16.2.9. There  
are 7 SSD OSDs on each server and one pool that resides on these OSDs.


My OSDs are terribly unbalanced:

ID  CLASS  WEIGHT    REWEIGHT  SIZE RAW USE  DATA OMAP  
META  AVAIL    %USE   VAR   PGS STATUS  TYPE NAME
-9 28.42200 -   28 TiB  9.3 TiB  9.2 TiB  161 MiB     
26 GiB   19 TiB  32.56  1.09    -  root ssddisks
-2  9.47400 -  9.5 TiB  3.4 TiB  3.4 TiB   66 MiB    
9.2 GiB  6.1 TiB  35.52  1.19    -  host px1-ssd
 0    ssd   1.74599   0.85004  1.7 TiB  810 GiB  807 GiB  3.2 MiB    
2.3 GiB  978 GiB  45.28  1.51   26  up  osd.0
 5    ssd   0.82999   0.85004  850 GiB  581 GiB  580 GiB   22 MiB    
912 MiB  269 GiB  68.38  2.29   19  up  osd.5
 6    ssd   0.82999   1.0  850 GiB  8.2 GiB  7.8 GiB  9.5 MiB    
435 MiB  842 GiB   0.97  0.03    4  up  osd.6
 7    ssd   0.82999   1.0  850 GiB  294 GiB  293 GiB   26 MiB    
591 MiB  556 GiB  34.60  1.16   11  up  osd.7
16    ssd   1.74599   0.85004  1.7 TiB  872 GiB  869 GiB  3.1 MiB    
2.3 GiB  916 GiB  48.75  1.63   27  up  osd.16
23    ssd   1.74599   1.0  1.7 TiB  438 GiB  436 GiB  1.5 MiB    
1.7 GiB  1.3 TiB  24.48  0.82   14  up  osd.23
24    ssd   1.74599   1.0  1.7 TiB  444 GiB  443 GiB  1.6 MiB    
1.0 GiB  1.3 TiB  24.81  0.83   17  up  osd.24
-6  9.47400 -  9.5 TiB  2.9 TiB  2.9 TiB   46 MiB    
8.1 GiB  6.6 TiB  30.39  1.02    -  host px2-ssd
12    ssd   0.82999   1.0  850 GiB  154 GiB  154 GiB   21 MiB    
368 MiB  696 GiB  18.16  0.61    9  up  osd.12
13    ssd   0.82999   1.0  850 GiB  144 GiB  143 GiB  527 KiB    
469 MiB  706 GiB  16.92  0.57    4  up  osd.13
14    ssd   0.82999   1.0  850 GiB  149 GiB  149 GiB   16 MiB    
299 MiB  700 GiB  17.58  0.59    7  up  osd.14
29    ssd   1.74599   1.0  1.7 TiB  449 GiB  448 GiB  1.6 MiB    
1.4 GiB  1.3 TiB  25.11  0.84   20  up  osd.29
30    ssd   1.74599   0.85004  1.7 TiB  885 GiB  882 GiB  3.1 MiB    
2.3 GiB  903 GiB  49.48  1.65   31  up  osd.30
31    ssd   1.74599   1.0  1.7 TiB  728 GiB  727 GiB  2.6 MiB    
1.8 GiB  1.0 TiB  40.74  1.36   22  up  osd.31
32    ssd   1.74599   1.0  1.7 TiB  438 GiB  437 GiB  1.6 MiB    
1.4 GiB  1.3 TiB  24.51  0.82   15  up  osd.32
-4  9.47400 -  9.5 TiB  3.0 TiB  3.0 TiB   49 MiB    
8.7 GiB  6.5 TiB  31.78  1.06    -  host px3-ssd
19    ssd   0.82999   1.0  850 GiB  293 GiB  292 GiB   14 MiB    
500 MiB  557 GiB  34.47  1.15    9  up  osd.19
20    ssd   0.82999   1.0  850 GiB  290 GiB  290 GiB   10 MiB    
482 MiB  560 GiB  34.15  1.14   10  up  osd.20
21    ssd   0.82999   1.0  850 GiB  148 GiB  147 GiB   16 MiB    
428 MiB  702 GiB  17.36  0.58    5  up  osd.21
25    ssd   1.74599   1.0  1.7 TiB  446 GiB  445 GiB  1.8 MiB    
1.6 GiB  1.3 TiB  24.96  0.83   19  up  osd.25
26    ssd   1.74599   1.0  1.7 TiB  739 GiB  737 GiB  2.6 MiB    
2.0 GiB  1.0 TiB  41.33  1.38   29  up  osd.26
27    ssd   1.74599   1.0  1.7 TiB  725 GiB  723 GiB  2.6 MiB    
2.1 GiB  1.0 TiB  40.55  1.36   21  up  osd.27
28    ssd   1.74599   1.0  1.7 TiB  442 GiB  440 GiB  1.6 MiB    
1.7 GiB  1.3 TiB  24.72  0.83   17  up  osd.28


I have done a "ceph osd reweight-by-utilization" and "ceph osd  
set-require-min-compat-client luminous". The pool has 32 PGs which  
were set by autoscale_mode, which is on.


Why are my OSDs, so unbalanced? I have osd.5 with 68.3% and osd.6  
with 0.97%  Also when the reweight-by-utilization, osd.5  
utilization actually increased...



What am i missing here?


Sp

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] unbalanced OSDs

2023-08-03 Thread Spiros Papageorgiou

Hi all,


I have a ceph cluster with 3 nodes. ceph version is 16.2.9. There are 7 
SSD OSDs on each server and one pool that resides on these OSDs.


My OSDs are terribly unbalanced:

ID  CLASS  WEIGHT    REWEIGHT  SIZE RAW USE  DATA OMAP META  
AVAIL    %USE   VAR   PGS STATUS  TYPE NAME
-9 28.42200 -   28 TiB  9.3 TiB  9.2 TiB  161 MiB    26 
GiB   19 TiB  32.56  1.09    -  root ssddisks
-2  9.47400 -  9.5 TiB  3.4 TiB  3.4 TiB   66 MiB   9.2 
GiB  6.1 TiB  35.52  1.19    -  host px1-ssd
 0    ssd   1.74599   0.85004  1.7 TiB  810 GiB  807 GiB  3.2 MiB   2.3 
GiB  978 GiB  45.28  1.51   26  up  osd.0
 5    ssd   0.82999   0.85004  850 GiB  581 GiB  580 GiB   22 MiB   912 
MiB  269 GiB  68.38  2.29   19  up  osd.5
 6    ssd   0.82999   1.0  850 GiB  8.2 GiB  7.8 GiB  9.5 MiB   435 
MiB  842 GiB   0.97  0.03    4  up  osd.6
 7    ssd   0.82999   1.0  850 GiB  294 GiB  293 GiB   26 MiB   591 
MiB  556 GiB  34.60  1.16   11  up  osd.7
16    ssd   1.74599   0.85004  1.7 TiB  872 GiB  869 GiB  3.1 MiB   2.3 
GiB  916 GiB  48.75  1.63   27  up  osd.16
23    ssd   1.74599   1.0  1.7 TiB  438 GiB  436 GiB  1.5 MiB   1.7 
GiB  1.3 TiB  24.48  0.82   14  up  osd.23
24    ssd   1.74599   1.0  1.7 TiB  444 GiB  443 GiB  1.6 MiB   1.0 
GiB  1.3 TiB  24.81  0.83   17  up  osd.24
-6  9.47400 -  9.5 TiB  2.9 TiB  2.9 TiB   46 MiB   8.1 
GiB  6.6 TiB  30.39  1.02    -  host px2-ssd
12    ssd   0.82999   1.0  850 GiB  154 GiB  154 GiB   21 MiB   368 
MiB  696 GiB  18.16  0.61    9  up  osd.12
13    ssd   0.82999   1.0  850 GiB  144 GiB  143 GiB  527 KiB   469 
MiB  706 GiB  16.92  0.57    4  up  osd.13
14    ssd   0.82999   1.0  850 GiB  149 GiB  149 GiB   16 MiB   299 
MiB  700 GiB  17.58  0.59    7  up  osd.14
29    ssd   1.74599   1.0  1.7 TiB  449 GiB  448 GiB  1.6 MiB   1.4 
GiB  1.3 TiB  25.11  0.84   20  up  osd.29
30    ssd   1.74599   0.85004  1.7 TiB  885 GiB  882 GiB  3.1 MiB   2.3 
GiB  903 GiB  49.48  1.65   31  up  osd.30
31    ssd   1.74599   1.0  1.7 TiB  728 GiB  727 GiB  2.6 MiB   1.8 
GiB  1.0 TiB  40.74  1.36   22  up  osd.31
32    ssd   1.74599   1.0  1.7 TiB  438 GiB  437 GiB  1.6 MiB   1.4 
GiB  1.3 TiB  24.51  0.82   15  up  osd.32
-4  9.47400 -  9.5 TiB  3.0 TiB  3.0 TiB   49 MiB   8.7 
GiB  6.5 TiB  31.78  1.06    -  host px3-ssd
19    ssd   0.82999   1.0  850 GiB  293 GiB  292 GiB   14 MiB   500 
MiB  557 GiB  34.47  1.15    9  up  osd.19
20    ssd   0.82999   1.0  850 GiB  290 GiB  290 GiB   10 MiB   482 
MiB  560 GiB  34.15  1.14   10  up  osd.20
21    ssd   0.82999   1.0  850 GiB  148 GiB  147 GiB   16 MiB   428 
MiB  702 GiB  17.36  0.58    5  up  osd.21
25    ssd   1.74599   1.0  1.7 TiB  446 GiB  445 GiB  1.8 MiB   1.6 
GiB  1.3 TiB  24.96  0.83   19  up  osd.25
26    ssd   1.74599   1.0  1.7 TiB  739 GiB  737 GiB  2.6 MiB   2.0 
GiB  1.0 TiB  41.33  1.38   29  up  osd.26
27    ssd   1.74599   1.0  1.7 TiB  725 GiB  723 GiB  2.6 MiB   2.1 
GiB  1.0 TiB  40.55  1.36   21  up  osd.27
28    ssd   1.74599   1.0  1.7 TiB  442 GiB  440 GiB  1.6 MiB   1.7 
GiB  1.3 TiB  24.72  0.83   17  up  osd.28


I have done a "ceph osd reweight-by-utilization" and "ceph osd 
set-require-min-compat-client luminous". The pool has 32 PGs which were 
set by autoscale_mode, which is on.


Why are my OSDs, so unbalanced? I have osd.5 with 68.3% and osd.6 with 
0.97%  Also when the reweight-by-utilization, osd.5 utilization 
actually increased...



What am i missing here?


Sp

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm migrate error

2023-08-03 Thread Eugen Block
Check out the ownership of the newly created DB device, according to  
your output it belongs to the root user. In the osd.log you probably  
should see something related to "permission denied". If you change it  
to ceph:ceph the OSD might start properly.


Zitat von Roland Giesler :


Ouch, I got exited too quickly!

On 2023/08/02 21:27, Roland Giesler wrote:

# systemctl start ceph-osd@14

And, viola!, it did it.

# ls -la /var/lib/ceph/osd/ceph-14/block*
lrwxrwxrwx 1 ceph ceph 50 Dec 25  2022  
/var/lib/ceph/osd/ceph-14/block ->  
/dev/mapper/0GVWr9-dQ65-LHcx-y6fD-z7fI-10A9-gVWZkY
lrwxrwxrwx 1 root root 10 Aug  2 21:17  
/var/lib/ceph/osd/ceph-14/block.db -> /dev/dm-20


It crashed!

# systemctl status ceph-osd@14
● ceph-osd@14.service - Ceph object storage daemon osd.14
 Loaded: loaded (/lib/systemd/system/ceph-osd@.service;  
enabled-runtime; vendor preset: enabled)

    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
 └─ceph-after-pve-cluster.conf
 Active: failed (Result: exit-code) since Wed 2023-08-02  
21:18:54 SAST; 10min ago
    Process: 520652  
ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster  
${CLUSTER} --id 14 (code=exited, status=0/SUCCESS)
    Process: 520660 ExecStart=/usr/bin/ceph-osd -f --cluster  
${CLUSTER} --id 14 --setuser ceph --setgroup ceph (code=exited,  
status=1/FAILURE)

   Main PID: 520660 (code=exited, status=1/FAILURE)
    CPU: 90ms

Aug 02 21:18:54 FT1-NodeC systemd[1]: ceph-osd@14.service: Scheduled  
restart job, restart counter is at 3.
Aug 02 21:18:54 FT1-NodeC systemd[1]: Stopped Ceph object storage  
daemon osd.14.
Aug 02 21:18:54 FT1-NodeC systemd[1]: ceph-osd@14.service: Start  
request repeated too quickly.
Aug 02 21:18:54 FT1-NodeC systemd[1]: ceph-osd@14.service: Failed  
with result 'exit-code'.
Aug 02 21:18:54 FT1-NodeC systemd[1]: Failed to start Ceph object  
storage daemon osd.14.
Aug 02 21:28:49 FT1-NodeC systemd[1]: ceph-osd@14.service: Start  
request repeated too quickly.
Aug 02 21:28:49 FT1-NodeC systemd[1]: ceph-osd@14.service: Failed  
with result 'exit-code'.
Aug 02 21:28:49 FT1-NodeC systemd[1]: Failed to start Ceph object  
storage daemon osd.14.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ref v18.2.0 QE Validation status

2023-08-03 Thread Thomas Lamprecht
Am 03/08/2023 um 00:30 schrieb Yuri Weinstein:
> 1. bookworm distro build support
> We will not build bookworm until Debian bug
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129 is resolved


FYI, there's also a bug in Debian's GCC 12, which is used by default
in Debian Bookworm, that causes issues with the gf-complete erasure
coding library and older AMD CPU's generating illegal instructions
which then kills e.g. the ceph-mon

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1012935

Their workaround is to compile gf-complete explicitly with -O1:

https://salsa.debian.org/openstack-team/third-party/gf-complete/-/commit/7751c075f868bf95873c6739d0d942f2a668c58f

While we (Proxmox) saw it for Ceph Quincy and didn't yet confirm it
for upcoming Ceph Reef, it's quite likely still there as the compiler
here seems to be at fault (and gf-complete code didn't change since
quincy FWICT).

- Thomas
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mgr services frequently crash on nodes 2,3,4

2023-08-03 Thread 陶冬冬
This mgr assert failure is fixed at https://github.com/ceph/ceph/pull/46688
You can upgrade to 16.2.13 to get the fix.


Eugen Block  于2023年8月3日周四 14:57写道:

> Can you query those config options yourself?
>
> storage01:~ # ceph config get mgr mgr/dashboard/standby_behaviour
> storage01:~ # ceph config get mgr mgr/dashboard/AUDIT_API_ENABLED
>
> I'm not sure if those are responsible for the crash though.
>
> Zitat von "Adiga, Anantha" :
>
> > Hi,
> >
> > Mgr service crash frequently on nodes 2 3 and 4  with the same
> > condition after the 4th node was added.
> >
> > root@zp3110b001a0104:/# ceph crash stat
> > 19 crashes recorded
> > 16 older than 1 days old:
> > 2023-07-29T03:35:32.006309Z_7b622c2b-a2fc-425a-acb8-dc1673b4c189
> > 2023-07-29T03:35:32.055174Z_a2ee1e23-5f41-4dbe-86ff-643fbf870dc9
> > 2023-07-29T14:34:13.752432Z_39b6a0d9-1bc3-4481-9a14-c92fea6c2710
> > 2023-07-30T03:02:57.510867Z_df595e04-0ac2-4e3d-93be-a7225348ea19
> > 2023-07-30T06:20:09.322530Z_0c2485f8-281c-4440-8b08-89b08a669de4
> > 2023-07-30T10:16:46.798405Z_79082f37-ee08-4a2b-84d1-d96c4026f321
> > 2023-07-30T10:16:46.843441Z_788391d6-3278-48c4-a95b-1934ee3265c1
> > 2023-07-31T02:26:55.903966Z_416a1e94-a8e1-4057-a683-a907faf400a1
> > 2023-07-31T04:40:10.216044Z_bef9d811-4e92-45cd-bcd7-3282962c8dfe
> > 2023-07-31T08:44:20.893344Z_037688ae-266f-4879-932c-2239f4679fd6
> > 2023-07-31T09:22:12.527968Z_f136c93b-7156-4176-a734-66a5a62513a4
> > 2023-07-31T15:22:08.417988Z_b80c6255-5eb3-41dd-b0b1-8bc5b070094f
> > 2023-07-31T23:05:16.589501Z_20ed8ef9-a478-49de-a371-08ea7a9937e5
> > 2023-08-01T01:26:01.911387Z_670f9e3c-7fbe-497f-9f0b-abeaefd8f2b3
> > 2023-08-01T01:51:39.759874Z_ff8206e4-34aa-44fe-82ac-7339e6714bb7
> > 2023-08-01T01:56:21.955706Z_98c86cdd-45ec-47dc-8f0c-2e5e09731db8
> > 7 older than 3 days old:
> > 2023-07-29T03:35:32.006309Z_7b622c2b-a2fc-425a-acb8-dc1673b4c189
> > 2023-07-29T03:35:32.055174Z_a2ee1e23-5f41-4dbe-86ff-643fbf870dc9
> > 2023-07-29T14:34:13.752432Z_39b6a0d9-1bc3-4481-9a14-c92fea6c2710
> > 2023-07-30T03:02:57.510867Z_df595e04-0ac2-4e3d-93be-a7225348ea19
> > 2023-07-30T06:20:09.322530Z_0c2485f8-281c-4440-8b08-89b08a669de4
> > 2023-07-30T10:16:46.798405Z_79082f37-ee08-4a2b-84d1-d96c4026f321
> > 2023-07-30T10:16:46.843441Z_788391d6-3278-48c4-a95b-1934ee3265c1
> >
> > root@zp3110b001a0104
> :/var/lib/ceph/8dbfcd81-fee3-49d2-ac0c-e988c8be7178/crash/posted/2023-07-31T08:44:20.893344Z_037688ae-266f-4879-932c-2239f4679fd6# root@zp3110b001a0104:/var/lib/ceph/8dbfcd81-fee3-49d2-ac0c-e988c8be7178/crash/posted/2023-07-31T08:44:20.893344Z_037688ae-266f-4879-932c-2239f4679fd6#>
> cat
> > meta
> > {
> > "crash_id":
> > "2023-07-31T08:44:20.893344Z_037688ae-266f-4879-932c-2239f4679fd6",
> > "timestamp": "2023-07-31T08:44:20.893344Z",
> > "process_name": "ceph-mgr",
> > "entity_name": "mgr.zp3110b001a0104.tmbkzq",
> > "ceph_version": "16.2.5",
> > "utsname_hostname": "zp3110b001a0104",
> > "utsname_sysname": "Linux",
> > "utsname_release": "5.4.0-153-generic",
> > "utsname_version": "#170-Ubuntu SMP Fri Jun 16 13:43:31 UTC 2023",
> > "utsname_machine": "x86_64",
> > "os_name": "CentOS Linux",
> > "os_id": "centos",
> > "os_version_id": "8",
> > "os_version": "8",
> > "assert_condition": "pending_service_map.epoch > service_map.epoch",
> > "assert_func": "DaemonServer::got_service_map():: > ServiceMap&)>",
> > "assert_file":
> >
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mgr/DaemonServer.cc",
> > "assert_line": 2932,
> > "assert_thread_name": "ms_dispatch",
> > "assert_msg":
> >
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mgr/DaemonServer.cc:
> In function 'DaemonServer::got_service_map()::'
> thread 7f127440a700 time
> 2023-07-31T08:44:20.887150+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mgr/DaemonServer.cc:
> 2932: FAILED ceph_assert(pending_service_map.epoch >
> > service_map.epoch)\n",
> > "backtrace": [
> > "/lib64/libpthread.so.0(+0x12b20) [0x7f127c611b20]",
> > "gsignal()",
> > "abort()",
> > "(ceph::__ceph_assert_fail(char const*, char const*, int,
> > char const*)+0x1a9) [0x7f127da26b75]",
> > "/usr/lib64/ceph/libceph-common.so.2(+0x276d3e)
> [0x7f127da26d3e]",
> > "(DaemonServer::got_service_map()+0xb2d) [0x5625aee23a4d]",
> >
> > "(Mgr::handle_service_map(boost::intrusive_ptr)+0x1b6)
> > [0x5625aee527c6]",
> > "(Mgr::ms_dispatch2(boost::intrusive_ptr
> > const&)+0x894) [0x5625aee55424]",
> > "(MgrStandby::ms_dispatch2(boost::intrusive_ptr
> > const&)+0xb0) 

[ceph-users] Re: mgr services frequently crash on nodes 2,3,4

2023-08-03 Thread Eugen Block

Can you query those config options yourself?

storage01:~ # ceph config get mgr mgr/dashboard/standby_behaviour
storage01:~ # ceph config get mgr mgr/dashboard/AUDIT_API_ENABLED

I'm not sure if those are responsible for the crash though.

Zitat von "Adiga, Anantha" :


Hi,

Mgr service crash frequently on nodes 2 3 and 4  with the same  
condition after the 4th node was added.


root@zp3110b001a0104:/# ceph crash stat
19 crashes recorded
16 older than 1 days old:
2023-07-29T03:35:32.006309Z_7b622c2b-a2fc-425a-acb8-dc1673b4c189
2023-07-29T03:35:32.055174Z_a2ee1e23-5f41-4dbe-86ff-643fbf870dc9
2023-07-29T14:34:13.752432Z_39b6a0d9-1bc3-4481-9a14-c92fea6c2710
2023-07-30T03:02:57.510867Z_df595e04-0ac2-4e3d-93be-a7225348ea19
2023-07-30T06:20:09.322530Z_0c2485f8-281c-4440-8b08-89b08a669de4
2023-07-30T10:16:46.798405Z_79082f37-ee08-4a2b-84d1-d96c4026f321
2023-07-30T10:16:46.843441Z_788391d6-3278-48c4-a95b-1934ee3265c1
2023-07-31T02:26:55.903966Z_416a1e94-a8e1-4057-a683-a907faf400a1
2023-07-31T04:40:10.216044Z_bef9d811-4e92-45cd-bcd7-3282962c8dfe
2023-07-31T08:44:20.893344Z_037688ae-266f-4879-932c-2239f4679fd6
2023-07-31T09:22:12.527968Z_f136c93b-7156-4176-a734-66a5a62513a4
2023-07-31T15:22:08.417988Z_b80c6255-5eb3-41dd-b0b1-8bc5b070094f
2023-07-31T23:05:16.589501Z_20ed8ef9-a478-49de-a371-08ea7a9937e5
2023-08-01T01:26:01.911387Z_670f9e3c-7fbe-497f-9f0b-abeaefd8f2b3
2023-08-01T01:51:39.759874Z_ff8206e4-34aa-44fe-82ac-7339e6714bb7
2023-08-01T01:56:21.955706Z_98c86cdd-45ec-47dc-8f0c-2e5e09731db8
7 older than 3 days old:
2023-07-29T03:35:32.006309Z_7b622c2b-a2fc-425a-acb8-dc1673b4c189
2023-07-29T03:35:32.055174Z_a2ee1e23-5f41-4dbe-86ff-643fbf870dc9
2023-07-29T14:34:13.752432Z_39b6a0d9-1bc3-4481-9a14-c92fea6c2710
2023-07-30T03:02:57.510867Z_df595e04-0ac2-4e3d-93be-a7225348ea19
2023-07-30T06:20:09.322530Z_0c2485f8-281c-4440-8b08-89b08a669de4
2023-07-30T10:16:46.798405Z_79082f37-ee08-4a2b-84d1-d96c4026f321
2023-07-30T10:16:46.843441Z_788391d6-3278-48c4-a95b-1934ee3265c1

root@zp3110b001a0104:/var/lib/ceph/8dbfcd81-fee3-49d2-ac0c-e988c8be7178/crash/posted/2023-07-31T08:44:20.893344Z_037688ae-266f-4879-932c-2239f4679fd6# cat  
meta

{
"crash_id":  
"2023-07-31T08:44:20.893344Z_037688ae-266f-4879-932c-2239f4679fd6",

"timestamp": "2023-07-31T08:44:20.893344Z",
"process_name": "ceph-mgr",
"entity_name": "mgr.zp3110b001a0104.tmbkzq",
"ceph_version": "16.2.5",
"utsname_hostname": "zp3110b001a0104",
"utsname_sysname": "Linux",
"utsname_release": "5.4.0-153-generic",
"utsname_version": "#170-Ubuntu SMP Fri Jun 16 13:43:31 UTC 2023",
"utsname_machine": "x86_64",
"os_name": "CentOS Linux",
"os_id": "centos",
"os_version_id": "8",
"os_version": "8",
"assert_condition": "pending_service_map.epoch > service_map.epoch",
"assert_func": "DaemonServer::got_service_map()::ServiceMap&)>",
"assert_file":  
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mgr/DaemonServer.cc",

"assert_line": 2932,
"assert_thread_name": "ms_dispatch",
"assert_msg":  
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mgr/DaemonServer.cc: In function 'DaemonServer::got_service_map()::' thread 7f127440a700 time 2023-07-31T08:44:20.887150+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mgr/DaemonServer.cc: 2932: FAILED ceph_assert(pending_service_map.epoch >  
service_map.epoch)\n",

"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f127c611b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int,  
char const*)+0x1a9) [0x7f127da26b75]",

"/usr/lib64/ceph/libceph-common.so.2(+0x276d3e) [0x7f127da26d3e]",
"(DaemonServer::got_service_map()+0xb2d) [0x5625aee23a4d]",
 
"(Mgr::handle_service_map(boost::intrusive_ptr)+0x1b6)  
[0x5625aee527c6]",
"(Mgr::ms_dispatch2(boost::intrusive_ptr  
const&)+0x894) [0x5625aee55424]",
"(MgrStandby::ms_dispatch2(boost::intrusive_ptr  
const&)+0xb0) [0x5625aee5ec10]",

"(DispatchQueue::entry()+0x126a) [0x7f127dc610ca]",
"(DispatchQueue::DispatchThread::entry()+0x11) [0x7f127dd11591]",
"/lib64/libpthread.so.0(+0x814a) [0x7f127c60714a]",
"clone()"
]
}