hi,

I have a 3-node Proxmox Ceph cluster that's been acting up whenever I try to 
get it to do anything with one of the pools (fastwrx) on the cluster.

`rbd pool stats fastwrx` just hangs on one node, but on the other two, responds 
instantaneously.

`ceph -s` looks like this:

root@ibnmajid:~# ceph -s
  cluster:
    id:     310af567-1607-402b-bc5d-c62286a129d5
    health: HEALTH_WARN
            insufficient standby MDS daemons available
 
  services:
    mon: 3 daemons, quorum ibnmajid,ganges,riogrande (age 47h)
    mgr: riogrande(active, since 47h)
    mds: 2/2 daemons up, 1 hot standby
    osd: 18 osds: 18 up (since 47h), 18 in (since 47h)
 
  data:
    volumes: 2/2 healthy
    pools:   7 pools, 1537 pgs
    objects: 793.24k objects, 1.9 TiB
    usage:   4.1 TiB used, 10 TiB / 14 TiB avail
    pgs:     1537 active+clean
 
  io:
    client:   1.5 MiB/s rd, 243 KiB/s wr, 3 op/s rd, 19 op/s wr

I don't really know where to begin here. nothing jumps out at me in syslog. 
it's like the rbd client, not even anything involved in serving data on the 
node, is just somehow broken.

ceph status on that node works fine. it appears to be a problem limited to only 
the one pool.
root@ibnmajid:~# ceph status

cluster: id:     310af567-1607-402b-bc5d-c62286a129d5 health:
HEALTH_WARN insufficient standby MDS daemons available

services:
mon: 3 daemons, quorum ibnmajid,ganges,riogrande (age 2d)
mgr: riogrande(active, since 2d) mds: 2/2 daemons up, 1 hot standby
osd: 18 osds: 18 up (since 2d), 18 in (since 2d)

data:
volumes: 2/2 healthy pools:   7 pools, 1537 pgs objects: 793.28k objects, 1.9 
TiB usage:   4.1 TiB used, 10 TiB / 14 TiB avail
pgs:     1537 active+clean

io:
client:   2.3 MiB/s rd, 137 KiB/s wr, 2 op/s rd, 18 op/s wr
if I try a different pool, that works fine on the same node:
root@ibnmajid:~# rbd pool stats largewrx
Total Images: 0
Total Snapshots: 0
Provisioned Size: 0 B
(those statistics are correct, it's not directly in use but rather in use with 
cephfs)
similarly, the FS pools related to fastwrx don't work on this node either, but 
others do:
root@ibnmajid:~# rbd pool stats fastwrxFS_data
^C
root@ibnmajid:~# rbd pool stats fastwrxFS_metadata
^C
root@ibnmajid:~# rbd pool stats largewrxFS_data
Total Images: 0
Total Snapshots: 0
Provisioned Size: 0 B
root@ibnmajid:~# rbd pool stats largewrxFS_metadata
Total Images: 0
Total Snapshots: 0
Provisioned Size: 0 B
root@ibnmajid:~#
on another node, everything returns results instantly, but fastwrxFS is 
definitely in use so I'm not sure why it says this:
root@ganges:~# rbd pool stats fastwrx
Total Images: 17
Total Snapshots: 0
Provisioned Size: 1.3 TiB
root@ganges:~# rbd pool stats fastwrxFS_data
Total Images: 0
Total Snapshots: 0
Provisioned Size: 0 B
root@ganges:~# rbd pool stats fastwrxFS_metadata
Total Images: 0
Total Snapshots: 0
Provisioned Size: 0 B
here's what happens if I try ceph osd pool stats on a "good" node:
root@ganges:~# ceph osd pool stats
pool fastwrx id 9
  client io 0 B/s rd, 105 KiB/s wr, 0 op/s rd, 14 op/s wr

pool largewrx id 10
  nothing is going on

pool fastwrxFS_data id 17
  nothing is going on

pool fastwrxFS_metadata id 18
  client io 852 B/s rd, 1 op/s rd, 0 op/s wr

pool largewrxFS_data id 20
  client io 2.9 MiB/s rd, 2 op/s rd, 0 op/s wr

pool largewrxFS_metadata id 21
  nothing is going on

pool .mgr id 22
  nothing is going on
and on the broken node:
root@ibnmajid:~# ceph osd pool stats
pool fastwrx id 9
  client io 0 B/s rd, 93 KiB/s wr, 0 op/s rd, 5 op/s wr

pool largewrx id 10
  nothing is going on

pool fastwrxFS_data id 17
  nothing is going on

pool fastwrxFS_metadata id 18
  client io 852 B/s rd, 1 op/s rd, 0 op/s wr

pool largewrxFS_data id 20
  client io 1.9 MiB/s rd, 0 op/s rd, 0 op/s wr

pool largewrxFS_metadata id 21
  nothing is going on

pool .mgr id 22
  nothing is going on
so whatever interface that uses seems to interact with the pool fine, I guess.

how do I get started fixing this?

thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to