Yes, this is all set up. It was working fine until after the problem with the osd host that lost the cluster/sync network occured.

There are a few other VMs that keep running along fine without this issue. I've restarted the problematic VM without success (that is, creating a file works, but overwriting it still hangs right away). fsck runs fine so reading the whole image works.

I'm kind of stumped as to what can cause this.

Because of the lengthy recovery, and then pg autoscaler currenty doing things there are currently lots of PGs that haven't been scrubbed, but I doubt that is an issue here.


Den 2023-09-29 kl. 18:52, skrev Anthony D'Atri:
EC for RBD wasn't possible until Luminous IIRC, so I had to ask.  You have a 
replicated metadata pool defined?  Does proxmox know that this is an EC pool?  
When connecting it needs to know both the metata and data pools.

On Sep 29, 2023, at 12:49, peter.lin...@fiberdirekt.se wrote:

(sorry for duplicate emails)

This turns out to be a good question actually.

The cluster is running Quincy, 17.2.6.

The compute node that is running the VM is proxmox, version 7.4-3. Supposedly this is 
fairly new, but the version of librbd1 claims to be 14.2.21 when I check with "apt 
list". We are not using proxmox's own ceph cluster release. However we haven't had 
any issues with this setup before, but we haven't been using neither erasure coded pools 
nor had the node-half-dead problem for such a long time.

The VM is configured using proxmox which is not libvirt but similar, and krbd 
is not enabled. I don't know for sure if proxmox has its own librbd linked in 
qemu/kvm.

"ceph features" looks like this:

{
     "mon": [
         {
             "features": "0x3f01cfbf7ffdffff",
             "release": "luminous",
             "num": 5
         }
     ],
     "osd": [
         {
             "features": "0x3f01cfbf7ffdffff",
             "release": "luminous",
             "num": 24
         }
     ],
     "client": [
         {
             "features": "0x3f01cfb87fecffff",
             "release": "luminous",
             "num": 4
         },
         {
             "features": "0x3f01cfbf7ffdffff",
             "release": "luminous",
             "num": 12
         }
     ],
     "mgr": [
         {
             "features": "0x3f01cfbf7ffdffff",
             "release": "luminous",
             "num": 2
         }
     ]
}

Regards,

Peter


Den 2023-09-29 kl. 17:55, skrev Anthony D'Atri:
Which Ceph releases are installed on the VM and the back end?  Is the VM using 
librbd through libvirt, or krbd?

On Sep 29, 2023, at 09:09, Peter Linder <peter.lin...@fiberdirekt.se> wrote:

Dear all,

I have a problem that after an OSD host lost connection to the sync/cluster 
rear network for many hours (the public network was online), a test VM using 
RBD cant overwrite its files. I can create a new file inside it just fine, but 
not overwrite it, the process just hangs.

The VM's disk is on an erasure coded data pool and a replicated pool in front 
of it. EC overwrites is on for the pool.

The custer consists of 5 hosts and 4 OSDs on each, and separate hosts for 
compute. There is a public and separate cluster network, separated. In this 
case, the AOC cable to the cluster network went link down on a host and it had 
to be replaced and the host was rebooted. Recovery took about a week to 
complete. The host was half-down for about 12 hours like this.

I have some other VMs as well with images in the same pool (totally 4), and 
they seem to work fine, it is just this one that cant overwrite.

I'm thinking there is somehow something wrong with just this image?

Regards,

Peter
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to