We experienced this problem in the past on older (pre-Jewel) releases
where a PG split that affected the RBD header object would result in
the watch getting lost by librados. Any chance you know if the
affected RBD header objects were involved in a PG split? Can you
generate a gcore dump of one of the affected VMs and ceph-post-file it
for analysis?

As for the VM going R/O, that is the expected behavior when a client
breaks the exclusive lock held by a (dead) client.

On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander <w...@42on.com> wrote:
> Hi,
>
> On a OpenStack environment I encountered a VM which went into R/O mode after 
> a RBD snapshot was created.
>
> Digging into this I found 10s (out of thousands) RBD images which DO have a 
> running VM, but do NOT have a watcher on the RBD image.
>
> For example:
>
> $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
>
> 'Watchers: none'
>
> The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
> client.
>
> In the meantime the cluster was already upgraded to 10.2.10
>
> Looking further I also found a Compute node with 10.2.10 installed which also 
> has RBD images without watchers.
>
> Restarting or live migrating the VM to a different host resolves this issue.
>
> The internet is full of posts where RBD images still have Watchers when 
> people don't expect them, but in this case I'm expecting a watcher which 
> isn't there.
>
> The main problem right now is that creating a snapshot potentially puts a VM 
> in Read-Only state because of the lack of notification.
>
> Has anybody seen this as well?
>
> Thanks,
>
> Wido
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to