Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-12-05 Thread Wido den Hollander

> Op 5 december 2017 om 15:27 schreef Jason Dillaman :
> 
> 
> On Tue, Dec 5, 2017 at 9:13 AM, Wido den Hollander  wrote:
> >
> >> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
> >>
> >>
> >> We experienced this problem in the past on older (pre-Jewel) releases
> >> where a PG split that affected the RBD header object would result in
> >> the watch getting lost by librados. Any chance you know if the
> >> affected RBD header objects were involved in a PG split? Can you
> >> generate a gcore dump of one of the affected VMs and ceph-post-file it
> >> for analysis?
> >>
> >
> > I asked again for the gcore, but they can't release it as it contains 
> > confidential information about the Instance and the Ceph cluster. I 
> > understand their reasoning and they also understand that it makes it 
> > difficult to debug this.
> >
> > I am allowed to look at the gcore dump when on location (next week), but 
> > I'm not allowed to share it.
> 
> Indeed -- best chance would be if you could reproduce on a VM that you
> are permitted to share.
> 

We are looking into that.

> >> As for the VM going R/O, that is the expected behavior when a client
> >> breaks the exclusive lock held by a (dead) client.
> >>
> >
> > We noticed another VM going into RO when a snapshot was created. When 
> > checking last week this Instance had a watcher, but after the snapshot/RO 
> > we found out it no longer has a watcher registered.
> >
> > Any suggestions or ideas?
> 
> If you have the admin socket enabled, you could run "ceph
> --admin-daemon /path/to/asok objecter_requests" to dump the ops. That
> probably won't be useful unless there is a smoking gun. Did you have
> any OSDs go out/down? Network issues?
> 

The admin socket is currently not enabled, but I will ask them to do that. We 
will then have to wait for this to happen again.

We didn't have any network issues there, but a few OSD went down and up again 
in the last few weeks, but not very recently afaik.

I'll look into the admin socket!

Wido

> > Wido
> >
> >> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> >> > Hi,
> >> >
> >> > On a OpenStack environment I encountered a VM which went into R/O mode 
> >> > after a RBD snapshot was created.
> >> >
> >> > Digging into this I found 10s (out of thousands) RBD images which DO 
> >> > have a running VM, but do NOT have a watcher on the RBD image.
> >> >
> >> > For example:
> >> >
> >> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >> >
> >> > 'Watchers: none'
> >> >
> >> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on 
> >> > the client.
> >> >
> >> > In the meantime the cluster was already upgraded to 10.2.10
> >> >
> >> > Looking further I also found a Compute node with 10.2.10 installed which 
> >> > also has RBD images without watchers.
> >> >
> >> > Restarting or live migrating the VM to a different host resolves this 
> >> > issue.
> >> >
> >> > The internet is full of posts where RBD images still have Watchers when 
> >> > people don't expect them, but in this case I'm expecting a watcher which 
> >> > isn't there.
> >> >
> >> > The main problem right now is that creating a snapshot potentially puts 
> >> > a VM in Read-Only state because of the lack of notification.
> >> >
> >> > Has anybody seen this as well?
> >> >
> >> > Thanks,
> >> >
> >> > Wido
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-12-05 Thread Jason Dillaman
On Tue, Dec 5, 2017 at 9:13 AM, Wido den Hollander  wrote:
>
>> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
>>
>>
>> We experienced this problem in the past on older (pre-Jewel) releases
>> where a PG split that affected the RBD header object would result in
>> the watch getting lost by librados. Any chance you know if the
>> affected RBD header objects were involved in a PG split? Can you
>> generate a gcore dump of one of the affected VMs and ceph-post-file it
>> for analysis?
>>
>
> I asked again for the gcore, but they can't release it as it contains 
> confidential information about the Instance and the Ceph cluster. I 
> understand their reasoning and they also understand that it makes it 
> difficult to debug this.
>
> I am allowed to look at the gcore dump when on location (next week), but I'm 
> not allowed to share it.

Indeed -- best chance would be if you could reproduce on a VM that you
are permitted to share.

>> As for the VM going R/O, that is the expected behavior when a client
>> breaks the exclusive lock held by a (dead) client.
>>
>
> We noticed another VM going into RO when a snapshot was created. When 
> checking last week this Instance had a watcher, but after the snapshot/RO we 
> found out it no longer has a watcher registered.
>
> Any suggestions or ideas?

If you have the admin socket enabled, you could run "ceph
--admin-daemon /path/to/asok objecter_requests" to dump the ops. That
probably won't be useful unless there is a smoking gun. Did you have
any OSDs go out/down? Network issues?

> Wido
>
>> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
>> > Hi,
>> >
>> > On a OpenStack environment I encountered a VM which went into R/O mode 
>> > after a RBD snapshot was created.
>> >
>> > Digging into this I found 10s (out of thousands) RBD images which DO have 
>> > a running VM, but do NOT have a watcher on the RBD image.
>> >
>> > For example:
>> >
>> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
>> >
>> > 'Watchers: none'
>> >
>> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on 
>> > the client.
>> >
>> > In the meantime the cluster was already upgraded to 10.2.10
>> >
>> > Looking further I also found a Compute node with 10.2.10 installed which 
>> > also has RBD images without watchers.
>> >
>> > Restarting or live migrating the VM to a different host resolves this 
>> > issue.
>> >
>> > The internet is full of posts where RBD images still have Watchers when 
>> > people don't expect them, but in this case I'm expecting a watcher which 
>> > isn't there.
>> >
>> > The main problem right now is that creating a snapshot potentially puts a 
>> > VM in Read-Only state because of the lack of notification.
>> >
>> > Has anybody seen this as well?
>> >
>> > Thanks,
>> >
>> > Wido
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-12-05 Thread Wido den Hollander

> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
> 
> 
> We experienced this problem in the past on older (pre-Jewel) releases
> where a PG split that affected the RBD header object would result in
> the watch getting lost by librados. Any chance you know if the
> affected RBD header objects were involved in a PG split? Can you
> generate a gcore dump of one of the affected VMs and ceph-post-file it
> for analysis?
> 

I asked again for the gcore, but they can't release it as it contains 
confidential information about the Instance and the Ceph cluster. I understand 
their reasoning and they also understand that it makes it difficult to debug 
this.

I am allowed to look at the gcore dump when on location (next week), but I'm 
not allowed to share it.

> As for the VM going R/O, that is the expected behavior when a client
> breaks the exclusive lock held by a (dead) client.
> 

We noticed another VM going into RO when a snapshot was created. When checking 
last week this Instance had a watcher, but after the snapshot/RO we found out 
it no longer has a watcher registered.

Any suggestions or ideas?

Wido

> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> > Hi,
> >
> > On a OpenStack environment I encountered a VM which went into R/O mode 
> > after a RBD snapshot was created.
> >
> > Digging into this I found 10s (out of thousands) RBD images which DO have a 
> > running VM, but do NOT have a watcher on the RBD image.
> >
> > For example:
> >
> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >
> > 'Watchers: none'
> >
> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
> > client.
> >
> > In the meantime the cluster was already upgraded to 10.2.10
> >
> > Looking further I also found a Compute node with 10.2.10 installed which 
> > also has RBD images without watchers.
> >
> > Restarting or live migrating the VM to a different host resolves this issue.
> >
> > The internet is full of posts where RBD images still have Watchers when 
> > people don't expect them, but in this case I'm expecting a watcher which 
> > isn't there.
> >
> > The main problem right now is that creating a snapshot potentially puts a 
> > VM in Read-Only state because of the lack of notification.
> >
> > Has anybody seen this as well?
> >
> > Thanks,
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-30 Thread Wido den Hollander

> Op 30 november 2017 om 14:19 schreef Jason Dillaman :
> 
> 
> On Thu, Nov 30, 2017 at 4:00 AM, Wido den Hollander  wrote:
> >
> >> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
> >>
> >>
> >> We experienced this problem in the past on older (pre-Jewel) releases
> >> where a PG split that affected the RBD header object would result in
> >> the watch getting lost by librados. Any chance you know if the
> >> affected RBD header objects were involved in a PG split? Can you
> >> generate a gcore dump of one of the affected VMs and ceph-post-file it
> >> for analysis?
> >>
> >
> > There was no PG splitting in the recent months on this cluster, so that's 
> > not something that might have happened here.
> 
> Possible alternative explanation: are you using cache tiering?

No, not either. It's running 3x replication. Standard RBD behind OpenStack.

Cluster has around 2.000 OSDs running all with 4TB disks and 3x replication.

I'll wait for the gcore dump of a running VM, but that may take a few days.

Wido

> 
> > I've asked the OpenStack team for a gcore dump, but they have to get that 
> > cleared before they can send it to me.
> >
> > This might take a bit of time!
> >
> > Wido
> >
> >> As for the VM going R/O, that is the expected behavior when a client
> >> breaks the exclusive lock held by a (dead) client.
> >>
> >> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> >> > Hi,
> >> >
> >> > On a OpenStack environment I encountered a VM which went into R/O mode 
> >> > after a RBD snapshot was created.
> >> >
> >> > Digging into this I found 10s (out of thousands) RBD images which DO 
> >> > have a running VM, but do NOT have a watcher on the RBD image.
> >> >
> >> > For example:
> >> >
> >> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >> >
> >> > 'Watchers: none'
> >> >
> >> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on 
> >> > the client.
> >> >
> >> > In the meantime the cluster was already upgraded to 10.2.10
> >> >
> >> > Looking further I also found a Compute node with 10.2.10 installed which 
> >> > also has RBD images without watchers.
> >> >
> >> > Restarting or live migrating the VM to a different host resolves this 
> >> > issue.
> >> >
> >> > The internet is full of posts where RBD images still have Watchers when 
> >> > people don't expect them, but in this case I'm expecting a watcher which 
> >> > isn't there.
> >> >
> >> > The main problem right now is that creating a snapshot potentially puts 
> >> > a VM in Read-Only state because of the lack of notification.
> >> >
> >> > Has anybody seen this as well?
> >> >
> >> > Thanks,
> >> >
> >> > Wido
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-30 Thread Jason Dillaman
On Thu, Nov 30, 2017 at 4:00 AM, Wido den Hollander  wrote:
>
>> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
>>
>>
>> We experienced this problem in the past on older (pre-Jewel) releases
>> where a PG split that affected the RBD header object would result in
>> the watch getting lost by librados. Any chance you know if the
>> affected RBD header objects were involved in a PG split? Can you
>> generate a gcore dump of one of the affected VMs and ceph-post-file it
>> for analysis?
>>
>
> There was no PG splitting in the recent months on this cluster, so that's not 
> something that might have happened here.

Possible alternative explanation: are you using cache tiering?

> I've asked the OpenStack team for a gcore dump, but they have to get that 
> cleared before they can send it to me.
>
> This might take a bit of time!
>
> Wido
>
>> As for the VM going R/O, that is the expected behavior when a client
>> breaks the exclusive lock held by a (dead) client.
>>
>> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
>> > Hi,
>> >
>> > On a OpenStack environment I encountered a VM which went into R/O mode 
>> > after a RBD snapshot was created.
>> >
>> > Digging into this I found 10s (out of thousands) RBD images which DO have 
>> > a running VM, but do NOT have a watcher on the RBD image.
>> >
>> > For example:
>> >
>> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
>> >
>> > 'Watchers: none'
>> >
>> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on 
>> > the client.
>> >
>> > In the meantime the cluster was already upgraded to 10.2.10
>> >
>> > Looking further I also found a Compute node with 10.2.10 installed which 
>> > also has RBD images without watchers.
>> >
>> > Restarting or live migrating the VM to a different host resolves this 
>> > issue.
>> >
>> > The internet is full of posts where RBD images still have Watchers when 
>> > people don't expect them, but in this case I'm expecting a watcher which 
>> > isn't there.
>> >
>> > The main problem right now is that creating a snapshot potentially puts a 
>> > VM in Read-Only state because of the lack of notification.
>> >
>> > Has anybody seen this as well?
>> >
>> > Thanks,
>> >
>> > Wido
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-30 Thread Wido den Hollander

> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
> 
> 
> We experienced this problem in the past on older (pre-Jewel) releases
> where a PG split that affected the RBD header object would result in
> the watch getting lost by librados. Any chance you know if the
> affected RBD header objects were involved in a PG split? Can you
> generate a gcore dump of one of the affected VMs and ceph-post-file it
> for analysis?
> 

There was no PG splitting in the recent months on this cluster, so that's not 
something that might have happened here.

I've asked the OpenStack team for a gcore dump, but they have to get that 
cleared before they can send it to me.

This might take a bit of time!

Wido

> As for the VM going R/O, that is the expected behavior when a client
> breaks the exclusive lock held by a (dead) client.
> 
> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> > Hi,
> >
> > On a OpenStack environment I encountered a VM which went into R/O mode 
> > after a RBD snapshot was created.
> >
> > Digging into this I found 10s (out of thousands) RBD images which DO have a 
> > running VM, but do NOT have a watcher on the RBD image.
> >
> > For example:
> >
> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >
> > 'Watchers: none'
> >
> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
> > client.
> >
> > In the meantime the cluster was already upgraded to 10.2.10
> >
> > Looking further I also found a Compute node with 10.2.10 installed which 
> > also has RBD images without watchers.
> >
> > Restarting or live migrating the VM to a different host resolves this issue.
> >
> > The internet is full of posts where RBD images still have Watchers when 
> > people don't expect them, but in this case I'm expecting a watcher which 
> > isn't there.
> >
> > The main problem right now is that creating a snapshot potentially puts a 
> > VM in Read-Only state because of the lack of notification.
> >
> > Has anybody seen this as well?
> >
> > Thanks,
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-29 Thread Jason Dillaman
We experienced this problem in the past on older (pre-Jewel) releases
where a PG split that affected the RBD header object would result in
the watch getting lost by librados. Any chance you know if the
affected RBD header objects were involved in a PG split? Can you
generate a gcore dump of one of the affected VMs and ceph-post-file it
for analysis?

As for the VM going R/O, that is the expected behavior when a client
breaks the exclusive lock held by a (dead) client.

On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> Hi,
>
> On a OpenStack environment I encountered a VM which went into R/O mode after 
> a RBD snapshot was created.
>
> Digging into this I found 10s (out of thousands) RBD images which DO have a 
> running VM, but do NOT have a watcher on the RBD image.
>
> For example:
>
> $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
>
> 'Watchers: none'
>
> The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
> client.
>
> In the meantime the cluster was already upgraded to 10.2.10
>
> Looking further I also found a Compute node with 10.2.10 installed which also 
> has RBD images without watchers.
>
> Restarting or live migrating the VM to a different host resolves this issue.
>
> The internet is full of posts where RBD images still have Watchers when 
> people don't expect them, but in this case I'm expecting a watcher which 
> isn't there.
>
> The main problem right now is that creating a snapshot potentially puts a VM 
> in Read-Only state because of the lack of notification.
>
> Has anybody seen this as well?
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-29 Thread Logan Kuhn
We've seen this.  Our environment isn't identical though, we use oVirt and 
connect to ceph (11.2.1) via cinder (9.2.1), but it's so very rare that we've 
never had any luck in pin pointing it and have a lot less VMs, <300.

Regards,
Logan

- On Nov 29, 2017, at 7:48 AM, Wido den Hollander w...@42on.com wrote:

| Hi,
| 
| On a OpenStack environment I encountered a VM which went into R/O mode after a
| RBD snapshot was created.
| 
| Digging into this I found 10s (out of thousands) RBD images which DO have a
| running VM, but do NOT have a watcher on the RBD image.
| 
| For example:
| 
| $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
| 
| 'Watchers: none'
| 
| The VM is however running since September 5th 2017 with Jewel 10.2.7 on the
| client.
| 
| In the meantime the cluster was already upgraded to 10.2.10
| 
| Looking further I also found a Compute node with 10.2.10 installed which also
| has RBD images without watchers.
| 
| Restarting or live migrating the VM to a different host resolves this issue.
| 
| The internet is full of posts where RBD images still have Watchers when people
| don't expect them, but in this case I'm expecting a watcher which isn't there.
| 
| The main problem right now is that creating a snapshot potentially puts a VM 
in
| Read-Only state because of the lack of notification.
| 
| Has anybody seen this as well?
| 
| Thanks,
| 
| Wido
| ___
| ceph-users mailing list
| ceph-users@lists.ceph.com
| http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-29 Thread Wido den Hollander
Hi,

On a OpenStack environment I encountered a VM which went into R/O mode after a 
RBD snapshot was created.

Digging into this I found 10s (out of thousands) RBD images which DO have a 
running VM, but do NOT have a watcher on the RBD image.

For example:

$ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086

'Watchers: none'

The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
client.

In the meantime the cluster was already upgraded to 10.2.10

Looking further I also found a Compute node with 10.2.10 installed which also 
has RBD images without watchers.

Restarting or live migrating the VM to a different host resolves this issue.

The internet is full of posts where RBD images still have Watchers when people 
don't expect them, but in this case I'm expecting a watcher which isn't there.

The main problem right now is that creating a snapshot potentially puts a VM in 
Read-Only state because of the lack of notification.

Has anybody seen this as well?

Thanks,

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com