[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-23 Thread Ilya Dryomov
On Fri, Apr 23, 2021 at 1:12 PM Boris Behrens  wrote:
>
>
>
> Am Fr., 23. Apr. 2021 um 13:00 Uhr schrieb Ilya Dryomov :
>>
>> On Fri, Apr 23, 2021 at 12:46 PM Boris Behrens  wrote:
>> >
>> >
>> >
>> > Am Fr., 23. Apr. 2021 um 12:16 Uhr schrieb Ilya Dryomov 
>> > :
>> >>
>> >> On Fri, Apr 23, 2021 at 12:03 PM Boris Behrens  wrote:
>> >> >
>> >> >
>> >> >
>> >> > Am Fr., 23. Apr. 2021 um 11:52 Uhr schrieb Ilya Dryomov 
>> >> > :
>> >> >>
>> >> >>
>> >> >> This snippet confirms my suspicion.  Unfortunately without a verbose
>> >> >> log from that VM from three days ago (i.e. when it got into this state)
>> >> >> it's hard to tell what exactly went wrong.
>> >> >>
>> >> >> The problem is that the VM doesn't consider itself to be the rightful
>> >> >> owner of the lock and so when "rbd snap create" requests the lock from
>> >> >> it in order to make a snapshot, the VM just ignores the request because
>> >> >> even though it owns the lock, its record appears to be of sync.
>> >> >>
>> >> >> I'd suggest to kick it by restarting osd36.  If the VM is active, it
>> >> >> should reacquire the lock and hopefully update its internal record as
>> >> >> expected.  If "rbd snap create" still hangs after that, it would mean
>> >> >> that we have a reproducer and can gather logs on the VM side.
>> >> >>
>> >> >> What version of qemu/librbd and ceph is in use (both on the VM side and
>> >> >> on the side you are running "rbd snap create"?
>> >> >>
>> >> > I just stopped the OSD, waited some seconds and started it again.
>> >> > I still can't create snapshots.
>> >> >
>> >> > Ceph version is 14.2.18 accross the board
>> >> > qemu is 4.1.0-1
>> >> > as we use krbd, the kernel version is 5.2.9-arch1-1-ARCH
>> >> >
>> >> > How can I gather more logs to debug it?
>> >>
>> >> Are you saying that this image is mapped and the lock is held by the
>> >> kernel client?  It doesn't look that way from the logs you shared.
>> >
>> > We use krbd instead of librbd (at least this is what I think I know), but 
>> > qemu is doing the kvm/rbd stuff.
>>
>> I'm going to assume that by "qemu is doing the kvm/rbd stuff", you
>> mean that you are using the librbd driver inside qemu and that this
>> image is opened by qemu (i.e. that driver).  If you don't know what
>> access method is being used, debugging this might be challenging ;)
>>
>> Let's start with the same output: "rbd lock ls", "rbd status" and "rbd
>> snap create --debug-ms=1 --debug-rbd=20".  It should be different after
>> osd36 was restarted.
>
> Here is the new one: https://pastebin.com/6qTsJK6W
> Ah ok, this CPU node still got the old thing and uses librbd to work with rbd 
> instead of krbd.

Sorry, I forgot that simply restarting the OSD doesn't trigger the
code path that I'm hoping would cause librbd inside the VM to update
its state.  I took a look at the code and I think there are a couple
of ways to do it (listed in the order of preference):

- cut the network between the VM and the cluster for more than 30
  seconds; it should be done externally so that to the VM it looks
  like a long network blip

- stop the VM process for more than 30 seconds

  $ PID=
  $ kill -STOP $PID && sleep 40 && kill -CONT $PID

- stop the osd36 process for more than 30 seconds with "nodown" flag
  set

  $ ceph osd set nodown
  $ PID=
  $ kill -STOP $PID && sleep 40 && kill -CONT $PID
  $ ceph osd unset nodown

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-23 Thread Ilya Dryomov
On Fri, Apr 23, 2021 at 12:46 PM Boris Behrens  wrote:
>
>
>
> Am Fr., 23. Apr. 2021 um 12:16 Uhr schrieb Ilya Dryomov :
>>
>> On Fri, Apr 23, 2021 at 12:03 PM Boris Behrens  wrote:
>> >
>> >
>> >
>> > Am Fr., 23. Apr. 2021 um 11:52 Uhr schrieb Ilya Dryomov 
>> > :
>> >>
>> >>
>> >> This snippet confirms my suspicion.  Unfortunately without a verbose
>> >> log from that VM from three days ago (i.e. when it got into this state)
>> >> it's hard to tell what exactly went wrong.
>> >>
>> >> The problem is that the VM doesn't consider itself to be the rightful
>> >> owner of the lock and so when "rbd snap create" requests the lock from
>> >> it in order to make a snapshot, the VM just ignores the request because
>> >> even though it owns the lock, its record appears to be of sync.
>> >>
>> >> I'd suggest to kick it by restarting osd36.  If the VM is active, it
>> >> should reacquire the lock and hopefully update its internal record as
>> >> expected.  If "rbd snap create" still hangs after that, it would mean
>> >> that we have a reproducer and can gather logs on the VM side.
>> >>
>> >> What version of qemu/librbd and ceph is in use (both on the VM side and
>> >> on the side you are running "rbd snap create"?
>> >>
>> > I just stopped the OSD, waited some seconds and started it again.
>> > I still can't create snapshots.
>> >
>> > Ceph version is 14.2.18 accross the board
>> > qemu is 4.1.0-1
>> > as we use krbd, the kernel version is 5.2.9-arch1-1-ARCH
>> >
>> > How can I gather more logs to debug it?
>>
>> Are you saying that this image is mapped and the lock is held by the
>> kernel client?  It doesn't look that way from the logs you shared.
>
> We use krbd instead of librbd (at least this is what I think I know), but 
> qemu is doing the kvm/rbd stuff.

I'm going to assume that by "qemu is doing the kvm/rbd stuff", you
mean that you are using the librbd driver inside qemu and that this
image is opened by qemu (i.e. that driver).  If you don't know what
access method is being used, debugging this might be challenging ;)

Let's start with the same output: "rbd lock ls", "rbd status" and "rbd
snap create --debug-ms=1 --debug-rbd=20".  It should be different after
osd36 was restarted.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-23 Thread Boris Behrens
Am Fr., 23. Apr. 2021 um 12:16 Uhr schrieb Ilya Dryomov :

> On Fri, Apr 23, 2021 at 12:03 PM Boris Behrens  wrote:
> >
> >
> >
> > Am Fr., 23. Apr. 2021 um 11:52 Uhr schrieb Ilya Dryomov <
> idryo...@gmail.com>:
> >>
> >>
> >> This snippet confirms my suspicion.  Unfortunately without a verbose
> >> log from that VM from three days ago (i.e. when it got into this state)
> >> it's hard to tell what exactly went wrong.
> >>
> >> The problem is that the VM doesn't consider itself to be the rightful
> >> owner of the lock and so when "rbd snap create" requests the lock from
> >> it in order to make a snapshot, the VM just ignores the request because
> >> even though it owns the lock, its record appears to be of sync.
> >>
> >> I'd suggest to kick it by restarting osd36.  If the VM is active, it
> >> should reacquire the lock and hopefully update its internal record as
> >> expected.  If "rbd snap create" still hangs after that, it would mean
> >> that we have a reproducer and can gather logs on the VM side.
> >>
> >> What version of qemu/librbd and ceph is in use (both on the VM side and
> >> on the side you are running "rbd snap create"?
> >>
> > I just stopped the OSD, waited some seconds and started it again.
> > I still can't create snapshots.
> >
> > Ceph version is 14.2.18 accross the board
> > qemu is 4.1.0-1
> > as we use krbd, the kernel version is 5.2.9-arch1-1-ARCH
> >
> > How can I gather more logs to debug it?
>
> Are you saying that this image is mapped and the lock is held by the
> kernel client?  It doesn't look that way from the logs you shared.
>
> Thanks,
>
> Ilya
>

We use krbd instead of librbd (at least this is what I think I know), but
qemu is doing the kvm/rbd stuff.
-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-23 Thread Ilya Dryomov
On Fri, Apr 23, 2021 at 12:03 PM Boris Behrens  wrote:
>
>
>
> Am Fr., 23. Apr. 2021 um 11:52 Uhr schrieb Ilya Dryomov :
>>
>>
>> This snippet confirms my suspicion.  Unfortunately without a verbose
>> log from that VM from three days ago (i.e. when it got into this state)
>> it's hard to tell what exactly went wrong.
>>
>> The problem is that the VM doesn't consider itself to be the rightful
>> owner of the lock and so when "rbd snap create" requests the lock from
>> it in order to make a snapshot, the VM just ignores the request because
>> even though it owns the lock, its record appears to be of sync.
>>
>> I'd suggest to kick it by restarting osd36.  If the VM is active, it
>> should reacquire the lock and hopefully update its internal record as
>> expected.  If "rbd snap create" still hangs after that, it would mean
>> that we have a reproducer and can gather logs on the VM side.
>>
>> What version of qemu/librbd and ceph is in use (both on the VM side and
>> on the side you are running "rbd snap create"?
>>
> I just stopped the OSD, waited some seconds and started it again.
> I still can't create snapshots.
>
> Ceph version is 14.2.18 accross the board
> qemu is 4.1.0-1
> as we use krbd, the kernel version is 5.2.9-arch1-1-ARCH
>
> How can I gather more logs to debug it?

Are you saying that this image is mapped and the lock is held by the
kernel client?  It doesn't look that way from the logs you shared.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-23 Thread Boris Behrens
Am Fr., 23. Apr. 2021 um 11:52 Uhr schrieb Ilya Dryomov :

>
> This snippet confirms my suspicion.  Unfortunately without a verbose
> log from that VM from three days ago (i.e. when it got into this state)
> it's hard to tell what exactly went wrong.
>
> The problem is that the VM doesn't consider itself to be the rightful
> owner of the lock and so when "rbd snap create" requests the lock from
> it in order to make a snapshot, the VM just ignores the request because
> even though it owns the lock, its record appears to be of sync.
>
> I'd suggest to kick it by restarting osd36.  If the VM is active, it
> should reacquire the lock and hopefully update its internal record as
> expected.  If "rbd snap create" still hangs after that, it would mean
> that we have a reproducer and can gather logs on the VM side.
>
> What version of qemu/librbd and ceph is in use (both on the VM side and
> on the side you are running "rbd snap create"?
>
> I just stopped the OSD, waited some seconds and started it again.
I still can't create snapshots.

Ceph version is 14.2.18 accross the board
qemu is 4.1.0-1
as we use krbd, the kernel version is 5.2.9-arch1-1-ARCH

How can I gather more logs to debug it?

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-23 Thread Ilya Dryomov
On Fri, Apr 23, 2021 at 9:16 AM Boris Behrens  wrote:
>
>
>
> Am Do., 22. Apr. 2021 um 20:59 Uhr schrieb Ilya Dryomov :
>>
>> On Thu, Apr 22, 2021 at 7:33 PM Boris Behrens  wrote:
>> >
>> >
>> >
>> > Am Do., 22. Apr. 2021 um 18:30 Uhr schrieb Ilya Dryomov 
>> > :
>> >>
>> >> On Thu, Apr 22, 2021 at 6:00 PM Boris Behrens  wrote:
>> >> >
>> >> >
>> >> >
>> >> > Am Do., 22. Apr. 2021 um 17:27 Uhr schrieb Ilya Dryomov 
>> >> > :
>> >> >>
>> >> >> On Thu, Apr 22, 2021 at 5:08 PM Boris Behrens  wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > Am Do., 22. Apr. 2021 um 16:43 Uhr schrieb Ilya Dryomov 
>> >> >> > :
>> >> >> >>
>> >> >> >> On Thu, Apr 22, 2021 at 4:20 PM Boris Behrens  
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> > Hi,
>> >> >> >> >
>> >> >> >> > I have a customer VM that is running fine, but I can not make 
>> >> >> >> > snapshots
>> >> >> >> > anymore.
>> >> >> >> > rbd snap create rbd/IMAGE@test-bb-1
>> >> >> >> > just hangs forever.
>> >> >> >>
>> >> >> >> Hi Boris,
>> >> >> >>
>> >> >> >> Run
>> >> >> >>
>> >> >> >> $ rbd snap create rbd/IMAGE@test-bb-1 --debug-ms=1 --debug-rbd=20
>> >> >> >>
>> >> >> >> let it hang for a few minutes and attach the output.
>> >> >> >
>> >> >> >
>> >> >> > I just pasted a short snip here: https://pastebin.com/B3Xgpbzd
>> >> >> > If you need more I can give it to you, but the output is very large.
>> >> >>
>> >> >> Paste the first couple thousand lines (i.e. from the very beginning),
>> >> >> that should be enough.
>> >> >>
>> >> > sure: https://pastebin.com/GsKpLbqG
>> >> >
>> >> > good luck :)
>> >>
>> >> What is the output of "rbd status"?  I know you said it shows one
>> >> watcher, but I need to see it.
>> >>
>> >>
>> > sure
>> > # rbd status rbd/IMAGE
>> > Watchers:
>> > watcher=[fd00:2380:2:43::11]:0/3919389201 client.136378749 
>> > cookie=139968010125312
>>
>
> Hi Ilya,
> thank you a lot for your support.
>
> This might be other hanging snapshot sheduler that got removed afterwards.
> Sorry for that.
>
> https://pastebin.com/TBZs7Mvb
>
> I just created a new paste and added status and lock ls at the top and at the 
> bottom.
> The 2nd watcher disaperas after a minute or so.
> All commands are done within one minute.

This snippet confirms my suspicion.  Unfortunately without a verbose
log from that VM from three days ago (i.e. when it got into this state)
it's hard to tell what exactly went wrong.

The problem is that the VM doesn't consider itself to be the rightful
owner of the lock and so when "rbd snap create" requests the lock from
it in order to make a snapshot, the VM just ignores the request because
even though it owns the lock, its record appears to be of sync.

I'd suggest to kick it by restarting osd36.  If the VM is active, it
should reacquire the lock and hopefully update its internal record as
expected.  If "rbd snap create" still hangs after that, it would mean
that we have a reproducer and can gather logs on the VM side.

What version of qemu/librbd and ceph is in use (both on the VM side and
on the side you are running "rbd snap create"?

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-22 Thread Ilya Dryomov
On Thu, Apr 22, 2021 at 6:00 PM Boris Behrens  wrote:
>
>
>
> Am Do., 22. Apr. 2021 um 17:27 Uhr schrieb Ilya Dryomov :
>>
>> On Thu, Apr 22, 2021 at 5:08 PM Boris Behrens  wrote:
>> >
>> >
>> >
>> > Am Do., 22. Apr. 2021 um 16:43 Uhr schrieb Ilya Dryomov 
>> > :
>> >>
>> >> On Thu, Apr 22, 2021 at 4:20 PM Boris Behrens  wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I have a customer VM that is running fine, but I can not make snapshots
>> >> > anymore.
>> >> > rbd snap create rbd/IMAGE@test-bb-1
>> >> > just hangs forever.
>> >>
>> >> Hi Boris,
>> >>
>> >> Run
>> >>
>> >> $ rbd snap create rbd/IMAGE@test-bb-1 --debug-ms=1 --debug-rbd=20
>> >>
>> >> let it hang for a few minutes and attach the output.
>> >
>> >
>> > I just pasted a short snip here: https://pastebin.com/B3Xgpbzd
>> > If you need more I can give it to you, but the output is very large.
>>
>> Paste the first couple thousand lines (i.e. from the very beginning),
>> that should be enough.
>>
> sure: https://pastebin.com/GsKpLbqG
>
> good luck :)

What is the output of "rbd status"?  I know you said it shows one
watcher, but I need to see it.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-22 Thread Boris Behrens
Am Do., 22. Apr. 2021 um 17:27 Uhr schrieb Ilya Dryomov :

> On Thu, Apr 22, 2021 at 5:08 PM Boris Behrens  wrote:
> >
> >
> >
> > Am Do., 22. Apr. 2021 um 16:43 Uhr schrieb Ilya Dryomov <
> idryo...@gmail.com>:
> >>
> >> On Thu, Apr 22, 2021 at 4:20 PM Boris Behrens  wrote:
> >> >
> >> > Hi,
> >> >
> >> > I have a customer VM that is running fine, but I can not make
> snapshots
> >> > anymore.
> >> > rbd snap create rbd/IMAGE@test-bb-1
> >> > just hangs forever.
> >>
> >> Hi Boris,
> >>
> >> Run
> >>
> >> $ rbd snap create rbd/IMAGE@test-bb-1 --debug-ms=1 --debug-rbd=20
> >>
> >> let it hang for a few minutes and attach the output.
> >
> >
> > I just pasted a short snip here: https://pastebin.com/B3Xgpbzd
> > If you need more I can give it to you, but the output is very large.
>
> Paste the first couple thousand lines (i.e. from the very beginning),
> that should be enough.
>
> sure: https://pastebin.com/GsKpLbqG

good luck :)

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-22 Thread Ilya Dryomov
On Thu, Apr 22, 2021 at 5:08 PM Boris Behrens  wrote:
>
>
>
> Am Do., 22. Apr. 2021 um 16:43 Uhr schrieb Ilya Dryomov :
>>
>> On Thu, Apr 22, 2021 at 4:20 PM Boris Behrens  wrote:
>> >
>> > Hi,
>> >
>> > I have a customer VM that is running fine, but I can not make snapshots
>> > anymore.
>> > rbd snap create rbd/IMAGE@test-bb-1
>> > just hangs forever.
>>
>> Hi Boris,
>>
>> Run
>>
>> $ rbd snap create rbd/IMAGE@test-bb-1 --debug-ms=1 --debug-rbd=20
>>
>> let it hang for a few minutes and attach the output.
>
>
> I just pasted a short snip here: https://pastebin.com/B3Xgpbzd
> If you need more I can give it to you, but the output is very large.

Paste the first couple thousand lines (i.e. from the very beginning),
that should be enough.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-22 Thread Boris Behrens
Am Do., 22. Apr. 2021 um 16:43 Uhr schrieb Ilya Dryomov :

> On Thu, Apr 22, 2021 at 4:20 PM Boris Behrens  wrote:
> >
> > Hi,
> >
> > I have a customer VM that is running fine, but I can not make snapshots
> > anymore.
> > rbd snap create rbd/IMAGE@test-bb-1
> > just hangs forever.
>
> Hi Boris,
>
> Run
>
> $ rbd snap create rbd/IMAGE@test-bb-1 --debug-ms=1 --debug-rbd=20
>
> let it hang for a few minutes and attach the output.
>

I just pasted a short snip here: https://pastebin.com/B3Xgpbzd
If you need more I can give it to you, but the output is very large.

>
> >
> > When I checked the status with
> > rbd status rbd/IMAGE
> > it shows one watcher, the cpu node where the VM is running.
> >
> > What can I do to investigate further, without restarting the VM.
> > This is the only affected VM and it stopped working three days ago.
>
> Can you think of any event related to the cluster, that VM or the
> VM fleet in general that occurred three days ago?
>
> We had an incident where the cpu nodes connected to the wrong cluster, but
this VM was not affected IIRC.

Cheers
 Boris

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd snap create now working and just hangs forever

2021-04-22 Thread Ilya Dryomov
On Thu, Apr 22, 2021 at 4:20 PM Boris Behrens  wrote:
>
> Hi,
>
> I have a customer VM that is running fine, but I can not make snapshots
> anymore.
> rbd snap create rbd/IMAGE@test-bb-1
> just hangs forever.

Hi Boris,

Run

$ rbd snap create rbd/IMAGE@test-bb-1 --debug-ms=1 --debug-rbd=20

let it hang for a few minutes and attach the output.

>
> When I checked the status with
> rbd status rbd/IMAGE
> it shows one watcher, the cpu node where the VM is running.
>
> What can I do to investigate further, without restarting the VM.
> This is the only affected VM and it stopped working three days ago.

Can you think of any event related to the cluster, that VM or the
VM fleet in general that occurred three days ago?

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io