[ceph-users] Re: Cephadm cannot aquire lock

2021-09-02 Thread fcid

Hi David,

It looks like we are affected by the same bug, thanks for the hint.

We're running pacific 16.2.0, and I'm looking forward to upgrading to 
the last pacific version, but the last upgrade I tried was not 
successful. In hindsight, it was the same bug causing the problem.


Now, my (naïve) upgrade strategy would be launching the upgrade process 
with the orchestrator, and killing the stuck cephadm process whenever it 
shows up. Assuming that there are not going to be changes in devices, I 
think It's going to work.


Thanks again, kind regards.

On 02/09/2021 15:03, David Orman wrote:

It may be this:

https://tracker.ceph.com/issues/50526
https://github.com/alfredodeza/remoto/issues/62

Which we resolved with: https://github.com/alfredodeza/remoto/pull/63

What version of ceph are you running, and is it impacted by the above?

David

On Thu, Sep 2, 2021 at 9:53 AM fcid  wrote:

Hi Sebastian,

Following your sugestion, I've found this process:

/usr/bin/python3
/var/lib/ceph//cephadm.f77d9d71514a634758d4ad41ab6eef36d25386c99d8b365310ad41f9b74d5ce6
--image
ceph/ceph@sha256:9b04c0f15704c49591640a37c7adfd40ffad0a4b42fecb950c3407687cb4f29a
ceph-volume --fsid  -- lvm list --format json

That process have been running for more than 12 hours, so I killed it
and then cephadm could aquire lock. Shortly after the process starts
again and I can see that it is running on all the nodes (we have 3
nodes). I tried executing the same sentence in all the nodes, from the
command line, and it works fine, here is the output
https://pastebin.com/v58Nyxdx.

What can be causing this process to be stuck when it is launched by the
orchestrator, since launching it from the command line works fine?

Thank you, kind regards.

On 02/09/2021 05:19, Sebastian Wagner wrote:

Am 31.08.21 um 04:05 schrieb fcid:

Hi ceph community,

I'm having some trouble trying to delete an OSD.

I've been using cephadm in one of our clusters and it's works fine,
but lately, after an OSD failure, I cannot delete it using the
orchestrator. Since the orchestrator is not working (for some unknown
reason) I tried to manually delete the OSD using the following command:

ceph purge osd  --yes-i-really-mean-it

This command removed the OSD from the crush map, but then the warning
CEPHADM_FAILED_DEAMON appeared. So the next step is delete de daemon
in the server that use to host the failed OSD. The command I used
here was the following:

cephadm rm-daemon --name osd. --fsid 

But this command does not work because, accoding to the log, cephadm
cannot aquire lock:

2021-08-30 21:50:09,712 DEBUG Lock 139899822730784 not acquired on
/run/cephadm/$FSID.lock, waiting 0.05 seconds ...
2021-08-30 21:50:09,762 DEBUG Acquiring lock 139899822730784 on
/run/cephadm/$FSID.lock
2021-08-30 21:50:09,763 DEBUG Lock 139899822730784 not acquired on
/run/cephadm/$FSID.lock, waiting 0.05 seconds ...

The file /run/cephadm/$FSID.lock does exist. Can I safely remove it?
What should I check before doing such task.

Yes, in case you're sure that no other cephadm process (i.e. call
`ps`) is stuck.


I'll really appreciate any hint you can give relating this matter.

Thanks! regards.


--
AltaVoz 
Fernando Cid
Ingeniero de Operaciones
www.altavoz.net 
Ubicación AltaVoz
Viña del Mar: 2 Poniente 355 of 53
 | +56 32 276 8060

Santiago: Antonio Bellet 292 of 701
 | +56 2 2585 4264


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
AltaVoz 
Fernando Cid
Ingeniero de Operaciones
www.altavoz.net 
Ubicación AltaVoz   
Viña del Mar: 2 Poniente 355 of 53 
 | +56 32 276 8060 

Santiago: Antonio Bellet 292 of 701 
 | +56 2 2585 4264 



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm cannot aquire lock

2021-09-02 Thread David Orman
It may be this:

https://tracker.ceph.com/issues/50526
https://github.com/alfredodeza/remoto/issues/62

Which we resolved with: https://github.com/alfredodeza/remoto/pull/63

What version of ceph are you running, and is it impacted by the above?

David

On Thu, Sep 2, 2021 at 9:53 AM fcid  wrote:
>
> Hi Sebastian,
>
> Following your sugestion, I've found this process:
>
> /usr/bin/python3
> /var/lib/ceph//cephadm.f77d9d71514a634758d4ad41ab6eef36d25386c99d8b365310ad41f9b74d5ce6
> --image
> ceph/ceph@sha256:9b04c0f15704c49591640a37c7adfd40ffad0a4b42fecb950c3407687cb4f29a
> ceph-volume --fsid  -- lvm list --format json
>
> That process have been running for more than 12 hours, so I killed it
> and then cephadm could aquire lock. Shortly after the process starts
> again and I can see that it is running on all the nodes (we have 3
> nodes). I tried executing the same sentence in all the nodes, from the
> command line, and it works fine, here is the output
> https://pastebin.com/v58Nyxdx.
>
> What can be causing this process to be stuck when it is launched by the
> orchestrator, since launching it from the command line works fine?
>
> Thank you, kind regards.
>
> On 02/09/2021 05:19, Sebastian Wagner wrote:
> >
> > Am 31.08.21 um 04:05 schrieb fcid:
> >> Hi ceph community,
> >>
> >> I'm having some trouble trying to delete an OSD.
> >>
> >> I've been using cephadm in one of our clusters and it's works fine,
> >> but lately, after an OSD failure, I cannot delete it using the
> >> orchestrator. Since the orchestrator is not working (for some unknown
> >> reason) I tried to manually delete the OSD using the following command:
> >>
> >> ceph purge osd  --yes-i-really-mean-it
> >>
> >> This command removed the OSD from the crush map, but then the warning
> >> CEPHADM_FAILED_DEAMON appeared. So the next step is delete de daemon
> >> in the server that use to host the failed OSD. The command I used
> >> here was the following:
> >>
> >> cephadm rm-daemon --name osd. --fsid 
> >>
> >> But this command does not work because, accoding to the log, cephadm
> >> cannot aquire lock:
> >>
> >> 2021-08-30 21:50:09,712 DEBUG Lock 139899822730784 not acquired on
> >> /run/cephadm/$FSID.lock, waiting 0.05 seconds ...
> >> 2021-08-30 21:50:09,762 DEBUG Acquiring lock 139899822730784 on
> >> /run/cephadm/$FSID.lock
> >> 2021-08-30 21:50:09,763 DEBUG Lock 139899822730784 not acquired on
> >> /run/cephadm/$FSID.lock, waiting 0.05 seconds ...
> >>
> >> The file /run/cephadm/$FSID.lock does exist. Can I safely remove it?
> >> What should I check before doing such task.
> >
> > Yes, in case you're sure that no other cephadm process (i.e. call
> > `ps`) is stuck.
> >
> >>
> >> I'll really appreciate any hint you can give relating this matter.
> >>
> >> Thanks! regards.
> >>
> >
> --
> AltaVoz 
> Fernando Cid
> Ingeniero de Operaciones
> www.altavoz.net 
> Ubicación AltaVoz
> Viña del Mar: 2 Poniente 355 of 53
>  | +56 32 276 8060
> 
> Santiago: Antonio Bellet 292 of 701
>  | +56 2 2585 4264
> 
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm cannot aquire lock

2021-09-02 Thread fcid

Hi Sebastian,

Following your sugestion, I've found this process:

/usr/bin/python3 
/var/lib/ceph//cephadm.f77d9d71514a634758d4ad41ab6eef36d25386c99d8b365310ad41f9b74d5ce6 
--image 
ceph/ceph@sha256:9b04c0f15704c49591640a37c7adfd40ffad0a4b42fecb950c3407687cb4f29a 
ceph-volume --fsid  -- lvm list --format json


That process have been running for more than 12 hours, so I killed it 
and then cephadm could aquire lock. Shortly after the process starts 
again and I can see that it is running on all the nodes (we have 3 
nodes). I tried executing the same sentence in all the nodes, from the 
command line, and it works fine, here is the output 
https://pastebin.com/v58Nyxdx.


What can be causing this process to be stuck when it is launched by the 
orchestrator, since launching it from the command line works fine?


Thank you, kind regards.

On 02/09/2021 05:19, Sebastian Wagner wrote:


Am 31.08.21 um 04:05 schrieb fcid:

Hi ceph community,

I'm having some trouble trying to delete an OSD.

I've been using cephadm in one of our clusters and it's works fine, 
but lately, after an OSD failure, I cannot delete it using the 
orchestrator. Since the orchestrator is not working (for some unknown 
reason) I tried to manually delete the OSD using the following command:


ceph purge osd  --yes-i-really-mean-it

This command removed the OSD from the crush map, but then the warning 
CEPHADM_FAILED_DEAMON appeared. So the next step is delete de daemon 
in the server that use to host the failed OSD. The command I used 
here was the following:


cephadm rm-daemon --name osd. --fsid 

But this command does not work because, accoding to the log, cephadm 
cannot aquire lock:


2021-08-30 21:50:09,712 DEBUG Lock 139899822730784 not acquired on 
/run/cephadm/$FSID.lock, waiting 0.05 seconds ...
2021-08-30 21:50:09,762 DEBUG Acquiring lock 139899822730784 on 
/run/cephadm/$FSID.lock
2021-08-30 21:50:09,763 DEBUG Lock 139899822730784 not acquired on 
/run/cephadm/$FSID.lock, waiting 0.05 seconds ...


The file /run/cephadm/$FSID.lock does exist. Can I safely remove it? 
What should I check before doing such task.


Yes, in case you're sure that no other cephadm process (i.e. call 
`ps`) is stuck.




I'll really appreciate any hint you can give relating this matter.

Thanks! regards.




--
AltaVoz 
Fernando Cid
Ingeniero de Operaciones
www.altavoz.net 
Ubicación AltaVoz   
Viña del Mar: 2 Poniente 355 of 53 
 | +56 32 276 8060 

Santiago: Antonio Bellet 292 of 701 
 | +56 2 2585 4264 



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm cannot aquire lock

2021-09-02 Thread Sebastian Wagner


Am 31.08.21 um 04:05 schrieb fcid:

Hi ceph community,

I'm having some trouble trying to delete an OSD.

I've been using cephadm in one of our clusters and it's works fine, 
but lately, after an OSD failure, I cannot delete it using the 
orchestrator. Since the orchestrator is not working (for some unknown 
reason) I tried to manually delete the OSD using the following command:


ceph purge osd  --yes-i-really-mean-it

This command removed the OSD from the crush map, but then the warning 
CEPHADM_FAILED_DEAMON appeared. So the next step is delete de daemon 
in the server that use to host the failed OSD. The command I used here 
was the following:


cephadm rm-daemon --name osd. --fsid 

But this command does not work because, accoding to the log, cephadm 
cannot aquire lock:


2021-08-30 21:50:09,712 DEBUG Lock 139899822730784 not acquired on 
/run/cephadm/$FSID.lock, waiting 0.05 seconds ...
2021-08-30 21:50:09,762 DEBUG Acquiring lock 139899822730784 on 
/run/cephadm/$FSID.lock
2021-08-30 21:50:09,763 DEBUG Lock 139899822730784 not acquired on 
/run/cephadm/$FSID.lock, waiting 0.05 seconds ...


The file /run/cephadm/$FSID.lock does exist. Can I safely remove it? 
What should I check before doing such task.


Yes, in case you're sure that no other cephadm process (i.e. call `ps`) 
is stuck.




I'll really appreciate any hint you can give relating this matter.

Thanks! regards.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io