[ceph-users] Re: tcmu-runner crashing on 16.2.5

Paul Giralt (pgiralt) Fri, 03 Sep 2021 08:32:51 -0700


On Sep 3, 2021, at 4:28 AM, Xiubo Li 
<xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote:

And TCMU runner shows 3 hosts up:

services:
mon: 5 daemons, quorum
cxcto-c240-j27-01.cisco.com<http://cxcto-c240-j27-01.cisco.com/>,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12
(age 16m)
mgr: cxcto-c240-j27-01.cisco.com.edgcpk(active, since 16m),
standbys: cxcto-c240-j27-02.llzeit
osd: 329 osds: 326 up (since 4m), 326 in (since 6d)
tcmu-runner: 28 portals active (3 hosts)

Could you check all the gateways nodes whether the tcmu-runner service is still
alive in all of them ?

The status will be reported by the tcmu-runner service, not the ceph-iscsi.

That’s the issue I’m having now - I can’t get the iscsi services (both the api
gateway and tcmu runner) to start on one of the 4 servers for some reason.
Since I’m using cephadm to orchestrate the enabling / disabling of services on
nodes, I first used cephadm to add all 4 gateways back. They all were running
and gwcli allowed me to make a change to try and remove one portal from one
target, however gwcli locked up when I did this. It looks like the
configuration change took place, however after that event, now cephadm does not
appear to be able to properly orchestrate the addition / removal of iscsi
gateways. I’m in a state where it’s trying to run on 3 of the servers (02, 03,
05) no matter what I do. If I set cephadm to run iscsi only on node 03, for
example, it keeps running on 02 and 05 as well. If I set cephadm to run on all
4 servers, it still only runs on 02, 03, and 05. It won’t start on 04 anymore.
I’m not really sure how to see if it’s even trying, as I’m not sure how cephadm
orchestrates the deployment of the containers.

Things seem to have gone from bad to worse now as I can’t get to a clean state
were I had 2 gateways running properly since I was able to delete a gateway
from one of the targets, but I can’t add it back again since I can’t get all 4
gateways back up since that appears to be the only way that gwcli will work
(sort of).

If you have any suggestions on how to get out of this mess I’d appreciate it.

Since the ceph-iscsi couldn't connect the stale gateways so it just forbids you
change anything of it. Could you check whether the rbd-target-api service is
alive ?

Then you can try to change the 'gateway.conf' to modify it.

Let’s say that 2 of the servers were dead for some reason and there way no way
to get them back online. Is the only way to resolve in that case to modify
gateway.conf? I’m a little nervous about doing this based on your last email
saying to not mess with the file, but I was able to download it and it looks
like modifying it would be relatively straightforward. Who is responsible for
creating that file? I’m thinking what I should probably do is:

- Shut down ESXi cluster so there are no iSCSI accesses
- Tell cephadm to underplay all iscsi gateways. If this doesn’t work (which it
probably won’t) then just stop the tcmu-runner and iscsi containers on all
serves so they’re not running.
- Modify gateway.conf to remove the gateways except for two
- Try to use cephadm to re-deploy on the two servers
- Bring back up the ESXi hosts.

Does this sound like a reasonable plan? I’m not sure if there is anything else
to look at on the cephadm side to understand why the services are no longer
being added/removed anymore.

-Paul

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: tcmu-runner crashing on 16.2.5

Reply via email to