On Sep 3, 2021, at 4:28 AM, Xiubo Li 
<xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote:

And TCMU runner shows 3 hosts up:

  services:
    mon:         5 daemons, quorum 
cxcto-c240-j27-01.cisco.com<http://cxcto-c240-j27-01.cisco.com/>,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12
 (age 16m)
    mgr:         cxcto-c240-j27-01.cisco.com.edgcpk(active, since 16m), 
standbys: cxcto-c240-j27-02.llzeit
    osd:         329 osds: 326 up (since 4m), 326 in (since 6d)
    tcmu-runner: 28 portals active (3 hosts)

Could you check all the gateways nodes whether the tcmu-runner service is still 
alive in all of them ?

The status will be reported by the tcmu-runner service, not the ceph-iscsi.


That’s the issue I’m having now - I can’t get the iscsi services (both the api 
gateway and tcmu runner) to start on one of the 4 servers for some reason. 
Since I’m using cephadm to orchestrate the enabling / disabling of services on 
nodes, I first used cephadm to add all 4 gateways back. They all were running 
and gwcli allowed me to make a change to try and remove one portal from one 
target, however gwcli locked up when I did this. It looks like the 
configuration change took place, however after that event, now cephadm does not 
appear to be able to properly orchestrate the addition / removal of iscsi 
gateways. I’m in a state where it’s trying to run on 3 of the servers (02, 03, 
05) no matter what I do. If I set cephadm to run iscsi only on node 03, for 
example, it keeps running on 02 and 05 as well. If I set cephadm to run on all 
4 servers, it still only runs on 02, 03, and 05. It won’t start on 04 anymore. 
I’m not really sure how to see if it’s even trying, as I’m not sure how cephadm 
orchestrates the deployment of the containers.



Things seem to have gone from bad to worse now as I can’t get to a clean state 
were I had 2 gateways running properly since I was able to delete a gateway 
from one of the targets, but I can’t add it back again since I can’t get all 4 
gateways back up since that appears to be the only way that gwcli will work 
(sort of).

If you have any suggestions on how to get out of this mess I’d appreciate it.

Since the ceph-iscsi couldn't connect the stale gateways so it just forbids you 
change anything of it. Could you check whether the rbd-target-api service is 
alive ?

Then you can try to change the 'gateway.conf' to modify it.

Let’s say that 2 of the servers were dead for some reason and there way no way 
to get them back online. Is the only way to resolve in that case to modify 
gateway.conf? I’m a little nervous about doing this based on your last email 
saying to not mess with the file, but I was able to download it and it looks 
like modifying it would be relatively straightforward. Who is responsible for 
creating that file? I’m thinking what I should probably do is:

- Shut down ESXi cluster so there are no iSCSI accesses
- Tell cephadm to underplay all iscsi gateways. If this doesn’t work (which it 
probably won’t) then just stop the tcmu-runner and iscsi containers on all 
serves so they’re not running.
- Modify gateway.conf to remove the gateways except for two
- Try to use cephadm to re-deploy on the two servers
- Bring back up the ESXi hosts.

Does this sound like a reasonable plan? I’m not sure if there is anything else 
to look at on the cephadm side to understand why the services are no longer 
being added/removed anymore.

-Paul


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to