[ceph-users] Re: CEPHADM_STRAY_DAEMON with iSCSI service

2021-12-08 Thread Paul Giralt (pgiralt)
https://tracker.ceph.com/issues/5

-Paul

Sent from my iPhone

On Dec 8, 2021, at 8:00 AM, Robert Sander  wrote:

Hi,

i just upgraded to 16.2.7 and deployed an iSCSI service.

Now I get for each configured target three stray daemons
(tcmu-runner) that are not managed by cephadm:

HEALTH_WARN 6 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 6 stray daemon(s) not managed by cephadm
   stray daemon tcmu-runner.cephtest20:rbd/iscsi01 on host cephtest20 not 
managed by cephadm
   stray daemon tcmu-runner.cephtest20:rbd/iscsi02 on host cephtest20 not 
managed by cephadm
   stray daemon tcmu-runner.cephtest30:rbd/iscsi01 on host cephtest30 not 
managed by cephadm
   stray daemon tcmu-runner.cephtest30:rbd/iscsi02 on host cephtest30 not 
managed by cephadm
   stray daemon tcmu-runner.cephtest40:rbd/iscsi01 on host cephtest40 not 
managed by cephadm
   stray daemon tcmu-runner.cephtest40:rbd/iscsi02 on host cephtest40 not 
managed by cephadm

How can this be resolved?

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-09-08 Thread Paul Giralt (pgiralt)
Thank you Xiubo. confirm=true worked and I was able to update via gwcli and 
then get everything reset back to normal again. I’m stable for now but still 
hoping that this fix can get in soon to make sure the crash doesn’t happen 
again.

Appreciate all your help on this.

-Paul


On Sep 6, 2021, at 7:29 AM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:



On 9/3/21 11:32 PM, Paul Giralt (pgiralt) wrote:


On Sep 3, 2021, at 4:28 AM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

And TCMU runner shows 3 hosts up:

  services:
mon: 5 daemons, quorum 
cxcto-c240-j27-01.cisco.com<http://cxcto-c240-j27-01.cisco.com/>,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12
 (age 16m)
mgr: cxcto-c240-j27-01.cisco.com.edgcpk(active, since 16m), 
standbys: cxcto-c240-j27-02.llzeit
osd: 329 osds: 326 up (since 4m), 326 in (since 6d)
tcmu-runner: 28 portals active (3 hosts)

Could you check all the gateways nodes whether the tcmu-runner service is still 
alive in all of them ?

The status will be reported by the tcmu-runner service, not the ceph-iscsi.


That’s the issue I’m having now - I can’t get the iscsi services (both the api 
gateway and tcmu runner) to start on one of the 4 servers for some reason. 
Since I’m using cephadm to orchestrate the enabling / disabling of services on 
nodes, I first used cephadm to add all 4 gateways back. They all were running 
and gwcli allowed me to make a change to try and remove one portal from one 
target, however gwcli locked up when I did this. It looks like the 
configuration change took place, however after that event, now cephadm does not 
appear to be able to properly orchestrate the addition / removal of iscsi 
gateways. I’m in a state where it’s trying to run on 3 of the servers (02, 03, 
05) no matter what I do. If I set cephadm to run iscsi only on node 03, for 
example, it keeps running on 02 and 05 as well. If I set cephadm to run on all 
4 servers, it still only runs on 02, 03, and 05. It won’t start on 04 anymore. 
I’m not really sure how to see if it’s even trying, as I’m not sure how cephadm 
orchestrates the deployment of the containers.


Have your tried "confirm=true" when deleting those two stale gateways ?

For example in my setups, I powered off the node02:


$ gwcli

...

| o- gateways 

 [Up: 1/2, Portals: 2]
| | o- node01 

 [172.16.219.128 (UP)]
| | o- node02 
...
 [172.16.219.138 (UNKNOWN)]

...

/> iscsi-targets/iqn.2003-01.com.redhat.iscsi-gw:ceph-igw/gateways/ delete 
gateway_name=node02
Deleting gateway, node02

Could not contact node02. If the gateway is permanently down. Use confirm=true 
to force removal. WARNING: Forcing removal of a gateway that can still be 
reached by an initiator may result in data corruption.
/>
/> iscsi-targets/iqn.2003-01.com.redhat.iscsi-gw:ceph-igw/gateways/ delete 
gateway_name=node02 confirm=true
Deleting gateway, node02
/>


I could remove it without bring the gateway node02 up.



Things seem to have gone from bad to worse now as I can’t get to a clean state 
were I had 2 gateways running properly since I was able to delete a gateway 
from one of the targets, but I can’t add it back again since I can’t get all 4 
gateways back up since that appears to be the only way that gwcli will work 
(sort of).

If you have any suggestions on how to get out of this mess I’d appreciate it.

Since the ceph-iscsi couldn't connect the stale gateways so it just forbids you 
change anything of it. Could you check whether the rbd-target-api service is 
alive ?

Then you can try to change the 'gateway.conf' to modify it.

Let’s say that 2 of the servers were dead for some reason and there way no way 
to get them back online. Is the only way to resolve in that case to modify 
gateway.conf? I’m a little nervous about doing this based on your last email 
saying to not mess with the file, but I was able to download it and it looks 
like modifying it would be relatively straightforward. Who is responsible for 
creating that file? I’m thinking what I should probably do is:

- Shut down ESXi cluster so there are no iSCSI accesses
- Tell cephadm to underplay all iscsi gateways. If this doesn’t work (which it 
probably won’t) then just stop the tcmu-runner and iscsi containers on all 
serves so they’re not running.
- Modify gateway.conf to remove the gateways except for two
- Try to use cephadm to re-deploy on the two servers
- Bring back up the ESXi hosts.

Does this sound like a reasonable plan? I’m not sure if there is anything else 
to look at on the cephadm side to understand why the services are no longer 
being added/removed anymore.

-Paul



___

[ceph-users] Re: Ceph dashboard pointing to the wrong grafana server address in iframe

2021-09-08 Thread Paul Giralt (pgiralt)
Thanks Ernesto.

ceph dashboard set-grafana-api-url fixed the problem. I’m not sure how it got 
set to the wrong server (I am using cephadm and I’m the only administrator) but 
at least it’s fixed now, so appreciate the help.

-Paul


On Sep 8, 2021, at 1:45 PM, Ernesto Puerta 
mailto:epuer...@redhat.com>> wrote:

Hi Paul,

You can check what's the currently set value with: [1]

$ ceph mgr dashboard get-dashboard-api-url

In some set-ups (multi-homed, proxied, ...), you might also need to set up the 
user-facing IP: [2]

$ ceph dashboard set-grafana-frontend-api-url 

If you're running a Cephadm-deployed cluster, Cephadm takes care of that one 
(you may check Ceph audit logs to find whether someone else modified that 
setting). [3]

[1] 
https://docs.ceph.com/en/latest/mgr/dashboard/?highlight=dashboard#configuring-dashboard
[2] 
https://docs.ceph.com/en/latest/mgr/dashboard/?highlight=dashboard#alternative-url-for-browsers
[3] https://docs.ceph.com/en/latest/cephadm/monitoring/#networks-and-ports

Kind Regards,
Ernesto


On Wed, Sep 8, 2021 at 6:59 PM Paul Giralt (pgiralt) 
mailto:pgir...@cisco.com>> wrote:
For some reason, the grafana dashboards in the dashboard are all pointing to a 
node that does not and has never run the grafana / Prometheus services. I’m not 
sure where this value is kept and how to change to back.

My two manager nodes are 10.122.242.196 and 10.122.242.198. For some reason, 
the HTML served by the dashboard running on either of those two nodes points to 
10.122.242.197. If I inspect the HTML and change the IP address for the iframe 
to the .196 address, it all works fine, so the issue is just with the dashboard 
for some reason thinking that it needs to point to .197. This was all working 
up until a few days ago.

Any idea where this value is stored and how to fix it?

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph dashboard pointing to the wrong grafana server address in iframe

2021-09-08 Thread Paul Giralt (pgiralt)
For some reason, the grafana dashboards in the dashboard are all pointing to a 
node that does not and has never run the grafana / Prometheus services. I’m not 
sure where this value is kept and how to change to back. 

My two manager nodes are 10.122.242.196 and 10.122.242.198. For some reason, 
the HTML served by the dashboard running on either of those two nodes points to 
10.122.242.197. If I inspect the HTML and change the IP address for the iframe 
to the .196 address, it all works fine, so the issue is just with the dashboard 
for some reason thinking that it needs to point to .197. This was all working 
up until a few days ago. 

Any idea where this value is stored and how to fix it? 

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm not properly adding / removing iscsi services anymore

2021-09-08 Thread Paul Giralt (pgiralt)
Thanks for the tip. I’ve just been using ‘docker exec -it  
/bin/bash’ to get into the containers, but those commands sound useful. I think 
I’ll install cephadm on all nodes just for this. 

Thanks again, 
-Paul


> On Sep 8, 2021, at 10:11 AM, Eugen Block  wrote:
> 
> Okay, I'm glad it worked!
> 
> 
>> At first I tried cephadm rm-daemon on the bootstrap node that I usually do 
>> all management from and it indicated that it could not remove the daemon:
>> 
>> [root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name 
>> iscsi.cxcto-c240-j27-04.lgqtxo --fsid 4a29e724-c4a6-11eb-b14a-5c838f8013a5
>> ERROR: Daemon not found: iscsi.cxcto-c240-j27-04.lgqtxo. See `cephadm ls`
>> 
>> When I would do ‘cephadm ls’ I only saw services running locally on that 
>> server, not the whole cluster. I’m not sure if this is expected or not.
> 
> As far as I can tell this is expected, yes. I have only a lab environment 
> with containers (we're still hesitating to upgrade to Octopus) but all 
> virtual nodes have cephadm installed, I thought that was a requirement, I may 
> be wrong though. But it definitely helps you to debug, for example with 
> 'cephadm enter --name ' you get a shell for that container or 
> 'cephadm logs --name ' you can inspect specific logs.
> 
> 
> Zitat von "Paul Giralt (pgiralt)" :
> 
>> Thanks Eugen.
>> 
>> At first I tried cephadm rm-daemon on the bootstrap node that I usually do 
>> all management from and it indicated that it could not remove the daemon:
>> 
>> [root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name 
>> iscsi.cxcto-c240-j27-04.lgqtxo --fsid 4a29e724-c4a6-11eb-b14a-5c838f8013a5
>> ERROR: Daemon not found: iscsi.cxcto-c240-j27-04.lgqtxo. See `cephadm ls`
>> 
>> When I would do ‘cephadm ls’ I only saw services running locally on that 
>> server, not the whole cluster. I’m not sure if this is expected or not. I 
>> installed cephadm on the cxcto-c240-j27-04 server and issued the command and 
>> it worked. It looks like when I did this, suddenly the containers on the 
>> other two servers that were not supposed to be running the iscsi gateway 
>> were removed and everything appeared to be back to normal. I then added back 
>> one server to the yaml file and applied it on the original bootstrap node 
>> and it got deployed properly, so it appears that everything is working 
>> again. Somehow deleting that daemon on the 04 server got everything working 
>> again.
>> 
>> Still not exactly sure why that fixed it, but at least it’s working again. 
>> Thanks for the suggestion.
>> 
>> -Paul
>> 
>> 
>>> On Sep 8, 2021, at 4:12 AM, Eugen Block  wrote:
>>> 
>>> If you only configured 1 iscsi gw but you see 3 running, have you tried to 
>>> destroy them with 'cephadm rm-daemon --name ...'? On the active MGR host 
>>> run 'journalctl -f' and you'll see plenty of information, it should also 
>>> contain information about the iscsi deployment. Or run 'cephadm logs --name 
>>> '.
>>> 
>>> 
>>> Zitat von "Paul Giralt (pgiralt)" :
>>> 
>>>> This was working until recently and now seems to have stopped working. 
>>>> Running Pacific 16.2.5. When I modify the deployment YAML file for my 
>>>> iscsi gateways, the services are not being added or removed as requested. 
>>>> It’s as if the state is “stuck”.
>>>> 
>>>> At one point I had 4 iSCSI gateways: 02, 03, 04 and 05. Through some back 
>>>> and forth of deploying and undeploying, I ended up in a state where the 
>>>> services are running on servers 02, 03, and 05 no matter what I tell 
>>>> cephadm to do. For example, right now I have the following configuration:
>>>> 
>>>> service_type: iscsi
>>>> service_id: iscsi
>>>> placement:
>>>> hosts:
>>>>   - cxcto-c240-j27-03.cisco.com
>>>> spec:
>>>> pool: iscsi-config
>>>> … removed the rest of this file ….
>>>> 
>>>> However ceph orch ls shows this:
>>>> 
>>>> [root@cxcto-c240-j27-01 ~]# ceph orch ls
>>>> NAME   PORTSRUNNING  REFRESHED  AGE  
>>>> PLACEMENT
>>>> alertmanager   ?:9093,9094  1/1  9m ago 3M   
>>>> count:1
>>>> crash 15/15  10m ago3M   *
>>>> grafana?:3000   1/1  9m ago 

[ceph-users] Re: Cephadm not properly adding / removing iscsi services anymore

2021-09-08 Thread Paul Giralt (pgiralt)
Thanks Eugen. 

At first I tried cephadm rm-daemon on the bootstrap node that I usually do all 
management from and it indicated that it could not remove the daemon: 

[root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name 
iscsi.cxcto-c240-j27-04.lgqtxo --fsid 4a29e724-c4a6-11eb-b14a-5c838f8013a5
ERROR: Daemon not found: iscsi.cxcto-c240-j27-04.lgqtxo. See `cephadm ls`

When I would do ‘cephadm ls’ I only saw services running locally on that 
server, not the whole cluster. I’m not sure if this is expected or not. I 
installed cephadm on the cxcto-c240-j27-04 server and issued the command and it 
worked. It looks like when I did this, suddenly the containers on the other two 
servers that were not supposed to be running the iscsi gateway were removed and 
everything appeared to be back to normal. I then added back one server to the 
yaml file and applied it on the original bootstrap node and it got deployed 
properly, so it appears that everything is working again. Somehow deleting that 
daemon on the 04 server got everything working again. 

Still not exactly sure why that fixed it, but at least it’s working again. 
Thanks for the suggestion. 

-Paul


> On Sep 8, 2021, at 4:12 AM, Eugen Block  wrote:
> 
> If you only configured 1 iscsi gw but you see 3 running, have you tried to 
> destroy them with 'cephadm rm-daemon --name ...'? On the active MGR host run 
> 'journalctl -f' and you'll see plenty of information, it should also contain 
> information about the iscsi deployment. Or run 'cephadm logs --name 
> '.
> 
> 
> Zitat von "Paul Giralt (pgiralt)" :
> 
>> This was working until recently and now seems to have stopped working. 
>> Running Pacific 16.2.5. When I modify the deployment YAML file for my iscsi 
>> gateways, the services are not being added or removed as requested. It’s as 
>> if the state is “stuck”.
>> 
>> At one point I had 4 iSCSI gateways: 02, 03, 04 and 05. Through some back 
>> and forth of deploying and undeploying, I ended up in a state where the 
>> services are running on servers 02, 03, and 05 no matter what I tell cephadm 
>> to do. For example, right now I have the following configuration:
>> 
>> service_type: iscsi
>> service_id: iscsi
>> placement:
>>  hosts:
>>- cxcto-c240-j27-03.cisco.com
>> spec:
>>  pool: iscsi-config
>> … removed the rest of this file ….
>> 
>> However ceph orch ls shows this:
>> 
>> [root@cxcto-c240-j27-01 ~]# ceph orch ls
>> NAME   PORTSRUNNING  REFRESHED  AGE  
>> PLACEMENT
>> alertmanager   ?:9093,9094  1/1  9m ago 3M   
>> count:1
>> crash 15/15  10m ago3M   *
>> grafana?:3000   1/1  9m ago 3M   
>> count:1
>> iscsi.iscsi 3/1  10m ago11m  
>> cxcto-c240-j27-03.cisco.com
>> mgr 2/2  9m ago 3M   
>> count:2
>> mon 5/5  9m ago 12d  
>> cxcto-c240-j27-01.cisco.com;cxcto-c240-j27-06.cisco.com;cxcto-c240-j27-08.cisco.com;cxcto-c240-j27-10.cisco.com;cxcto-c240-j27-12.cisco.com
>> node-exporter  ?:9100 15/15  10m ago3M   *
>> osd.dashboard-admin-1622750977792  0/15  -  3M   *
>> osd.dashboard-admin-1622751032319   326/341  10m ago3M   *
>> prometheus ?:9095   1/1  9m ago 3M   
>> count:1
>> 
>> Notice it shows 3/1 because the service is still running on 3 servers even 
>> though I’ve told it to only run on one. If I configure all 4 servers and 
>> apply (ceph orch apply) then I end up with 3/4 because server 04 never 
>> deploys. It’s as if something is “stuck”.
>> 
>> Any ideas where to look / log files that might help figure out what’s 
>> happening?
>> 
>> -Paul
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephadm not properly adding / removing iscsi services anymore

2021-09-07 Thread Paul Giralt (pgiralt)
This was working until recently and now seems to have stopped working. Running 
Pacific 16.2.5. When I modify the deployment YAML file for my iscsi gateways, 
the services are not being added or removed as requested. It’s as if the state 
is “stuck”. 

At one point I had 4 iSCSI gateways: 02, 03, 04 and 05. Through some back and 
forth of deploying and undeploying, I ended up in a state where the services 
are running on servers 02, 03, and 05 no matter what I tell cephadm to do. For 
example, right now I have the following configuration: 

service_type: iscsi
service_id: iscsi
placement:
  hosts:
- cxcto-c240-j27-03.cisco.com
spec:
  pool: iscsi-config
… removed the rest of this file …. 

However ceph orch ls shows this: 

[root@cxcto-c240-j27-01 ~]# ceph orch ls
NAME   PORTSRUNNING  REFRESHED  AGE  
PLACEMENT
alertmanager   ?:9093,9094  1/1  9m ago 3M   count:1
crash 15/15  10m ago3M   *
grafana?:3000   1/1  9m ago 3M   count:1
iscsi.iscsi 3/1  10m ago11m  
cxcto-c240-j27-03.cisco.com
mgr 2/2  9m ago 3M   count:2
mon 5/5  9m ago 12d  
cxcto-c240-j27-01.cisco.com;cxcto-c240-j27-06.cisco.com;cxcto-c240-j27-08.cisco.com;cxcto-c240-j27-10.cisco.com;cxcto-c240-j27-12.cisco.com
node-exporter  ?:9100 15/15  10m ago3M   *
osd.dashboard-admin-1622750977792  0/15  -  3M   *
osd.dashboard-admin-1622751032319   326/341  10m ago3M   *
prometheus ?:9095   1/1  9m ago 3M   count:1

Notice it shows 3/1 because the service is still running on 3 servers even 
though I’ve told it to only run on one. If I configure all 4 servers and apply 
(ceph orch apply) then I end up with 3/4 because server 04 never deploys. It’s 
as if something is “stuck”. 

Any ideas where to look / log files that might help figure out what’s 
happening? 

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-09-03 Thread Paul Giralt (pgiralt)


On Sep 3, 2021, at 4:28 AM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

And TCMU runner shows 3 hosts up:

  services:
mon: 5 daemons, quorum 
cxcto-c240-j27-01.cisco.com,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12
 (age 16m)
mgr: cxcto-c240-j27-01.cisco.com.edgcpk(active, since 16m), 
standbys: cxcto-c240-j27-02.llzeit
osd: 329 osds: 326 up (since 4m), 326 in (since 6d)
tcmu-runner: 28 portals active (3 hosts)

Could you check all the gateways nodes whether the tcmu-runner service is still 
alive in all of them ?

The status will be reported by the tcmu-runner service, not the ceph-iscsi.


That’s the issue I’m having now - I can’t get the iscsi services (both the api 
gateway and tcmu runner) to start on one of the 4 servers for some reason. 
Since I’m using cephadm to orchestrate the enabling / disabling of services on 
nodes, I first used cephadm to add all 4 gateways back. They all were running 
and gwcli allowed me to make a change to try and remove one portal from one 
target, however gwcli locked up when I did this. It looks like the 
configuration change took place, however after that event, now cephadm does not 
appear to be able to properly orchestrate the addition / removal of iscsi 
gateways. I’m in a state where it’s trying to run on 3 of the servers (02, 03, 
05) no matter what I do. If I set cephadm to run iscsi only on node 03, for 
example, it keeps running on 02 and 05 as well. If I set cephadm to run on all 
4 servers, it still only runs on 02, 03, and 05. It won’t start on 04 anymore. 
I’m not really sure how to see if it’s even trying, as I’m not sure how cephadm 
orchestrates the deployment of the containers.



Things seem to have gone from bad to worse now as I can’t get to a clean state 
were I had 2 gateways running properly since I was able to delete a gateway 
from one of the targets, but I can’t add it back again since I can’t get all 4 
gateways back up since that appears to be the only way that gwcli will work 
(sort of).

If you have any suggestions on how to get out of this mess I’d appreciate it.

Since the ceph-iscsi couldn't connect the stale gateways so it just forbids you 
change anything of it. Could you check whether the rbd-target-api service is 
alive ?

Then you can try to change the 'gateway.conf' to modify it.

Let’s say that 2 of the servers were dead for some reason and there way no way 
to get them back online. Is the only way to resolve in that case to modify 
gateway.conf? I’m a little nervous about doing this based on your last email 
saying to not mess with the file, but I was able to download it and it looks 
like modifying it would be relatively straightforward. Who is responsible for 
creating that file? I’m thinking what I should probably do is:

- Shut down ESXi cluster so there are no iSCSI accesses
- Tell cephadm to underplay all iscsi gateways. If this doesn’t work (which it 
probably won’t) then just stop the tcmu-runner and iscsi containers on all 
serves so they’re not running.
- Modify gateway.conf to remove the gateways except for two
- Try to use cephadm to re-deploy on the two servers
- Bring back up the ESXi hosts.

Does this sound like a reasonable plan? I’m not sure if there is anything else 
to look at on the cephadm side to understand why the services are no longer 
being added/removed anymore.

-Paul


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-09-01 Thread Paul Giralt (pgiralt)
Thanks. The problem is that when I start gwcli, I get this:

[root@cxcto-c240-j27-02 /]# gwcli
Warning: Could not load preferences file /root/.gwcli/prefs.bin.

2 gateways are inaccessible - updates will be disabled

2 gateways are inaccessible - updates will be disabled

2 gateways are inaccessible - updates will be disabled

I think because it thinks the two gateways are down, it doesn’t let you remove 
them. It’s a bit of a catch-22.

I tried re-adding the two missing gateways via cephadm so that they come back 
up and then tried deleting the gateways from gwcli, but that just locked up 
gwcli even though it actually does seem to have removed it from the 
configuration but now I’m in a really strange state. I can’t seem to get all 
the gateways up now and it looks like applying the configuration via cephadm is 
not actually changing the deployment of iscsi services. If I try to deploy to 
all 4 servers, I end up with servers 02, 03, and 05 deployed, but 04 never 
deploys. If I try to change the configuration to only deploy to 03, it still 
stays deployed on 02, 03, and 05. It’s like it’s stuck somewhere, but I’m not 
sure where to look.

I currently have the configuration to only enable one gateway, but in ‘ceph 
orch ls’ I can see that there are 3/1 running (so 3 running even though there 
should only be 1):

[root@cxcto-c240-j27-01 ~]# ceph orch ls
NAME   PORTSRUNNING  REFRESHED  AGE  
PLACEMENT
alertmanager   ?:9093,9094  1/1  2m ago 3M   count:1
crash 15/15  4m ago 3M   *
grafana?:3000   1/1  2m ago 3M   count:1
iscsi.iscsi 3/1  4m ago 9m   
cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com>

And TCMU runner shows 3 hosts up:

  services:
mon: 5 daemons, quorum 
cxcto-c240-j27-01.cisco.com<http://cxcto-c240-j27-01.cisco.com>,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12
 (age 16m)
mgr: cxcto-c240-j27-01.cisco.com.edgcpk(active, since 16m), 
standbys: cxcto-c240-j27-02.llzeit
osd: 329 osds: 326 up (since 4m), 326 in (since 6d)
tcmu-runner: 28 portals active (3 hosts)

Things seem to have gone from bad to worse now as I can’t get to a clean state 
were I had 2 gateways running properly since I was able to delete a gateway 
from one of the targets, but I can’t add it back again since I can’t get all 4 
gateways back up since that appears to be the only way that gwcli will work 
(sort of).

If you have any suggestions on how to get out of this mess I’d appreciate it.

-Paul




On Sep 1, 2021, at 9:17 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:


On 9/1/21 12:32 PM, Paul Giralt (pgiralt) wrote:


However, the gwcli command is still showing the other two gateways which are no 
longer enabled anymore. Where does this list of gateways get stored?

All this configurations are stored in the "gateway.conf" object in "rbd" pool.


How do I access this object? Is it a file or some kind of object store?



Just use the normal rados command:


# rados -p rbd ls
rbd_object_map.137335b78a72
rbd_header.137335b78a72
gateway.conf
rbd_directory
rbd_header.13750fee0be9ae
rbd_id.block2
rbd_object_map.13750fee0be9ae
rbd_object_map.1378bf4c6ef770
rbd_header.1378bf4c6ef770
rbd_id.block4
rbd_id.block3
# rados -p rbd get gateway.conf a.txt

But you'd better don't touch this object manually here, it's risky. If you want 
to change it you'd better do that by using the REST API or gwcli command.


 It appears that the two gateways that are no longer part of the cluster still 
appear as the owners of some of the LUNs:

/iscsi-targets> ls
o- iscsi-targets 
.
 [DiscoveryAuth: CHAP, Targets: 3]
  o- iqn.2001-07.com.ceph:1622752075720 
.. [Auth: CHAP, 
Gateways: 4]
  | o- disks 

 [Disks: 5]
  | | o- iscsi-pool-0001/iscsi-p0001-img-01 
... [Owner: 
cxcto-c240-j27-02.cisco.com<http://cxcto-c240-j27-02.cisco.com/>, Lun: 0]
  | | o- iscsi-pool-0001/iscsi-p0001-img-02 
... [Owner: 
cxcto-c240-j27-04.cisco.com<http://cxcto-c240-j27-04.cisco.com/>, Lun: 3]
  | | o- iscsi-pool-0003/iscsi-p0003-img-01 
... [Owner: 
cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com/>, Lun: 1]
  | | o- iscsi-pool-0003/iscsi-p0003-img-02 
... [Owner: 
cxcto-c240-j27-05.cisco.com<http://cxcto-c240-j27-05.cisco.com/>, Lun: 4]
  | | o- iscsi-pool-0005/iscsi-p0005-img-01 
.

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-31 Thread Paul Giralt (pgiralt)


However, the gwcli command is still showing the other two gateways which are no 
longer enabled anymore. Where does this list of gateways get stored?

All this configurations are stored in the "gateway.conf" object in "rbd" pool.


How do I access this object? Is it a file or some kind of object store?



 It appears that the two gateways that are no longer part of the cluster still 
appear as the owners of some of the LUNs:

/iscsi-targets> ls
o- iscsi-targets 
.
 [DiscoveryAuth: CHAP, Targets: 3]
  o- iqn.2001-07.com.ceph:1622752075720 
.. [Auth: CHAP, 
Gateways: 4]
  | o- disks 

 [Disks: 5]
  | | o- iscsi-pool-0001/iscsi-p0001-img-01 
... [Owner: 
cxcto-c240-j27-02.cisco.com, Lun: 0]
  | | o- iscsi-pool-0001/iscsi-p0001-img-02 
... [Owner: 
cxcto-c240-j27-04.cisco.com, Lun: 3]
  | | o- iscsi-pool-0003/iscsi-p0003-img-01 
... [Owner: 
cxcto-c240-j27-03.cisco.com, Lun: 1]
  | | o- iscsi-pool-0003/iscsi-p0003-img-02 
... [Owner: 
cxcto-c240-j27-05.cisco.com, Lun: 4]
  | | o- iscsi-pool-0005/iscsi-p0005-img-01 
... [Owner: 
cxcto-c240-j27-02.cisco.com, Lun: 2]
  | o- gateways 
..
 [Up: 2/4, Portals: 4]
  | | o- cxcto-c240-j27-02.cisco.com 
. 
[10.122.242.197 (UP)]
  | | o- cxcto-c240-j27-03.cisco.com 
. 
[10.122.242.198 (UP)]
  | | o- cxcto-c240-j27-04.cisco.com 
 
[10.122.242.199 (UNKNOWN)]
  | | o- cxcto-c240-j27-05.cisco.com 
 
[10.122.242.200 (UNKNOWN)]
  | o- host-groups 

 [Groups : 0]
  | o- hosts 

 [Auth: ACL_DISABLED, Hosts: 0]
  o- iqn.2001-07.com.ceph:1622752147345 
.. [Auth: CHAP, 
Gateways: 4]
  | o- disks 

 [Disks: 5]
  | | o- iscsi-pool-0002/iscsi-p0002-img-01 
... [Owner: 
cxcto-c240-j27-04.cisco.com, Lun: 0]
  | | o- iscsi-pool-0002/iscsi-p0002-img-02 
... [Owner: 
cxcto-c240-j27-02.cisco.com, Lun: 3]
  | | o- iscsi-pool-0004/iscsi-p0004-img-01 
... [Owner: 
cxcto-c240-j27-05.cisco.com, Lun: 1]
  | | o- iscsi-pool-0004/iscsi-p0004-img-02 
... [Owner: 
cxcto-c240-j27-03.cisco.com, Lun: 4]
  | | o- iscsi-pool-0006/iscsi-p0006-img-01 
... [Owner: 
cxcto-c240-j27-03.cisco.com, Lun: 2]
  | o- gateways 
..
 [Up: 2/4, Portals: 4]
  | | o- cxcto-c240-j27-02.cisco.com 
. 
[10.122.242.197 (UP)]
  | | o- cxcto-c240-j27-03.cisco.com 
. 
[10.122.242.198 (UP)]
  | | o- cxcto-c240-j27-04.cisco.com 
 
[10.122.242.199 (UNKNOWN)]
  | | o- cxcto-c240-j27-05.cisco.com 
 
[10.122.242.200 (UNKNOWN)]
  | o- host-groups 

 [Groups : 0]
  | o- hosts 

 [Auth: ACL_DISAB

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-31 Thread Paul Giralt (pgiralt)
Thank you. This is exactly what I was looking for. 

If I’m understanding correctly, what gets listed as the “owner” is what gets 
advertised via ALUA as the primary path, but the lock owner indicates which 
gateway currently owns the lock for that image and is allowed to pass traffic 
for that LUN, correct? 

BTW - it appears there is some other kind of bug. I’m using cephadm for 
bringing the iscsi gateways up/down. Right now I only have two that are 
configured and ‘ceph orch ls’ only shows two as expected: 

[root@cxcto-c240-j27-01 ~]# ceph orch ls
NAME   PORTSRUNNING  REFRESHED  AGE  
PLACEMENT
alertmanager   ?:9093,9094  1/1  8m ago 2M   count:1
crash 15/15  8m ago 2M   *
grafana?:3000   1/1  8m ago 2M   count:1
iscsi.iscsi 2/2  8m ago 81m  
cxcto-c240-j27-02.cisco.com;cxcto-c240-j27-03.cisco.com
mgr 2/2  8m ago 2M   count:2
mon 5/5  8m ago 5d   
cxcto-c240-j27-01.cisco.com;cxcto-c240-j27-06.cisco.com;cxcto-c240-j27-08.cisco.com;cxcto-c240-j27-10.cisco.com;cxcto-c240-j27-12.cisco.com
node-exporter  ?:9100 15/15  8m ago 2M   *
osd.dashboard-admin-1622750977792  0/15  -  2M   *
osd.dashboard-admin-1622751032319   326/341  8m ago 2M   *
prometheus ?:9095   1/1  8m ago 2M   count:1

However, the gwcli command is still showing the other two gateways which are no 
longer enabled anymore. Where does this list of gateways get stored? It appears 
that the two gateways that are no longer part of the cluster still appear as 
the owners of some of the LUNs: 

/iscsi-targets> ls
o- iscsi-targets 
.
 [DiscoveryAuth: CHAP, Targets: 3]
  o- iqn.2001-07.com.ceph:1622752075720 
.. [Auth: CHAP, 
Gateways: 4]
  | o- disks 

 [Disks: 5]
  | | o- iscsi-pool-0001/iscsi-p0001-img-01 
... [Owner: 
cxcto-c240-j27-02.cisco.com, Lun: 0]
  | | o- iscsi-pool-0001/iscsi-p0001-img-02 
... [Owner: 
cxcto-c240-j27-04.cisco.com, Lun: 3]
  | | o- iscsi-pool-0003/iscsi-p0003-img-01 
... [Owner: 
cxcto-c240-j27-03.cisco.com, Lun: 1]
  | | o- iscsi-pool-0003/iscsi-p0003-img-02 
... [Owner: 
cxcto-c240-j27-05.cisco.com, Lun: 4]
  | | o- iscsi-pool-0005/iscsi-p0005-img-01 
... [Owner: 
cxcto-c240-j27-02.cisco.com, Lun: 2]
  | o- gateways 
..
 [Up: 2/4, Portals: 4]
  | | o- cxcto-c240-j27-02.cisco.com 
. 
[10.122.242.197 (UP)]
  | | o- cxcto-c240-j27-03.cisco.com 
. 
[10.122.242.198 (UP)]
  | | o- cxcto-c240-j27-04.cisco.com 
 
[10.122.242.199 (UNKNOWN)]
  | | o- cxcto-c240-j27-05.cisco.com 
 
[10.122.242.200 (UNKNOWN)]
  | o- host-groups 

 [Groups : 0]
  | o- hosts 

 [Auth: ACL_DISABLED, Hosts: 0]
  o- iqn.2001-07.com.ceph:1622752147345 
.. [Auth: CHAP, 
Gateways: 4]
  | o- disks 

 [Disks: 5]
  | | o- iscsi-pool-0002/iscsi-p0002-img-01 
... [Owner: 
cxcto-c240-j27-04.cisco.com, Lun: 0]
  | | o- iscsi-pool-0002/iscsi-p0002-img-02 
... [Owner: 
cxcto-c240-j27-02.cisco.com, Lun: 3]
  | | o- iscsi-pool-0004/iscsi-p0004-img-01 
... [Owner: 
cxcto-c240-j27-05.cisco.com, Lun: 1]
  | | o- iscsi-pool-0004/iscsi-p0004-img-02 
... [Owner: 
cxcto-c240-j27-03.cisco.com, Lun: 4]
  | | o- iscsi-pool-0006/iscsi-p0006-img-01 
... [Owner: 
cxcto-c240-j27-03.cisco.com, Lun: 2]
  | o- gateways 
..
 [Up: 2/4, P

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-31 Thread Paul Giralt (pgiralt)
Xiubo,

Thank you for all the help so far. I was finally able to figure out what the 
trigger for the issue was and how to make sure it doesn’t happen - at least not 
in a steady state. There is still the possibility of running into the bug in a 
failover scenario of some kind, but at least for now I think I’m stable.

I now have two iSCSI gateways running now and I’m not seeing the locks flapping 
back and forth between the two after making a change on the ESXi cluster that 
I’ll describe below.

I have 50 ESXi hosts communicating with the Ceph cluster. What happened was 
that for some reason, some of the hosts did not see the full list of paths to 
all the iSCSI gateways. In my case, each host should have seen a total of 44 
paths for all the LUNs but some were only seeing 32 or 37 (or some other 
number). This meant that if one of the paths it wasn’t seeing happened to be 
the primary path, it was not using it and using another path instead. This 
appear to be what was causing the images to flap back and forth between the two 
gateways. Once I went through each host and manually rescanned the adapter to 
discover all the available paths after adding the second iSCSI gateway, 
everything stabilized. If even one host in the environment doesn’t see all the 
paths, this flapping occurs.

Am I right to assume that the iSCSI gateways automatically determine which LUN 
they will advertise being primary for? Is there a command that lets me view 
which gateway is primary for which LUN? I’m guessing when another gateway gets 
added, the calculation of who is primary for each LUN gets re-calculated and 
advertised out to the clients?

-Paul




I did a quick test where I re-enabled a second iSCSI gateway to take a closer 
look at the paths on the ESXi hosts and I definitely see that when the second 
path becomes available, different hosts are pointing to different gateways for 
the Active I/O Path.

I was reading on how ALUA works and as far as I can tell, isn’t CEPH supposed 
to indicate to the ESXi hosts which iSCSI gateway “owns” a given LUN at any 
point so that the hosts know which path to make active?

Yeah, the ceph-iscsi/tcmu-runner services will do that. It will report this to 
the clients.


Could there be something wrong where more than one iSCSI gateway is advertising 
that it owns the LUN to the ESXi hosts?


This has been test and working well in linux in product and the logic never 
changed for several years.

I am not very sure how the ESXi internal will handle this but it should be in 
compliance with the iscsi proto, in linux the multipath could successfully 
detect which path is active and will choose it.

-Paul


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-30 Thread Paul Giralt (pgiralt)


On Aug 30, 2021, at 7:14 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:


We are using “Most Recently Used” - however there are 50 ESXi hosts all trying 
to access the same data stores, so it’s very possible that one host is choosing 
iSCSI gateway 1 and another host is choosing iSCSI gateway 2.

If so, you need to fix this.


I did a quick test where I re-enabled a second iSCSI gateway to take a closer 
look at the paths on the ESXi hosts and I definitely see that when the second 
path becomes available, different hosts are pointing to different gateways for 
the Active I/O Path.

I was reading on how ALUA works and as far as I can tell, isn’t CEPH supposed 
to indicate to the ESXi hosts which iSCSI gateway “owns” a given LUN at any 
point so that the hosts know which path to make active? Could there be 
something wrong where more than one iSCSI gateway is advertising that it owns 
the LUN to the ESXi hosts?

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-30 Thread Paul Giralt (pgiralt)
Inline…


Usually it shouldn't be so many entries, and we have added some patches to fix 
this.

When the exclusive lock is broke by a new gateway, the previous one will be 
added to the blocklist by ceph. And in tcmu-runner when the previous gateway 
detect that it has been added to the blocklist will it try to reopen that 
image, during which the stale blocklist entry will be removed.

Locally I tried and couldn't reproduce this issue.

Could you send me some of the blacklist entries you saw ? What are they look 
like ?


Sure - here is a short snippet of an example from last week that I logged:

[root@cxcto-c240-j27-01 ~]# ceph osd blacklist ls
10.122.242.199:0/4226293933 2021-08-27T20:39:44.061508+
10.122.242.198:0/2282309529 2021-08-27T20:39:44.192061+
10.122.242.197:0/3952967406 2021-08-27T20:39:43.240488+
10.122.242.197:0/2355855561 2021-08-27T20:39:43.272794+
10.122.242.197:0/1182316932 2021-08-27T20:39:43.029873+
10.122.242.199:0/2839589212 2021-08-27T20:39:43.086062+
10.122.242.197:0/1768780841 2021-08-27T20:39:43.068138+
10.122.242.197:0/3136295259 2021-08-27T20:39:42.175505+
10.122.242.197:0/2370363728 2021-08-27T20:39:42.051609+
10.122.242.200:0/3544501318 2021-08-27T20:39:42.270797+
10.122.242.199:0/2049723123 2021-08-27T20:39:39.951792+
10.122.242.197:0/244698804 2021-08-27T20:39:40.236649+
10.122.242.197:0/4246474017 2021-08-27T20:39:39.736140+
10.122.242.199:0/180742279 2021-08-27T20:39:39.071984+
10.122.242.197:0/301384989 2021-08-27T20:39:37.949623+
10.122.242.198:0/1054843518 2021-08-27T20:39:36.859075+
10.122.242.197:0/3257267535 2021-08-27T20:39:37.219640+
10.122.242.197:0/856045413 2021-08-27T20:39:35.812634+
10.122.242.197:0/3399533573 2021-08-27T20:39:35.815300+
10.122.242.199:0/2236736112 2021-08-27T20:39:34.643961+
10.122.242.200:0/1419077698 2021-08-27T20:39:34.674641+
10.122.242.197:0/3192807538 2021-08-27T20:39:34.234574+
10.122.242.197:0/3535150177 2021-08-27T20:39:34.222314+
10.122.242.197:0/214335566 2021-08-27T20:39:34.253096+
10.122.242.197:0/3969335486 2021-08-27T20:39:33.255463+
10.122.242.199:0/2070390422 2021-08-27T20:39:33.873459+
10.122.242.197:0/2291391951 2021-08-27T20:39:32.087530+
10.122.242.197:0/3306349369 2021-08-27T20:39:32.746714+
10.122.242.199:0/3740796140 2021-08-27T20:39:31.658943+
10.122.242.197:0/4099129478 2021-08-27T20:39:31.249495+
10.122.242.197:0/2514846855 2021-08-27T20:39:29.735692+
10.122.242.200:0/1580971106 2021-08-27T20:39:30.224904+

… snip ….

10.122.242.197:0/2755867789 2021-08-27T20:26:10.204159+
10.122.242.198:0/3288532389 2021-08-27T20:26:26.269538+
10.122.242.200:0/4167199222 2021-08-27T20:26:26.970351+
10.122.242.197:0/2795494436 2021-08-27T20:26:27.348841+
10.122.242.197:0/643616937 2021-08-27T20:26:27.452835+
10.122.242.197:0/549853104 2021-08-27T20:26:27.361418+
listed 14528 entries





The initiators accessing the iSCSI volumes are all VMWare ESXi hosts. Do you 
think it’s expected to see so much path switching in this kind of environment 
or perhaps I need to look at some parameters on the ESXi side to make it not 
switch so often.

What's the Path Selection Policy you are using ?

It can be either “Fixed”, “Most Recently Used” or “Round Robin” and it seems 
you are using the last one ?


We are using “Most Recently Used” - however there are 50 ESXi hosts all trying 
to access the same data stores, so it’s very possible that one host is choosing 
iSCSI gateway 1 and another host is choosing iSCSI gateway 2. Is only one 
gateway allowed to access an image at any given time? If so perhaps I need to 
hardcode the paths on the ESXi hosts so they all prefer one gateway and then 
only use the other to fail over.



Now we don’t have redundancy, but at least things are stable while we wait for 
a fix. Any chance this fix will make it into the 16.2.6 release?

Not sure, I am still wait someone to help me review them.

Ilya, would you be able to help?



- Xiubo



-Paul


On Aug 29, 2021, at 8:48 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:



On 8/27/21 11:10 PM, Paul Giralt (pgiralt) wrote:
Ok - thanks Xiubo. Not sure I feel comfortable doing that without breaking 
something else, so will wait for a new release that incorporates the fix. In 
the meantime I’m trying to figure out what might be triggering the issue, since 
this has been running fine for months and just recently started happening. Now 
it happens fairly regularly.

I noticed that in the tcmu logs, I see the following:

2021-08-27 15:06:40.158 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0001.iscsi-p0001-img-01: 
Could not update service status. (Err -107)
2021-08-27 15:06:40.158 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not report events. Error -107.
2021-08-27 15:06:41.131 8:io_context_pool [WARN] tcmu_notify_loc

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-29 Thread Paul Giralt (pgiralt)
Thanks Xiubo,

I actually had the same idea on Friday and I reduced the number of iSCSI 
gateways to 1 and the problem appears to have disappeared for now. I’m guessing 
there is still some chance it could happen, but it would be much more rare to 
occur.

I did notice the blacklist was growing very large (over 14,000 entries) and I 
found 1503692 which appears to explain why those entries are growing so high, 
but like you said that doesn’t appear to be a problem in and of itself.

The initiators accessing the iSCSI volumes are all VMWare ESXi hosts. Do you 
think it’s expected to see so much path switching in this kind of environment 
or perhaps I need to look at some parameters on the ESXi side to make it not 
switch so often.

Now we don’t have redundancy, but at least things are stable while we wait for 
a fix. Any chance this fix will make it into the 16.2.6 release?

-Paul


On Aug 29, 2021, at 8:48 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:



On 8/27/21 11:10 PM, Paul Giralt (pgiralt) wrote:
Ok - thanks Xiubo. Not sure I feel comfortable doing that without breaking 
something else, so will wait for a new release that incorporates the fix. In 
the meantime I’m trying to figure out what might be triggering the issue, since 
this has been running fine for months and just recently started happening. Now 
it happens fairly regularly.

I noticed that in the tcmu logs, I see the following:

2021-08-27 15:06:40.158 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0001.iscsi-p0001-img-01: 
Could not update service status. (Err -107)
2021-08-27 15:06:40.158 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not report events. Error -107.
2021-08-27 15:06:41.131 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Async lock drop. Old state 5
2021-08-27 15:06:41.147 8:cmdproc-uio9 [INFO] alua_implicit_transition:592 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Starting write lock acquisition 
operation.
2021-08-27 15:06:42.132 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0002.iscsi-p0002-img-02: 
Could not update service status. (Err -107)
2021-08-27 15:06:42.132 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Could not report events. Error -107.
2021-08-27 15:06:42.216 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: 
{10.122.242.197:0/2251669337}
2021-08-27 15:06:42.217 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry '�(~'. 
(Err -13)
2021-08-27 15:06:42.217 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: 
{10.122.242.197:0/3276725458}
2021-08-27 15:06:42.218 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry ''. (Err 
-13)
2021-08-27 15:06:42.443 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Async lock drop. Old state 5
2021-08-27 15:06:42.459 8:cmdproc-uio0 [INFO] alua_implicit_transition:592 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Starting write lock acquisition 
operation.
2021-08-27 15:06:42.488 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: removing addrs: 
{10.122.242.197:0/2189482708}
2021-08-27 15:06:42.489 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Could not rm blacklist entry '`"�'. 
(Err -13)

The tcmu_rbd_service_status_update is showing up in there which is the code 
that is affected by this bug. Any idea what the error -107 means? Maybe if I 
fix what is causing some of these errors, it might work around the problem. 
Also if you have thoughts on the other blacklist entry errors and what might be 
causing them, that would be greatly appreciated as well.


There has one way to improve this, which is to make the HA=1, but won't void it 
100%. I found your case is triggered when it's doing active paths switching 
between different gateways, which will do the exclusive lock broke and 
acquiring frequently, the Error -107 means the image has been closed by 
tcmu-runner but another thread is trying to use the freed connection to report 
the status. The blocklist error should be okay, it won't affect anything, it's 
just a waning.


- Xiubo

-Paul


On Aug 26, 2021, at 8:37 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

On 8/27/21 12:06 AM, Paul Giralt (pgiralt) wrote:
This is great. Is there a way to test the fix in my environment?


It seem you could restart the tcmu-runner service from the container.

Since this change not only in the handler_rbd.so but also th

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-27 Thread Paul Giralt (pgiralt)
Ok - thanks Xiubo. Not sure I feel comfortable doing that without breaking 
something else, so will wait for a new release that incorporates the fix. In 
the meantime I’m trying to figure out what might be triggering the issue, since 
this has been running fine for months and just recently started happening. Now 
it happens fairly regularly.

I noticed that in the tcmu logs, I see the following:

2021-08-27 15:06:40.158 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0001.iscsi-p0001-img-01: 
Could not update service status. (Err -107)
2021-08-27 15:06:40.158 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not report events. Error -107.
2021-08-27 15:06:41.131 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Async lock drop. Old state 5
2021-08-27 15:06:41.147 8:cmdproc-uio9 [INFO] alua_implicit_transition:592 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Starting write lock acquisition 
operation.
2021-08-27 15:06:42.132 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0002.iscsi-p0002-img-02: 
Could not update service status. (Err -107)
2021-08-27 15:06:42.132 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Could not report events. Error -107.
2021-08-27 15:06:42.216 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: 
{10.122.242.197:0/2251669337}
2021-08-27 15:06:42.217 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry '�(~'. 
(Err -13)
2021-08-27 15:06:42.217 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: 
{10.122.242.197:0/3276725458}
2021-08-27 15:06:42.218 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry ''. (Err 
-13)
2021-08-27 15:06:42.443 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Async lock drop. Old state 5
2021-08-27 15:06:42.459 8:cmdproc-uio0 [INFO] alua_implicit_transition:592 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Starting write lock acquisition 
operation.
2021-08-27 15:06:42.488 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: removing addrs: 
{10.122.242.197:0/2189482708}
2021-08-27 15:06:42.489 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Could not rm blacklist entry '`"�'. 
(Err -13)

The tcmu_rbd_service_status_update is showing up in there which is the code 
that is affected by this bug. Any idea what the error -107 means? Maybe if I 
fix what is causing some of these errors, it might work around the problem. 
Also if you have thoughts on the other blacklist entry errors and what might be 
causing them, that would be greatly appreciated as well.

-Paul


On Aug 26, 2021, at 8:37 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

On 8/27/21 12:06 AM, Paul Giralt (pgiralt) wrote:
This is great. Is there a way to test the fix in my environment?


It seem you could restart the tcmu-runner service from the container.

Since this change not only in the handler_rbd.so but also the libtcmu.so and 
tcmu-runner binary, the whole tcmu-runner need to be built.

That means I am afraid you have to build and install it from source on the host 
and then restart the tcmu container.



-Paul


On Aug 26, 2021, at 11:05 AM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:


Hi Paul, Ilya,

I have fixed it in [1], please help review.

Thanks

[1] https://github.com/open-iscsi/tcmu-runner/pull/667


On 8/26/21 7:34 PM, Paul Giralt (pgiralt) wrote:
Thank you for the analysis. Can you think of a workaround for the issue?

-Paul

Sent from my iPhone

On Aug 26, 2021, at 5:17 AM, Xiubo Li 
<mailto:xiu...@redhat.com> wrote:



Hi Paul,

There has one racy case when updating the state to ceph cluster and while 
reopening the image, which will close and open the image, the crash should 
happen just after the image was closed and the resources were released and then 
if work queue was trying to update the state to ceph cluster it will trigger 
use-after-free bug.

I will try to fix it.

Thanks


On 8/26/21 10:40 AM, Paul Giralt (pgiralt) wrote:
I will send a unicast email with the link and details.

-Paul


On Aug 25, 2021, at 10:37 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:


Hi Paul,

Please send me the detail versions of the tcmu-runner and ceph-iscsi packages 
you are using.

Thanks


On 8/26/21 10:21 AM, Paul Giralt (pgiralt) wrote:
Thank you. I did find some coredump files. Is there a way I can send these to 
you to analyze?

[root@cxcto-c240-j27-02 coredump]# ls -asl
total 71292

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-26 Thread Paul Giralt (pgiralt)
This is great. Is there a way to test the fix in my environment?

-Paul


On Aug 26, 2021, at 11:05 AM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:


Hi Paul, Ilya,

I have fixed it in [1], please help review.

Thanks

[1] https://github.com/open-iscsi/tcmu-runner/pull/667


On 8/26/21 7:34 PM, Paul Giralt (pgiralt) wrote:
Thank you for the analysis. Can you think of a workaround for the issue?

-Paul

Sent from my iPhone

On Aug 26, 2021, at 5:17 AM, Xiubo Li 
<mailto:xiu...@redhat.com> wrote:



Hi Paul,

There has one racy case when updating the state to ceph cluster and while 
reopening the image, which will close and open the image, the crash should 
happen just after the image was closed and the resources were released and then 
if work queue was trying to update the state to ceph cluster it will trigger 
use-after-free bug.

I will try to fix it.

Thanks


On 8/26/21 10:40 AM, Paul Giralt (pgiralt) wrote:
I will send a unicast email with the link and details.

-Paul


On Aug 25, 2021, at 10:37 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:


Hi Paul,

Please send me the detail versions of the tcmu-runner and ceph-iscsi packages 
you are using.

Thanks


On 8/26/21 10:21 AM, Paul Giralt (pgiralt) wrote:
Thank you. I did find some coredump files. Is there a way I can send these to 
you to analyze?

[root@cxcto-c240-j27-02 coredump]# ls -asl
total 71292
0 drwxr-xr-x. 2 root root  176 Aug 25 18:31 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
34496 -rw-r-. 1 root root 35316215 Aug 25 18:31 
core.tcmu-runner.0.3083bbc32b6a43acb768b85818414867.4523.162993068100.lz4
36796 -rw-r-. 1 root root 37671322 Aug 24 09:17 
core.tcmu-runner.0.baf25867590c40da87305e67d5b97751.4521.162981102200.lz4

[root@cxcto-c240-j27-03 coredump]# ls -asl
total 161188
4 drwxr-xr-x. 2 root root 4096 Aug 25 19:29 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
45084 -rw-r-. 1 root root 46159860 Aug 25 19:29 
core.tcmu-runner.0.a276a2f5ee5a4d279917fd8c335c9b93.5281.162993419000.lz4
33468 -rw-r-. 1 root root 34263834 Aug 24 16:08 
core.tcmu-runner.0.a9df4a27b1ea43d09c6c254bb1e3447a.4209.162983573000.lz4
34212 -rw-r-. 1 root root 35027795 Aug 25 03:43 
core.tcmu-runner.0.cce93af5693444108993f0d48371197d.5564.162987741600.lz4
48420 -rw-r-. 1 root root 49574566 Aug 24 10:03 
core.tcmu-runner.0.e4f4ed6e35154c95b43f87b069380fbe.4091.162981383200.lz4

[root@cxcto-c240-j27-04 coredump]# ls -asl
total 359240
 4 drwxr-xr-x. 2 root root  4096 Aug 25 19:20 .
 0 drwxr-xr-x. 5 root root70 Aug 10 11:31 ..
 31960 -rw-r-. 1 root root  32720639 Aug 25 00:36 
core.tcmu-runner.0.115ba6ee7acb42b8acfe2a1a958b5367.34161.162986618200.lz4
 38516 -rw-r-. 1 root root  39435484 Aug 25 19:20 
core.tcmu-runner.0.4d43dd5cde9c4d44a96b2c744a9b43f4.4295.162993361500.lz4
 81012 -rw-r-. 1 root root  82951773 Aug 25 14:38 
core.tcmu-runner.0.6998ff9717cf4e96932349eacd1d81bc.4274.162991672000.lz4
 95872 -rw-r-. 1 root root  98165539 Aug 23 17:02 
core.tcmu-runner.0.9a28e301d6604d1a8eafbe12ae896c2f.4269.162975254700.lz4
111876 -rw-r-. 1 root root 114554583 Aug 24 11:41 
core.tcmu-runner.0.f9ea1331105b44f2b2f28dc0c1a7e653.5059.162981970500.lz4

[root@cxcto-c240-j27-05 coredump]# ls -asl
total 115720
0 drwxr-xr-x. 2 root root  261 Aug 25 16:47 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
44720 -rw-r-. 1 root root 45786023 Aug 24 09:46 
core.tcmu-runner.0.530b308c30534b9aa4e7619ff1ab869c.4145.162981278700.lz4
35032 -rw-r-. 1 root root 35867165 Aug 24 17:52 
core.tcmu-runner.0.5afb87334bd741699c6fd44ceb031128.5672.162984193900.lz4
35968 -rw-r-. 1 root root 36826770 Aug 25 16:47 
core.tcmu-runner.0.da66f3f24a624426a75cbe20758be879.5339.162992443500.lz4



-Paul


On Aug 25, 2021, at 10:14 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:



On 8/26/21 10:08 AM, Paul Giralt (pgiralt) wrote:
Thanks Xiubo. I will try this. How do I set the log level to 4?


It's in the /etc/tcmu/tcmu.cfg in the tcmu container. No need to restart the 
tcmu-runner service, the changes will be loaded by tcmu-runner daemon after the 
tcmu.cfg closed.



-Paul


On Aug 25, 2021, at 9:30 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

It's buggy, we need one way to export the tcmu-runner log to the host.

Could you see any crash coredump from the host ?

Without that could you keep running some commands like '$ tail -f 
XYZ/tcmu-runner.log' in a console from the tcmu containers, let's see could we 
get any useful logs ? At the same time please set the log_level to 4. If it's 
an experimental setup then you can just set the log_level to 5.

I am not confident we can get any coredump from tcmu-runner.log, but at least 
we can get something else which may give us a clue.

- Xiubo






___
ceph-users mailing list -- c

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-26 Thread Paul Giralt (pgiralt)
Thank you for the analysis. Can you think of a workaround for the issue?

-Paul

Sent from my iPhone

On Aug 26, 2021, at 5:17 AM, Xiubo Li  wrote:



Hi Paul,

There has one racy case when updating the state to ceph cluster and while 
reopening the image, which will close and open the image, the crash should 
happen just after the image was closed and the resources were released and then 
if work queue was trying to update the state to ceph cluster it will trigger 
use-after-free bug.

I will try to fix it.

Thanks


On 8/26/21 10:40 AM, Paul Giralt (pgiralt) wrote:
I will send a unicast email with the link and details.

-Paul


On Aug 25, 2021, at 10:37 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:


Hi Paul,

Please send me the detail versions of the tcmu-runner and ceph-iscsi packages 
you are using.

Thanks


On 8/26/21 10:21 AM, Paul Giralt (pgiralt) wrote:
Thank you. I did find some coredump files. Is there a way I can send these to 
you to analyze?

[root@cxcto-c240-j27-02 coredump]# ls -asl
total 71292
0 drwxr-xr-x. 2 root root  176 Aug 25 18:31 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
34496 -rw-r-. 1 root root 35316215 Aug 25 18:31 
core.tcmu-runner.0.3083bbc32b6a43acb768b85818414867.4523.162993068100.lz4
36796 -rw-r-. 1 root root 37671322 Aug 24 09:17 
core.tcmu-runner.0.baf25867590c40da87305e67d5b97751.4521.162981102200.lz4

[root@cxcto-c240-j27-03 coredump]# ls -asl
total 161188
4 drwxr-xr-x. 2 root root 4096 Aug 25 19:29 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
45084 -rw-r-. 1 root root 46159860 Aug 25 19:29 
core.tcmu-runner.0.a276a2f5ee5a4d279917fd8c335c9b93.5281.162993419000.lz4
33468 -rw-r-. 1 root root 34263834 Aug 24 16:08 
core.tcmu-runner.0.a9df4a27b1ea43d09c6c254bb1e3447a.4209.162983573000.lz4
34212 -rw-r-. 1 root root 35027795 Aug 25 03:43 
core.tcmu-runner.0.cce93af5693444108993f0d48371197d.5564.162987741600.lz4
48420 -rw-r-. 1 root root 49574566 Aug 24 10:03 
core.tcmu-runner.0.e4f4ed6e35154c95b43f87b069380fbe.4091.162981383200.lz4

[root@cxcto-c240-j27-04 coredump]# ls -asl
total 359240
 4 drwxr-xr-x. 2 root root  4096 Aug 25 19:20 .
 0 drwxr-xr-x. 5 root root70 Aug 10 11:31 ..
 31960 -rw-r-. 1 root root  32720639 Aug 25 00:36 
core.tcmu-runner.0.115ba6ee7acb42b8acfe2a1a958b5367.34161.162986618200.lz4
 38516 -rw-r-. 1 root root  39435484 Aug 25 19:20 
core.tcmu-runner.0.4d43dd5cde9c4d44a96b2c744a9b43f4.4295.162993361500.lz4
 81012 -rw-r-. 1 root root  82951773 Aug 25 14:38 
core.tcmu-runner.0.6998ff9717cf4e96932349eacd1d81bc.4274.162991672000.lz4
 95872 -rw-r-. 1 root root  98165539 Aug 23 17:02 
core.tcmu-runner.0.9a28e301d6604d1a8eafbe12ae896c2f.4269.162975254700.lz4
111876 -rw-r-. 1 root root 114554583 Aug 24 11:41 
core.tcmu-runner.0.f9ea1331105b44f2b2f28dc0c1a7e653.5059.162981970500.lz4

[root@cxcto-c240-j27-05 coredump]# ls -asl
total 115720
0 drwxr-xr-x. 2 root root  261 Aug 25 16:47 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
44720 -rw-r-. 1 root root 45786023 Aug 24 09:46 
core.tcmu-runner.0.530b308c30534b9aa4e7619ff1ab869c.4145.162981278700.lz4
35032 -rw-r-. 1 root root 35867165 Aug 24 17:52 
core.tcmu-runner.0.5afb87334bd741699c6fd44ceb031128.5672.162984193900.lz4
35968 -rw-r-. 1 root root 36826770 Aug 25 16:47 
core.tcmu-runner.0.da66f3f24a624426a75cbe20758be879.5339.162992443500.lz4



-Paul


On Aug 25, 2021, at 10:14 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:



On 8/26/21 10:08 AM, Paul Giralt (pgiralt) wrote:
Thanks Xiubo. I will try this. How do I set the log level to 4?


It's in the /etc/tcmu/tcmu.cfg in the tcmu container. No need to restart the 
tcmu-runner service, the changes will be loaded by tcmu-runner daemon after the 
tcmu.cfg closed.



-Paul


On Aug 25, 2021, at 9:30 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

It's buggy, we need one way to export the tcmu-runner log to the host.

Could you see any crash coredump from the host ?

Without that could you keep running some commands like '$ tail -f 
XYZ/tcmu-runner.log' in a console from the tcmu containers, let's see could we 
get any useful logs ? At the same time please set the log_level to 4. If it's 
an experimental setup then you can just set the log_level to 5.

I am not confident we can get any coredump from tcmu-runner.log, but at least 
we can get something else which may give us a clue.

- Xiubo





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
I will send a unicast email with the link and details.

-Paul


On Aug 25, 2021, at 10:37 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:


Hi Paul,

Please send me the detail versions of the tcmu-runner and ceph-iscsi packages 
you are using.

Thanks


On 8/26/21 10:21 AM, Paul Giralt (pgiralt) wrote:
Thank you. I did find some coredump files. Is there a way I can send these to 
you to analyze?

[root@cxcto-c240-j27-02 coredump]# ls -asl
total 71292
0 drwxr-xr-x. 2 root root  176 Aug 25 18:31 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
34496 -rw-r-. 1 root root 35316215 Aug 25 18:31 
core.tcmu-runner.0.3083bbc32b6a43acb768b85818414867.4523.162993068100.lz4
36796 -rw-r-. 1 root root 37671322 Aug 24 09:17 
core.tcmu-runner.0.baf25867590c40da87305e67d5b97751.4521.162981102200.lz4

[root@cxcto-c240-j27-03 coredump]# ls -asl
total 161188
4 drwxr-xr-x. 2 root root 4096 Aug 25 19:29 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
45084 -rw-r-. 1 root root 46159860 Aug 25 19:29 
core.tcmu-runner.0.a276a2f5ee5a4d279917fd8c335c9b93.5281.162993419000.lz4
33468 -rw-r-. 1 root root 34263834 Aug 24 16:08 
core.tcmu-runner.0.a9df4a27b1ea43d09c6c254bb1e3447a.4209.162983573000.lz4
34212 -rw-r-. 1 root root 35027795 Aug 25 03:43 
core.tcmu-runner.0.cce93af5693444108993f0d48371197d.5564.162987741600.lz4
48420 -rw-r-. 1 root root 49574566 Aug 24 10:03 
core.tcmu-runner.0.e4f4ed6e35154c95b43f87b069380fbe.4091.162981383200.lz4

[root@cxcto-c240-j27-04 coredump]# ls -asl
total 359240
 4 drwxr-xr-x. 2 root root  4096 Aug 25 19:20 .
 0 drwxr-xr-x. 5 root root70 Aug 10 11:31 ..
 31960 -rw-r-. 1 root root  32720639 Aug 25 00:36 
core.tcmu-runner.0.115ba6ee7acb42b8acfe2a1a958b5367.34161.162986618200.lz4
 38516 -rw-r-. 1 root root  39435484 Aug 25 19:20 
core.tcmu-runner.0.4d43dd5cde9c4d44a96b2c744a9b43f4.4295.162993361500.lz4
 81012 -rw-r-. 1 root root  82951773 Aug 25 14:38 
core.tcmu-runner.0.6998ff9717cf4e96932349eacd1d81bc.4274.162991672000.lz4
 95872 -rw-r-. 1 root root  98165539 Aug 23 17:02 
core.tcmu-runner.0.9a28e301d6604d1a8eafbe12ae896c2f.4269.162975254700.lz4
111876 -rw-r-. 1 root root 114554583 Aug 24 11:41 
core.tcmu-runner.0.f9ea1331105b44f2b2f28dc0c1a7e653.5059.162981970500.lz4

[root@cxcto-c240-j27-05 coredump]# ls -asl
total 115720
0 drwxr-xr-x. 2 root root  261 Aug 25 16:47 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
44720 -rw-r-. 1 root root 45786023 Aug 24 09:46 
core.tcmu-runner.0.530b308c30534b9aa4e7619ff1ab869c.4145.162981278700.lz4
35032 -rw-r-. 1 root root 35867165 Aug 24 17:52 
core.tcmu-runner.0.5afb87334bd741699c6fd44ceb031128.5672.162984193900.lz4
35968 -rw-r-. 1 root root 36826770 Aug 25 16:47 
core.tcmu-runner.0.da66f3f24a624426a75cbe20758be879.5339.162992443500.lz4



-Paul


On Aug 25, 2021, at 10:14 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:



On 8/26/21 10:08 AM, Paul Giralt (pgiralt) wrote:
Thanks Xiubo. I will try this. How do I set the log level to 4?


It's in the /etc/tcmu/tcmu.cfg in the tcmu container. No need to restart the 
tcmu-runner service, the changes will be loaded by tcmu-runner daemon after the 
tcmu.cfg closed.



-Paul


On Aug 25, 2021, at 9:30 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

It's buggy, we need one way to export the tcmu-runner log to the host.

Could you see any crash coredump from the host ?

Without that could you keep running some commands like '$ tail -f 
XYZ/tcmu-runner.log' in a console from the tcmu containers, let's see could we 
get any useful logs ? At the same time please set the log_level to 4. If it's 
an experimental setup then you can just set the log_level to 5.

I am not confident we can get any coredump from tcmu-runner.log, but at least 
we can get something else which may give us a clue.

- Xiubo





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
Thank you. I did find some coredump files. Is there a way I can send these to 
you to analyze?

[root@cxcto-c240-j27-02 coredump]# ls -asl
total 71292
0 drwxr-xr-x. 2 root root  176 Aug 25 18:31 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
34496 -rw-r-. 1 root root 35316215 Aug 25 18:31 
core.tcmu-runner.0.3083bbc32b6a43acb768b85818414867.4523.162993068100.lz4
36796 -rw-r-. 1 root root 37671322 Aug 24 09:17 
core.tcmu-runner.0.baf25867590c40da87305e67d5b97751.4521.162981102200.lz4

[root@cxcto-c240-j27-03 coredump]# ls -asl
total 161188
4 drwxr-xr-x. 2 root root 4096 Aug 25 19:29 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
45084 -rw-r-. 1 root root 46159860 Aug 25 19:29 
core.tcmu-runner.0.a276a2f5ee5a4d279917fd8c335c9b93.5281.162993419000.lz4
33468 -rw-r-. 1 root root 34263834 Aug 24 16:08 
core.tcmu-runner.0.a9df4a27b1ea43d09c6c254bb1e3447a.4209.162983573000.lz4
34212 -rw-r-. 1 root root 35027795 Aug 25 03:43 
core.tcmu-runner.0.cce93af5693444108993f0d48371197d.5564.162987741600.lz4
48420 -rw-r-. 1 root root 49574566 Aug 24 10:03 
core.tcmu-runner.0.e4f4ed6e35154c95b43f87b069380fbe.4091.162981383200.lz4

[root@cxcto-c240-j27-04 coredump]# ls -asl
total 359240
 4 drwxr-xr-x. 2 root root  4096 Aug 25 19:20 .
 0 drwxr-xr-x. 5 root root70 Aug 10 11:31 ..
 31960 -rw-r-. 1 root root  32720639 Aug 25 00:36 
core.tcmu-runner.0.115ba6ee7acb42b8acfe2a1a958b5367.34161.162986618200.lz4
 38516 -rw-r-. 1 root root  39435484 Aug 25 19:20 
core.tcmu-runner.0.4d43dd5cde9c4d44a96b2c744a9b43f4.4295.162993361500.lz4
 81012 -rw-r-. 1 root root  82951773 Aug 25 14:38 
core.tcmu-runner.0.6998ff9717cf4e96932349eacd1d81bc.4274.162991672000.lz4
 95872 -rw-r-. 1 root root  98165539 Aug 23 17:02 
core.tcmu-runner.0.9a28e301d6604d1a8eafbe12ae896c2f.4269.162975254700.lz4
111876 -rw-r-. 1 root root 114554583 Aug 24 11:41 
core.tcmu-runner.0.f9ea1331105b44f2b2f28dc0c1a7e653.5059.162981970500.lz4

[root@cxcto-c240-j27-05 coredump]# ls -asl
total 115720
0 drwxr-xr-x. 2 root root  261 Aug 25 16:47 .
0 drwxr-xr-x. 5 root root   70 Aug 10 11:31 ..
44720 -rw-r-. 1 root root 45786023 Aug 24 09:46 
core.tcmu-runner.0.530b308c30534b9aa4e7619ff1ab869c.4145.162981278700.lz4
35032 -rw-r-. 1 root root 35867165 Aug 24 17:52 
core.tcmu-runner.0.5afb87334bd741699c6fd44ceb031128.5672.162984193900.lz4
35968 -rw-r-. 1 root root 36826770 Aug 25 16:47 
core.tcmu-runner.0.da66f3f24a624426a75cbe20758be879.5339.162992443500.lz4



-Paul


On Aug 25, 2021, at 10:14 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:



On 8/26/21 10:08 AM, Paul Giralt (pgiralt) wrote:
Thanks Xiubo. I will try this. How do I set the log level to 4?


It's in the /etc/tcmu/tcmu.cfg in the tcmu container. No need to restart the 
tcmu-runner service, the changes will be loaded by tcmu-runner daemon after the 
tcmu.cfg closed.



-Paul


On Aug 25, 2021, at 9:30 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

It's buggy, we need one way to export the tcmu-runner log to the host.

Could you see any crash coredump from the host ?

Without that could you keep running some commands like '$ tail -f 
XYZ/tcmu-runner.log' in a console from the tcmu containers, let's see could we 
get any useful logs ? At the same time please set the log_level to 4. If it's 
an experimental setup then you can just set the log_level to 5.

I am not confident we can get any coredump from tcmu-runner.log, but at least 
we can get something else which may give us a clue.

- Xiubo




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
Thanks Xiubo. I will try this. How do I set the log level to 4?

-Paul


On Aug 25, 2021, at 9:30 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

It's buggy, we need one way to export the tcmu-runner log to the host.

Could you see any crash coredump from the host ?

Without that could you keep running some commands like '$ tail -f 
XYZ/tcmu-runner.log' in a console from the tcmu containers, let's see could we 
get any useful logs ? At the same time please set the log_level to 4. If it's 
an experimental setup then you can just set the log_level to 5.

I am not confident we can get any coredump from tcmu-runner.log, but at least 
we can get something else which may give us a clue.

- Xiubo



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
If the tcmu-runner daemon is died, the above logs are expected. So we need to 
know what has caused the tcmu-runner service's crash.

Xiubo


Thanks for the response Xiubo. How can I go about figuring out why the 
tsmu-runner daemon has died? Are there any logs I can pull that will give 
insight into why it’s happening?

-Paul




-Paul


On Aug 25, 2021, at 2:44 PM, Paul Giralt (pgiralt) 
mailto:pgir...@cisco.com>> wrote:

Ilya / Xiubo,

The problem just re-occurred on one server and I ran the systemctl status 
command. You can see there are no tcmu-runner processes listed:

[root@cxcto-c240-j27-04 ~]# systemctl status
● cxcto-c240-j27-04.cisco.com<http://cxcto-c240-j27-04.cisco.com/>
   State: running
Jobs: 0 queued
  Failed: 0 units
   Since: Wed 2021-08-25 01:26:00 EDT; 13h ago
  CGroup: /
  ├─docker
  │ ├─1c794e4dc591d5cf33318364c27d59dc9106418ca20d484d61cffc9f7168d691
  │ │ ├─6200 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.32 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─6305 /usr/bin/ceph-osd -n osd.32 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
  │ ├─3193fec6cab38d1276667d5ddd9c07365bc0c124841cececf7238b59beefb959
  │ │ ├─6080 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.142 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─6259 /usr/bin/ceph-osd -n osd.142 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
  │ ├─5043c4101899e723565dfe7e3bb3f869e518d1c19f83de64fceed3852c841da6
  │ │ ├─4204 /sbin/docker-init -- /usr/bin/rbd-target-api
  │ │ └─4331 /usr/bin/python3.6 -s /usr/bin/rbd-target-api
  │ ├─f872a81eade4f937e068fe0317681f477ed5c32d6fa1727f5f1558b3e784bcdb
  │ │ ├─6148 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.162 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─6272 /usr/bin/ceph-osd -n osd.162 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
  │ ├─8283765ab59f1aa6f8e55937c222bdd551dc9fc80d0ce7721120dd41c73ae5ba
  │ │ ├─6217 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.100 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─6336 /usr/bin/ceph-osd -n osd.100 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
  │ ├─ea560d46401771e5a3cb2d1934f932dc2b5d96cc23c42556205abbf9bd719b84
  │ │ ├─7236 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.212 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─7422 /usr/bin/ceph-osd -n osd.212 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
  │ ├─061d8b9e71bfda52f5b0a3627bd9da7d4cf0d3950fc29339841d31db2f91a84e
  │ │ ├─7254 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.91 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─7421 /usr/bin/ceph-osd -n osd.91 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
  │ ├─847855f0b5a60758e0e396b7f4d1884474712526fe1be33f05f2dd798269242c
  │ │ ├─7286 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.110 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─7425 /usr/bin/ceph-osd -n osd.110 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
  │ ├─13f8e85206f1fe1da978d72052b00eb5386ec27cfa8137609e40f48474420422
  │ │ ├─6227 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.21 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─6335 /usr/bin/ceph-osd -n osd.21 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
  │ ├─ab34680a337eaa3014bfa4495b716f42eb98481fb27eb625626ca76965ad8ee1
  │ │ ├─7233 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.53 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
  │ │ └─7420 /usr/bin/ceph

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
Ilya / Xiubo, 

The problem just re-occurred on one server and I ran the systemctl status 
command. You can see there are no tcmu-runner processes listed: 

[root@cxcto-c240-j27-04 ~]# systemctl status
● cxcto-c240-j27-04.cisco.com
State: running
 Jobs: 0 queued
   Failed: 0 units
Since: Wed 2021-08-25 01:26:00 EDT; 13h ago
   CGroup: /
   ├─docker
   │ ├─1c794e4dc591d5cf33318364c27d59dc9106418ca20d484d61cffc9f7168d691
   │ │ ├─6200 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.32 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─6305 /usr/bin/ceph-osd -n osd.32 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─3193fec6cab38d1276667d5ddd9c07365bc0c124841cececf7238b59beefb959
   │ │ ├─6080 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.142 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─6259 /usr/bin/ceph-osd -n osd.142 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─5043c4101899e723565dfe7e3bb3f869e518d1c19f83de64fceed3852c841da6
   │ │ ├─4204 /sbin/docker-init -- /usr/bin/rbd-target-api
   │ │ └─4331 /usr/bin/python3.6 -s /usr/bin/rbd-target-api
   │ ├─f872a81eade4f937e068fe0317681f477ed5c32d6fa1727f5f1558b3e784bcdb
   │ │ ├─6148 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.162 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─6272 /usr/bin/ceph-osd -n osd.162 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─8283765ab59f1aa6f8e55937c222bdd551dc9fc80d0ce7721120dd41c73ae5ba
   │ │ ├─6217 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.100 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─6336 /usr/bin/ceph-osd -n osd.100 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─ea560d46401771e5a3cb2d1934f932dc2b5d96cc23c42556205abbf9bd719b84
   │ │ ├─7236 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.212 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─7422 /usr/bin/ceph-osd -n osd.212 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─061d8b9e71bfda52f5b0a3627bd9da7d4cf0d3950fc29339841d31db2f91a84e
   │ │ ├─7254 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.91 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─7421 /usr/bin/ceph-osd -n osd.91 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─847855f0b5a60758e0e396b7f4d1884474712526fe1be33f05f2dd798269242c
   │ │ ├─7286 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.110 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─7425 /usr/bin/ceph-osd -n osd.110 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─13f8e85206f1fe1da978d72052b00eb5386ec27cfa8137609e40f48474420422
   │ │ ├─6227 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.21 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─6335 /usr/bin/ceph-osd -n osd.21 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─ab34680a337eaa3014bfa4495b716f42eb98481fb27eb625626ca76965ad8ee1
   │ │ ├─7233 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.53 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─7420 /usr/bin/ceph-osd -n osd.53 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
   │ ├─8a4bc8a41f3f3424ca95bd5710492022484ef4f00e97c1ee9d013577c5752f66
   │ │ ├─7217 /sbin/docker-init -- /usr/bin/ceph-osd -n osd.121 -f 
--setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix=debug
   │ │ └─7345 /usr/bi

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)

> 
> Does the node hang while shutting down or does it lock up so that you
> can't even issue the reboot command?
> 

It hangs when shutting down. I can SSH in and issue commands just fine and it 
takes the shutdown command and kicks me out, but it appears to never shut down 
as I can still ping the server until I power-cycle it. 


> The first place to look at is dmesg and "systemctl status".  cephadm
> wraps the services into systemd units so there should be a record of
> it terminating there.  If tcmu-runner is indeed crashing, Xiubo (CCed)
> might be able to help with debugging.

Thank you for the pointer. I’ll look at this next time it happens and send what 
I see. 

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] tcmu-runner crashing on 16.2.5

2021-08-24 Thread Paul Giralt (pgiralt)
I upgraded to Pacific 16.2.5 about a month ago and everything was working fine. 
Suddenly for the past few days I’ve started having the tcmu-runner container on 
my iSCSI gateways just disappear. I’m assuming this is because they have 
crashed. I deployed the services using cephadm / ceph orch in Docker 
containers. 

It appears that when the service crashes, the container just disappears and it 
doesn’t look like tcmu-runner is exporting logs anywhere, so I can’t figure out 
any way to determine the root cause of these failures. When this happens, it 
appears to cause issues where I can’t reboot the machine (Running CentOS 8) and 
I need to power-cycle the server to recover. 

I’m really not sure where to look to figure out why it’s suddenly failing. The 
failure is happening randomly on all 4 of the iSCSI gateways. Any pointers 
would be greatly appreciated. 

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph status shows 'updating'

2021-08-20 Thread Paul Giralt (pgiralt)
Doesn’t look like it: 

[root@cxcto-c240-j27-01 ~]# ceph orch upgrade status
{
"target_image": null,
"in_progress": false,
"services_complete": [],
"progress": null,
"message": ""
}

I did an upgrade from 16.2.4 to 16.2.5 about 4 weeks ago when the 16.2.5 
release came out, but that seemed to have gone fine. 

The values even look weird in that output (the -1’s) but I’m really not sure 
what “-1 -> 14” is even trying to tell me. There are 15 servers in the cluster 
so maybe it’s trying to tell me something about servers that need to be 
completed - not sure. 

-Paul




> On Aug 20, 2021, at 12:16 PM, Eugen Block  wrote:
> 
> What is the output of 'ceph orch upgrade status'? Did you (maybe 
> accidentally) start an update? You can stop it with 'ceph orch upgrade stop'.
> 
> 
> Zitat von "Paul Giralt (pgiralt)" :
> 
>> The output of my ’ceph status’ shows the following:
>> 
>>  progress:
>>Updating node-exporter deployment (-1 -> 14) (0s)
>>  []
>>Updating crash deployment (-1 -> 14) (0s)
>>  []
>>Updating crash deployment (-1 -> 14) (0s)
>>  []
>>Updating node-exporter deployment (-1 -> 14) (0s)
>>  []
>>Updating node-exporter deployment (-1 -> 14) (0s)
>>  []
>>Updating crash deployment (-1 -> 14) (0s)
>>  []
>>Updating node-exporter deployment (-1 -> 14) (0s)
>>  []
>>Updating crash deployment (+1 -1 -> 14) (1s)
>>  [==..] (remaining: 1s)
>> 
>> This started showing up after I rebooted one of my nodes where the iscsi 
>> gateway was in a strange state. Everything seems to be up and running fine 
>> now, but I’m not sure what this output in ceph status is telling me. It 
>> seems to be stuck in in this state.
>> 
>> Any idea what it means and how to get it unstuck?
>> 
>> -Paul
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph status shows 'updating'

2021-08-20 Thread Paul Giralt (pgiralt)
The output of my ’ceph status’ shows the following: 

  progress:
Updating node-exporter deployment (-1 -> 14) (0s)
  []
Updating crash deployment (-1 -> 14) (0s)
  []
Updating crash deployment (-1 -> 14) (0s)
  []
Updating node-exporter deployment (-1 -> 14) (0s)
  []
Updating node-exporter deployment (-1 -> 14) (0s)
  []
Updating crash deployment (-1 -> 14) (0s)
  []
Updating node-exporter deployment (-1 -> 14) (0s)
  []
Updating crash deployment (+1 -1 -> 14) (1s)
  [==..] (remaining: 1s)

This started showing up after I rebooted one of my nodes where the iscsi 
gateway was in a strange state. Everything seems to be up and running fine now, 
but I’m not sure what this output in ceph status is telling me. It seems to be 
stuck in in this state. 

Any idea what it means and how to get it unstuck? 

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MTU mismatch error in Ceph dashboard

2021-08-06 Thread Paul Giralt (pgiralt)
Thank you Ernesto.

Yes - so I see that all the eno1, eno2, and docker0 interfaces show up with an 
MTU of 1500 which is correct, but since these interfaces are not being used at 
all, they shouldn’t be flagged as a problem. I’ll just ignore the errors for 
now, but would be good to have a way to indicate that these interfaces are not 
being used.

-Paul


On Aug 6, 2021, at 12:45 PM, Ernesto Puerta 
mailto:epuer...@redhat.com>> wrote:

Hi Paul,

The Prometheus web UI is available at port 9095. It doesn't need any 
credentials to log in and you simply type the name of the metric 
("node_network_mtu_bytes") in the text box and you'll get the latest values:



As suggested, if you want to mute those alerts you can do that from the Cluster 
> Monitoring menu:



Kind Regards,
Ernesto



On Wed, Aug 4, 2021 at 10:07 PM Paul Giralt (pgiralt) 
mailto:pgir...@cisco.com>> wrote:
I’m seeing the same issue. I’m not familiar with where to access the 
“Prometheus UI”. Can you point me to some instructions on how to do this and 
I’ll gladly collect the output of that command.

FWIW, here are the interfaces on my machine:

1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8<http://127.0.0.1/8> scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: enp6s0:  mtu 9000 qdisc mq master 
bond0 state UP group default qlen 1000
link/ether 58:97:bd:8f:76:d8 brd ff:ff:ff:ff:ff:ff
3: enp7s0:  mtu 9000 qdisc mq master 
bond0 state UP group default qlen 1000
link/ether 58:97:bd:8f:76:d8 brd ff:ff:ff:ff:ff:ff
4: enp17s0:  mtu 9000 qdisc mq master 
bond1 state UP group default qlen 1000
link/ether 5c:83:8f:80:13:a4 brd ff:ff:ff:ff:ff:ff
5: enp18s0:  mtu 9000 qdisc mq master 
bond1 state UP group default qlen 1000
link/ether 5c:83:8f:80:13:a4 brd ff:ff:ff:ff:ff:ff
6: eno1:  mtu 1500 qdisc mq state UP group 
default qlen 1000
link/ether ec:bd:1d:08:87:8e brd ff:ff:ff:ff:ff:ff
7: eno2:  mtu 1500 qdisc mq state DOWN group 
default qlen 1000
link/ether ec:bd:1d:08:87:8f brd ff:ff:ff:ff:ff:ff
8: bond1:  mtu 9000 qdisc noqueue state 
UP group default qlen 1000
link/ether 5c:83:8f:80:13:a4 brd ff:ff:ff:ff:ff:ff
inet 10.9.192.196/24<http://10.9.192.196/24> brd 10.9.192.255 scope global 
noprefixroute bond1
   valid_lft forever preferred_lft forever
inet6 fe80::5e83:8fff:fe80:13a4/64 scope link noprefixroute
   valid_lft forever preferred_lft forever
9: bond0:  mtu 9000 qdisc noqueue state 
UP group default qlen 1000
link/ether 58:97:bd:8f:76:d8 brd ff:ff:ff:ff:ff:ff
inet 10.122.242.196/24<http://10.122.242.196/24> brd 10.122.242.255 scope 
global noprefixroute bond0
   valid_lft forever preferred_lft forever
inet6 fe80::5a97:bdff:fe8f:76d8/64 scope link noprefixroute
   valid_lft forever preferred_lft forever
10: docker0:  mtu 1500 qdisc noqueue state 
DOWN group default
link/ether 02:42:d2:19:1a:28 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16<http://172.17.0.1/16> brd 172.17.255.255 scope global 
docker0
   valid_lft forever preferred_lft forever

I did notice that docker0 has an MTU of 1500 as do the eno1 and eno2 interfaces 
which I’m not using. I’m not sure if that’s related to the error. I’ve been 
meaning to try changing the MTU on the eno interfaces just to see if that makes 
a difference but haven’t gotten around to it.

-Paul


> On Aug 4, 2021, at 2:31 PM, Ernesto Puerta 
> mailto:epuer...@redhat.com>> wrote:
>
> Hi J-P,
>
> Could you please go to the Prometheus UI and share the output of the
> following query "node_network_mtu_bytes"? That'd be useful to understand
> the issue. If you can open a tracker issue here:
> https://tracker.ceph.com/projects/dashboard/issues/new ?
>
> In the meantime you should be able to mute the alert (Cluster > Monitoring
>> Silences).
>
> Kind Regards,
> Ernesto
>
>
> On Wed, Aug 4, 2021 at 5:49 PM J-P Methot 
> mailto:jp.met...@planethoster.info>>
> wrote:
>
>> Hi,
>>
>> We're running Ceph 16.2.5 Pacific and, in the ceph dashboard, we keep
>> getting a MTU mismatch alert. However, all our hosts have the same
>> network configuration:
>>
>> => bond0:  mtu 9000 qdisc
>> noqueue state UPgroup default qlen 1000 => vlan.24@bond0:
>>  mtu 9000 qdisc noqueue state UPgroup
>> default qlen 1000
>>
>>
>> Physical interfaces, bond andvlans are all setto9000.
>>
>> The alert's message looks like this :
>>
>> Node node20 has a different MTU size (9000) than the median value on
>> device vlan.24.
>>
>> Is this a known Pacific bug? None of our other Ceph clusters does this
>>

[ceph-users] Re: MTU mismatch error in Ceph dashboard

2021-08-04 Thread Paul Giralt (pgiralt)
Actually it’s also complaining about docker0 as well. Not sure how to change 
the MTU on that one though. It’s not even up. 

-Paul


> On Aug 4, 2021, at 5:24 PM, Paul Giralt  wrote:
> 
> Yes - you’re right. It’s complaining about eno1 and eno2 which I’m not using. 
> I’ll change those and it will probably make the error go away. I’m guessing 
> something changed between 16.2.4 and 16.2.5 because I didn’t start seeing 
> this error until after the upgrade. 
> 
> -Paul
> 
> 
>> On Aug 4, 2021, at 5:09 PM, Kai Stian Olstad  wrote:
>> 
>> On 04.08.2021 22:06, Paul Giralt (pgiralt) wrote:
>>> I did notice that docker0 has an MTU of 1500 as do the eno1 and eno2
>>> interfaces which I’m not using. I’m not sure if that’s related to the
>>> error. I’ve been meaning to try changing the MTU on the eno interfaces
>>> just to see if that makes a difference but haven’t gotten around to
>>> it.
>> 
>> If you look at the message it says which interface it is.
>> 
>> It does check and report on all the interfaces, even those that is in DOWN 
>> state which it shouldn't.
>> 
>> 
>> -- 
>> Kai Stian Olstad
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MTU mismatch error in Ceph dashboard

2021-08-04 Thread Paul Giralt (pgiralt)
Yes - you’re right. It’s complaining about eno1 and eno2 which I’m not using. 
I’ll change those and it will probably make the error go away. I’m guessing 
something changed between 16.2.4 and 16.2.5 because I didn’t start seeing this 
error until after the upgrade. 

-Paul


> On Aug 4, 2021, at 5:09 PM, Kai Stian Olstad  wrote:
> 
> On 04.08.2021 22:06, Paul Giralt (pgiralt) wrote:
>> I did notice that docker0 has an MTU of 1500 as do the eno1 and eno2
>> interfaces which I’m not using. I’m not sure if that’s related to the
>> error. I’ve been meaning to try changing the MTU on the eno interfaces
>> just to see if that makes a difference but haven’t gotten around to
>> it.
> 
> If you look at the message it says which interface it is.
> 
> It does check and report on all the interfaces, even those that is in DOWN 
> state which it shouldn't.
> 
> 
> -- 
> Kai Stian Olstad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MTU mismatch error in Ceph dashboard

2021-08-04 Thread Paul Giralt (pgiralt)
I’m seeing the same issue. I’m not familiar with where to access the 
“Prometheus UI”. Can you point me to some instructions on how to do this and 
I’ll gladly collect the output of that command. 

FWIW, here are the interfaces on my machine: 

1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: enp6s0:  mtu 9000 qdisc mq master 
bond0 state UP group default qlen 1000
link/ether 58:97:bd:8f:76:d8 brd ff:ff:ff:ff:ff:ff
3: enp7s0:  mtu 9000 qdisc mq master 
bond0 state UP group default qlen 1000
link/ether 58:97:bd:8f:76:d8 brd ff:ff:ff:ff:ff:ff
4: enp17s0:  mtu 9000 qdisc mq master 
bond1 state UP group default qlen 1000
link/ether 5c:83:8f:80:13:a4 brd ff:ff:ff:ff:ff:ff
5: enp18s0:  mtu 9000 qdisc mq master 
bond1 state UP group default qlen 1000
link/ether 5c:83:8f:80:13:a4 brd ff:ff:ff:ff:ff:ff
6: eno1:  mtu 1500 qdisc mq state UP group 
default qlen 1000
link/ether ec:bd:1d:08:87:8e brd ff:ff:ff:ff:ff:ff
7: eno2:  mtu 1500 qdisc mq state DOWN group 
default qlen 1000
link/ether ec:bd:1d:08:87:8f brd ff:ff:ff:ff:ff:ff
8: bond1:  mtu 9000 qdisc noqueue state 
UP group default qlen 1000
link/ether 5c:83:8f:80:13:a4 brd ff:ff:ff:ff:ff:ff
inet 10.9.192.196/24 brd 10.9.192.255 scope global noprefixroute bond1
   valid_lft forever preferred_lft forever
inet6 fe80::5e83:8fff:fe80:13a4/64 scope link noprefixroute
   valid_lft forever preferred_lft forever
9: bond0:  mtu 9000 qdisc noqueue state 
UP group default qlen 1000
link/ether 58:97:bd:8f:76:d8 brd ff:ff:ff:ff:ff:ff
inet 10.122.242.196/24 brd 10.122.242.255 scope global noprefixroute bond0
   valid_lft forever preferred_lft forever
inet6 fe80::5a97:bdff:fe8f:76d8/64 scope link noprefixroute
   valid_lft forever preferred_lft forever
10: docker0:  mtu 1500 qdisc noqueue state 
DOWN group default
link/ether 02:42:d2:19:1a:28 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
   valid_lft forever preferred_lft forever

I did notice that docker0 has an MTU of 1500 as do the eno1 and eno2 interfaces 
which I’m not using. I’m not sure if that’s related to the error. I’ve been 
meaning to try changing the MTU on the eno interfaces just to see if that makes 
a difference but haven’t gotten around to it. 

-Paul


> On Aug 4, 2021, at 2:31 PM, Ernesto Puerta  wrote:
> 
> Hi J-P,
> 
> Could you please go to the Prometheus UI and share the output of the
> following query "node_network_mtu_bytes"? That'd be useful to understand
> the issue. If you can open a tracker issue here:
> https://tracker.ceph.com/projects/dashboard/issues/new ?
> 
> In the meantime you should be able to mute the alert (Cluster > Monitoring
>> Silences).
> 
> Kind Regards,
> Ernesto
> 
> 
> On Wed, Aug 4, 2021 at 5:49 PM J-P Methot 
> wrote:
> 
>> Hi,
>> 
>> We're running Ceph 16.2.5 Pacific and, in the ceph dashboard, we keep
>> getting a MTU mismatch alert. However, all our hosts have the same
>> network configuration:
>> 
>> => bond0:  mtu 9000 qdisc
>> noqueue state UPgroup default qlen 1000 => vlan.24@bond0:
>>  mtu 9000 qdisc noqueue state UPgroup
>> default qlen 1000
>> 
>> 
>> Physical interfaces, bond andvlans are all setto9000.
>> 
>> The alert's message looks like this :
>> 
>> Node node20 has a different MTU size (9000) than the median value on
>> device vlan.24.
>> 
>> Is this a known Pacific bug? None of our other Ceph clusters does this
>> (they are running on Octopus/Nautilus).
>> 
>> --
>> 
>> Jean-Philippe Méthot
>> Senior Openstack system administrator
>> Administrateur système Openstack sénior
>> PlanetHoster inc.
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Redeploy iSCSI Gateway fail - 167 returned from docker run

2021-06-01 Thread Paul Giralt (pgiralt)
Ok - so looks like that 167 was a red herring. That was actually a valid 
result. The issue is the container was starting up then dying. 

Looks like this all went downhill as a result of me changing the name of an 
image (I had named it iscsi-img-005 instead of iscsi-img-0005 to match all my 
other image names). Looks like iSCSI gateway does not like that. It worked fine 
after the rename, but when I redeployed the gateways, they never came back up. 
I saw an error in the logs that indicated it was having trouble with that name. 

I managed to get things back up and running on one gateway after a lot of 
messing around getting it to the point where I could delete the configuration 
that it didn’t like. Right now I’m running on one gateway fine. I tried to 
scale back up to 4 servers and the other three all have different issues. One 
comes up, but when I try to provision a target to use it, I get an out of index 
error. On the 3rd the docker containers come up, but the gateway never shows 
up. The 4th the containers never come up because it fails with this message: 

subprocess.CalledProcessError: Command 'ceph -n 
client.iscsi.iscsi.cxcto-c240-j27-05.noraaw --conf /etc/ceph/ceph.conf osd 
blacklist rm 10.122.242.200:6977/1317769556' returned non-zero exit status 13.

I feel like it would probably be best to just wipe all the iscsi gateway 
configuration and start from scratch with the iSCSI configuration piece - 
however even when I remove the service (ceph orch rm iscsi.iscsi) the 
configuration appears to still be maintained. 

Where is all this configuration stored? Is there a way to completely remove it 
to start the iscsi gateways on a clean slate? 

-Paul


> On Jun 1, 2021, at 8:05 PM, Paul Giralt (pgiralt)  wrote:
> 
> CEPH 16.2.4. I was having an issue where I put a server into maintenance mode 
> and after doing so, the containers for the iSCSI gateway were not running, so 
> I decided to do a redeploy of the service. This caused all the servers 
> running iSCSI to get in a state where it looks like ceph orch was trying to 
> delete the container, but it was stuck. My only recourse was to reboot the 
> servers. I ended up doing a ‘ceph orch rm iscsi.iscsi’ to just remove the 
> services and then tried to redeploy. When I do this, I’m seeing the following 
> in the cephadm logs on the servers where the iscsi gateway is being deployed: 
> 
> 2021-06-01 19:48:15,110 INFO Deploy daemon 
> iscsi.iscsi.cxcto-c240-j27-02.zeypah ...
> 2021-06-01 19:48:15,111 DEBUG Running command: /bin/docker run --rm 
> --ipc=host --net=host --entrypoint stat --init -e 
> CONTAINER_IMAGE=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
>  -e NODE_NAME=cxcto-c240-j27-02.cisco.com -e CEPH_USE_RANDOM_NONCE=1 
> docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
>  -c %u %g /var/lib/ceph
> 2021-06-01 19:48:15,529 DEBUG stat: 167 167
> 
> Later in the logs I see: 
> 
> 2021-06-01 19:48:25,933 DEBUG Running command: /bin/docker inspect --format 
> {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index .Config.Labels 
> "io.ceph.version"}} 
> ceph-a67d529e-ba7f-11eb-940b-5c838f8013a5-iscsi.iscsi.cxcto-c240-j27-02.zeypah
> 2021-06-01 19:48:25,984 DEBUG /bin/docker:
> 2021-06-01 19:48:25,984 DEBUG /bin/docker: Error: No such object: 
> ceph-a67d529e-ba7f-11eb-940b-5c838f8013a5-iscsi.iscsi.cxcto-c240-j27-02.zeypah
> 
> Obviously no such object because the container creation failed. 
> 
> If I try to run that command that is in the logs manually, I get: 
> 
> [root@cxcto-c240-j27-02 ceph]# /bin/docker run --rm --ipc=host --net=host 
> --entrypoint stat --init -e 
> CONTAINER_IMAGE=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
>  -e NODE_NAME=cxcto-c240-j27-02.cisco.com -e CEPH_USE_RANDOM_NONCE=1 
> docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
>  -c %u %g /var/lib/ceph
> stat: cannot stat '%g': No such file or directory
> 167
> 
> So the 167 seems to line up with what’s showing up in the script. I’m not 
> clear on what the deal is with the %g. What is supposed to be in that 
> placeholder? Any thoughts on why this is failing? 
> 
> Right now all my iSCSI gateways are down and basically my whole environment 
> is down as a result 🙁 
> 
> -Paul
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Redeploy iSCSI Gateway fail - 167 returned from docker run

2021-06-01 Thread Paul Giralt (pgiralt)
CEPH 16.2.4. I was having an issue where I put a server into maintenance mode 
and after doing so, the containers for the iSCSI gateway were not running, so I 
decided to do a redeploy of the service. This caused all the servers running 
iSCSI to get in a state where it looks like ceph orch was trying to delete the 
container, but it was stuck. My only recourse was to reboot the servers. I 
ended up doing a ‘ceph orch rm iscsi.iscsi’ to just remove the services and 
then tried to redeploy. When I do this, I’m seeing the following in the cephadm 
logs on the servers where the iscsi gateway is being deployed: 

2021-06-01 19:48:15,110 INFO Deploy daemon iscsi.iscsi.cxcto-c240-j27-02.zeypah 
...
2021-06-01 19:48:15,111 DEBUG Running command: /bin/docker run --rm --ipc=host 
--net=host --entrypoint stat --init -e 
CONTAINER_IMAGE=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
 -e NODE_NAME=cxcto-c240-j27-02.cisco.com -e CEPH_USE_RANDOM_NONCE=1 
docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
 -c %u %g /var/lib/ceph
2021-06-01 19:48:15,529 DEBUG stat: 167 167

Later in the logs I see: 

2021-06-01 19:48:25,933 DEBUG Running command: /bin/docker inspect --format 
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index .Config.Labels 
"io.ceph.version"}} 
ceph-a67d529e-ba7f-11eb-940b-5c838f8013a5-iscsi.iscsi.cxcto-c240-j27-02.zeypah
2021-06-01 19:48:25,984 DEBUG /bin/docker:
2021-06-01 19:48:25,984 DEBUG /bin/docker: Error: No such object: 
ceph-a67d529e-ba7f-11eb-940b-5c838f8013a5-iscsi.iscsi.cxcto-c240-j27-02.zeypah

Obviously no such object because the container creation failed. 

If I try to run that command that is in the logs manually, I get: 

[root@cxcto-c240-j27-02 ceph]# /bin/docker run --rm --ipc=host --net=host 
--entrypoint stat --init -e 
CONTAINER_IMAGE=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
 -e NODE_NAME=cxcto-c240-j27-02.cisco.com -e CEPH_USE_RANDOM_NONCE=1 
docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
 -c %u %g /var/lib/ceph
stat: cannot stat '%g': No such file or directory
167

So the 167 seems to line up with what’s showing up in the script. I’m not clear 
on what the deal is with the %g. What is supposed to be in that placeholder? 
Any thoughts on why this is failing? 

Right now all my iSCSI gateways are down and basically my whole environment is 
down as a result 🙁 

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Unable to delete disk from iSCSI target

2021-06-01 Thread Paul Giralt (pgiralt)
I’m trying to delete a disk from an iSCSI target so that I can remove the 
image, but running into an issue. If I try to delete it from the CEPH 
dashboard, I just get an error saying that the DELETE timed out after 45 
seconds. 

If I try to do it from gwcli, the command never returns: 

/iscsi-target...4075058/disks> delete iscsi-pool-001/iscsi-img-0003

After I enter that command, it just hangs indefinitely. 

I’ve tried to remove any dependencies on the disk by making sure it’s unmounted 
and detached from the ESXi hosts that are using the disk. 

I’m guessing it’s failing because there is some dependency on the disk that is 
causing the deletion to fail, but not sure what else to look at. Any thoughts? 

-Paul

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io