[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Ilya Dryomov
On Wed, Aug 25, 2021 at 7:02 AM Paul Giralt (pgiralt) wrote: > > I upgraded to Pacific 16.2.5 about a month ago and everything was working > fine. Suddenly for the past few days I’ve started having the tcmu-runner > container on my iSCSI gateways just disappear. I’m assuming this is because > t

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
> > Does the node hang while shutting down or does it lock up so that you > can't even issue the reboot command? > It hangs when shutting down. I can SSH in and issue commands just fine and it takes the shutdown command and kicks me out, but it appears to never shut down as I can still ping t

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
Ilya / Xiubo, The problem just re-occurred on one server and I ran the systemctl status command. You can see there are no tcmu-runner processes listed: [root@cxcto-c240-j27-04 ~]# systemctl status ● cxcto-c240-j27-04.cisco.com State: running Jobs: 0 queued Failed: 0 units Since

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
If the tcmu-runner daemon is died, the above logs are expected. So we need to know what has caused the tcmu-runner service's crash. Xiubo Thanks for the response Xiubo. How can I go about figuring out why the tsmu-runner daemon has died? Are there any logs I can pull that will give insight in

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
Thanks Xiubo. I will try this. How do I set the log level to 4? -Paul On Aug 25, 2021, at 9:30 PM, Xiubo Li mailto:xiu...@redhat.com>> wrote: It's buggy, we need one way to export the tcmu-runner log to the host. Could you see any crash coredump from the host ? Without that could you keep ru

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
Thank you. I did find some coredump files. Is there a way I can send these to you to analyze? [root@cxcto-c240-j27-02 coredump]# ls -asl total 71292 0 drwxr-xr-x. 2 root root 176 Aug 25 18:31 . 0 drwxr-xr-x. 5 root root 70 Aug 10 11:31 .. 34496 -rw-r-. 1 root root 35316215

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-25 Thread Paul Giralt (pgiralt)
I will send a unicast email with the link and details. -Paul On Aug 25, 2021, at 10:37 PM, Xiubo Li mailto:xiu...@redhat.com>> wrote: Hi Paul, Please send me the detail versions of the tcmu-runner and ceph-iscsi packages you are using. Thanks On 8/26/21 10:21 AM, Paul Giralt (pgiralt) wr

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-26 Thread Paul Giralt (pgiralt)
Thank you for the analysis. Can you think of a workaround for the issue? -Paul Sent from my iPhone On Aug 26, 2021, at 5:17 AM, Xiubo Li wrote:  Hi Paul, There has one racy case when updating the state to ceph cluster and while reopening the image, which will close and open the image, the

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-26 Thread Paul Giralt (pgiralt)
This is great. Is there a way to test the fix in my environment? -Paul On Aug 26, 2021, at 11:05 AM, Xiubo Li mailto:xiu...@redhat.com>> wrote: Hi Paul, Ilya, I have fixed it in [1], please help review. Thanks [1] https://github.com/open-iscsi/tcmu-runner/pull/667 On 8/26/21 7:34 PM, Pau

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-27 Thread Paul Giralt (pgiralt)
Ok - thanks Xiubo. Not sure I feel comfortable doing that without breaking something else, so will wait for a new release that incorporates the fix. In the meantime I’m trying to figure out what might be triggering the issue, since this has been running fine for months and just recently started

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-29 Thread Paul Giralt (pgiralt)
Thanks Xiubo, I actually had the same idea on Friday and I reduced the number of iSCSI gateways to 1 and the problem appears to have disappeared for now. I’m guessing there is still some chance it could happen, but it would be much more rare to occur. I did notice the blacklist was growing ver

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-30 Thread Paul Giralt (pgiralt)
Inline… Usually it shouldn't be so many entries, and we have added some patches to fix this. When the exclusive lock is broke by a new gateway, the previous one will be added to the blocklist by ceph. And in tcmu-runner when the previous gateway detect that it has been added to the blocklist

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-30 Thread Paul Giralt (pgiralt)
On Aug 30, 2021, at 7:14 PM, Xiubo Li mailto:xiu...@redhat.com>> wrote: We are using “Most Recently Used” - however there are 50 ESXi hosts all trying to access the same data stores, so it’s very possible that one host is choosing iSCSI gateway 1 and another host is choosing iSCSI gateway 2.

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-31 Thread Paul Giralt (pgiralt)
Xiubo, Thank you for all the help so far. I was finally able to figure out what the trigger for the issue was and how to make sure it doesn’t happen - at least not in a steady state. There is still the possibility of running into the bug in a failover scenario of some kind, but at least for now

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-31 Thread Paul Giralt (pgiralt)
Thank you. This is exactly what I was looking for. If I’m understanding correctly, what gets listed as the “owner” is what gets advertised via ALUA as the primary path, but the lock owner indicates which gateway currently owns the lock for that image and is allowed to pass traffic for that LUN

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-31 Thread Paul Giralt (pgiralt)
However, the gwcli command is still showing the other two gateways which are no longer enabled anymore. Where does this list of gateways get stored? All this configurations are stored in the "gateway.conf" object in "rbd" pool. How do I access this object? Is it a file or some kind of object

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-09-01 Thread Paul Giralt (pgiralt)
Thanks. The problem is that when I start gwcli, I get this: [root@cxcto-c240-j27-02 /]# gwcli Warning: Could not load preferences file /root/.gwcli/prefs.bin. 2 gateways are inaccessible - updates will be disabled 2 gateways are inaccessible - updates will be disabled 2 gateways are inaccessibl

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-09-03 Thread Paul Giralt (pgiralt)
On Sep 3, 2021, at 4:28 AM, Xiubo Li mailto:xiu...@redhat.com>> wrote: And TCMU runner shows 3 hosts up: services: mon: 5 daemons, quorum cxcto-c240-j27-01.cisco.com,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12 (ag

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-09-08 Thread Paul Giralt (pgiralt)
Thank you Xiubo. confirm=true worked and I was able to update via gwcli and then get everything reset back to normal again. I’m stable for now but still hoping that this fix can get in soon to make sure the crash doesn’t happen again. Appreciate all your help on this. -Paul On Sep 6, 2021, a