Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown
Just to give a short feedback - everything is fine now: - via ceph-ansible we got some tcmu-runner / ceph-iscsi development versions - our ISCSI alua setup was a mess (it was a mixture of explicit and implicit alua while only implicit alua is supported at the moment) - our multipath devices showed the same priorities for all of our pathes (instead of 50 / 10 - 10 - 10 priorities) Fix: - shutdown compelte iscsi traffic - iscsiadm logout / multipath -F (removes all devices) - update ceph-iscsi & tcmu runner to stable versions - reinitialize iscsi devices: login & multipath Now it looks like it should with only implicit alua mode and the correct priorities on our multipath devices ;) Thanks @Mike for your help! Von: ceph-users im Auftrag von Kilian Ries Gesendet: Dienstag, 22. Oktober 2019 23:38:51 An: Mike Christie; ceph-users@lists.ceph.com Betreff: Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown - Each LUN is exported to multiple clients (at the same time) - yes, IO is done to the LUNs (read and write); (oVirt runs VMs on each of the LUNs) Ok, i'll update this tomorrow with the logs you asked for ... Von: Mike Christie Gesendet: Dienstag, 22. Oktober 2019 19:43:40 An: Kilian Ries; ceph-users@lists.ceph.com Betreff: Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown On 10/22/2019 03:20 AM, Kilian Ries wrote: > Hi, > > > i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the > client side. In the tcmu-runner logs i the the following happening every > few seconds: > > Are you exporting a LUN to one client or multiple clients at the same time? > > tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 Are you doing any IO to the iscsi LUN? If not, then we normally saw this with a older version. It would start at dm-multipath initialization and then just continue forever. Your package looks like it has the fix: commit dd7dd51c6cafa8bbcd3ca0eef31fb378b27ff499 Author: Mike Christie Date: Mon Jan 14 17:06:27 2019 -0600 Allow some commands to run while taking lock so we should not be seeing it. Could you turn on tcmu-runner debugging? Open the file: /etc/tcmu/tcmu.conf and set: log_level = 5 Do this while you are hitting this bug. I only need a couple seconds so I can see what commands are being sent. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown
- Each LUN is exported to multiple clients (at the same time) - yes, IO is done to the LUNs (read and write); (oVirt runs VMs on each of the LUNs) Ok, i'll update this tomorrow with the logs you asked for ... Von: Mike Christie Gesendet: Dienstag, 22. Oktober 2019 19:43:40 An: Kilian Ries; ceph-users@lists.ceph.com Betreff: Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown On 10/22/2019 03:20 AM, Kilian Ries wrote: > Hi, > > > i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the > client side. In the tcmu-runner logs i the the following happening every > few seconds: > > Are you exporting a LUN to one client or multiple clients at the same time? > > tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 Are you doing any IO to the iscsi LUN? If not, then we normally saw this with a older version. It would start at dm-multipath initialization and then just continue forever. Your package looks like it has the fix: commit dd7dd51c6cafa8bbcd3ca0eef31fb378b27ff499 Author: Mike Christie Date: Mon Jan 14 17:06:27 2019 -0600 Allow some commands to run while taking lock so we should not be seeing it. Could you turn on tcmu-runner debugging? Open the file: /etc/tcmu/tcmu.conf and set: log_level = 5 Do this while you are hitting this bug. I only need a couple seconds so I can see what commands are being sent. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown
Hi, i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the client side. In the tcmu-runner logs i the the following happening every few seconds: ### 2019-10-22 10:11:11.231 1710 [WARN] tcmu_rbd_lock:762 rbd/image.lun0: Acquired exclusive lock. 2019-10-22 10:11:11.395 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:12.346 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: Async lock drop. Old state 1 2019-10-22 10:11:12.353 1710 [INFO] alua_implicit_transition:566 rbd/image.lun0: Starting lock acquisition operation. 2019-10-22 10:11:13.325 1710 [INFO] alua_implicit_transition:566 rbd/image.lun0: Starting lock acquisition operation. 2019-10-22 10:11:13.852 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:13.854 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:13.863 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:14.202 1710 [INFO] alua_implicit_transition:566 rbd/image.lun0: Starting lock acquisition operation. 2019-10-22 10:11:14.285 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:15.217 1710 [WARN] tcmu_rbd_lock:762 rbd/image.lun0: Acquired exclusive lock. 2019-10-22 10:11:15.873 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:16.696 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: Async lock drop. Old state 1 2019-10-22 10:11:16.696 1710 [INFO] alua_implicit_transition:566 rbd/image.lun0: Starting lock acquisition operation. 2019-10-22 10:11:16.696 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: Async lock drop. Old state 2 2019-10-22 10:11:16.992 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. ### This happens on all of my four iscsi exporter nodes. Blacklist gives me the following (number of blacklisted objects does not really shrink): ### ceph osd blacklist ls listed 10579 entries ### On the client site i configured the multipath config like this: ### device { vendor "LIO-ORG" hardware_handler "1 alua" path_grouping_policy "failover" path_selector "queue-length 0" failback 60 path_checker tur prio alua prio_args exclusive_pref_bit fast_io_fail_tmo 25 no_path_retry queue } ### And multipath -ll shows me all four path as "active ready" without errors. For me this looks like the tcmu-runner cannot aquire exclusive lock and it is flapping between nodes. In addition, in the ceph gui / dashboard i can see the LUNs in the "active / optimized" state are flapping between nodes ... I'm have installed the following versions (CentOS 7.7, Ceph 13.2.6): ### rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel" python-cephfs-13.2.6-0.el7.x86_64 ceph-selinux-13.2.6-0.el7.x86_64 kernel-3.10.0-957.5.1.el7.x86_64 kernel-3.10.0-957.1.3.el7.x86_64 kernel-tools-libs-3.10.0-1062.1.2.el7.x86_64 libcephfs2-13.2.6-0.el7.x86_64 libtcmu-1.4.0-106.gd17d24e.el7.x86_64 ceph-common-13.2.6-0.el7.x86_64 ceph-osd-13.2.6-0.el7.x86_64 tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 kernel-3.10.0-1062.1.2.el7.x86_64 ceph-iscsi-3.3-1.el7.noarch kernel-headers-3.10.0-1062.1.2.el7.x86_64 kernel-3.10.0-862.14.4.el7.x86_64 ceph-base-13.2.6-0.el7.x86_64 kernel-tools-3.10.0-1062.1.2.el7.x86_64 ### Greets, Kilian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size
@Mike Did you have the chance to update download.ceph.com repositories for the new version? I just tested the packages from shaman in our DEV environment and it seems to fix the work - after updating the packages i was not able to reproduce the error again and tcmu-runner starts up without any errors ;) Von: Mike Christie Gesendet: Donnerstag, 3. Oktober 2019 00:20:51 An: Kilian Ries; dilla...@redhat.com Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size On 10/02/2019 02:15 PM, Kilian Ries wrote: > Ok i just compared my local python files and the git commit you sent me > - it really looks like i have the old files installed. All the changes > are missing in my local files. > > > > Where can i get a new ceph-iscsi-config package that has the fixe > included? I have installed version: They are on shaman only right now: https://4.chacra.ceph.com/r/ceph-iscsi-config/master/24deeb206ed2354d44b0f33d7d26d475e1014f76/centos/7/flavors/default/noarch/ https://4.chacra.ceph.com/r/ceph-iscsi-cli/master/4802654a6963df6bf5f4a968782cfabfae835067/centos/7/flavors/default/noarch/ The shaman rpms above have one bug we just fixed in ceph-iscsi-config where if DNS is not setup correctly gwcli commands can take minutes. I am going to try and get download.ceph.com updated. > > ceph-iscsi-config-2.6-2.6.el7.noarch > > *Von:* ceph-users im Auftrag von > Kilian Ries > *Gesendet:* Mittwoch, 2. Oktober 2019 21:04:45 > *An:* dilla...@redhat.com > *Cc:* ceph-users@lists.ceph.com > *Betreff:* Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image > size > > > Yes, i created all four luns with these sizes: > > > lun0 - 5120G > > lun1 - 5121G > > lun2 - 5122G > > lun3 - 5123G > > > Its always one GB more per LUN... Is there any newer ceph-iscsi-config > package than i have installed? > > > ceph-iscsi-config-2.6-2.6.el7.noarch > > > Then i could try to update the package and see if the error is fixed ... > > ---- > *Von:* Jason Dillaman > *Gesendet:* Mittwoch, 2. Oktober 2019 16:00:03 > *An:* Kilian Ries > *Cc:* ceph-users@lists.ceph.com > *Betreff:* Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image > size > > On Wed, Oct 2, 2019 at 9:50 AM Kilian Ries wrote: >> >> Hi, >> >> >> i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was >> setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that >> only two of the four configured iscsi gw nodes are working correct. I first >> noticed via gwcli: >> >> >> ### >> >> >> $gwcli -d ls >> >> Traceback (most recent call last): >> >> File "/usr/bin/gwcli", line 191, in >> >> main() >> >> File "/usr/bin/gwcli", line 103, in main >> >> root_node.refresh() >> >> File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in >> refresh >> >> raise GatewayError >> >> gwcli.utils.GatewayError >> >> >> ### >> >> >> I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" >> are not running. I were not able to restart them via systemd. I then found >> that even tcmu-runner is not running and it exits with the following error: >> >> >> >> ### >> >> >> tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD >> image size 5498631880704. Requested new size 5497558138880. >> >> >> ### >> >> >> Now i have the situation that two nodes are running correct and two cant >> start tcmu-runner. I don't know where the image size mismatches are coming >> from - i haven't configured or resized any of the images. >> >> >> Is there any chance to get my two iscsi gw nodes back working? > > It sounds like you are potentially hitting [1]. The ceph-iscsi-config > library thinks your image size is 5TiB but you actually have a 5121GiB > (~5.001TiB) RBD image. Any clue how your RBD image got to be 1GiB > larger than an even 5TiB? > >> >> >> >> The following packets are installed: >> >> >> rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel" >> >> >> libtcmu-1.4.0-106.gd17d24e.el7.x86_64 >> >> ceph-iscsi-cli-2.7-2.7.el7.noarch >> >> kernel-3.10.0-957.5.1.el7.x86_64 >> >> ceph-base-13.2.5-0.el7.x86_64 >> >&g
Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size
Ok i just compared my local python files and the git commit you sent me - it really looks like i have the old files installed. All the changes are missing in my local files. Where can i get a new ceph-iscsi-config package that has the fixe included? I have installed version: ceph-iscsi-config-2.6-2.6.el7.noarch Von: ceph-users im Auftrag von Kilian Ries Gesendet: Mittwoch, 2. Oktober 2019 21:04:45 An: dilla...@redhat.com Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size Yes, i created all four luns with these sizes: lun0 - 5120G lun1 - 5121G lun2 - 5122G lun3 - 5123G Its always one GB more per LUN... Is there any newer ceph-iscsi-config package than i have installed? ceph-iscsi-config-2.6-2.6.el7.noarch Then i could try to update the package and see if the error is fixed ... Von: Jason Dillaman Gesendet: Mittwoch, 2. Oktober 2019 16:00:03 An: Kilian Ries Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size On Wed, Oct 2, 2019 at 9:50 AM Kilian Ries wrote: > > Hi, > > > i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was > setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that only > two of the four configured iscsi gw nodes are working correct. I first > noticed via gwcli: > > > ### > > > $gwcli -d ls > > Traceback (most recent call last): > > File "/usr/bin/gwcli", line 191, in > > main() > > File "/usr/bin/gwcli", line 103, in main > > root_node.refresh() > > File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in > refresh > > raise GatewayError > > gwcli.utils.GatewayError > > > ### > > > I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" are > not running. I were not able to restart them via systemd. I then found that > even tcmu-runner is not running and it exits with the following error: > > > > ### > > > tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD > image size 5498631880704. Requested new size 5497558138880. > > > ### > > > Now i have the situation that two nodes are running correct and two cant > start tcmu-runner. I don't know where the image size mismatches are coming > from - i haven't configured or resized any of the images. > > > Is there any chance to get my two iscsi gw nodes back working? It sounds like you are potentially hitting [1]. The ceph-iscsi-config library thinks your image size is 5TiB but you actually have a 5121GiB (~5.001TiB) RBD image. Any clue how your RBD image got to be 1GiB larger than an even 5TiB? > > > > The following packets are installed: > > > rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel" > > > libtcmu-1.4.0-106.gd17d24e.el7.x86_64 > > ceph-iscsi-cli-2.7-2.7.el7.noarch > > kernel-3.10.0-957.5.1.el7.x86_64 > > ceph-base-13.2.5-0.el7.x86_64 > > ceph-iscsi-config-2.6-2.6.el7.noarch > > ceph-common-13.2.5-0.el7.x86_64 > > ceph-selinux-13.2.5-0.el7.x86_64 > > kernel-tools-libs-3.10.0-957.5.1.el7.x86_64 > > python-cephfs-13.2.5-0.el7.x86_64 > > ceph-osd-13.2.5-0.el7.x86_64 > > kernel-headers-3.10.0-957.5.1.el7.x86_64 > > kernel-tools-3.10.0-957.5.1.el7.x86_64 > > kernel-3.10.0-957.1.3.el7.x86_64 > > libcephfs2-13.2.5-0.el7.x86_64 > > kernel-3.10.0-862.14.4.el7.x86_64 > > tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 > > > > Thanks, > > Greets > > > Kilian > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] https://github.com/ceph/ceph-iscsi-config/pull/68 -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size
Yes, i created all four luns with these sizes: lun0 - 5120G lun1 - 5121G lun2 - 5122G lun3 - 5123G Its always one GB more per LUN... Is there any newer ceph-iscsi-config package than i have installed? ceph-iscsi-config-2.6-2.6.el7.noarch Then i could try to update the package and see if the error is fixed ... Von: Jason Dillaman Gesendet: Mittwoch, 2. Oktober 2019 16:00:03 An: Kilian Ries Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size On Wed, Oct 2, 2019 at 9:50 AM Kilian Ries wrote: > > Hi, > > > i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was > setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that only > two of the four configured iscsi gw nodes are working correct. I first > noticed via gwcli: > > > ### > > > $gwcli -d ls > > Traceback (most recent call last): > > File "/usr/bin/gwcli", line 191, in > > main() > > File "/usr/bin/gwcli", line 103, in main > > root_node.refresh() > > File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in > refresh > > raise GatewayError > > gwcli.utils.GatewayError > > > ### > > > I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" are > not running. I were not able to restart them via systemd. I then found that > even tcmu-runner is not running and it exits with the following error: > > > > ### > > > tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD > image size 5498631880704. Requested new size 5497558138880. > > > ### > > > Now i have the situation that two nodes are running correct and two cant > start tcmu-runner. I don't know where the image size mismatches are coming > from - i haven't configured or resized any of the images. > > > Is there any chance to get my two iscsi gw nodes back working? It sounds like you are potentially hitting [1]. The ceph-iscsi-config library thinks your image size is 5TiB but you actually have a 5121GiB (~5.001TiB) RBD image. Any clue how your RBD image got to be 1GiB larger than an even 5TiB? > > > > The following packets are installed: > > > rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel" > > > libtcmu-1.4.0-106.gd17d24e.el7.x86_64 > > ceph-iscsi-cli-2.7-2.7.el7.noarch > > kernel-3.10.0-957.5.1.el7.x86_64 > > ceph-base-13.2.5-0.el7.x86_64 > > ceph-iscsi-config-2.6-2.6.el7.noarch > > ceph-common-13.2.5-0.el7.x86_64 > > ceph-selinux-13.2.5-0.el7.x86_64 > > kernel-tools-libs-3.10.0-957.5.1.el7.x86_64 > > python-cephfs-13.2.5-0.el7.x86_64 > > ceph-osd-13.2.5-0.el7.x86_64 > > kernel-headers-3.10.0-957.5.1.el7.x86_64 > > kernel-tools-3.10.0-957.5.1.el7.x86_64 > > kernel-3.10.0-957.1.3.el7.x86_64 > > libcephfs2-13.2.5-0.el7.x86_64 > > kernel-3.10.0-862.14.4.el7.x86_64 > > tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 > > > > Thanks, > > Greets > > > Kilian > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] https://github.com/ceph/ceph-iscsi-config/pull/68 -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] tcmu-runner: mismatched sizes for rbd image size
Hi, i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that only two of the four configured iscsi gw nodes are working correct. I first noticed via gwcli: ### $gwcli -d ls Traceback (most recent call last): File "/usr/bin/gwcli", line 191, in main() File "/usr/bin/gwcli", line 103, in main root_node.refresh() File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in refresh raise GatewayError gwcli.utils.GatewayError ### I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" are not running. I were not able to restart them via systemd. I then found that even tcmu-runner is not running and it exits with the following error: ### tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD image size 5498631880704. Requested new size 5497558138880. ### Now i have the situation that two nodes are running correct and two cant start tcmu-runner. I don't know where the image size mismatches are coming from - i haven't configured or resized any of the images. Is there any chance to get my two iscsi gw nodes back working? The following packets are installed: rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel" libtcmu-1.4.0-106.gd17d24e.el7.x86_64 ceph-iscsi-cli-2.7-2.7.el7.noarch kernel-3.10.0-957.5.1.el7.x86_64 ceph-base-13.2.5-0.el7.x86_64 ceph-iscsi-config-2.6-2.6.el7.noarch ceph-common-13.2.5-0.el7.x86_64 ceph-selinux-13.2.5-0.el7.x86_64 kernel-tools-libs-3.10.0-957.5.1.el7.x86_64 python-cephfs-13.2.5-0.el7.x86_64 ceph-osd-13.2.5-0.el7.x86_64 kernel-headers-3.10.0-957.5.1.el7.x86_64 kernel-tools-3.10.0-957.5.1.el7.x86_64 kernel-3.10.0-957.1.3.el7.x86_64 libcephfs2-13.2.5-0.el7.x86_64 kernel-3.10.0-862.14.4.el7.x86_64 tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 Thanks, Greets Kilian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com