Hi, I have a strange issue where LIO-T based ISCSI targets and LUNs most of the time simply don’t work. They either don’t start, or bounce around until no more nodes are tried. The less-than-usefull information on the logs is like:
Aug 21 22:49:06 [10531] storage-1-prod pengine: warning: check_migration_threshold: Forcing iscsi0-target away from storage-1-prod after 1000000 failures (max=1000000) Aug 21 22:54:47 storage-1-prod crmd[2757]: notice: Result of start operation for ip-iscsi0-vlan40 on storage-1-prod: 0 (ok) Aug 21 22:54:47 storage-1-prod iSCSITarget(iscsi0-target)[5427]: WARNING: Configuration parameter "tid" is not supported by the iSCSI implementation and will be ignored. Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: INFO: Parameter auto_add_default_portal is now 'false'. Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: INFO: Created target iqn.2017-08.acccess.net:prod-1-ha. Created TPG 1. Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: ERROR: This Target already exists in configFS Aug 21 22:54:48 storage-1-prod crmd[2757]: notice: Result of start operation for iscsi0-target on storage-1-prod: 1 (unknown error) Aug 21 22:54:49 storage-1-prod iSCSITarget(iscsi0-target)[5536]: INFO: Deleted Target iqn.2017-08.access.net:prod-1-ha. Aug 21 22:54:49 storage-1-prod crmd[2757]: notice: Result of stop operation for iscsi0-target on storage-1-prod: 0 (ok) Now, the unknown error seems to actually be a targetcli type of error: "This Target already exists in configFS”. Checking with targetcli shows zero configured items on either node. Manually starting the LUNs and target gives: john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-target Error performing operation: Operation not permitted Operation start for iscsi0-target (ocf:heartbeat:iSCSITarget) returned 1 > stderr: WARNING: Configuration parameter "tid" is not supported by the > iSCSI implementation and will be ignored. > stderr: INFO: Parameter auto_add_default_portal is now 'false'. > stderr: INFO: Created target iqn.2017-08.access.net:prod-1-ha. Created TPG > 1. > stderr: ERROR: This Target already exists in configFS but now targetcli shows at least the target. Checking with crm status still shows the target as stopped. Manually starting the LUNs gives: john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-lun0 Operation start for iscsi0-lun0 (ocf:heartbeat:iSCSILogicalUnit) returned 0 > stderr: INFO: Created block storage object iscsi0-lun0 using > /dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-root. > stderr: INFO: Created LUN 0. > stderr: DEBUG: iscsi0-lun0 start : 0 john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-lun1 Operation start for iscsi0-lun1 (ocf:heartbeat:iSCSILogicalUnit) returned 0 > stderr: INFO: Created block storage object iscsi0-lun1 using > /dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-swap. > stderr: /usr/lib/ocf/resource.d/heartbeat/iSCSILogicalUnit: line 378: > /sys/kernel/config/target/core/iblock_0/iscsi0-lun1/wwn/vpd_unit_serial: No > such file or directory > stderr: INFO: Created LUN 1. > stderr: DEBUG: iscsi0-lun1 start : 0 So the second LUN seems to have some bad parameters created by the iSCSILogicalUnit script. Checking with targetcli however shows both LUNs and the target up and running. Checking again with crm status (and pcs status) shows all three resources still stopped. Since LUNs are colocated with the target and the target still has fail counts, I clear them with: sudo pcs resource cleanup iscsi0-target Now the LUNs and target are all active in crm status / pcs status. But it’s quite a manual process to get this to work! I’m thinking either my configuration is bad or there is some bug somewhere in targetcli / LIO or the iSCSI heartbeat script. On top of all the manual work, it still breaks on any action. A move, failover, reboot etc. instantly breaks it. Everything else (the underlying ZFS Pool, the DRBD device, the IPv4 IP’s etc) moves just fine, it’s only the ISCSI that’s being problematic. Concrete questions: - Is my config bad? - Is there a known issue with ISCSI? (I have only found old references about ordering) I have added the output of crm config show as cib.txt and the output of a fresh boot of both nodes is: Current DC: storage-2-prod (version 1.1.16-94ff4df) - partition with quorum Last updated: Mon Aug 21 22:55:05 2017 Last change: Mon Aug 21 22:36:23 2017 by root via cibadmin on storage-1-prod 2 nodes configured 21 resources configured Online: [ storage-1-prod storage-2-prod ] Full list of resources: ip-iscsi0-vlan10 (ocf::heartbeat:IPaddr2): Started storage-1-prod ip-iscsi0-vlan20 (ocf::heartbeat:IPaddr2): Started storage-1-prod ip-iscsi0-vlan30 (ocf::heartbeat:IPaddr2): Started storage-1-prod ip-iscsi0-vlan40 (ocf::heartbeat:IPaddr2): Started storage-1-prod Master/Slave Set: drbd_master_slave0 [drbd_disk0] Masters: [ storage-1-prod ] Slaves: [ storage-2-prod ] Master/Slave Set: drbd_master_slave1 [drbd_disk1] Masters: [ storage-2-prod ] Slaves: [ storage-1-prod ] ip-iscsi1-vlan10 (ocf::heartbeat:IPaddr2): Started storage-2-prod ip-iscsi1-vlan20 (ocf::heartbeat:IPaddr2): Started storage-2-prod ip-iscsi1-vlan30 (ocf::heartbeat:IPaddr2): Started storage-2-prod ip-iscsi1-vlan40 (ocf::heartbeat:IPaddr2): Started storage-2-prod st-storage-1-prod (stonith:meatware): Started storage-2-prod st-storage-2-prod (stonith:meatware): Started storage-1-prod zfs-iscsipool0 (ocf::heartbeat:ZFS): Started storage-1-prod zfs-iscsipool1 (ocf::heartbeat:ZFS): Started storage-2-prod iscsi0-lun0 (ocf::heartbeat:iSCSILogicalUnit): Stopped iscsi0-lun1 (ocf::heartbeat:iSCSILogicalUnit): Stopped iscsi0-target (ocf::heartbeat:iSCSITarget): Stopped Clone Set: dlm-clone [dlm] Started: [ storage-1-prod storage-2-prod ] Failed Actions: * iscsi0-target_start_0 on storage-2-prod 'unknown error' (1): call=99, status=complete, exitreason='none', last-rc-change='Mon Aug 21 22:54:49 2017', queued=0ms, exec=954ms * iscsi0-target_start_0 on storage-1-prod 'unknown error' (1): call=98, status=complete, exitreason='none', last-rc-change='Mon Aug 21 22:54:47 2017', queued=0ms, exec=1062ms Regards, John
node 180945669: storage-1-prod node 180945670: storage-2-prod \ attributes primitive dlm ocf:pacemaker:controld \ op start interval=0s timeout=90 \ op stop interval=0s timeout=100 \ op monitor interval=60s primitive drbd_disk0 ocf:linbit:drbd \ params drbd_resource=disk0 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd_disk1 ocf:linbit:drbd \ params drbd_resource=disk1 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive ip-iscsi0-vlan10 IPaddr2 \ params ip=10.201.0.25 nic=eno4 cidr_netmask=24 \ meta migration-threshold=2 target-role=Started \ op monitor interval=20 on-fail=restart timeout=60 primitive ip-iscsi0-vlan20 IPaddr2 \ params ip=10.201.1.25 nic=eno3 cidr_netmask=24 \ meta migration-threshold=2 target-role=Started \ op monitor interval=20 on-fail=restart timeout=60 primitive ip-iscsi0-vlan30 IPaddr2 \ params ip=10.201.2.25 nic=eno2 cidr_netmask=24 \ meta migration-threshold=2 target-role=Started \ op monitor interval=20 on-fail=restart timeout=60 primitive ip-iscsi0-vlan40 IPaddr2 \ params ip=10.201.3.25 nic=eno1 cidr_netmask=24 \ meta migration-threshold=2 target-role=Started \ op monitor interval=20 on-fail=restart timeout=60 primitive ip-iscsi1-vlan10 IPaddr2 \ params ip=10.201.0.26 nic=eno4 cidr_netmask=24 \ meta migration-threshold=2 \ op monitor interval=20 on-fail=restart timeout=60 primitive ip-iscsi1-vlan20 IPaddr2 \ params ip=10.201.1.26 nic=eno3 cidr_netmask=24 \ meta migration-threshold=2 \ op monitor interval=20 on-fail=restart timeout=60 primitive ip-iscsi1-vlan30 IPaddr2 \ params ip=10.201.2.26 nic=eno2 cidr_netmask=24 \ meta migration-threshold=2 \ op monitor interval=20 on-fail=restart timeout=60 primitive ip-iscsi1-vlan40 IPaddr2 \ params ip=10.201.3.26 nic=eno1 cidr_netmask=24 \ meta migration-threshold=2 \ op monitor interval=20 on-fail=restart timeout=60 primitive iscsi0-lun0 iSCSILogicalUnit \ params implementation=lio-t target_iqn="iqn.2017-08.access.net:prod-1-ha" lun=0 path="/dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-root" \ meta target-role=Started primitive iscsi0-lun1 iSCSILogicalUnit \ params implementation=lio-t target_iqn="iqn.2017-08.access.net:prod-1-ha" lun=1 path="/dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-swap" \ meta target-role=Started primitive iscsi0-target iSCSITarget \ params implementation=lio-t iqn="iqn.2017-08.access.net:prod-1-ha" tid=1 \ op monitor interval=30s \ meta target-role=Started primitive st-storage-1-prod stonith:meatware \ params hostlist=storage-1-prod \ meta target-role=Started primitive st-storage-2-prod stonith:meatware \ params hostlist=storage-2-prod \ meta target-role=Started primitive zfs-iscsipool0 ZFS \ params pool=iscsipool0 \ op start timeout=90 interval=0 \ op stop timeout=90 interval=0 \ meta target-role=Started primitive zfs-iscsipool1 ZFS \ params pool=iscsipool1 \ op start timeout=90 interval=0 \ op stop timeout=90 interval=0 \ meta target-role=Started ms drbd_master_slave0 drbd_disk0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms drbd_master_slave1 drbd_disk1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone dlm-clone dlm \ meta clone-max=2 clone-node-max=1 target-role=Started location cli-prefer-drbd_master_slave0 drbd_master_slave0 role=Master inf: storage-1-prod location cli-prefer-drbd_master_slave1 drbd_master_slave1 role=Started inf: storage-2-prod location cli-prefer-zfs-iscsipool0 zfs-iscsipool0 role=Started inf: storage-1-prod location cli-prefer-zfs-iscsipool1 zfs-iscsipool1 role=Started inf: storage-2-prod order ip0-after-drbd0 inf: drbd_master_slave0:promote zfs-iscsipool0 ip-iscsi1-vlan10 ip-iscsi1-vlan20 ip-iscsi1-vlan30 ip-iscsi1-vlan40 iscsi0-target iscsi0-lun0 iscsi0-lun1 order ip1-after-drbd1 inf: drbd_master_slave1:promote zfs-iscsipool1 ip-iscsi0-vlan10 ip-iscsi0-vlan20 ip-iscsi0-vlan30 ip-iscsi0-vlan40 location l-st-storage-1-prod st-storage-1-prod -inf: storage-1-prod location l-st-storage-2-prod st-storage-2-prod -inf: storage-2-prod location lun0-prefer-iscsipool0 iscsi0-target role=Started inf: storage-1-prod location lun1-prefer-iscsipool0 iscsi0-lun1 role=Started inf: storage-1-prod location storage-0 { drbd_master_slave0 zfs-iscsipool0 ip-iscsi0-vlan10 ip-iscsi0-vlan20 ip-iscsi0-vlan30 ip-iscsi0-vlan40 iscsi0-target iscsi0-lun0 iscsi0-lun1 } 100: storage-1-prod location storage-1 { drbd_master_slave1 zfs-iscsipool1 ip-iscsi1-vlan10 ip-iscsi1-vlan20 ip-iscsi1-vlan30 ip-iscsi1-vlan40 } 100: storage-2-prod colocation storage-target0 inf: ip-iscsi0-vlan10 ip-iscsi0-vlan20 ip-iscsi0-vlan30 ip-iscsi0-vlan40 zfs-iscsipool0 drbd_master_slave0:Master colocation storage-target1 inf: ip-iscsi1-vlan10 ip-iscsi1-vlan20 ip-iscsi1-vlan30 ip-iscsi1-vlan40 zfs-iscsipool1 drbd_master_slave1:Master location target0-prefer-iscsipool0 iscsi0-target role=Started inf: storage-1-prod property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.16-94ff4df \ cluster-infrastructure=corosync \ cluster-name=access_storage \ stonith-enabled=true \ no-quorum-policy=ignore \ default-resource-stickiness=100
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org