[ClusterLabs] Pacemaker not starting ISCSI LUNs and Targets

John Keates Mon, 21 Aug 2017 23:59:02 -0700

Hi,

I have a strange issue where LIO-T based ISCSI targets and LUNs most of the 
time simply don’t work. They either don’t start, or bounce around until no more 
nodes are tried.
The less-than-usefull information on the logs is like:


Aug 21 22:49:06 [10531] storage-1-prod    pengine:  warning: 
check_migration_threshold: Forcing iscsi0-target away from storage-1-prod after 
1000000 failures (max=1000000)

Aug 21 22:54:47 storage-1-prod crmd[2757]:   notice: Result of start operation 
for ip-iscsi0-vlan40 on storage-1-prod: 0 (ok)
Aug 21 22:54:47 storage-1-prod iSCSITarget(iscsi0-target)[5427]: WARNING: 
Configuration parameter "tid" is not supported by the iSCSI implementation and 
will be ignored.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: INFO: 
Parameter auto_add_default_portal is now 'false'.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: INFO: Created 
target iqn.2017-08.acccess.net:prod-1-ha. Created TPG 1.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: ERROR: This 
Target already exists in configFS
Aug 21 22:54:48 storage-1-prod crmd[2757]:   notice: Result of start operation 
for iscsi0-target on storage-1-prod: 1 (unknown error)
Aug 21 22:54:49 storage-1-prod iSCSITarget(iscsi0-target)[5536]: INFO: Deleted 
Target iqn.2017-08.access.net:prod-1-ha.
Aug 21 22:54:49 storage-1-prod crmd[2757]:   notice: Result of stop operation 
for iscsi0-target on storage-1-prod: 0 (ok)

Now, the unknown error seems to actually be a targetcli type of error: "This 
Target already exists in configFS”. Checking with targetcli shows zero 
configured items on either node.
Manually starting the LUNs and target gives:


john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-target
Error performing operation: Operation not permitted
Operation start for iscsi0-target (ocf:heartbeat:iSCSITarget) returned 1
 >  stderr: WARNING: Configuration parameter "tid" is not supported by the 
 > iSCSI implementation and will be ignored.
 >  stderr: INFO: Parameter auto_add_default_portal is now 'false'.
 >  stderr: INFO: Created target iqn.2017-08.access.net:prod-1-ha. Created TPG 
 > 1.
 >  stderr: ERROR: This Target already exists in configFS

but now targetcli shows at least the target. Checking with crm status still 
shows the target as stopped.
Manually starting the LUNs gives:


john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-lun0
Operation start for iscsi0-lun0 (ocf:heartbeat:iSCSILogicalUnit) returned 0
 >  stderr: INFO: Created block storage object iscsi0-lun0 using 
 > /dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-root.
 >  stderr: INFO: Created LUN 0.
 >  stderr: DEBUG: iscsi0-lun0 start : 0
john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-lun1
Operation start for iscsi0-lun1 (ocf:heartbeat:iSCSILogicalUnit) returned 0
 >  stderr: INFO: Created block storage object iscsi0-lun1 using 
 > /dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-swap.
 >  stderr: /usr/lib/ocf/resource.d/heartbeat/iSCSILogicalUnit: line 378: 
 > /sys/kernel/config/target/core/iblock_0/iscsi0-lun1/wwn/vpd_unit_serial: No 
 > such file or directory
 >  stderr: INFO: Created LUN 1.
 >  stderr: DEBUG: iscsi0-lun1 start : 0

So the second LUN seems to have some bad parameters created by the 
iSCSILogicalUnit script. Checking with targetcli however shows both LUNs and 
the target up and running.
Checking again with crm status (and pcs status) shows all three resources still 
stopped. Since LUNs are colocated with the target and the target still has fail 
counts, I clear them with:

sudo pcs resource cleanup iscsi0-target

Now the LUNs and target are all active in crm status / pcs status. But it’s 
quite a manual process to get this to work! I’m thinking either my 
configuration is bad or there is some bug somewhere in targetcli / LIO or the 
iSCSI heartbeat script.
On top of all the manual work, it still breaks on any action. A move, failover, 
reboot etc. instantly breaks it. Everything else (the underlying ZFS Pool, the 
DRBD device, the IPv4 IP’s etc) moves just fine, it’s only the ISCSI that’s 
being problematic.

Concrete questions:

- Is my config bad?
- Is there a known issue with ISCSI? (I have only found old references about 
ordering)

I have added the output of crm config show as cib.txt and the output of a fresh 
boot of both nodes is:

Current DC: storage-2-prod (version 1.1.16-94ff4df) - partition with quorum
Last updated: Mon Aug 21 22:55:05 2017
Last change: Mon Aug 21 22:36:23 2017 by root via cibadmin on storage-1-prod

2 nodes configured
21 resources configured

Online: [ storage-1-prod storage-2-prod ]

Full list of resources:

 ip-iscsi0-vlan10       (ocf::heartbeat:IPaddr2):       Started storage-1-prod
 ip-iscsi0-vlan20       (ocf::heartbeat:IPaddr2):       Started storage-1-prod
 ip-iscsi0-vlan30       (ocf::heartbeat:IPaddr2):       Started storage-1-prod
 ip-iscsi0-vlan40       (ocf::heartbeat:IPaddr2):       Started storage-1-prod
 Master/Slave Set: drbd_master_slave0 [drbd_disk0]
     Masters: [ storage-1-prod ]
     Slaves: [ storage-2-prod ]
 Master/Slave Set: drbd_master_slave1 [drbd_disk1]
     Masters: [ storage-2-prod ]
     Slaves: [ storage-1-prod ]
 ip-iscsi1-vlan10       (ocf::heartbeat:IPaddr2):       Started storage-2-prod
 ip-iscsi1-vlan20       (ocf::heartbeat:IPaddr2):       Started storage-2-prod
 ip-iscsi1-vlan30       (ocf::heartbeat:IPaddr2):       Started storage-2-prod
 ip-iscsi1-vlan40       (ocf::heartbeat:IPaddr2):       Started storage-2-prod
 st-storage-1-prod      (stonith:meatware):     Started storage-2-prod
 st-storage-2-prod      (stonith:meatware):     Started storage-1-prod
 zfs-iscsipool0 (ocf::heartbeat:ZFS):   Started storage-1-prod
 zfs-iscsipool1 (ocf::heartbeat:ZFS):   Started storage-2-prod
 iscsi0-lun0    (ocf::heartbeat:iSCSILogicalUnit):      Stopped
 iscsi0-lun1    (ocf::heartbeat:iSCSILogicalUnit):      Stopped
 iscsi0-target  (ocf::heartbeat:iSCSITarget):   Stopped
 Clone Set: dlm-clone [dlm]
     Started: [ storage-1-prod storage-2-prod ]

Failed Actions:
* iscsi0-target_start_0 on storage-2-prod 'unknown error' (1): call=99, 
status=complete, exitreason='none',
    last-rc-change='Mon Aug 21 22:54:49 2017', queued=0ms, exec=954ms
* iscsi0-target_start_0 on storage-1-prod 'unknown error' (1): call=98, 
status=complete, exitreason='none',
    last-rc-change='Mon Aug 21 22:54:47 2017', queued=0ms, exec=1062ms

Regards,
John

node 180945669: storage-1-prod
node 180945670: storage-2-prod \
        attributes
primitive dlm ocf:pacemaker:controld \
        op start interval=0s timeout=90 \
        op stop interval=0s timeout=100 \
        op monitor interval=60s
primitive drbd_disk0 ocf:linbit:drbd \
        params drbd_resource=disk0 \
        op monitor interval=29s role=Master \
        op monitor interval=31s role=Slave
primitive drbd_disk1 ocf:linbit:drbd \
        params drbd_resource=disk1 \
        op monitor interval=29s role=Master \
        op monitor interval=31s role=Slave
primitive ip-iscsi0-vlan10 IPaddr2 \
        params ip=10.201.0.25 nic=eno4 cidr_netmask=24 \
        meta migration-threshold=2 target-role=Started \
        op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi0-vlan20 IPaddr2 \
        params ip=10.201.1.25 nic=eno3 cidr_netmask=24 \
        meta migration-threshold=2 target-role=Started \
        op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi0-vlan30 IPaddr2 \
        params ip=10.201.2.25 nic=eno2 cidr_netmask=24 \
        meta migration-threshold=2 target-role=Started \
        op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi0-vlan40 IPaddr2 \
        params ip=10.201.3.25 nic=eno1 cidr_netmask=24 \
        meta migration-threshold=2 target-role=Started \
        op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi1-vlan10 IPaddr2 \
        params ip=10.201.0.26 nic=eno4 cidr_netmask=24 \
        meta migration-threshold=2 \
        op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi1-vlan20 IPaddr2 \
        params ip=10.201.1.26 nic=eno3 cidr_netmask=24 \
        meta migration-threshold=2 \
        op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi1-vlan30 IPaddr2 \
        params ip=10.201.2.26 nic=eno2 cidr_netmask=24 \
        meta migration-threshold=2 \
        op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi1-vlan40 IPaddr2 \
        params ip=10.201.3.26 nic=eno1 cidr_netmask=24 \
        meta migration-threshold=2 \
        op monitor interval=20 on-fail=restart timeout=60
primitive iscsi0-lun0 iSCSILogicalUnit \
        params implementation=lio-t 
target_iqn="iqn.2017-08.access.net:prod-1-ha" lun=0 
path="/dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-root" \
        meta target-role=Started
primitive iscsi0-lun1 iSCSILogicalUnit \
        params implementation=lio-t 
target_iqn="iqn.2017-08.access.net:prod-1-ha" lun=1 
path="/dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-swap" \
        meta target-role=Started
primitive iscsi0-target iSCSITarget \
        params implementation=lio-t iqn="iqn.2017-08.access.net:prod-1-ha" 
tid=1 \
        op monitor interval=30s \
        meta target-role=Started
primitive st-storage-1-prod stonith:meatware \
        params hostlist=storage-1-prod \
        meta target-role=Started
primitive st-storage-2-prod stonith:meatware \
        params hostlist=storage-2-prod \
        meta target-role=Started
primitive zfs-iscsipool0 ZFS \
        params pool=iscsipool0 \
        op start timeout=90 interval=0 \
        op stop timeout=90 interval=0 \
        meta target-role=Started
primitive zfs-iscsipool1 ZFS \
        params pool=iscsipool1 \
        op start timeout=90 interval=0 \
        op stop timeout=90 interval=0 \
        meta target-role=Started
ms drbd_master_slave0 drbd_disk0 \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
notify=true target-role=Started
ms drbd_master_slave1 drbd_disk1 \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
notify=true target-role=Started
clone dlm-clone dlm \
        meta clone-max=2 clone-node-max=1 target-role=Started
location cli-prefer-drbd_master_slave0 drbd_master_slave0 role=Master inf: 
storage-1-prod
location cli-prefer-drbd_master_slave1 drbd_master_slave1 role=Started inf: 
storage-2-prod
location cli-prefer-zfs-iscsipool0 zfs-iscsipool0 role=Started inf: 
storage-1-prod
location cli-prefer-zfs-iscsipool1 zfs-iscsipool1 role=Started inf: 
storage-2-prod
order ip0-after-drbd0 inf: drbd_master_slave0:promote zfs-iscsipool0 
ip-iscsi1-vlan10 ip-iscsi1-vlan20 ip-iscsi1-vlan30 ip-iscsi1-vlan40 
iscsi0-target iscsi0-lun0 iscsi0-lun1
order ip1-after-drbd1 inf: drbd_master_slave1:promote zfs-iscsipool1 
ip-iscsi0-vlan10 ip-iscsi0-vlan20 ip-iscsi0-vlan30 ip-iscsi0-vlan40
location l-st-storage-1-prod st-storage-1-prod -inf: storage-1-prod
location l-st-storage-2-prod st-storage-2-prod -inf: storage-2-prod
location lun0-prefer-iscsipool0 iscsi0-target role=Started inf: storage-1-prod
location lun1-prefer-iscsipool0 iscsi0-lun1 role=Started inf: storage-1-prod
location storage-0 { drbd_master_slave0 zfs-iscsipool0 ip-iscsi0-vlan10 
ip-iscsi0-vlan20 ip-iscsi0-vlan30 ip-iscsi0-vlan40 iscsi0-target iscsi0-lun0 
iscsi0-lun1 } 100: storage-1-prod
location storage-1 { drbd_master_slave1 zfs-iscsipool1 ip-iscsi1-vlan10 
ip-iscsi1-vlan20 ip-iscsi1-vlan30 ip-iscsi1-vlan40 } 100: storage-2-prod
colocation storage-target0 inf: ip-iscsi0-vlan10 ip-iscsi0-vlan20 
ip-iscsi0-vlan30 ip-iscsi0-vlan40 zfs-iscsipool0 drbd_master_slave0:Master
colocation storage-target1 inf: ip-iscsi1-vlan10 ip-iscsi1-vlan20 
ip-iscsi1-vlan30 ip-iscsi1-vlan40 zfs-iscsipool1 drbd_master_slave1:Master
location target0-prefer-iscsipool0 iscsi0-target role=Started inf: 
storage-1-prod
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.16-94ff4df \
        cluster-infrastructure=corosync \
        cluster-name=access_storage \
        stonith-enabled=true \
        no-quorum-policy=ignore \
        default-resource-stickiness=100

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker not starting ISCSI LUNs and Targets

Reply via email to