[Linux-HA] Sending ping on other port

2013-11-12 Thread Ahmed Munir
Hi All,

I would like to know, is it possible to change the ping port as 5060 using
ocf::pingd? Means to say need to send ping at port 5060.

If it is possible please advise what parameters do I need to enable it?

-- 
Regards,

Ahmed Munir Chohan
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Sending ping on other port

2013-11-12 Thread Takehiro Matsushima
Hi,

Generally, ping uses ICMP which has no concept about port.
ocf:pacemaker:pingd and calls internally ocf:pacemaker:ping are
implemented using fping.

BTW, I googled, to knock any tcp/udp port like ping.
hping looks be useable.
Of course, modify/create the resource agent is required. :)


regards,
Takehiro Matsushima


2013/11/13 Ahmed Munir ahmedmunir...@gmail.com:
 Hi All,

 I would like to know, is it possible to change the ping port as 5060 using
 ocf::pingd? Means to say need to send ping at port 5060.

 If it is possible please advise what parameters do I need to enable it?

 --
 Regards,

 Ahmed Munir Chohan
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-12 Thread Jefferson Ogata

Greetings.

I'm working on a high-availability iSCSI target (tgt) cluster using 
CentOS 6.4, to support a blade VM infrastructure. I've encountered a 
number of problems i haven't found documented elsewhere (not for lack of 
looking), and i want to run some solutions past the list to see what 
sticks. I have one outstanding problem i haven't been able to work out, 
which i will present in another thread to follow.


I'm going to forgo posting detailed configs initially here, as i think 
the problems are abstract enough that it won't be necessary. If it turns 
out to be necessary, we can do that. I hope also that this list is an 
acceptable place for this; i found a lot of pointers here so it seemed 
appropriate.


The cluster is two Dell boxes with a bunch of directly attached Dell SAS 
storage. On each box, disks are organized into two RAID10 volumes and 
four RAID6 volumes, to serve different profiles of IOPS and storage 
volume needed by various applications. Each of these volumes is synced 
to the other box using DRBD 8.4 over an LACP bond of two crossover 10 
GbE links. The pacemaker/CMAN/corosync stack also talks over the 
crossover. Each box is connected to the network using another 10 GbE 
link, with a 1 GbE link bonded in active/backup mode and attached to a 
separate switch so i can reboot the primary switch if necessary without 
losing connectivity. (The 10 GbE transceivers for my switch 
infrastructure are expensive, and multi-switch LACP is way too bleeding 
edge for me.) Write cache is disabled on all tgt LUNs (using mode_page), 
and all the RAIDs are battery backed.


One of my requirements here is to run multiple tgt targets. Each DRBD 
volume is the sole physical volume of a distinct LVM volume group, and 
each VG is assigned to a unique target. pacemaker naturally balances the 
six targets into three on each box. This configuration has the advantage 
that when most initiators are readers, outgoing bandwidth from each box 
can theoretically hit 10 Gb/s, resulting in 20 Gb/s read bandwidth. (In 
practice i'm maxing out around 12 Gb/s, but i think that's because of 
limitations in my switch infrastructure.) Another advantage of multiple 
targets is that when something screwy happens and a target goes down, 
this only takes down the LUNs on that target, rather than the whole kit 
and kaboodle.


Another requirement i have is password authentication to the targets.

So, first the problems, then the workarounds:

The existing iSCSITarget RA is not designed to support multiple targets 
with a single user account in the tgt implementation. But getting the 
initiators (libvirt nodes, also using CentOS 6.4) to support a different 
account for each target is non-trivial, since there's only one slot in 
/etc/iscsi/iscsi.conf for authentication info. There are workarounds for 
initiators but they're pretty ugly, especially if you want to manage 
your iSCSI targets using libvirt pools.


In practice i ran into failover problems under load almost immediately. 
Under load, when i would initiate a failover, there was a race 
condition: the iSCSILogicalUnit RA will take down the LUNs one at a 
time, waiting for each connection to terminate, and if the initiators 
reconnect quickly enough, they get pissed off at finding that the target 
still exists but the LUN they were using no longer does, which is often 
the case during this transient takedown process. On the initiator, it 
looks something like this, and it's fatal (here LUN 4 has gone away but 
the target is still alive, maybe working on disconnecting LUN 3):


Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal 
Request [current]
Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit 
not supported
Nov  7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical 
block 16542656


One solution to this is using the portblock RA to block all initiator 
traffic during failover, but this creates another problem: tgtd doesn't 
allow established connections to expire as long as there's outstanding 
data in the Send-Q for the TCP connection; if tgtd has already queued a 
bunch of traffic to an initiator when a failover starts, and portblock 
starts blocking ACK packets from the initiator, the Send-Q never drains, 
and the tgtd connection hangs permanently. This stops failover from 
completing, and eventually everyone is unhappy, especially me.


I was having another problem with portblock: start and stop operations 
would frequently but unpredictably fail when multiple targets were 
simultaneously failing over. The error reported by pacemaker was 
'insufficient privileges' (rc=4). This was pretty mysterious since 
everything was running as root, and root has no problem executing 
iptables. This would blow failover sequences up, and the target resource 
would go down.


So, here's how i've worked around these problems. Comments about how 
stupid was not to have done X are, of course, welcome. I'd rather not 
hear You must 

[Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-12 Thread Jefferson Ogata
Here's a problem i don't understand, and i'd like a solution to if 
possible, or at least i'd like to understand why it's a problem, because 
i'm clearly not getting something.


I have an iSCSI target cluster using CentOS 6.4 with stock 
pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.


Both DRBD and cluster comms use a dedicated crossover link.

The target storage is battery-backed RAID.

DRBD resources all use protocol C.

stonith is configured and working.

tgtd write cache is disabled using mode_page in additional_params. This 
is correctly reported using sdparm --get WCE on initiators.


Here's the question: if i am writing from an iSCSI initiator, and i take 
down the crossover link between the nodes of my cluster, i end up with 
corrupt data on the target disk.


I know this isn't the formal way to test pacemaker failover. 
Everything's fine if i fence a node or do a manual migration or 
shutdown. But i don't understand why taking the crossover down results 
in corrupted write operations.


In greater detail, assuming the initiator sends a write request for some 
block, here's the normal sequence as i understand it:


- tgtd receives it and queues it straight for the device backing the LUN 
(write cache is disabled).
- drbd receives it, commits it to disk, sends it to the other node, and 
waits for an acknowledgement (protocol C).
- the remote node receives it, commits it to disk, and sends an 
acknowledgement.
- the initial node receives the drbd acknowledgement, and acknowledges 
the write to tgtd.

- tgtd acknowledges the write to the initiator.

Now, suppose an initiator is writing when i take the crossover link 
down, and pacemaker reacts to the loss in comms by fencing the node with 
the currently active target. It then brings up the target on the 
surviving, formerly inactive, node. This results in a drbd split brain, 
since some writes have been queued on the fenced node but never made it 
to the surviving node, and must be retransmitted by the initiator; once 
the surviving node becomes active it starts committing these writes to 
its copy of the mirror. I'm fine with a split brain; i can resolve it by 
discarding outstanding data on the fenced node.


But in practice, the actual written data is lost, and i don't understand 
why. AFAICS, none of the outstanding writes should have been 
acknowledged by tgtd on the fenced node, so when the surviving node 
becomes active, the initiator should simply re-send all of them. But 
this isn't what happens; instead most of the outstanding writes are 
lost. No i/o error is reported on the initiator; stuff just vanishes.


I'm writing directly to a block device for these tests, so the lost data 
isn't the result of filesystem corruption; it simply never gets written 
to the target disk on the survivor.


What am i missing?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-12 Thread Andrew Beekhof

On 13 Nov 2013, at 2:10 pm, Jefferson Ogata linux...@antibozo.net wrote:

 Here's a problem i don't understand, and i'd like a solution to if possible, 
 or at least i'd like to understand why it's a problem, because i'm clearly 
 not getting something.
 
 I have an iSCSI target cluster using CentOS 6.4 with stock 
 pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.
 
 Both DRBD and cluster comms use a dedicated crossover link.
 
 The target storage is battery-backed RAID.
 
 DRBD resources all use protocol C.
 
 stonith is configured and working.
 
 tgtd write cache is disabled using mode_page in additional_params. This is 
 correctly reported using sdparm --get WCE on initiators.
 
 Here's the question: if i am writing from an iSCSI initiator, and i take down 
 the crossover link between the nodes of my cluster, i end up with corrupt 
 data on the target disk.
 
 I know this isn't the formal way to test pacemaker failover. Everything's 
 fine if i fence a node or do a manual migration or shutdown. But i don't 
 understand why taking the crossover down results in corrupted write 
 operations.
 
 In greater detail, assuming the initiator sends a write request for some 
 block, here's the normal sequence as i understand it:
 
 - tgtd receives it and queues it straight for the device backing the LUN 
 (write cache is disabled).
 - drbd receives it, commits it to disk, sends it to the other node, and waits 
 for an acknowledgement (protocol C).
 - the remote node receives it, commits it to disk, and sends an 
 acknowledgement.
 - the initial node receives the drbd acknowledgement, and acknowledges the 
 write to tgtd.
 - tgtd acknowledges the write to the initiator.
 
 Now, suppose an initiator is writing when i take the crossover link down, and 
 pacemaker reacts to the loss in comms by fencing the node with the currently 
 active target. It then brings up the target on the surviving, formerly 
 inactive, node. This results in a drbd split brain, since some writes have 
 been queued on the fenced node but never made it to the surviving node, and 
 must be retransmitted by the initiator; once the surviving node becomes 
 active it starts committing these writes to its copy of the mirror. I'm fine 
 with a split brain; i can resolve it by discarding outstanding data on the 
 fenced node.
 
 But in practice, the actual written data is lost, and i don't understand why. 
 AFAICS, none of the outstanding writes should have been acknowledged by tgtd 
 on the fenced node, so when the surviving node becomes active, the initiator 
 should simply re-send all of them. But this isn't what happens; instead most 
 of the outstanding writes are lost. No i/o error is reported on the 
 initiator; stuff just vanishes.
 
 I'm writing directly to a block device for these tests, so the lost data 
 isn't the result of filesystem corruption; it simply never gets written to 
 the target disk on the survivor.
 
 What am i missing?

iSCSI, drbd, etc are not really my area of expertise, but it may be worth 
taking the cluster out of the loop and manually performing the equivalent 
actions.
If the underlying drbd and iSCSI setups have a problem, then the cluster isn't 
going to do much about it.



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-12 Thread Vladislav Bogdanov
13.11.2013 06:10, Jefferson Ogata wrote:
 Here's a problem i don't understand, and i'd like a solution to if
 possible, or at least i'd like to understand why it's a problem, because
 i'm clearly not getting something.
 
 I have an iSCSI target cluster using CentOS 6.4 with stock
 pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.
 
 Both DRBD and cluster comms use a dedicated crossover link.
 
 The target storage is battery-backed RAID.
 
 DRBD resources all use protocol C.
 
 stonith is configured and working.
 
 tgtd write cache is disabled using mode_page in additional_params. This
 is correctly reported using sdparm --get WCE on initiators.
 
 Here's the question: if i am writing from an iSCSI initiator, and i take
 down the crossover link between the nodes of my cluster, i end up with
 corrupt data on the target disk.
 
 I know this isn't the formal way to test pacemaker failover.
 Everything's fine if i fence a node or do a manual migration or
 shutdown. But i don't understand why taking the crossover down results
 in corrupted write operations.
 
 In greater detail, assuming the initiator sends a write request for some
 block, here's the normal sequence as i understand it:
 
 - tgtd receives it and queues it straight for the device backing the LUN
 (write cache is disabled).
 - drbd receives it, commits it to disk, sends it to the other node, and
 waits for an acknowledgement (protocol C).
 - the remote node receives it, commits it to disk, and sends an
 acknowledgement.
 - the initial node receives the drbd acknowledgement, and acknowledges
 the write to tgtd.
 - tgtd acknowledges the write to the initiator.
 
 Now, suppose an initiator is writing when i take the crossover link
 down, and pacemaker reacts to the loss in comms by fencing the node with
 the currently active target. It then brings up the target on the
 surviving, formerly inactive, node. This results in a drbd split brain,
 since some writes have been queued on the fenced node but never made it
 to the surviving node, and must be retransmitted by the initiator; once
 the surviving node becomes active it starts committing these writes to
 its copy of the mirror. I'm fine with a split brain; i can resolve it by
 discarding outstanding data on the fenced node.
 
 But in practice, the actual written data is lost, and i don't understand
 why. AFAICS, none of the outstanding writes should have been
 acknowledged by tgtd on the fenced node, so when the surviving node
 becomes active, the initiator should simply re-send all of them. But
 this isn't what happens; instead most of the outstanding writes are
 lost. No i/o error is reported on the initiator; stuff just vanishes.
 
 I'm writing directly to a block device for these tests, so the lost data
 isn't the result of filesystem corruption; it simply never gets written
 to the target disk on the survivor.
 
 What am i missing?

Do you have handlers (fence-peer /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;) configured in
drbd.conf?

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-12 Thread Vladislav Bogdanov
13.11.2013 04:46, Jefferson Ogata wrote:
...
 
 In practice i ran into failover problems under load almost immediately.
 Under load, when i would initiate a failover, there was a race
 condition: the iSCSILogicalUnit RA will take down the LUNs one at a
 time, waiting for each connection to terminate, and if the initiators
 reconnect quickly enough, they get pissed off at finding that the target
 still exists but the LUN they were using no longer does, which is often
 the case during this transient takedown process. On the initiator, it
 looks something like this, and it's fatal (here LUN 4 has gone away but
 the target is still alive, maybe working on disconnecting LUN 3):
 
 Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal
 Request [current]
 Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit
 not supported
 Nov  7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical
 block 16542656
 
 One solution to this is using the portblock RA to block all initiator

In addition I force use of multipath on initiators with no_path_retry=queue

...

 
 1. Lack of support for multiple targets using the same tgt account. This
 is a problem because the iSCSITarget RA defines the user and the target
 at the same time. If it allowed multiple targets to use the same user,
 it wouldn't know when it is safe to delete the user in a stop operation,
 because some other target might still be using it.
 
 To solve this i did two things: first i wrote a new RA that manages a
 tgt user; this is instantiated as a clone so it runs along with the tgtd
 clone. Second i tweaked the iSCSITarget RA so that on start, if
 incoming_username is defined but incoming_password is not, the RA skips
 the account creation step and simply binds the new target to
 incoming_username. On stop, it similarly no longer deletes the account
 if incoming_password is unset. I also had to relax the uniqueness
 constraint on incoming_username in the RA metadata.
 
 2. Disappearing LUNs during failover cause initiators to blow chunks.
 For this i used portblock, but had to modify it because the TCP Send-Q
 would never drain.
 
 3. portblock preventing TCP Send-Q from draining, causing tgtd
 connections to hang. I modified portblock to reverse the sense of the
 iptables rules it was adding: instead of blocking traffic from the
 initiator on the INPUT chain, it now blocks traffic from the target on
 the OUTPUT chain with a tcp-reset response. With this setup, as soon as
 portblock goes active, the next packet tgtd attempts to send to a given
 initiator will get a TCP RST response, causing tgtd to hang up the
 connection immediately. This configuration allows the connections to
 terminate promptly under load.
 
 I'm not totally satisfied with this workaround. It means
 acknowledgements of operations tgtd has actually completed never make it
 back to the initiator. I suspect this could cause problems in some
 scenarios. I don't think it causes a problem the way i'm using it, with
 each LUN as backing store for a distinct VM--when the LUN is back up on
 the other node, the outstanding operations are re-sent by the initiator.
 Maybe with a clustered filesystem this would cause problems; it
 certainly would cause problems if the target device were, for example, a
 tape drive.
 
 4. Insufficient privileges faults in the portblock RA. This was
 another race condition that occurred because i was using multiple
 targets, meaning that without a mutex, multiple portblock invocations
 would be running in parallel during a failover. If you try to run
 iptables while another iptables is running, you get Resource not
 available and this was coming back to pacemaker as insufficient
 privileges. This is simply a bug in the portblock RA; it should have a
 mutex to prevent parallel iptables invocations. I fixed this by adding
 an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for
 start, stop, monitor, and status operations.
 
 I'm not sure why more people haven't run into these problems before. I
 hope it's not that i'm doing things wrong, but rather that few others
 haven't earnestly tried to build anything quite like this setup. If
 anyone out there has set up a similar cluster and *not* had these
 problems, i'd like to know about it. Meanwhile, if others *have* had
 these problems, i'd also like to know, especially if they've found
 alternate solutions.

Can't say about 1, I use IET, it doesn't seem to have that limitation.
2 - I use alternative home-brew ms RA which blocks (DROP) both input and
output for a specified VIP on demote (targets are configured to be bound
to that VIPs). I also export one big LUN per target and then set up clvm
VG on top of it (all initiators are in the same another cluster).
3 - can't say as well, IET is probably not affected.
4 - That is true, iptables doesn't have atomic rules management, so you
definitely need mutex or dispatcher like firewalld (didn't try it though).