Re: [Linux-HA] FW cluster fails at 4am

2013-12-28 Thread Jefferson Ogata

On 2013-12-28 06:13, Tracy Reed wrote:

On Fri, Dec 27, 2013 at 08:54:17PM PST, Jefferson Ogata spake thusly:

Log rotation tends to run around that time on Red Hat. Check your logrotate
configuration. Maybe something is rotating corosync logs and using the wrong
signal to start a new log file.


That was actually the first thing I looked at! I found
/etc/logrotate.d/shorewall and removed it. But that seems to have had no effect
on the problem. That file has been gone for 3 weeks, the machines rebooted (not
that it should matter), and the problem has happened several times since then.

I've searched all over and can't find anything. And it doesn't even happen
every morning, just every week or two. Hard to nail down a real pattern other
than usually (not always) 4am.


Is it possible that it's a coincidence of log rotation after patching? 
In certain circumstances i've had library replacement or subsequent 
prelink activity on libraries lead to a crash of some services during 
log rotation. This hasn't happened to me with pacemaker/cman/corosync, 
but it might conceivably explain why it only happens to you once in a while.


You might take a look at the pacct data in /var/account/ for the time of 
the crash; it should indicate exit status for the dying process as well 
as what other processes were started around the same time.



Or, if not that, could it be some other cronned task?


These firewall machines are standard CentOS boxes. The stock crons (logrotate
etc) and a 5 minute nagios passive check are the only things on them as far as
I can tell. Although I haven't quite figured out what causes logrotate to run
at 4am. I know it is in the /etc/cron.daily/logrotate but what runs this at
4am? Is 4am some special hard-coded time in crond?

I just noticed that there is an /etc/logrotate.d/cman which rotates
/var/log/cluster/*log Could this somehow be an issue? I'm running pacemaker and
corosync but I'm not running cman:

# /etc/init.d/cman status
cman is not running

Should I be? I don't think it is necessary for this particular kind of
cluster... But since it isn't running it shouldn't matter.


Yes, you're supposed to switch to cman. Not sure if it's related to your 
problem, tho.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] FW cluster fails at 4am

2013-12-27 Thread Jefferson Ogata

On 2013-12-28 04:34, Tracy Reed wrote:

First, thanks in advance for any help anyone may provide. I've been battling
this problem off and on for months and it is driving me mad:

Once every week or two my cluster fails. For reasons unknown it seems to
initiate a failover and then the shorewall service (lsb) does not get started
(or is stopped). The majority of the time it happens just after 4am. Although
it has happened at other times, although much less frequently. Tonight I am
going to have to be up at 4am to poke around on the cluster and observe what is
happening, if anything.

[snip]

Log rotation tends to run around that time on Red Hat. Check your 
logrotate configuration. Maybe something is rotating corosync logs and 
using the wrong signal to start a new log file.


Or, if not that, could it be some other cronned task?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] FYI: resource-agents-3.9.2-40.el6.x86_64 kills heartbeat-3.0.4

2013-12-04 Thread Jefferson Ogata

On 2013-12-02 00:03, Andrew Beekhof wrote:

On Wed, Nov 27, 2013, at 06:15 PM, Jefferson Ogata wrote:

On 2013-11-28 01:55, Andrew Beekhof wrote:

On 28 Nov 2013, at 11:29 am, Jefferson Ogata linux...@antibozo.net wrote:

On 2013-11-28 00:12, Dimitri Maziuk wrote:

Just so you know:

RedHat's (centos, actually) latest build of resource-agents sets $HA_BIN
to /usr/libexec/heartbeat. The daemon in heartbeat-3.0.4 RPM is
/usr/lib64/heartbeat/heartbeat so $HA_BIN/heartbeat binary does not exist.

(And please hold the upgrade to pacemaker comments: I'm hoping if I
wait just a little bit longer I can upgrade to ceph and openstack -- or
retire, whichever comes first ;)


Hey, upgrading to pacemaker wouldn't necessarily help. Red Hat broke that last month by 
dropping most of the resource agents they'd initially shipped. (Don't you love Technology 
Previews?)


Thats the whole point behind the tech preview label... it means the software 
is not yet in a form that Red Hat will support and is subject to changes _exactly_ like 
the one made to resource-agents.


Um, yes, i know. That's why i mentioned it.


Ok, sorry, I wasn't sure.


It's nicer, however, when Red Hat takes a conservative position with the
Tech Preview. They could have shipped a minimal set of resource agents
in the first place,


3 years ago we didn't know if pacemaker would _ever_ be supported in
RHEL-6, so stripping out agents wasn't on our radar.
I'm sure the only reason it and the rest of pacemaker shipped at all was
to humor the guy they'd just hired.

It was only at the point that supporting pacemaker in 6.5 became likely
that someone took a look at the full list and had a heart-attack.


so people would have a better idea what they had to
provide on their own end, instead of pulling the rug out with nary a
mention of what they were doing.


Yes, that was not good.
One of the challenges I find at Red Hat is the gaps between when a
decision is made, when we're allowed to talk about it and when customers
find out about it.  As a developermore  its the things we spent
significant time on that first come to mind when writing release notes,
not the 3s it took to remove some files from the spec file - even though
the latter is going to have a bigger affect :-(

We can only say that lessons have been learned and that we will do
better if there is a similar situation next time.


+1 Insightful/Informative/Interesting.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-12-04 Thread Jefferson Ogata

On 2013-11-21 16:34, Jefferson Ogata wrote:

On 2013-11-20 08:35, Jefferson Ogata wrote:

Indeed, using iptables with REJECT and tcp-reset, this seems to piss off
the initiators, creating immediate i/o errors. But one can use DROP on
incoming SYN packets and let established connections drain. I've been
trying to get this to work but am finding that it takes so long for some
connections to drain that something times out. I haven't given up on
this approach, tho. Testing this stuff can be tricky because if i make
one mistake, stonith kicks in and i end up having to wait 5-10 minutes
for the machine to reboot and resync its DRBD devices.


Follow-up on this: the original race condition i reported still occurs
with this strategy: if existing TCP connections are allowed to drain by
passing packets from established initiator connections (by blocking only
SYN packets), then the initiator can also send new requests to the
target during the takedown process; the takedown removes LUNs from the
live target and the initiator generates an i/o error if it happens to
try to access a LUN that has been removed before the connection is removed.

This happens because the configuration looks something like this (crm):

group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 portunblock

On takedown, if portblock is tweaked to pass packets for existing
connections so they can drain, there's a window while LUNs lun3, lun2,
lun1 are being removed from the target where this race condition occurs.
The connection isn't removed until iSCSITarget runs to stop the target.

A way to handle this that should actually work is to write a new RA that
deletes the connections from the target *before* the LUNs are removed
during takedown. The config would look something like this, then:

group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 tgtConnections portunblock

On takedown, then, portunblock will block new incoming connections,
tgtConnections will shut down existing connections and wait for them to
drain, then the LUNs can be safely removed before the target is taken down.

I'll write this RA today and see how that works.


So, this strategy worked. The final RA is attached. The config (crm) 
then looks like this, using the tweaked portblock RA that blocks syn 
only, the tgtUser RA that adds a tgtd user, and the tweaked iSCSITarget 
RA that doesn't add a user if no password is provided (see previous 
discussion for the latter two RAs). This is a two-node cluster using 
DRBD-backed LVMs and multiple targets. The names have been changed to 
protect the innocent, and is simplified to only a single target for 
brevity, but it should be clear how to do multiple DRBDs/VGs/targets. 
I've left out the stonith config here also.



primitive tgtd lsb:tgtd op monitor interval=10s
clone clone.tgtd tgtd
primitive user.username ocf:local:tgtUser params username=username 
password=password

clone clone.user.username user.username
order clone.tgtd_before_clone.user.username inf: clone.tgtd:start 
clone.user.username:start


primitive drbd.pv1 ocf:linbit:drbd params drbd_resource=pv1 op monitor 
role=Slave interval=29s timeout=600s op monitor role=Master 
interval=31s timeout=600s op start timeout=240s op stop timeout=240s
ms ms.drbd.pv1 drbd.pv1 meta master-max=1 master-node-max=1 
clone-max=2 clone-node-max=1 notify=true
primitive lvm.vg1 ocf:heartbeat:LVM params volgrpname=vg1 op monitor 
interval=30s timeout=30s op start timeout=30s op stop timeout=30s

order ms.drbd.pv1_before_lvm.vg1 inf: ms.drbd.pv1:promote lvm.vg1:start
colocation ms.drbd.pv1_with_lvm.vg1 inf: ms.drbd.pv1:Master lvm.vg1

primitive target.1 ocf:local:iSCSITarget params iqn=iqnt1 tid=1 
incoming_username=username implementation=tgt portals= op monitor 
interval=30s op start timeout=30s op stop timeout=120s
primitive lun.1.1 ocf:heartbeat:iSCSILogicalUnit params 
target_iqn=iqnt1 lun=1 path=/dev/vg1/lv1 
additional_parameters=scsi_id=vg1/lv1 
mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0 
implementation=tgt op monitor interval=30s op start timeout=30s op 
stop timeout=120s
primitive ip.192.168.1.244 ocf:heartbeat:IPaddr params 
ip=192.168.1.244 cidr_netmask=24 nic=bond0
primitive portblock.ip.192.168.1.244 ocf:local:portblock params 
ip=192.168.1.244 action=block protocol=tcp portno=3260 
syn_only=true op monitor interval=10s timeout=10s depth=0
primitive tgtfinal.1 ocf:local:tgtFinal params tid=1 op monitor 
interval=30s timeout=30s op stop timeout=60s
primitive portunblock.ip.192.168.1.244 ocf:local:portblock params 
ip=192.168.1.244 action=unblock protocol=tcp portno=3260 
syn_only=true op monitor interval=10s timeout=10s depth=0


group group.target.1 lvm.vg1 portblock.ip.192.168.12.244 
ip.192.168.12.244 target.6 lun.6.1 tgtfinal.1 portunblock.ip.192.168.1.244


order clone.tgtd_before_group.target.1 inf: clone.tgtd:start

Re: [Linux-HA] FYI: resource-agents-3.9.2-40.el6.x86_64 kills heartbeat-3.0.4

2013-11-27 Thread Jefferson Ogata

On 2013-11-28 00:12, Dimitri Maziuk wrote:

Just so you know:

RedHat's (centos, actually) latest build of resource-agents sets $HA_BIN
to /usr/libexec/heartbeat. The daemon in heartbeat-3.0.4 RPM is
/usr/lib64/heartbeat/heartbeat so $HA_BIN/heartbeat binary does not exist.

(And please hold the upgrade to pacemaker comments: I'm hoping if I
wait just a little bit longer I can upgrade to ceph and openstack -- or
retire, whichever comes first ;)


Hey, upgrading to pacemaker wouldn't necessarily help. Red Hat broke 
that last month by dropping most of the resource agents they'd initially 
shipped. (Don't you love Technology Previews?)

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] FYI: resource-agents-3.9.2-40.el6.x86_64 kills heartbeat-3.0.4

2013-11-27 Thread Jefferson Ogata

On 2013-11-28 01:55, Andrew Beekhof wrote:

On 28 Nov 2013, at 11:29 am, Jefferson Ogata linux...@antibozo.net wrote:

On 2013-11-28 00:12, Dimitri Maziuk wrote:

Just so you know:

RedHat's (centos, actually) latest build of resource-agents sets $HA_BIN
to /usr/libexec/heartbeat. The daemon in heartbeat-3.0.4 RPM is
/usr/lib64/heartbeat/heartbeat so $HA_BIN/heartbeat binary does not exist.

(And please hold the upgrade to pacemaker comments: I'm hoping if I
wait just a little bit longer I can upgrade to ceph and openstack -- or
retire, whichever comes first ;)


Hey, upgrading to pacemaker wouldn't necessarily help. Red Hat broke that last month by 
dropping most of the resource agents they'd initially shipped. (Don't you love Technology 
Previews?)


Thats the whole point behind the tech preview label... it means the software 
is not yet in a form that Red Hat will support and is subject to changes _exactly_ like 
the one made to resource-agents.


Um, yes, i know. That's why i mentioned it.

It's nicer, however, when Red Hat takes a conservative position with the 
Tech Preview. They could have shipped a minimal set of resource agents 
in the first place, so people would have a better idea what they had to 
provide on their own end, instead of pulling the rug out with nary a 
mention of what they were doing.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-21 Thread Jefferson Ogata

On 2013-11-20 08:35, Jefferson Ogata wrote:

Indeed, using iptables with REJECT and tcp-reset, this seems to piss off
the initiators, creating immediate i/o errors. But one can use DROP on
incoming SYN packets and let established connections drain. I've been
trying to get this to work but am finding that it takes so long for some
connections to drain that something times out. I haven't given up on
this approach, tho. Testing this stuff can be tricky because if i make
one mistake, stonith kicks in and i end up having to wait 5-10 minutes
for the machine to reboot and resync its DRBD devices.


Follow-up on this: the original race condition i reported still occurs 
with this strategy: if existing TCP connections are allowed to drain by 
passing packets from established initiator connections (by blocking only 
SYN packets), then the initiator can also send new requests to the 
target during the takedown process; the takedown removes LUNs from the 
live target and the initiator generates an i/o error if it happens to 
try to access a LUN that has been removed before the connection is removed.


This happens because the configuration looks something like this (crm):

group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1 
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 portunblock


On takedown, if portblock is tweaked to pass packets for existing 
connections so they can drain, there's a window while LUNs lun3, lun2, 
lun1 are being removed from the target where this race condition occurs. 
The connection isn't removed until iSCSITarget runs to stop the target.


A way to handle this that should actually work is to write a new RA that 
deletes the connections from the target *before* the LUNs are removed 
during takedown. The config would look something like this, then:


group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1 
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 tgtConnections portunblock


On takedown, then, portunblock will block new incoming connections, 
tgtConnections will shut down existing connections and wait for them to 
drain, then the LUNs can be safely removed before the target is taken down.


I'll write this RA today and see how that works.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-20 Thread Jefferson Ogata

On 2013-11-20 07:04, Vladislav Bogdanov wrote:

19.11.2013 13:48, Lars Ellenberg wrote:

On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote:

13.11.2013 04:46, Jefferson Ogata wrote:
...

3. portblock preventing TCP Send-Q from draining, causing tgtd
connections to hang. I modified portblock to reverse the sense of the
iptables rules it was adding: instead of blocking traffic from the
initiator on the INPUT chain, it now blocks traffic from the target on
the OUTPUT chain with a tcp-reset response. With this setup, as soon as
portblock goes active, the next packet tgtd attempts to send to a given
initiator will get a TCP RST response, causing tgtd to hang up the
connection immediately. This configuration allows the connections to
terminate promptly under load.

I'm not totally satisfied with this workaround. It means
acknowledgements of operations tgtd has actually completed never make it
back to the initiator. I suspect this could cause problems in some
scenarios. I don't think it causes a problem the way i'm using it, with
each LUN as backing store for a distinct VM--when the LUN is back up on
the other node, the outstanding operations are re-sent by the initiator.
Maybe with a clustered filesystem this would cause problems; it
certainly would cause problems if the target device were, for example, a
tape drive.


Maybe only block new incoming connection attempts?


That may cause issues on an initiator side in some circumstances (IIRC):
* connection is established
* pacemaker fires target move
* target is destroyed, connection breaks (TCP RST is sent to initiator)
* initiator connects again
* target is not available on iSCSI level (but portals answer either on
old or on new node) or portals are not available
* initiator *returns error* to an upper layer - this one is important
* target is configured on other node then

I was hit by this, but that was several years ago, so I may miss some
details.


Indeed, using iptables with REJECT and tcp-reset, this seems to piss off 
the initiators, creating immediate i/o errors. But one can use DROP on 
incoming SYN packets and let established connections drain. I've been 
trying to get this to work but am finding that it takes so long for some 
connections to drain that something times out. I haven't given up on 
this approach, tho. Testing this stuff can be tricky because if i make 
one mistake, stonith kicks in and i end up having to wait 5-10 minutes 
for the machine to reboot and resync its DRBD devices.



My experience with IET and LIO shows it is better (safer) to block all
iSCSI traffic to target's portals, both directions.
* connection is established
* pacemaker fires target move
* both directions are blocked (DROP) on both target nodes
* target is destroyed, connection stays established on initiator side,
just TCP packets timeout
* target is configured on other node (VIPs are moved too)
* firewall rules are removed
* initiator (re)sends request
* target sends RST (?) back - it doesn't have that connection
* initiator reconnects and continues to use target


As already noted, this approach doesn't work with TGT because it refuses 
to teardown its config until it has drained all its connections, and 
they can't drain if ACK packets can't come in. The only reliable 
solution i've found so far is to send RSTs to tgtd (but leave the 
initiators in the dark).


I'm also using VIPs. They don't have to be bound to a specific target in 
a tgt configuration; you just have each initiator connect to a given 
target only using its unique VIP.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-19 Thread Jefferson Ogata

On 2013-11-19 09:32, Lars Ellenberg wrote:

On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote:

Here's a problem i don't understand, and i'd like a solution to if
possible, or at least i'd like to understand why it's a problem,
because i'm clearly not getting something.

I have an iSCSI target cluster using CentOS 6.4 with stock
pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from
source.

Both DRBD and cluster comms use a dedicated crossover link.

The target storage is battery-backed RAID.

DRBD resources all use protocol C.

stonith is configured and working.


What about DRBD fencing.

You have to use fencing resource-and-stonith;,
and a suitable fencing handler.


Currently i have fencing resource-only; and

fence-peer /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;

in my DRBD config. stonith is configured in pacemaker. This was the best 
i could come up with from what documentation i was able to find.


I will try using fencing resource-and-stonith; but i'm unclear on 
whether that requires some sort of additional stonith configuration in 
DRBD, which i didn't think would be necessary.



Because:


tgtd write cache is disabled using mode_page in additional_params.
This is correctly reported using sdparm --get WCE on initiators.

Here's the question: if i am writing from an iSCSI initiator, and i
take down the crossover link between the nodes of my cluster, i end
up with corrupt data on the target disk.

I know this isn't the formal way to test pacemaker failover.
Everything's fine if i fence a node or do a manual migration or
shutdown. But i don't understand why taking the crossover down
results in corrupted write operations.

In greater detail, assuming the initiator sends a write request for
some block, here's the normal sequence as i understand it:

- tgtd receives it and queues it straight for the device backing the
LUN (write cache is disabled).
- drbd receives it, commits it to disk, sends it to the other node,
and waits for an acknowledgement (protocol C).
- the remote node receives it, commits it to disk, and sends an
acknowledgement.
- the initial node receives the drbd acknowledgement, and
acknowledges the write to tgtd.
- tgtd acknowledges the write to the initiator.

Now, suppose an initiator is writing when i take the crossover link
down, and pacemaker reacts to the loss in comms by fencing the node
with the currently active target. It then brings up the target on
the surviving, formerly inactive, node. This results in a drbd split
brain, since some writes have been queued on the fenced node but
never made it to the surviving node,


But have been acknowledged as written to the initiator,
which is why the initiator won't retransmit them.


This is the crux of what i'm not understanding: why, if i'm using 
protocol C, would DRBD acknowledge a write before it's been committed to 
the remote replica? If so, then i really don't understand the point of 
protocol C.


Or is tgtd acknowledging writes before they've been committed by the 
underlying backing store? I thought disabling the write cache would 
prevent that.


What i *thought* should happen was that writes received by the target 
after the crossover link fails would not be acknowledged under protocol 
C, and would be retransmitted after fencing completed and the backup 
node becomes primary. These writes overlap with writes that were 
committed on the fenced node's replica but hadn't been transmitted to 
the other replica, so this results in split brain that is reliably 
resolvable by discarding data from the fenced node's replica and resyncing.



With the DRBD fencing policy fencing resource-and-stonith;,
DRBD will *block* further IO (and not acknowledge anything
that did not make it to the peer) until the fence-peer handler
returns that it would be safe to resume IO again.

This avoids the data divergence (aka split-brain, because
usually data-divergence is the result of split-brain).


and must be retransmitted by
the initiator; once the surviving node becomes active it starts
committing these writes to its copy of the mirror. I'm fine with a
split brain;


You should not be.
DRBD reporting split-brain detected is usually a sign of a bad setup.


Well, by fine i meant that i felt i had a clear understanding of how 
to resolve the split brain without ending up with corruption. But if 
DRBD is acknowledging writes that haven't been committed on both 
replicas even with protocol C, i may have been incorrect in this.



i can resolve it by discarding outstanding data on the fenced node.

But in practice, the actual written data is lost, and i don't
understand why. AFAICS, none of the outstanding writes should have
been acknowledged by tgtd on the fenced node, so when the surviving
node becomes active, the initiator should simply re-send all of
them. But this isn't what happens; instead most of the outstanding
writes are lost. No i/o error is reported

Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-19 Thread Jefferson Ogata

On 2013-11-19 10:48, Lars Ellenberg wrote:

On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote:

13.11.2013 04:46, Jefferson Ogata wrote:
...

In practice i ran into failover problems under load almost immediately.
Under load, when i would initiate a failover, there was a race
condition: the iSCSILogicalUnit RA will take down the LUNs one at a
time, waiting for each connection to terminate, and if the initiators
reconnect quickly enough, they get pissed off at finding that the target
still exists but the LUN they were using no longer does, which is often
the case during this transient takedown process. On the initiator, it
looks something like this, and it's fatal (here LUN 4 has gone away but
the target is still alive, maybe working on disconnecting LUN 3):

Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal
Request [current]
Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit
not supported
Nov  7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical
block 16542656

One solution to this is using the portblock RA to block all initiator


In addition I force use of multipath on initiators with no_path_retry=queue

...



1. Lack of support for multiple targets using the same tgt account. This
is a problem because the iSCSITarget RA defines the user and the target
at the same time. If it allowed multiple targets to use the same user,
it wouldn't know when it is safe to delete the user in a stop operation,
because some other target might still be using it.

To solve this i did two things: first i wrote a new RA that manages a


Did I miss it, or did you post it somewhere?
Fork on Github and push there, so we can have a look?


Not set up with git right now; i've attached it here. It's short.


tgt user; this is instantiated as a clone so it runs along with the tgtd
clone. Second i tweaked the iSCSITarget RA so that on start, if
incoming_username is defined but incoming_password is not, the RA skips
the account creation step and simply binds the new target to
incoming_username. On stop, it similarly no longer deletes the account
if incoming_password is unset. I also had to relax the uniqueness
constraint on incoming_username in the RA metadata.

2. Disappearing LUNs during failover cause initiators to blow chunks.
For this i used portblock, but had to modify it because the TCP Send-Q
would never drain.

3. portblock preventing TCP Send-Q from draining, causing tgtd
connections to hang. I modified portblock to reverse the sense of the
iptables rules it was adding: instead of blocking traffic from the
initiator on the INPUT chain, it now blocks traffic from the target on
the OUTPUT chain with a tcp-reset response. With this setup, as soon as
portblock goes active, the next packet tgtd attempts to send to a given
initiator will get a TCP RST response, causing tgtd to hang up the
connection immediately. This configuration allows the connections to
terminate promptly under load.

I'm not totally satisfied with this workaround. It means
acknowledgements of operations tgtd has actually completed never make it
back to the initiator. I suspect this could cause problems in some
scenarios. I don't think it causes a problem the way i'm using it, with
each LUN as backing store for a distinct VM--when the LUN is back up on
the other node, the outstanding operations are re-sent by the initiator.
Maybe with a clustered filesystem this would cause problems; it
certainly would cause problems if the target device were, for example, a
tape drive.


Maybe only block new incoming connection attempts?


That's a good idea. Theoretically that should allow the existing 
connections to drain. I'm worried it can lead to pacemaker timeouts 
firing if there's a lot of queued data in the send queues, but i'll test it.


Thanks for the suggestion.
#!/bin/sh
#
# Resource script for managing tgt users
#
# Description:  Manages a tgt user as an OCF resource in 
#   an High Availability setup.
#
# Author: Jefferson Ogata jefferson.og...@noaa.gov
# License: GNU General Public License (GPL) 
#
#
#   usage: $0 {start|stop|status|monitor|validate-all|meta-data}
#
#   The start arg adds the user.
#
#   The stop arg deletes it.
#
# OCF parameters:
# OCF_RESKEY_username
# OCF_RESKEY_password
#
# Example crm configuration:
#
# primitive tgtd lsb:tgtd op monitor interval=10s 
# clone clone.tgtd tgtd
# primitive user.foo ocf:heartbeat:tgtUser params username=foo 
password=secret  
# clone clone.user.foo user.foo
# order clone.tgtd_before_clone.user.foo inf: clone.tgtd:start 
clone.user.foo:start

#
##
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

USAGE=Usage: $0 {start|stop|status|monitor|validate-all|meta-data};

##

usage() 
{
echo $USAGE 2
}

meta_data

Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-19 Thread Jefferson Ogata

On 2013-11-13 06:02, Vladislav Bogdanov wrote:

13.11.2013 04:46, Jefferson Ogata wrote:

[snip]

4. Insufficient privileges faults in the portblock RA. This was
another race condition that occurred because i was using multiple
targets, meaning that without a mutex, multiple portblock invocations
would be running in parallel during a failover. If you try to run
iptables while another iptables is running, you get Resource not
available and this was coming back to pacemaker as insufficient
privileges. This is simply a bug in the portblock RA; it should have a
mutex to prevent parallel iptables invocations. I fixed this by adding
an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for
start, stop, monitor, and status operations.

I'm not sure why more people haven't run into these problems before. I
hope it's not that i'm doing things wrong, but rather that few others
haven't earnestly tried to build anything quite like this setup. If
anyone out there has set up a similar cluster and *not* had these
problems, i'd like to know about it. Meanwhile, if others *have* had
these problems, i'd also like to know, especially if they've found
alternate solutions.


Can't say about 1, I use IET, it doesn't seem to have that limitation.
2 - I use alternative home-brew ms RA which blocks (DROP) both input and
output for a specified VIP on demote (targets are configured to be bound
to that VIPs). I also export one big LUN per target and then set up clvm
VG on top of it (all initiators are in the same another cluster).
3 - can't say as well, IET is probably not affected.
4 - That is true, iptables doesn't have atomic rules management, so you
definitely need mutex or dispatcher like firewalld (didn't try it though).


The issue here is not really the lack of atomic rules management (which 
you can hack in iptables by defining a new chain for the rules you want 
to add and adding a jump to it in the INPUT or OUTPUT chain). The issue 
i was encountering is that if the iptables process locks a resource. If 
a second iptables process attempts to run concurrently, it can't lock, 
and it fails. The RA was feeding this back to pacemaker as a resource 
stop/stop failure, when really the RA should have been preventing 
concurrent execution of iptables in the first place.


Thanks for the feedback.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-12 Thread Jefferson Ogata

Greetings.

I'm working on a high-availability iSCSI target (tgt) cluster using 
CentOS 6.4, to support a blade VM infrastructure. I've encountered a 
number of problems i haven't found documented elsewhere (not for lack of 
looking), and i want to run some solutions past the list to see what 
sticks. I have one outstanding problem i haven't been able to work out, 
which i will present in another thread to follow.


I'm going to forgo posting detailed configs initially here, as i think 
the problems are abstract enough that it won't be necessary. If it turns 
out to be necessary, we can do that. I hope also that this list is an 
acceptable place for this; i found a lot of pointers here so it seemed 
appropriate.


The cluster is two Dell boxes with a bunch of directly attached Dell SAS 
storage. On each box, disks are organized into two RAID10 volumes and 
four RAID6 volumes, to serve different profiles of IOPS and storage 
volume needed by various applications. Each of these volumes is synced 
to the other box using DRBD 8.4 over an LACP bond of two crossover 10 
GbE links. The pacemaker/CMAN/corosync stack also talks over the 
crossover. Each box is connected to the network using another 10 GbE 
link, with a 1 GbE link bonded in active/backup mode and attached to a 
separate switch so i can reboot the primary switch if necessary without 
losing connectivity. (The 10 GbE transceivers for my switch 
infrastructure are expensive, and multi-switch LACP is way too bleeding 
edge for me.) Write cache is disabled on all tgt LUNs (using mode_page), 
and all the RAIDs are battery backed.


One of my requirements here is to run multiple tgt targets. Each DRBD 
volume is the sole physical volume of a distinct LVM volume group, and 
each VG is assigned to a unique target. pacemaker naturally balances the 
six targets into three on each box. This configuration has the advantage 
that when most initiators are readers, outgoing bandwidth from each box 
can theoretically hit 10 Gb/s, resulting in 20 Gb/s read bandwidth. (In 
practice i'm maxing out around 12 Gb/s, but i think that's because of 
limitations in my switch infrastructure.) Another advantage of multiple 
targets is that when something screwy happens and a target goes down, 
this only takes down the LUNs on that target, rather than the whole kit 
and kaboodle.


Another requirement i have is password authentication to the targets.

So, first the problems, then the workarounds:

The existing iSCSITarget RA is not designed to support multiple targets 
with a single user account in the tgt implementation. But getting the 
initiators (libvirt nodes, also using CentOS 6.4) to support a different 
account for each target is non-trivial, since there's only one slot in 
/etc/iscsi/iscsi.conf for authentication info. There are workarounds for 
initiators but they're pretty ugly, especially if you want to manage 
your iSCSI targets using libvirt pools.


In practice i ran into failover problems under load almost immediately. 
Under load, when i would initiate a failover, there was a race 
condition: the iSCSILogicalUnit RA will take down the LUNs one at a 
time, waiting for each connection to terminate, and if the initiators 
reconnect quickly enough, they get pissed off at finding that the target 
still exists but the LUN they were using no longer does, which is often 
the case during this transient takedown process. On the initiator, it 
looks something like this, and it's fatal (here LUN 4 has gone away but 
the target is still alive, maybe working on disconnecting LUN 3):


Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal 
Request [current]
Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit 
not supported
Nov  7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical 
block 16542656


One solution to this is using the portblock RA to block all initiator 
traffic during failover, but this creates another problem: tgtd doesn't 
allow established connections to expire as long as there's outstanding 
data in the Send-Q for the TCP connection; if tgtd has already queued a 
bunch of traffic to an initiator when a failover starts, and portblock 
starts blocking ACK packets from the initiator, the Send-Q never drains, 
and the tgtd connection hangs permanently. This stops failover from 
completing, and eventually everyone is unhappy, especially me.


I was having another problem with portblock: start and stop operations 
would frequently but unpredictably fail when multiple targets were 
simultaneously failing over. The error reported by pacemaker was 
'insufficient privileges' (rc=4). This was pretty mysterious since 
everything was running as root, and root has no problem executing 
iptables. This would blow failover sequences up, and the target resource 
would go down.


So, here's how i've worked around these problems. Comments about how 
stupid was not to have done X are, of course, welcome. I'd rather not 
hear You must 

[Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-12 Thread Jefferson Ogata
Here's a problem i don't understand, and i'd like a solution to if 
possible, or at least i'd like to understand why it's a problem, because 
i'm clearly not getting something.


I have an iSCSI target cluster using CentOS 6.4 with stock 
pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.


Both DRBD and cluster comms use a dedicated crossover link.

The target storage is battery-backed RAID.

DRBD resources all use protocol C.

stonith is configured and working.

tgtd write cache is disabled using mode_page in additional_params. This 
is correctly reported using sdparm --get WCE on initiators.


Here's the question: if i am writing from an iSCSI initiator, and i take 
down the crossover link between the nodes of my cluster, i end up with 
corrupt data on the target disk.


I know this isn't the formal way to test pacemaker failover. 
Everything's fine if i fence a node or do a manual migration or 
shutdown. But i don't understand why taking the crossover down results 
in corrupted write operations.


In greater detail, assuming the initiator sends a write request for some 
block, here's the normal sequence as i understand it:


- tgtd receives it and queues it straight for the device backing the LUN 
(write cache is disabled).
- drbd receives it, commits it to disk, sends it to the other node, and 
waits for an acknowledgement (protocol C).
- the remote node receives it, commits it to disk, and sends an 
acknowledgement.
- the initial node receives the drbd acknowledgement, and acknowledges 
the write to tgtd.

- tgtd acknowledges the write to the initiator.

Now, suppose an initiator is writing when i take the crossover link 
down, and pacemaker reacts to the loss in comms by fencing the node with 
the currently active target. It then brings up the target on the 
surviving, formerly inactive, node. This results in a drbd split brain, 
since some writes have been queued on the fenced node but never made it 
to the surviving node, and must be retransmitted by the initiator; once 
the surviving node becomes active it starts committing these writes to 
its copy of the mirror. I'm fine with a split brain; i can resolve it by 
discarding outstanding data on the fenced node.


But in practice, the actual written data is lost, and i don't understand 
why. AFAICS, none of the outstanding writes should have been 
acknowledged by tgtd on the fenced node, so when the surviving node 
becomes active, the initiator should simply re-send all of them. But 
this isn't what happens; instead most of the outstanding writes are 
lost. No i/o error is reported on the initiator; stuff just vanishes.


I'm writing directly to a block device for these tests, so the lost data 
isn't the result of filesystem corruption; it simply never gets written 
to the target disk on the survivor.


What am i missing?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems