Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-19 Thread Vladislav Bogdanov
19.11.2013 13:48, Lars Ellenberg wrote:
> On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote:
>> 13.11.2013 04:46, Jefferson Ogata wrote:
>> ...
>>>
>>> In practice i ran into failover problems under load almost immediately.
>>> Under load, when i would initiate a failover, there was a race
>>> condition: the iSCSILogicalUnit RA will take down the LUNs one at a
>>> time, waiting for each connection to terminate, and if the initiators
>>> reconnect quickly enough, they get pissed off at finding that the target
>>> still exists but the LUN they were using no longer does, which is often
>>> the case during this transient takedown process. On the initiator, it
>>> looks something like this, and it's fatal (here LUN 4 has gone away but
>>> the target is still alive, maybe working on disconnecting LUN 3):
>>>
>>> Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal
>>> Request [current]
>>> Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit
>>> not supported
>>> Nov  7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical
>>> block 16542656
>>>
>>> One solution to this is using the portblock RA to block all initiator
>>
>> In addition I force use of multipath on initiators with no_path_retry=queue
>>
>> ...
>>
>>>
>>> 1. Lack of support for multiple targets using the same tgt account. This
>>> is a problem because the iSCSITarget RA defines the user and the target
>>> at the same time. If it allowed multiple targets to use the same user,
>>> it wouldn't know when it is safe to delete the user in a stop operation,
>>> because some other target might still be using it.
>>>
>>> To solve this i did two things: first i wrote a new RA that manages a
> 
> Did I miss it, or did you post it somewhere?
> Fork on Github and push there, so we can have a look?
> 
>>> tgt user; this is instantiated as a clone so it runs along with the tgtd
>>> clone. Second i tweaked the iSCSITarget RA so that on start, if
>>> incoming_username is defined but incoming_password is not, the RA skips
>>> the account creation step and simply binds the new target to
>>> incoming_username. On stop, it similarly no longer deletes the account
>>> if incoming_password is unset. I also had to relax the uniqueness
>>> constraint on incoming_username in the RA metadata.
>>>
>>> 2. Disappearing LUNs during failover cause initiators to blow chunks.
>>> For this i used portblock, but had to modify it because the TCP Send-Q
>>> would never drain.
>>>
>>> 3. portblock preventing TCP Send-Q from draining, causing tgtd
>>> connections to hang. I modified portblock to reverse the sense of the
>>> iptables rules it was adding: instead of blocking traffic from the
>>> initiator on the INPUT chain, it now blocks traffic from the target on
>>> the OUTPUT chain with a tcp-reset response. With this setup, as soon as
>>> portblock goes active, the next packet tgtd attempts to send to a given
>>> initiator will get a TCP RST response, causing tgtd to hang up the
>>> connection immediately. This configuration allows the connections to
>>> terminate promptly under load.
>>>
>>> I'm not totally satisfied with this workaround. It means
>>> acknowledgements of operations tgtd has actually completed never make it
>>> back to the initiator. I suspect this could cause problems in some
>>> scenarios. I don't think it causes a problem the way i'm using it, with
>>> each LUN as backing store for a distinct VM--when the LUN is back up on
>>> the other node, the outstanding operations are re-sent by the initiator.
>>> Maybe with a clustered filesystem this would cause problems; it
>>> certainly would cause problems if the target device were, for example, a
>>> tape drive.
> 
> Maybe only block "new" incoming connection attempts?
> 

That may cause issues on an initiator side in some circumstances (IIRC):
* connection is established
* pacemaker fires target move
* target is destroyed, connection breaks (TCP RST is sent to initiator)
* initiator connects again
* target is not available on iSCSI level (but portals answer either on
old or on new node) or portals are not available
* initiator *returns error* to an upper layer <- this one is important
* target is configured on other node then

I was hit by this, but that was several years ago, so I may miss some
details.

My experience with IET and LIO shows it is better (safer) to block all
iSCSI traffic to target's portals, both directions.
* connection is established
* pacemaker fires target move
* both directions are blocked (DROP) on both target nodes
* target is destroyed, connection stays "established" on initiator side,
just TCP packets timeout
* target is configured on other node (VIPs are moved too)
* firewall rules are removed
* initiator (re)sends request
* target sends RST (?) back - it doesn't have that connection
* initiator reconnects and continues to use target


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http

Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-19 Thread Jefferson Ogata

On 2013-11-13 06:02, Vladislav Bogdanov wrote:

13.11.2013 04:46, Jefferson Ogata wrote:

[snip]

4. "Insufficient privileges" faults in the portblock RA. This was
another race condition that occurred because i was using multiple
targets, meaning that without a mutex, multiple portblock invocations
would be running in parallel during a failover. If you try to run
iptables while another iptables is running, you get "Resource not
available" and this was coming back to pacemaker as "insufficient
privileges". This is simply a bug in the portblock RA; it should have a
mutex to prevent parallel iptables invocations. I fixed this by adding
an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for
start, stop, monitor, and status operations.

I'm not sure why more people haven't run into these problems before. I
hope it's not that i'm doing things wrong, but rather that few others
haven't earnestly tried to build anything quite like this setup. If
anyone out there has set up a similar cluster and *not* had these
problems, i'd like to know about it. Meanwhile, if others *have* had
these problems, i'd also like to know, especially if they've found
alternate solutions.


Can't say about 1, I use IET, it doesn't seem to have that limitation.
2 - I use alternative home-brew ms RA which blocks (DROP) both input and
output for a specified VIP on demote (targets are configured to be bound
to that VIPs). I also export one big LUN per target and then set up clvm
VG on top of it (all initiators are in the same another cluster).
3 - can't say as well, IET is probably not affected.
4 - That is true, iptables doesn't have atomic rules management, so you
definitely need mutex or dispatcher like firewalld (didn't try it though).


The issue here is not really the lack of atomic rules management (which 
you can hack in iptables by defining a new chain for the rules you want 
to add and adding a jump to it in the INPUT or OUTPUT chain). The issue 
i was encountering is that if the iptables process locks a resource. If 
a second iptables process attempts to run concurrently, it can't lock, 
and it fails. The RA was feeding this back to pacemaker as a resource 
stop/stop failure, when really the RA should have been preventing 
concurrent execution of iptables in the first place.


Thanks for the feedback.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-19 Thread Jefferson Ogata

On 2013-11-19 10:48, Lars Ellenberg wrote:

On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote:

13.11.2013 04:46, Jefferson Ogata wrote:
...

In practice i ran into failover problems under load almost immediately.
Under load, when i would initiate a failover, there was a race
condition: the iSCSILogicalUnit RA will take down the LUNs one at a
time, waiting for each connection to terminate, and if the initiators
reconnect quickly enough, they get pissed off at finding that the target
still exists but the LUN they were using no longer does, which is often
the case during this transient takedown process. On the initiator, it
looks something like this, and it's fatal (here LUN 4 has gone away but
the target is still alive, maybe working on disconnecting LUN 3):

Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal
Request [current]
Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit
not supported
Nov  7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical
block 16542656

One solution to this is using the portblock RA to block all initiator


In addition I force use of multipath on initiators with no_path_retry=queue

...



1. Lack of support for multiple targets using the same tgt account. This
is a problem because the iSCSITarget RA defines the user and the target
at the same time. If it allowed multiple targets to use the same user,
it wouldn't know when it is safe to delete the user in a stop operation,
because some other target might still be using it.

To solve this i did two things: first i wrote a new RA that manages a


Did I miss it, or did you post it somewhere?
Fork on Github and push there, so we can have a look?


Not set up with git right now; i've attached it here. It's short.


tgt user; this is instantiated as a clone so it runs along with the tgtd
clone. Second i tweaked the iSCSITarget RA so that on start, if
incoming_username is defined but incoming_password is not, the RA skips
the account creation step and simply binds the new target to
incoming_username. On stop, it similarly no longer deletes the account
if incoming_password is unset. I also had to relax the uniqueness
constraint on incoming_username in the RA metadata.

2. Disappearing LUNs during failover cause initiators to blow chunks.
For this i used portblock, but had to modify it because the TCP Send-Q
would never drain.

3. portblock preventing TCP Send-Q from draining, causing tgtd
connections to hang. I modified portblock to reverse the sense of the
iptables rules it was adding: instead of blocking traffic from the
initiator on the INPUT chain, it now blocks traffic from the target on
the OUTPUT chain with a tcp-reset response. With this setup, as soon as
portblock goes active, the next packet tgtd attempts to send to a given
initiator will get a TCP RST response, causing tgtd to hang up the
connection immediately. This configuration allows the connections to
terminate promptly under load.

I'm not totally satisfied with this workaround. It means
acknowledgements of operations tgtd has actually completed never make it
back to the initiator. I suspect this could cause problems in some
scenarios. I don't think it causes a problem the way i'm using it, with
each LUN as backing store for a distinct VM--when the LUN is back up on
the other node, the outstanding operations are re-sent by the initiator.
Maybe with a clustered filesystem this would cause problems; it
certainly would cause problems if the target device were, for example, a
tape drive.


Maybe only block "new" incoming connection attempts?


That's a good idea. Theoretically that should allow the existing 
connections to drain. I'm worried it can lead to pacemaker timeouts 
firing if there's a lot of queued data in the send queues, but i'll test it.


Thanks for the suggestion.
#!/bin/sh
#
# Resource script for managing tgt users
#
# Description:  Manages a tgt user as an OCF resource in 
#   an High Availability setup.
#
# Author: Jefferson Ogata 
# License: GNU General Public License (GPL) 
#
#
#   usage: $0 {start|stop|status|monitor|validate-all|meta-data}
#
#   The "start" arg adds the user.
#
#   The "stop" arg deletes it.
#
# OCF parameters:
# OCF_RESKEY_username
# OCF_RESKEY_password
#
# Example crm configuration:
#
# primitive tgtd lsb:tgtd op monitor interval="10s" 
# clone clone.tgtd tgtd
# primitive user.foo ocf:heartbeat:tgtUser params username="foo" 
password="secret"  
# clone clone.user.foo user.foo
# order clone.tgtd_before_clone.user.foo inf: clone.tgtd:start 
clone.user.foo:start

#
##
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

USAGE="Usage: $0 {start|stop|status|monitor|validate-all|meta-data}";

##

usage() 
{
echo $USAGE >&2
}

meta_data() 
{
cat <


1

Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-19 Thread Jefferson Ogata

On 2013-11-19 09:32, Lars Ellenberg wrote:

On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote:

Here's a problem i don't understand, and i'd like a solution to if
possible, or at least i'd like to understand why it's a problem,
because i'm clearly not getting something.

I have an iSCSI target cluster using CentOS 6.4 with stock
pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from
source.

Both DRBD and cluster comms use a dedicated crossover link.

The target storage is battery-backed RAID.

DRBD resources all use protocol C.

stonith is configured and working.


What about DRBD fencing.

You have to use "fencing resource-and-stonith;",
and a suitable fencing handler.


Currently i have "fencing resource-only;" and

fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;

in my DRBD config. stonith is configured in pacemaker. This was the best 
i could come up with from what documentation i was able to find.


I will try using "fencing resource-and-stonith;" but i'm unclear on 
whether that requires some sort of additional stonith configuration in 
DRBD, which i didn't think would be necessary.



Because:


tgtd write cache is disabled using mode_page in additional_params.
This is correctly reported using sdparm --get WCE on initiators.

Here's the question: if i am writing from an iSCSI initiator, and i
take down the crossover link between the nodes of my cluster, i end
up with corrupt data on the target disk.

I know this isn't the formal way to test pacemaker failover.
Everything's fine if i fence a node or do a manual migration or
shutdown. But i don't understand why taking the crossover down
results in corrupted write operations.

In greater detail, assuming the initiator sends a write request for
some block, here's the normal sequence as i understand it:

- tgtd receives it and queues it straight for the device backing the
LUN (write cache is disabled).
- drbd receives it, commits it to disk, sends it to the other node,
and waits for an acknowledgement (protocol C).
- the remote node receives it, commits it to disk, and sends an
acknowledgement.
- the initial node receives the drbd acknowledgement, and
acknowledges the write to tgtd.
- tgtd acknowledges the write to the initiator.

Now, suppose an initiator is writing when i take the crossover link
down, and pacemaker reacts to the loss in comms by fencing the node
with the currently active target. It then brings up the target on
the surviving, formerly inactive, node. This results in a drbd split
brain, since some writes have been queued on the fenced node but
never made it to the surviving node,


But have been acknowledged as written to the initiator,
which is why the initiator won't retransmit them.


This is the crux of what i'm not understanding: why, if i'm using 
protocol C, would DRBD acknowledge a write before it's been committed to 
the remote replica? If so, then i really don't understand the point of 
protocol C.


Or is tgtd acknowledging writes before they've been committed by the 
underlying backing store? I thought disabling the write cache would 
prevent that.


What i *thought* should happen was that writes received by the target 
after the crossover link fails would not be acknowledged under protocol 
C, and would be retransmitted after fencing completed and the backup 
node becomes primary. These writes overlap with writes that were 
committed on the fenced node's replica but hadn't been transmitted to 
the other replica, so this results in split brain that is reliably 
resolvable by discarding data from the fenced node's replica and resyncing.



With the DRBD fencing policy "fencing resource-and-stonith;",
DRBD will *block* further IO (and not acknowledge anything
that did not make it to the peer) until the fence-peer handler
returns that it would be safe to resume IO again.

This avoids the data divergence (aka "split-brain", because
usually data-divergence is the result of split-brain).


and must be retransmitted by
the initiator; once the surviving node becomes active it starts
committing these writes to its copy of the mirror. I'm fine with a
split brain;


You should not be.
DRBD reporting "split-brain detected" is usually a sign of a bad setup.


Well, by "fine" i meant that i felt i had a clear understanding of how 
to resolve the split brain without ending up with corruption. But if 
DRBD is acknowledging writes that haven't been committed on both 
replicas even with protocol C, i may have been incorrect in this.



i can resolve it by discarding outstanding data on the fenced node.

But in practice, the actual written data is lost, and i don't
understand why. AFAICS, none of the outstanding writes should have
been acknowledged by tgtd on the fenced node, so when the surviving
node becomes active, the initiator should simply re-send all of
them. But this isn't what happens; instead most of the outstanding
writes are lost. No i/o error 

Re: [Linux-HA] Ping domain

2013-11-19 Thread Brent Clark

Good day

Thank you for replying.

Brent

On 19/11/2013 13:01, Lars Ellenberg wrote:

On Tue, Nov 19, 2013 at 10:02:45AM +0200, Brent Clark wrote:

Good day.

I would like to ask, if you can supply a domain for ping in ha.cf.

If anyone can assist, it would be appreciated.

I'm not exactly sure what you are asking here,
but maybe this helps:
http://moin.linux-ha.org/PingDirective

Or consider switching to Pacemaker...



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Ping domain

2013-11-19 Thread Lars Ellenberg
On Tue, Nov 19, 2013 at 10:02:45AM +0200, Brent Clark wrote:
> Good day.
> 
> I would like to ask, if you can supply a domain for ping in ha.cf.
> 
> If anyone can assist, it would be appreciated.

I'm not exactly sure what you are asking here,
but maybe this helps:
http://moin.linux-ha.org/PingDirective

Or consider switching to Pacemaker...

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat errors related to Gmain_timeout_dispatch at low traffic

2013-11-19 Thread Lars Ellenberg
On Thu, Nov 14, 2013 at 04:46:16PM +0530, Savita Kulkarni wrote:
> Hi,
> 
> Recently we are seeing lots of heartbeat errors related to
> Gmain_timeout_dispatch
> on our system.
> I checked on mailing list archives if other people have faced this issue.
> There are few email threads regarding this but people are seeing this issue
> in case of high load.
> 
> On our system there is very low/no load is present.
> 
> We are running heartbeat on guest VMs, using VMWARE ESXi 5.0.
> We have heartbeat -2.1.3-4
> It is working fine without any issues on other other setups and issue is
> coming only on this setup.
> 
> Following types of errors are present in /var/log/messages
> 
> Nov 12 09:58:43  heartbeat: [23036]: WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 15270 ms (> 1010
> ms) before being called (GSource: 0x138926b8)
> Nov 12 09:59:00  heartbeat: [23036]: info: Gmain_timeout_dispatch:
> started at 583294569 should have started at 583293042
> Nov 12 09:59:00 heartbeat: [23036]: WARN: Gmain_timeout_dispatch:
> Dispatch function for update msgfree count was delayed 33960 ms (>
> 1 ms) before being called (GSource: 0x13892f58)
> 
> Can anyone tell me what can be the issue?
> 
> Can it be a hardware issue?

Could be many things, even that, yes.

Could be that upgrading to recent heartbeat 3 helps.

Could be that there is to little load, and your virtualization just
stops scheduling the VM itself, because it thinks it is underutilized...

Does it recover if you kill/restart heartbeat?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

2013-11-19 Thread Lars Ellenberg
On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote:
> 13.11.2013 04:46, Jefferson Ogata wrote:
> ...
> > 
> > In practice i ran into failover problems under load almost immediately.
> > Under load, when i would initiate a failover, there was a race
> > condition: the iSCSILogicalUnit RA will take down the LUNs one at a
> > time, waiting for each connection to terminate, and if the initiators
> > reconnect quickly enough, they get pissed off at finding that the target
> > still exists but the LUN they were using no longer does, which is often
> > the case during this transient takedown process. On the initiator, it
> > looks something like this, and it's fatal (here LUN 4 has gone away but
> > the target is still alive, maybe working on disconnecting LUN 3):
> > 
> > Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal
> > Request [current]
> > Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit
> > not supported
> > Nov  7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical
> > block 16542656
> > 
> > One solution to this is using the portblock RA to block all initiator
> 
> In addition I force use of multipath on initiators with no_path_retry=queue
> 
> ...
> 
> > 
> > 1. Lack of support for multiple targets using the same tgt account. This
> > is a problem because the iSCSITarget RA defines the user and the target
> > at the same time. If it allowed multiple targets to use the same user,
> > it wouldn't know when it is safe to delete the user in a stop operation,
> > because some other target might still be using it.
> > 
> > To solve this i did two things: first i wrote a new RA that manages a

Did I miss it, or did you post it somewhere?
Fork on Github and push there, so we can have a look?

> > tgt user; this is instantiated as a clone so it runs along with the tgtd
> > clone. Second i tweaked the iSCSITarget RA so that on start, if
> > incoming_username is defined but incoming_password is not, the RA skips
> > the account creation step and simply binds the new target to
> > incoming_username. On stop, it similarly no longer deletes the account
> > if incoming_password is unset. I also had to relax the uniqueness
> > constraint on incoming_username in the RA metadata.
> > 
> > 2. Disappearing LUNs during failover cause initiators to blow chunks.
> > For this i used portblock, but had to modify it because the TCP Send-Q
> > would never drain.
> > 
> > 3. portblock preventing TCP Send-Q from draining, causing tgtd
> > connections to hang. I modified portblock to reverse the sense of the
> > iptables rules it was adding: instead of blocking traffic from the
> > initiator on the INPUT chain, it now blocks traffic from the target on
> > the OUTPUT chain with a tcp-reset response. With this setup, as soon as
> > portblock goes active, the next packet tgtd attempts to send to a given
> > initiator will get a TCP RST response, causing tgtd to hang up the
> > connection immediately. This configuration allows the connections to
> > terminate promptly under load.
> > 
> > I'm not totally satisfied with this workaround. It means
> > acknowledgements of operations tgtd has actually completed never make it
> > back to the initiator. I suspect this could cause problems in some
> > scenarios. I don't think it causes a problem the way i'm using it, with
> > each LUN as backing store for a distinct VM--when the LUN is back up on
> > the other node, the outstanding operations are re-sent by the initiator.
> > Maybe with a clustered filesystem this would cause problems; it
> > certainly would cause problems if the target device were, for example, a
> > tape drive.

Maybe only block "new" incoming connection attempts?

> > 4. "Insufficient privileges" faults in the portblock RA. This was
> > another race condition that occurred because i was using multiple
> > targets, meaning that without a mutex, multiple portblock invocations
> > would be running in parallel during a failover. If you try to run
> > iptables while another iptables is running, you get "Resource not
> > available" and this was coming back to pacemaker as "insufficient
> > privileges". This is simply a bug in the portblock RA; it should have a
> > mutex to prevent parallel iptables invocations. I fixed this by adding
> > an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for
> > start, stop, monitor, and status operations.
> >
> > I'm not sure why more people haven't run into these problems before. I
> > hope it's not that i'm doing things wrong, but rather that few others
> > haven't earnestly tried to build anything quite like this setup. If
> > anyone out there has set up a similar cluster and *not* had these
> > problems, i'd like to know about it. Meanwhile, if others *have* had
> > these problems, i'd also like to know, especially if they've found
> > alternate solutions.
> 
> Can't say about 1, I use IET, it doesn't seem to have that limitation.
> 2 - I use alternative home

Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-19 Thread Lars Ellenberg
On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote:
> Here's a problem i don't understand, and i'd like a solution to if
> possible, or at least i'd like to understand why it's a problem,
> because i'm clearly not getting something.
> 
> I have an iSCSI target cluster using CentOS 6.4 with stock
> pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from
> source.
> 
> Both DRBD and cluster comms use a dedicated crossover link.
> 
> The target storage is battery-backed RAID.
> 
> DRBD resources all use protocol C.
> 
> stonith is configured and working.

What about DRBD fencing.

You have to use "fencing resource-and-stonith;",
and a suitable fencing handler.

Because:

> tgtd write cache is disabled using mode_page in additional_params.
> This is correctly reported using sdparm --get WCE on initiators.
> 
> Here's the question: if i am writing from an iSCSI initiator, and i
> take down the crossover link between the nodes of my cluster, i end
> up with corrupt data on the target disk.
> 
> I know this isn't the formal way to test pacemaker failover.
> Everything's fine if i fence a node or do a manual migration or
> shutdown. But i don't understand why taking the crossover down
> results in corrupted write operations.
> 
> In greater detail, assuming the initiator sends a write request for
> some block, here's the normal sequence as i understand it:
> 
> - tgtd receives it and queues it straight for the device backing the
> LUN (write cache is disabled).
> - drbd receives it, commits it to disk, sends it to the other node,
> and waits for an acknowledgement (protocol C).
> - the remote node receives it, commits it to disk, and sends an
> acknowledgement.
> - the initial node receives the drbd acknowledgement, and
> acknowledges the write to tgtd.
> - tgtd acknowledges the write to the initiator.
> 
> Now, suppose an initiator is writing when i take the crossover link
> down, and pacemaker reacts to the loss in comms by fencing the node
> with the currently active target. It then brings up the target on
> the surviving, formerly inactive, node. This results in a drbd split
> brain, since some writes have been queued on the fenced node but
> never made it to the surviving node,

But have been acknowledged as written to the initiator,
which is why the initiator won't retransmit them.

With the DRBD fencing policy "fencing resource-and-stonith;",
DRBD will *block* further IO (and not acknowledge anything
that did not make it to the peer) until the fence-peer handler
returns that it would be safe to resume IO again.

This avoids the data divergence (aka "split-brain", because
usually data-divergence is the result of split-brain).

> and must be retransmitted by
> the initiator; once the surviving node becomes active it starts
> committing these writes to its copy of the mirror. I'm fine with a
> split brain;

You should not be.
DRBD reporting "split-brain detected" is usually a sign of a bad setup.

> i can resolve it by discarding outstanding data on the fenced node.
> 
> But in practice, the actual written data is lost, and i don't
> understand why. AFAICS, none of the outstanding writes should have
> been acknowledged by tgtd on the fenced node, so when the surviving
> node becomes active, the initiator should simply re-send all of
> them. But this isn't what happens; instead most of the outstanding
> writes are lost. No i/o error is reported on the initiator; stuff
> just vanishes.
> 
> I'm writing directly to a block device for these tests, so the lost
> data isn't the result of filesystem corruption; it simply never gets
> written to the target disk on the survivor.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems