Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-19 Thread Lars Ellenberg
On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote:
 Here's a problem i don't understand, and i'd like a solution to if
 possible, or at least i'd like to understand why it's a problem,
 because i'm clearly not getting something.
 
 I have an iSCSI target cluster using CentOS 6.4 with stock
 pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from
 source.
 
 Both DRBD and cluster comms use a dedicated crossover link.
 
 The target storage is battery-backed RAID.
 
 DRBD resources all use protocol C.
 
 stonith is configured and working.

What about DRBD fencing.

You have to use fencing resource-and-stonith;,
and a suitable fencing handler.

Because:

 tgtd write cache is disabled using mode_page in additional_params.
 This is correctly reported using sdparm --get WCE on initiators.
 
 Here's the question: if i am writing from an iSCSI initiator, and i
 take down the crossover link between the nodes of my cluster, i end
 up with corrupt data on the target disk.
 
 I know this isn't the formal way to test pacemaker failover.
 Everything's fine if i fence a node or do a manual migration or
 shutdown. But i don't understand why taking the crossover down
 results in corrupted write operations.
 
 In greater detail, assuming the initiator sends a write request for
 some block, here's the normal sequence as i understand it:
 
 - tgtd receives it and queues it straight for the device backing the
 LUN (write cache is disabled).
 - drbd receives it, commits it to disk, sends it to the other node,
 and waits for an acknowledgement (protocol C).
 - the remote node receives it, commits it to disk, and sends an
 acknowledgement.
 - the initial node receives the drbd acknowledgement, and
 acknowledges the write to tgtd.
 - tgtd acknowledges the write to the initiator.
 
 Now, suppose an initiator is writing when i take the crossover link
 down, and pacemaker reacts to the loss in comms by fencing the node
 with the currently active target. It then brings up the target on
 the surviving, formerly inactive, node. This results in a drbd split
 brain, since some writes have been queued on the fenced node but
 never made it to the surviving node,

But have been acknowledged as written to the initiator,
which is why the initiator won't retransmit them.

With the DRBD fencing policy fencing resource-and-stonith;,
DRBD will *block* further IO (and not acknowledge anything
that did not make it to the peer) until the fence-peer handler
returns that it would be safe to resume IO again.

This avoids the data divergence (aka split-brain, because
usually data-divergence is the result of split-brain).

 and must be retransmitted by
 the initiator; once the surviving node becomes active it starts
 committing these writes to its copy of the mirror. I'm fine with a
 split brain;

You should not be.
DRBD reporting split-brain detected is usually a sign of a bad setup.

 i can resolve it by discarding outstanding data on the fenced node.
 
 But in practice, the actual written data is lost, and i don't
 understand why. AFAICS, none of the outstanding writes should have
 been acknowledged by tgtd on the fenced node, so when the surviving
 node becomes active, the initiator should simply re-send all of
 them. But this isn't what happens; instead most of the outstanding
 writes are lost. No i/o error is reported on the initiator; stuff
 just vanishes.
 
 I'm writing directly to a block device for these tests, so the lost
 data isn't the result of filesystem corruption; it simply never gets
 written to the target disk on the survivor.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-19 Thread Jefferson Ogata

On 2013-11-19 09:32, Lars Ellenberg wrote:

On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote:

Here's a problem i don't understand, and i'd like a solution to if
possible, or at least i'd like to understand why it's a problem,
because i'm clearly not getting something.

I have an iSCSI target cluster using CentOS 6.4 with stock
pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from
source.

Both DRBD and cluster comms use a dedicated crossover link.

The target storage is battery-backed RAID.

DRBD resources all use protocol C.

stonith is configured and working.


What about DRBD fencing.

You have to use fencing resource-and-stonith;,
and a suitable fencing handler.


Currently i have fencing resource-only; and

fence-peer /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;

in my DRBD config. stonith is configured in pacemaker. This was the best 
i could come up with from what documentation i was able to find.


I will try using fencing resource-and-stonith; but i'm unclear on 
whether that requires some sort of additional stonith configuration in 
DRBD, which i didn't think would be necessary.



Because:


tgtd write cache is disabled using mode_page in additional_params.
This is correctly reported using sdparm --get WCE on initiators.

Here's the question: if i am writing from an iSCSI initiator, and i
take down the crossover link between the nodes of my cluster, i end
up with corrupt data on the target disk.

I know this isn't the formal way to test pacemaker failover.
Everything's fine if i fence a node or do a manual migration or
shutdown. But i don't understand why taking the crossover down
results in corrupted write operations.

In greater detail, assuming the initiator sends a write request for
some block, here's the normal sequence as i understand it:

- tgtd receives it and queues it straight for the device backing the
LUN (write cache is disabled).
- drbd receives it, commits it to disk, sends it to the other node,
and waits for an acknowledgement (protocol C).
- the remote node receives it, commits it to disk, and sends an
acknowledgement.
- the initial node receives the drbd acknowledgement, and
acknowledges the write to tgtd.
- tgtd acknowledges the write to the initiator.

Now, suppose an initiator is writing when i take the crossover link
down, and pacemaker reacts to the loss in comms by fencing the node
with the currently active target. It then brings up the target on
the surviving, formerly inactive, node. This results in a drbd split
brain, since some writes have been queued on the fenced node but
never made it to the surviving node,


But have been acknowledged as written to the initiator,
which is why the initiator won't retransmit them.


This is the crux of what i'm not understanding: why, if i'm using 
protocol C, would DRBD acknowledge a write before it's been committed to 
the remote replica? If so, then i really don't understand the point of 
protocol C.


Or is tgtd acknowledging writes before they've been committed by the 
underlying backing store? I thought disabling the write cache would 
prevent that.


What i *thought* should happen was that writes received by the target 
after the crossover link fails would not be acknowledged under protocol 
C, and would be retransmitted after fencing completed and the backup 
node becomes primary. These writes overlap with writes that were 
committed on the fenced node's replica but hadn't been transmitted to 
the other replica, so this results in split brain that is reliably 
resolvable by discarding data from the fenced node's replica and resyncing.



With the DRBD fencing policy fencing resource-and-stonith;,
DRBD will *block* further IO (and not acknowledge anything
that did not make it to the peer) until the fence-peer handler
returns that it would be safe to resume IO again.

This avoids the data divergence (aka split-brain, because
usually data-divergence is the result of split-brain).


and must be retransmitted by
the initiator; once the surviving node becomes active it starts
committing these writes to its copy of the mirror. I'm fine with a
split brain;


You should not be.
DRBD reporting split-brain detected is usually a sign of a bad setup.


Well, by fine i meant that i felt i had a clear understanding of how 
to resolve the split brain without ending up with corruption. But if 
DRBD is acknowledging writes that haven't been committed on both 
replicas even with protocol C, i may have been incorrect in this.



i can resolve it by discarding outstanding data on the fenced node.

But in practice, the actual written data is lost, and i don't
understand why. AFAICS, none of the outstanding writes should have
been acknowledged by tgtd on the fenced node, so when the surviving
node becomes active, the initiator should simply re-send all of
them. But this isn't what happens; instead most of the outstanding
writes are lost. No i/o error is reported on 

[Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-12 Thread Jefferson Ogata
Here's a problem i don't understand, and i'd like a solution to if 
possible, or at least i'd like to understand why it's a problem, because 
i'm clearly not getting something.


I have an iSCSI target cluster using CentOS 6.4 with stock 
pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.


Both DRBD and cluster comms use a dedicated crossover link.

The target storage is battery-backed RAID.

DRBD resources all use protocol C.

stonith is configured and working.

tgtd write cache is disabled using mode_page in additional_params. This 
is correctly reported using sdparm --get WCE on initiators.


Here's the question: if i am writing from an iSCSI initiator, and i take 
down the crossover link between the nodes of my cluster, i end up with 
corrupt data on the target disk.


I know this isn't the formal way to test pacemaker failover. 
Everything's fine if i fence a node or do a manual migration or 
shutdown. But i don't understand why taking the crossover down results 
in corrupted write operations.


In greater detail, assuming the initiator sends a write request for some 
block, here's the normal sequence as i understand it:


- tgtd receives it and queues it straight for the device backing the LUN 
(write cache is disabled).
- drbd receives it, commits it to disk, sends it to the other node, and 
waits for an acknowledgement (protocol C).
- the remote node receives it, commits it to disk, and sends an 
acknowledgement.
- the initial node receives the drbd acknowledgement, and acknowledges 
the write to tgtd.

- tgtd acknowledges the write to the initiator.

Now, suppose an initiator is writing when i take the crossover link 
down, and pacemaker reacts to the loss in comms by fencing the node with 
the currently active target. It then brings up the target on the 
surviving, formerly inactive, node. This results in a drbd split brain, 
since some writes have been queued on the fenced node but never made it 
to the surviving node, and must be retransmitted by the initiator; once 
the surviving node becomes active it starts committing these writes to 
its copy of the mirror. I'm fine with a split brain; i can resolve it by 
discarding outstanding data on the fenced node.


But in practice, the actual written data is lost, and i don't understand 
why. AFAICS, none of the outstanding writes should have been 
acknowledged by tgtd on the fenced node, so when the surviving node 
becomes active, the initiator should simply re-send all of them. But 
this isn't what happens; instead most of the outstanding writes are 
lost. No i/o error is reported on the initiator; stuff just vanishes.


I'm writing directly to a block device for these tests, so the lost data 
isn't the result of filesystem corruption; it simply never gets written 
to the target disk on the survivor.


What am i missing?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-12 Thread Andrew Beekhof

On 13 Nov 2013, at 2:10 pm, Jefferson Ogata linux...@antibozo.net wrote:

 Here's a problem i don't understand, and i'd like a solution to if possible, 
 or at least i'd like to understand why it's a problem, because i'm clearly 
 not getting something.
 
 I have an iSCSI target cluster using CentOS 6.4 with stock 
 pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.
 
 Both DRBD and cluster comms use a dedicated crossover link.
 
 The target storage is battery-backed RAID.
 
 DRBD resources all use protocol C.
 
 stonith is configured and working.
 
 tgtd write cache is disabled using mode_page in additional_params. This is 
 correctly reported using sdparm --get WCE on initiators.
 
 Here's the question: if i am writing from an iSCSI initiator, and i take down 
 the crossover link between the nodes of my cluster, i end up with corrupt 
 data on the target disk.
 
 I know this isn't the formal way to test pacemaker failover. Everything's 
 fine if i fence a node or do a manual migration or shutdown. But i don't 
 understand why taking the crossover down results in corrupted write 
 operations.
 
 In greater detail, assuming the initiator sends a write request for some 
 block, here's the normal sequence as i understand it:
 
 - tgtd receives it and queues it straight for the device backing the LUN 
 (write cache is disabled).
 - drbd receives it, commits it to disk, sends it to the other node, and waits 
 for an acknowledgement (protocol C).
 - the remote node receives it, commits it to disk, and sends an 
 acknowledgement.
 - the initial node receives the drbd acknowledgement, and acknowledges the 
 write to tgtd.
 - tgtd acknowledges the write to the initiator.
 
 Now, suppose an initiator is writing when i take the crossover link down, and 
 pacemaker reacts to the loss in comms by fencing the node with the currently 
 active target. It then brings up the target on the surviving, formerly 
 inactive, node. This results in a drbd split brain, since some writes have 
 been queued on the fenced node but never made it to the surviving node, and 
 must be retransmitted by the initiator; once the surviving node becomes 
 active it starts committing these writes to its copy of the mirror. I'm fine 
 with a split brain; i can resolve it by discarding outstanding data on the 
 fenced node.
 
 But in practice, the actual written data is lost, and i don't understand why. 
 AFAICS, none of the outstanding writes should have been acknowledged by tgtd 
 on the fenced node, so when the surviving node becomes active, the initiator 
 should simply re-send all of them. But this isn't what happens; instead most 
 of the outstanding writes are lost. No i/o error is reported on the 
 initiator; stuff just vanishes.
 
 I'm writing directly to a block device for these tests, so the lost data 
 isn't the result of filesystem corruption; it simply never gets written to 
 the target disk on the survivor.
 
 What am i missing?

iSCSI, drbd, etc are not really my area of expertise, but it may be worth 
taking the cluster out of the loop and manually performing the equivalent 
actions.
If the underlying drbd and iSCSI setups have a problem, then the cluster isn't 
going to do much about it.



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

2013-11-12 Thread Vladislav Bogdanov
13.11.2013 06:10, Jefferson Ogata wrote:
 Here's a problem i don't understand, and i'd like a solution to if
 possible, or at least i'd like to understand why it's a problem, because
 i'm clearly not getting something.
 
 I have an iSCSI target cluster using CentOS 6.4 with stock
 pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.
 
 Both DRBD and cluster comms use a dedicated crossover link.
 
 The target storage is battery-backed RAID.
 
 DRBD resources all use protocol C.
 
 stonith is configured and working.
 
 tgtd write cache is disabled using mode_page in additional_params. This
 is correctly reported using sdparm --get WCE on initiators.
 
 Here's the question: if i am writing from an iSCSI initiator, and i take
 down the crossover link between the nodes of my cluster, i end up with
 corrupt data on the target disk.
 
 I know this isn't the formal way to test pacemaker failover.
 Everything's fine if i fence a node or do a manual migration or
 shutdown. But i don't understand why taking the crossover down results
 in corrupted write operations.
 
 In greater detail, assuming the initiator sends a write request for some
 block, here's the normal sequence as i understand it:
 
 - tgtd receives it and queues it straight for the device backing the LUN
 (write cache is disabled).
 - drbd receives it, commits it to disk, sends it to the other node, and
 waits for an acknowledgement (protocol C).
 - the remote node receives it, commits it to disk, and sends an
 acknowledgement.
 - the initial node receives the drbd acknowledgement, and acknowledges
 the write to tgtd.
 - tgtd acknowledges the write to the initiator.
 
 Now, suppose an initiator is writing when i take the crossover link
 down, and pacemaker reacts to the loss in comms by fencing the node with
 the currently active target. It then brings up the target on the
 surviving, formerly inactive, node. This results in a drbd split brain,
 since some writes have been queued on the fenced node but never made it
 to the surviving node, and must be retransmitted by the initiator; once
 the surviving node becomes active it starts committing these writes to
 its copy of the mirror. I'm fine with a split brain; i can resolve it by
 discarding outstanding data on the fenced node.
 
 But in practice, the actual written data is lost, and i don't understand
 why. AFAICS, none of the outstanding writes should have been
 acknowledged by tgtd on the fenced node, so when the surviving node
 becomes active, the initiator should simply re-send all of them. But
 this isn't what happens; instead most of the outstanding writes are
 lost. No i/o error is reported on the initiator; stuff just vanishes.
 
 I'm writing directly to a block device for these tests, so the lost data
 isn't the result of filesystem corruption; it simply never gets written
 to the target disk on the survivor.
 
 What am i missing?

Do you have handlers (fence-peer /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;) configured in
drbd.conf?

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems