Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C
On 2013-11-19 09:32, Lars Ellenberg wrote: On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote: Here's a problem i don't understand, and i'd like a solution to if possible, or at least i'd like to understand why it's a problem, because i'm clearly not getting something. I have an iSCSI target cluster using CentOS 6.4 with stock pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source. Both DRBD and cluster comms use a dedicated crossover link. The target storage is battery-backed RAID. DRBD resources all use protocol C. stonith is configured and working. What about DRBD fencing. You have to use "fencing resource-and-stonith;", and a suitable fencing handler. Currently i have "fencing resource-only;" and fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target /usr/lib/drbd/crm-unfence-peer.sh; in my DRBD config. stonith is configured in pacemaker. This was the best i could come up with from what documentation i was able to find. I will try using "fencing resource-and-stonith;" but i'm unclear on whether that requires some sort of additional stonith configuration in DRBD, which i didn't think would be necessary. Because: tgtd write cache is disabled using mode_page in additional_params. This is correctly reported using sdparm --get WCE on initiators. Here's the question: if i am writing from an iSCSI initiator, and i take down the crossover link between the nodes of my cluster, i end up with corrupt data on the target disk. I know this isn't the formal way to test pacemaker failover. Everything's fine if i fence a node or do a manual migration or shutdown. But i don't understand why taking the crossover down results in corrupted write operations. In greater detail, assuming the initiator sends a write request for some block, here's the normal sequence as i understand it: - tgtd receives it and queues it straight for the device backing the LUN (write cache is disabled). - drbd receives it, commits it to disk, sends it to the other node, and waits for an acknowledgement (protocol C). - the remote node receives it, commits it to disk, and sends an acknowledgement. - the initial node receives the drbd acknowledgement, and acknowledges the write to tgtd. - tgtd acknowledges the write to the initiator. Now, suppose an initiator is writing when i take the crossover link down, and pacemaker reacts to the loss in comms by fencing the node with the currently active target. It then brings up the target on the surviving, formerly inactive, node. This results in a drbd split brain, since some writes have been queued on the fenced node but never made it to the surviving node, But have been acknowledged as written to the initiator, which is why the initiator won't retransmit them. This is the crux of what i'm not understanding: why, if i'm using protocol C, would DRBD acknowledge a write before it's been committed to the remote replica? If so, then i really don't understand the point of protocol C. Or is tgtd acknowledging writes before they've been committed by the underlying backing store? I thought disabling the write cache would prevent that. What i *thought* should happen was that writes received by the target after the crossover link fails would not be acknowledged under protocol C, and would be retransmitted after fencing completed and the backup node becomes primary. These writes overlap with writes that were committed on the fenced node's replica but hadn't been transmitted to the other replica, so this results in split brain that is reliably resolvable by discarding data from the fenced node's replica and resyncing. With the DRBD fencing policy "fencing resource-and-stonith;", DRBD will *block* further IO (and not acknowledge anything that did not make it to the peer) until the fence-peer handler returns that it would be safe to resume IO again. This avoids the data divergence (aka "split-brain", because usually data-divergence is the result of split-brain). and must be retransmitted by the initiator; once the surviving node becomes active it starts committing these writes to its copy of the mirror. I'm fine with a split brain; You should not be. DRBD reporting "split-brain detected" is usually a sign of a bad setup. Well, by "fine" i meant that i felt i had a clear understanding of how to resolve the split brain without ending up with corruption. But if DRBD is acknowledging writes that haven't been committed on both replicas even with protocol C, i may have been incorrect in this. i can resolve it by discarding outstanding data on the fenced node. But in practice, the actual written data is lost, and i don't understand why. AFAICS, none of the outstanding writes should have been acknowledged by tgtd on the fenced node, so when the surviving node becomes active, the initiator should simply re-send all of them. But this isn't what happens; instead most of the outstanding writes are lost. No i/o error
Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C
On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote: > Here's a problem i don't understand, and i'd like a solution to if > possible, or at least i'd like to understand why it's a problem, > because i'm clearly not getting something. > > I have an iSCSI target cluster using CentOS 6.4 with stock > pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from > source. > > Both DRBD and cluster comms use a dedicated crossover link. > > The target storage is battery-backed RAID. > > DRBD resources all use protocol C. > > stonith is configured and working. What about DRBD fencing. You have to use "fencing resource-and-stonith;", and a suitable fencing handler. Because: > tgtd write cache is disabled using mode_page in additional_params. > This is correctly reported using sdparm --get WCE on initiators. > > Here's the question: if i am writing from an iSCSI initiator, and i > take down the crossover link between the nodes of my cluster, i end > up with corrupt data on the target disk. > > I know this isn't the formal way to test pacemaker failover. > Everything's fine if i fence a node or do a manual migration or > shutdown. But i don't understand why taking the crossover down > results in corrupted write operations. > > In greater detail, assuming the initiator sends a write request for > some block, here's the normal sequence as i understand it: > > - tgtd receives it and queues it straight for the device backing the > LUN (write cache is disabled). > - drbd receives it, commits it to disk, sends it to the other node, > and waits for an acknowledgement (protocol C). > - the remote node receives it, commits it to disk, and sends an > acknowledgement. > - the initial node receives the drbd acknowledgement, and > acknowledges the write to tgtd. > - tgtd acknowledges the write to the initiator. > > Now, suppose an initiator is writing when i take the crossover link > down, and pacemaker reacts to the loss in comms by fencing the node > with the currently active target. It then brings up the target on > the surviving, formerly inactive, node. This results in a drbd split > brain, since some writes have been queued on the fenced node but > never made it to the surviving node, But have been acknowledged as written to the initiator, which is why the initiator won't retransmit them. With the DRBD fencing policy "fencing resource-and-stonith;", DRBD will *block* further IO (and not acknowledge anything that did not make it to the peer) until the fence-peer handler returns that it would be safe to resume IO again. This avoids the data divergence (aka "split-brain", because usually data-divergence is the result of split-brain). > and must be retransmitted by > the initiator; once the surviving node becomes active it starts > committing these writes to its copy of the mirror. I'm fine with a > split brain; You should not be. DRBD reporting "split-brain detected" is usually a sign of a bad setup. > i can resolve it by discarding outstanding data on the fenced node. > > But in practice, the actual written data is lost, and i don't > understand why. AFAICS, none of the outstanding writes should have > been acknowledged by tgtd on the fenced node, so when the surviving > node becomes active, the initiator should simply re-send all of > them. But this isn't what happens; instead most of the outstanding > writes are lost. No i/o error is reported on the initiator; stuff > just vanishes. > > I'm writing directly to a block device for these tests, so the lost > data isn't the result of filesystem corruption; it simply never gets > written to the target disk on the survivor. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C
13.11.2013 06:10, Jefferson Ogata wrote: > Here's a problem i don't understand, and i'd like a solution to if > possible, or at least i'd like to understand why it's a problem, because > i'm clearly not getting something. > > I have an iSCSI target cluster using CentOS 6.4 with stock > pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source. > > Both DRBD and cluster comms use a dedicated crossover link. > > The target storage is battery-backed RAID. > > DRBD resources all use protocol C. > > stonith is configured and working. > > tgtd write cache is disabled using mode_page in additional_params. This > is correctly reported using sdparm --get WCE on initiators. > > Here's the question: if i am writing from an iSCSI initiator, and i take > down the crossover link between the nodes of my cluster, i end up with > corrupt data on the target disk. > > I know this isn't the formal way to test pacemaker failover. > Everything's fine if i fence a node or do a manual migration or > shutdown. But i don't understand why taking the crossover down results > in corrupted write operations. > > In greater detail, assuming the initiator sends a write request for some > block, here's the normal sequence as i understand it: > > - tgtd receives it and queues it straight for the device backing the LUN > (write cache is disabled). > - drbd receives it, commits it to disk, sends it to the other node, and > waits for an acknowledgement (protocol C). > - the remote node receives it, commits it to disk, and sends an > acknowledgement. > - the initial node receives the drbd acknowledgement, and acknowledges > the write to tgtd. > - tgtd acknowledges the write to the initiator. > > Now, suppose an initiator is writing when i take the crossover link > down, and pacemaker reacts to the loss in comms by fencing the node with > the currently active target. It then brings up the target on the > surviving, formerly inactive, node. This results in a drbd split brain, > since some writes have been queued on the fenced node but never made it > to the surviving node, and must be retransmitted by the initiator; once > the surviving node becomes active it starts committing these writes to > its copy of the mirror. I'm fine with a split brain; i can resolve it by > discarding outstanding data on the fenced node. > > But in practice, the actual written data is lost, and i don't understand > why. AFAICS, none of the outstanding writes should have been > acknowledged by tgtd on the fenced node, so when the surviving node > becomes active, the initiator should simply re-send all of them. But > this isn't what happens; instead most of the outstanding writes are > lost. No i/o error is reported on the initiator; stuff just vanishes. > > I'm writing directly to a block device for these tests, so the lost data > isn't the result of filesystem corruption; it simply never gets written > to the target disk on the survivor. > > What am i missing? Do you have handlers (fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";) configured in drbd.conf? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C
On 13 Nov 2013, at 2:10 pm, Jefferson Ogata wrote: > Here's a problem i don't understand, and i'd like a solution to if possible, > or at least i'd like to understand why it's a problem, because i'm clearly > not getting something. > > I have an iSCSI target cluster using CentOS 6.4 with stock > pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source. > > Both DRBD and cluster comms use a dedicated crossover link. > > The target storage is battery-backed RAID. > > DRBD resources all use protocol C. > > stonith is configured and working. > > tgtd write cache is disabled using mode_page in additional_params. This is > correctly reported using sdparm --get WCE on initiators. > > Here's the question: if i am writing from an iSCSI initiator, and i take down > the crossover link between the nodes of my cluster, i end up with corrupt > data on the target disk. > > I know this isn't the formal way to test pacemaker failover. Everything's > fine if i fence a node or do a manual migration or shutdown. But i don't > understand why taking the crossover down results in corrupted write > operations. > > In greater detail, assuming the initiator sends a write request for some > block, here's the normal sequence as i understand it: > > - tgtd receives it and queues it straight for the device backing the LUN > (write cache is disabled). > - drbd receives it, commits it to disk, sends it to the other node, and waits > for an acknowledgement (protocol C). > - the remote node receives it, commits it to disk, and sends an > acknowledgement. > - the initial node receives the drbd acknowledgement, and acknowledges the > write to tgtd. > - tgtd acknowledges the write to the initiator. > > Now, suppose an initiator is writing when i take the crossover link down, and > pacemaker reacts to the loss in comms by fencing the node with the currently > active target. It then brings up the target on the surviving, formerly > inactive, node. This results in a drbd split brain, since some writes have > been queued on the fenced node but never made it to the surviving node, and > must be retransmitted by the initiator; once the surviving node becomes > active it starts committing these writes to its copy of the mirror. I'm fine > with a split brain; i can resolve it by discarding outstanding data on the > fenced node. > > But in practice, the actual written data is lost, and i don't understand why. > AFAICS, none of the outstanding writes should have been acknowledged by tgtd > on the fenced node, so when the surviving node becomes active, the initiator > should simply re-send all of them. But this isn't what happens; instead most > of the outstanding writes are lost. No i/o error is reported on the > initiator; stuff just vanishes. > > I'm writing directly to a block device for these tests, so the lost data > isn't the result of filesystem corruption; it simply never gets written to > the target disk on the survivor. > > What am i missing? iSCSI, drbd, etc are not really my area of expertise, but it may be worth taking the cluster out of the loop and manually performing the equivalent actions. If the underlying drbd and iSCSI setups have a problem, then the cluster isn't going to do much about it. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems