Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
19.11.2013 13:48, Lars Ellenberg wrote: > On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote: >> 13.11.2013 04:46, Jefferson Ogata wrote: >> ... >>> >>> In practice i ran into failover problems under load almost immediately. >>> Under load, when i would initiate a failover, there was a race >>> condition: the iSCSILogicalUnit RA will take down the LUNs one at a >>> time, waiting for each connection to terminate, and if the initiators >>> reconnect quickly enough, they get pissed off at finding that the target >>> still exists but the LUN they were using no longer does, which is often >>> the case during this transient takedown process. On the initiator, it >>> looks something like this, and it's fatal (here LUN 4 has gone away but >>> the target is still alive, maybe working on disconnecting LUN 3): >>> >>> Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal >>> Request [current] >>> Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit >>> not supported >>> Nov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical >>> block 16542656 >>> >>> One solution to this is using the portblock RA to block all initiator >> >> In addition I force use of multipath on initiators with no_path_retry=queue >> >> ... >> >>> >>> 1. Lack of support for multiple targets using the same tgt account. This >>> is a problem because the iSCSITarget RA defines the user and the target >>> at the same time. If it allowed multiple targets to use the same user, >>> it wouldn't know when it is safe to delete the user in a stop operation, >>> because some other target might still be using it. >>> >>> To solve this i did two things: first i wrote a new RA that manages a > > Did I miss it, or did you post it somewhere? > Fork on Github and push there, so we can have a look? > >>> tgt user; this is instantiated as a clone so it runs along with the tgtd >>> clone. Second i tweaked the iSCSITarget RA so that on start, if >>> incoming_username is defined but incoming_password is not, the RA skips >>> the account creation step and simply binds the new target to >>> incoming_username. On stop, it similarly no longer deletes the account >>> if incoming_password is unset. I also had to relax the uniqueness >>> constraint on incoming_username in the RA metadata. >>> >>> 2. Disappearing LUNs during failover cause initiators to blow chunks. >>> For this i used portblock, but had to modify it because the TCP Send-Q >>> would never drain. >>> >>> 3. portblock preventing TCP Send-Q from draining, causing tgtd >>> connections to hang. I modified portblock to reverse the sense of the >>> iptables rules it was adding: instead of blocking traffic from the >>> initiator on the INPUT chain, it now blocks traffic from the target on >>> the OUTPUT chain with a tcp-reset response. With this setup, as soon as >>> portblock goes active, the next packet tgtd attempts to send to a given >>> initiator will get a TCP RST response, causing tgtd to hang up the >>> connection immediately. This configuration allows the connections to >>> terminate promptly under load. >>> >>> I'm not totally satisfied with this workaround. It means >>> acknowledgements of operations tgtd has actually completed never make it >>> back to the initiator. I suspect this could cause problems in some >>> scenarios. I don't think it causes a problem the way i'm using it, with >>> each LUN as backing store for a distinct VM--when the LUN is back up on >>> the other node, the outstanding operations are re-sent by the initiator. >>> Maybe with a clustered filesystem this would cause problems; it >>> certainly would cause problems if the target device were, for example, a >>> tape drive. > > Maybe only block "new" incoming connection attempts? > That may cause issues on an initiator side in some circumstances (IIRC): * connection is established * pacemaker fires target move * target is destroyed, connection breaks (TCP RST is sent to initiator) * initiator connects again * target is not available on iSCSI level (but portals answer either on old or on new node) or portals are not available * initiator *returns error* to an upper layer <- this one is important * target is configured on other node then I was hit by this, but that was several years ago, so I may miss some details. My experience with IET and LIO shows it is better (safer) to block all iSCSI traffic to target's portals, both directions. * connection is established * pacemaker fires target move * both directions are blocked (DROP) on both target nodes * target is destroyed, connection stays "established" on initiator side, just TCP packets timeout * target is configured on other node (VIPs are moved too) * firewall rules are removed * initiator (re)sends request * target sends RST (?) back - it doesn't have that connection * initiator reconnects and continues to use target ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On 2013-11-13 06:02, Vladislav Bogdanov wrote: 13.11.2013 04:46, Jefferson Ogata wrote: [snip] 4. "Insufficient privileges" faults in the portblock RA. This was another race condition that occurred because i was using multiple targets, meaning that without a mutex, multiple portblock invocations would be running in parallel during a failover. If you try to run iptables while another iptables is running, you get "Resource not available" and this was coming back to pacemaker as "insufficient privileges". This is simply a bug in the portblock RA; it should have a mutex to prevent parallel iptables invocations. I fixed this by adding an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for start, stop, monitor, and status operations. I'm not sure why more people haven't run into these problems before. I hope it's not that i'm doing things wrong, but rather that few others haven't earnestly tried to build anything quite like this setup. If anyone out there has set up a similar cluster and *not* had these problems, i'd like to know about it. Meanwhile, if others *have* had these problems, i'd also like to know, especially if they've found alternate solutions. Can't say about 1, I use IET, it doesn't seem to have that limitation. 2 - I use alternative home-brew ms RA which blocks (DROP) both input and output for a specified VIP on demote (targets are configured to be bound to that VIPs). I also export one big LUN per target and then set up clvm VG on top of it (all initiators are in the same another cluster). 3 - can't say as well, IET is probably not affected. 4 - That is true, iptables doesn't have atomic rules management, so you definitely need mutex or dispatcher like firewalld (didn't try it though). The issue here is not really the lack of atomic rules management (which you can hack in iptables by defining a new chain for the rules you want to add and adding a jump to it in the INPUT or OUTPUT chain). The issue i was encountering is that if the iptables process locks a resource. If a second iptables process attempts to run concurrently, it can't lock, and it fails. The RA was feeding this back to pacemaker as a resource stop/stop failure, when really the RA should have been preventing concurrent execution of iptables in the first place. Thanks for the feedback. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On 2013-11-19 10:48, Lars Ellenberg wrote: On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote: 13.11.2013 04:46, Jefferson Ogata wrote: ... In practice i ran into failover problems under load almost immediately. Under load, when i would initiate a failover, there was a race condition: the iSCSILogicalUnit RA will take down the LUNs one at a time, waiting for each connection to terminate, and if the initiators reconnect quickly enough, they get pissed off at finding that the target still exists but the LUN they were using no longer does, which is often the case during this transient takedown process. On the initiator, it looks something like this, and it's fatal (here LUN 4 has gone away but the target is still alive, maybe working on disconnecting LUN 3): Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal Request [current] Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit not supported Nov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical block 16542656 One solution to this is using the portblock RA to block all initiator In addition I force use of multipath on initiators with no_path_retry=queue ... 1. Lack of support for multiple targets using the same tgt account. This is a problem because the iSCSITarget RA defines the user and the target at the same time. If it allowed multiple targets to use the same user, it wouldn't know when it is safe to delete the user in a stop operation, because some other target might still be using it. To solve this i did two things: first i wrote a new RA that manages a Did I miss it, or did you post it somewhere? Fork on Github and push there, so we can have a look? Not set up with git right now; i've attached it here. It's short. tgt user; this is instantiated as a clone so it runs along with the tgtd clone. Second i tweaked the iSCSITarget RA so that on start, if incoming_username is defined but incoming_password is not, the RA skips the account creation step and simply binds the new target to incoming_username. On stop, it similarly no longer deletes the account if incoming_password is unset. I also had to relax the uniqueness constraint on incoming_username in the RA metadata. 2. Disappearing LUNs during failover cause initiators to blow chunks. For this i used portblock, but had to modify it because the TCP Send-Q would never drain. 3. portblock preventing TCP Send-Q from draining, causing tgtd connections to hang. I modified portblock to reverse the sense of the iptables rules it was adding: instead of blocking traffic from the initiator on the INPUT chain, it now blocks traffic from the target on the OUTPUT chain with a tcp-reset response. With this setup, as soon as portblock goes active, the next packet tgtd attempts to send to a given initiator will get a TCP RST response, causing tgtd to hang up the connection immediately. This configuration allows the connections to terminate promptly under load. I'm not totally satisfied with this workaround. It means acknowledgements of operations tgtd has actually completed never make it back to the initiator. I suspect this could cause problems in some scenarios. I don't think it causes a problem the way i'm using it, with each LUN as backing store for a distinct VM--when the LUN is back up on the other node, the outstanding operations are re-sent by the initiator. Maybe with a clustered filesystem this would cause problems; it certainly would cause problems if the target device were, for example, a tape drive. Maybe only block "new" incoming connection attempts? That's a good idea. Theoretically that should allow the existing connections to drain. I'm worried it can lead to pacemaker timeouts firing if there's a lot of queued data in the send queues, but i'll test it. Thanks for the suggestion. #!/bin/sh # # Resource script for managing tgt users # # Description: Manages a tgt user as an OCF resource in # an High Availability setup. # # Author: Jefferson Ogata # License: GNU General Public License (GPL) # # # usage: $0 {start|stop|status|monitor|validate-all|meta-data} # # The "start" arg adds the user. # # The "stop" arg deletes it. # # OCF parameters: # OCF_RESKEY_username # OCF_RESKEY_password # # Example crm configuration: # # primitive tgtd lsb:tgtd op monitor interval="10s" # clone clone.tgtd tgtd # primitive user.foo ocf:heartbeat:tgtUser params username="foo" password="secret" # clone clone.user.foo user.foo # order clone.tgtd_before_clone.user.foo inf: clone.tgtd:start clone.user.foo:start # ## # Initialization: : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs USAGE="Usage: $0 {start|stop|status|monitor|validate-all|meta-data}"; ## usage() { echo $USAGE >&2 } meta_data() { cat < 1
Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C
On 2013-11-19 09:32, Lars Ellenberg wrote: On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote: Here's a problem i don't understand, and i'd like a solution to if possible, or at least i'd like to understand why it's a problem, because i'm clearly not getting something. I have an iSCSI target cluster using CentOS 6.4 with stock pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source. Both DRBD and cluster comms use a dedicated crossover link. The target storage is battery-backed RAID. DRBD resources all use protocol C. stonith is configured and working. What about DRBD fencing. You have to use "fencing resource-and-stonith;", and a suitable fencing handler. Currently i have "fencing resource-only;" and fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target /usr/lib/drbd/crm-unfence-peer.sh; in my DRBD config. stonith is configured in pacemaker. This was the best i could come up with from what documentation i was able to find. I will try using "fencing resource-and-stonith;" but i'm unclear on whether that requires some sort of additional stonith configuration in DRBD, which i didn't think would be necessary. Because: tgtd write cache is disabled using mode_page in additional_params. This is correctly reported using sdparm --get WCE on initiators. Here's the question: if i am writing from an iSCSI initiator, and i take down the crossover link between the nodes of my cluster, i end up with corrupt data on the target disk. I know this isn't the formal way to test pacemaker failover. Everything's fine if i fence a node or do a manual migration or shutdown. But i don't understand why taking the crossover down results in corrupted write operations. In greater detail, assuming the initiator sends a write request for some block, here's the normal sequence as i understand it: - tgtd receives it and queues it straight for the device backing the LUN (write cache is disabled). - drbd receives it, commits it to disk, sends it to the other node, and waits for an acknowledgement (protocol C). - the remote node receives it, commits it to disk, and sends an acknowledgement. - the initial node receives the drbd acknowledgement, and acknowledges the write to tgtd. - tgtd acknowledges the write to the initiator. Now, suppose an initiator is writing when i take the crossover link down, and pacemaker reacts to the loss in comms by fencing the node with the currently active target. It then brings up the target on the surviving, formerly inactive, node. This results in a drbd split brain, since some writes have been queued on the fenced node but never made it to the surviving node, But have been acknowledged as written to the initiator, which is why the initiator won't retransmit them. This is the crux of what i'm not understanding: why, if i'm using protocol C, would DRBD acknowledge a write before it's been committed to the remote replica? If so, then i really don't understand the point of protocol C. Or is tgtd acknowledging writes before they've been committed by the underlying backing store? I thought disabling the write cache would prevent that. What i *thought* should happen was that writes received by the target after the crossover link fails would not be acknowledged under protocol C, and would be retransmitted after fencing completed and the backup node becomes primary. These writes overlap with writes that were committed on the fenced node's replica but hadn't been transmitted to the other replica, so this results in split brain that is reliably resolvable by discarding data from the fenced node's replica and resyncing. With the DRBD fencing policy "fencing resource-and-stonith;", DRBD will *block* further IO (and not acknowledge anything that did not make it to the peer) until the fence-peer handler returns that it would be safe to resume IO again. This avoids the data divergence (aka "split-brain", because usually data-divergence is the result of split-brain). and must be retransmitted by the initiator; once the surviving node becomes active it starts committing these writes to its copy of the mirror. I'm fine with a split brain; You should not be. DRBD reporting "split-brain detected" is usually a sign of a bad setup. Well, by "fine" i meant that i felt i had a clear understanding of how to resolve the split brain without ending up with corruption. But if DRBD is acknowledging writes that haven't been committed on both replicas even with protocol C, i may have been incorrect in this. i can resolve it by discarding outstanding data on the fenced node. But in practice, the actual written data is lost, and i don't understand why. AFAICS, none of the outstanding writes should have been acknowledged by tgtd on the fenced node, so when the surviving node becomes active, the initiator should simply re-send all of them. But this isn't what happens; instead most of the outstanding writes are lost. No i/o error
Re: [Linux-HA] Ping domain
Good day Thank you for replying. Brent On 19/11/2013 13:01, Lars Ellenberg wrote: On Tue, Nov 19, 2013 at 10:02:45AM +0200, Brent Clark wrote: Good day. I would like to ask, if you can supply a domain for ping in ha.cf. If anyone can assist, it would be appreciated. I'm not exactly sure what you are asking here, but maybe this helps: http://moin.linux-ha.org/PingDirective Or consider switching to Pacemaker... ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Ping domain
On Tue, Nov 19, 2013 at 10:02:45AM +0200, Brent Clark wrote: > Good day. > > I would like to ask, if you can supply a domain for ping in ha.cf. > > If anyone can assist, it would be appreciated. I'm not exactly sure what you are asking here, but maybe this helps: http://moin.linux-ha.org/PingDirective Or consider switching to Pacemaker... -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat errors related to Gmain_timeout_dispatch at low traffic
On Thu, Nov 14, 2013 at 04:46:16PM +0530, Savita Kulkarni wrote: > Hi, > > Recently we are seeing lots of heartbeat errors related to > Gmain_timeout_dispatch > on our system. > I checked on mailing list archives if other people have faced this issue. > There are few email threads regarding this but people are seeing this issue > in case of high load. > > On our system there is very low/no load is present. > > We are running heartbeat on guest VMs, using VMWARE ESXi 5.0. > We have heartbeat -2.1.3-4 > It is working fine without any issues on other other setups and issue is > coming only on this setup. > > Following types of errors are present in /var/log/messages > > Nov 12 09:58:43 heartbeat: [23036]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 15270 ms (> 1010 > ms) before being called (GSource: 0x138926b8) > Nov 12 09:59:00 heartbeat: [23036]: info: Gmain_timeout_dispatch: > started at 583294569 should have started at 583293042 > Nov 12 09:59:00 heartbeat: [23036]: WARN: Gmain_timeout_dispatch: > Dispatch function for update msgfree count was delayed 33960 ms (> > 1 ms) before being called (GSource: 0x13892f58) > > Can anyone tell me what can be the issue? > > Can it be a hardware issue? Could be many things, even that, yes. Could be that upgrading to recent heartbeat 3 helps. Could be that there is to little load, and your virtualization just stops scheduling the VM itself, because it thinks it is underutilized... Does it recover if you kill/restart heartbeat? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote: > 13.11.2013 04:46, Jefferson Ogata wrote: > ... > > > > In practice i ran into failover problems under load almost immediately. > > Under load, when i would initiate a failover, there was a race > > condition: the iSCSILogicalUnit RA will take down the LUNs one at a > > time, waiting for each connection to terminate, and if the initiators > > reconnect quickly enough, they get pissed off at finding that the target > > still exists but the LUN they were using no longer does, which is often > > the case during this transient takedown process. On the initiator, it > > looks something like this, and it's fatal (here LUN 4 has gone away but > > the target is still alive, maybe working on disconnecting LUN 3): > > > > Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal > > Request [current] > > Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit > > not supported > > Nov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical > > block 16542656 > > > > One solution to this is using the portblock RA to block all initiator > > In addition I force use of multipath on initiators with no_path_retry=queue > > ... > > > > > 1. Lack of support for multiple targets using the same tgt account. This > > is a problem because the iSCSITarget RA defines the user and the target > > at the same time. If it allowed multiple targets to use the same user, > > it wouldn't know when it is safe to delete the user in a stop operation, > > because some other target might still be using it. > > > > To solve this i did two things: first i wrote a new RA that manages a Did I miss it, or did you post it somewhere? Fork on Github and push there, so we can have a look? > > tgt user; this is instantiated as a clone so it runs along with the tgtd > > clone. Second i tweaked the iSCSITarget RA so that on start, if > > incoming_username is defined but incoming_password is not, the RA skips > > the account creation step and simply binds the new target to > > incoming_username. On stop, it similarly no longer deletes the account > > if incoming_password is unset. I also had to relax the uniqueness > > constraint on incoming_username in the RA metadata. > > > > 2. Disappearing LUNs during failover cause initiators to blow chunks. > > For this i used portblock, but had to modify it because the TCP Send-Q > > would never drain. > > > > 3. portblock preventing TCP Send-Q from draining, causing tgtd > > connections to hang. I modified portblock to reverse the sense of the > > iptables rules it was adding: instead of blocking traffic from the > > initiator on the INPUT chain, it now blocks traffic from the target on > > the OUTPUT chain with a tcp-reset response. With this setup, as soon as > > portblock goes active, the next packet tgtd attempts to send to a given > > initiator will get a TCP RST response, causing tgtd to hang up the > > connection immediately. This configuration allows the connections to > > terminate promptly under load. > > > > I'm not totally satisfied with this workaround. It means > > acknowledgements of operations tgtd has actually completed never make it > > back to the initiator. I suspect this could cause problems in some > > scenarios. I don't think it causes a problem the way i'm using it, with > > each LUN as backing store for a distinct VM--when the LUN is back up on > > the other node, the outstanding operations are re-sent by the initiator. > > Maybe with a clustered filesystem this would cause problems; it > > certainly would cause problems if the target device were, for example, a > > tape drive. Maybe only block "new" incoming connection attempts? > > 4. "Insufficient privileges" faults in the portblock RA. This was > > another race condition that occurred because i was using multiple > > targets, meaning that without a mutex, multiple portblock invocations > > would be running in parallel during a failover. If you try to run > > iptables while another iptables is running, you get "Resource not > > available" and this was coming back to pacemaker as "insufficient > > privileges". This is simply a bug in the portblock RA; it should have a > > mutex to prevent parallel iptables invocations. I fixed this by adding > > an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for > > start, stop, monitor, and status operations. > > > > I'm not sure why more people haven't run into these problems before. I > > hope it's not that i'm doing things wrong, but rather that few others > > haven't earnestly tried to build anything quite like this setup. If > > anyone out there has set up a similar cluster and *not* had these > > problems, i'd like to know about it. Meanwhile, if others *have* had > > these problems, i'd also like to know, especially if they've found > > alternate solutions. > > Can't say about 1, I use IET, it doesn't seem to have that limitation. > 2 - I use alternative home
Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C
On Wed, Nov 13, 2013 at 03:10:07AM +, Jefferson Ogata wrote: > Here's a problem i don't understand, and i'd like a solution to if > possible, or at least i'd like to understand why it's a problem, > because i'm clearly not getting something. > > I have an iSCSI target cluster using CentOS 6.4 with stock > pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from > source. > > Both DRBD and cluster comms use a dedicated crossover link. > > The target storage is battery-backed RAID. > > DRBD resources all use protocol C. > > stonith is configured and working. What about DRBD fencing. You have to use "fencing resource-and-stonith;", and a suitable fencing handler. Because: > tgtd write cache is disabled using mode_page in additional_params. > This is correctly reported using sdparm --get WCE on initiators. > > Here's the question: if i am writing from an iSCSI initiator, and i > take down the crossover link between the nodes of my cluster, i end > up with corrupt data on the target disk. > > I know this isn't the formal way to test pacemaker failover. > Everything's fine if i fence a node or do a manual migration or > shutdown. But i don't understand why taking the crossover down > results in corrupted write operations. > > In greater detail, assuming the initiator sends a write request for > some block, here's the normal sequence as i understand it: > > - tgtd receives it and queues it straight for the device backing the > LUN (write cache is disabled). > - drbd receives it, commits it to disk, sends it to the other node, > and waits for an acknowledgement (protocol C). > - the remote node receives it, commits it to disk, and sends an > acknowledgement. > - the initial node receives the drbd acknowledgement, and > acknowledges the write to tgtd. > - tgtd acknowledges the write to the initiator. > > Now, suppose an initiator is writing when i take the crossover link > down, and pacemaker reacts to the loss in comms by fencing the node > with the currently active target. It then brings up the target on > the surviving, formerly inactive, node. This results in a drbd split > brain, since some writes have been queued on the fenced node but > never made it to the surviving node, But have been acknowledged as written to the initiator, which is why the initiator won't retransmit them. With the DRBD fencing policy "fencing resource-and-stonith;", DRBD will *block* further IO (and not acknowledge anything that did not make it to the peer) until the fence-peer handler returns that it would be safe to resume IO again. This avoids the data divergence (aka "split-brain", because usually data-divergence is the result of split-brain). > and must be retransmitted by > the initiator; once the surviving node becomes active it starts > committing these writes to its copy of the mirror. I'm fine with a > split brain; You should not be. DRBD reporting "split-brain detected" is usually a sign of a bad setup. > i can resolve it by discarding outstanding data on the fenced node. > > But in practice, the actual written data is lost, and i don't > understand why. AFAICS, none of the outstanding writes should have > been acknowledged by tgtd on the fenced node, so when the surviving > node becomes active, the initiator should simply re-send all of > them. But this isn't what happens; instead most of the outstanding > writes are lost. No i/o error is reported on the initiator; stuff > just vanishes. > > I'm writing directly to a block device for these tests, so the lost > data isn't the result of filesystem corruption; it simply never gets > written to the target disk on the survivor. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems