Thanks Dan - I missed that those messages were in dmesg. So starting from 2 connected nodes (Secondary/Secondary) we set one to primary ("drbdadm primary drbd-sr1") and from my perspective my ssh connection drops and the machine locks up for about 5 minutes.
Same behavior on both nodes, either one freezes for 5 minutes when being set to the primary.. so it doesn't appear to be any kind of hardware issue specific to one of them. Below is what I'm seeing in dmesg. Note - the two nodes in question are connected by a cross-over gigabit cable. Very weird behavior.. after 5 minutes of freezing up the node came up again and everything seems to be ok.. Anyone have any ideas? block drbd1: role( Secondary -> Primary ) d-con drbd-sr1: asender terminated d-con drbd-sr1: Terminating asender thread d-con drbd-sr1: Connection closed block drbd1: new current UUID 5A99C51D68CDB447:188E44BA42FFFCF4:2460EA01C7EA7F96:245FEA01C7EA7F96 d-con drbd-sr1: conn( BrokenPipe -> Unconnected ) d-con drbd-sr1: receiver terminated d-con drbd-sr1: Restarting receiver thread d-con drbd-sr1: receiver (re)started d-con drbd-sr1: conn( Unconnected -> WFConnection ) d-con drbd-sr1: initial packet S crossed d-con drbd-sr1: Handshake successful: Agreed network protocol version 101 d-con drbd-sr1: conn( WFConnection -> WFReportParams ) d-con drbd-sr1: Starting asender thread (from drbd_r_drbd-sr1 [26469]) block drbd1: drbd_sync_handshake: block drbd1: self 5A99C51D68CDB447:188E44BA42FFFCF4:2460EA01C7EA7F96:245FEA01C7EA7F96 bits:0 flags:0 block drbd1: peer 188E44BA42FFFCF4:0000000000000000:2460EA01C7EA7F96:245FEA01C7EA7F96 bits:0 flags:0 block drbd1: uuid_compare()=1 by rule 70 block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) block drbd1: Began resync as SyncSource (will sync 0 KB [0 bits set]). block drbd1: updated sync UUID 5A99C51D68CDB447:188F44BA42FFFCF4:188E44BA42FFFCF4:2460EA01C7EA7F96 block drbd1: Resync done (total 1 sec; paused 0 sec; 0 K/sec) block drbd1: updated UUIDs 5A99C51D68CDB447:0000000000000000:188F44BA42FFFCF4:188E44BA42FFFCF4 block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) On Fri, Oct 5, 2012 at 6:39 PM, Dan Barker <dbar...@visioncomm.net> wrote: > dmesg | grep sr1 should show you all you need to know.**** > > ** ** > > Dan (there’s that word “should” again<g>)**** > > ** ** > > *From:* drbd-user-boun...@lists.linbit.com [mailto: > drbd-user-boun...@lists.linbit.com] *On Behalf Of *Andrew Eross > *Sent:* Friday, October 05, 2012 2:17 PM > *To:* drbd-user@lists.linbit.com > *Subject:* [DRBD-user] IO Error Logging**** > > ** ** > > Hi guys,**** > > ** ** > > I'm trying to debug a SSD drive that's the backing device for my secondary > node.**** > > ** ** > > The primary/secondary are sync'd (protocol C) and everything goes fine > until I get to testing fail-over, e.g.on the primary "drbdadm secondary > drbd-sr1", and on the secondary "drbdadm primary drbd-sr1".**** > > ** ** > > When I do this the secondary locks up for about 5 minutes (SSH session > drops) then it starts responding again and I see drbd has now dropped into > diskless mode.**** > > ** ** > > I'm thinking there might be IO errors occurring with the underlying disk > and perhaps drbd is automatically detaching it.**** > > ** ** > > Right now I'm running badblocks on the backing device and seeing if it can > find any problems.**** > > ** ** > > In the meantime I've been trying to figure out how to get more information > about IO errors from drbd.**** > > ** ** > > My devices are configured with "detach" as recommended ( > http://www.drbd.org/users-guide/s-configure-io-error-behavior.html), > however, I'm not sure how to find out more information about when this > event occurs.**** > > ** ** > > Are there any debugging options I can enable that would help me see IO > error details that caused a detach? **** > > ** ** > > Thanks!**** > > Andrew**** > > ** ** > > _______________________________________________ > drbd-user mailing list > drbd-user@lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > >
_______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user