On 3/4/11 4:58 PM, William Seligman wrote:
> On 3/4/11 12:38 PM, William Seligman wrote:
>> I've RTFM'ed and google'd on this problem. Now I ask the experts.
> 
> Now that I've joined this list, I looked at the archives directly. I see that
> Cory Coager reported the same problem:
> 
> http://lists.linbit.com/pipermail/drbd-user/2011-March/015735.html
> 
> Lars Ellenberg suggested that the problem was due to a bad NIC. Maybe... but
> what are the odds that two different systems have a bad NIC?

I just did a little archeology. I haven't always experienced these regular DRBD
sync error messages. They began when I made two changes to my configuration 
file:

- I switched from "Protocol C" to "Protocol A".
- I added "net { ping-timeout 100; }"

Are either of these changes likely to cause problems?

My next step would normally be to reverse those changes, but these are
production systems and it's hard for me to perform tests.

>> Setup: Two systems; hypatia is primary, orestes is secondary. OS is 
>> Scientific
>> Linux 5.5: kernel 2.6.18-194.26.1.el5xen; DRBD version drbd-8.3.8.1-30.el5.
>>
>> Each has two partitions that are used for separate DRBD devices: /dev/md0
>> (software RAID1) and /dev/sdd2. On both systems:
>>
>> partition /dev/md0 => device drbd1
>> partition /dev/sdd2 => device drbd2
>>
>> The DRBD traffic goes over a single Ethernet cable that connects the two 
>> systems.
>>
>> For drbd1, the control heirarchy is Corosync->DRBD->LVM->Xen.
>> For drbd2, the control is Corosync->DRBD->Just mount the thing.
>>
>> The complicated one is drbd1, but it seems to work just fine. The problem
>> appears to be with drbd2, which doesn't do much of anything; it's a 
>> work/backup
>> directory which I use to take infrequent (~two months) snapshots of the 
>> virtual
>> machines on drbd1.
>>
>> Every ten seconds, the error messages at the end of this post appear in the 
>> log
>> of the primary system and there are similar lines on the secondary system. It
>> seems that drbd2 is losing its connection, re-establishing, and doing a 
>> re-sync.
>>
>> Everything works, most of the time. But once every few weeks there's enough 
>> of a
>> delay that Corosync takes notice and STONITHs one of the systems, which is a 
>> big
>> pain.
>>
>> I've tried:
>> - switching from Protocol C to Protocol A
>> - setting "net {ping-timeout 100;}"
>> - throttling the connection by "syncer {rate 10M;}" (used to be 100M)
>>
>> Any ideas?
>>
>> Mar  4 12:26:25 hypatia kernel: block drbd2: meta connection shut down by 
>> peer.
>> Mar  4 12:26:25 hypatia kernel: block drbd2: peer( Secondary -> Unknown ) 
>> conn(
>> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: asender terminated
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Terminating asender thread
>> Mar  4 12:26:25 hypatia kernel: block drbd2: sock was shut down by peer
>> Mar  4 12:26:25 hypatia kernel: block drbd2: short read expecting header on
>> sock: r=0
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Creating new current UUID
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Connection closed
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( NetworkFailure -> 
>> Unconnected )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: receiver terminated
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Restarting receiver thread
>> Mar  4 12:26:25 hypatia kernel: block drbd2: receiver (re)started
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( Unconnected -> 
>> WFConnection )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Handshake successful: Agreed
>> network protocol version 94
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( WFConnection -> 
>> WFReportParams )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Starting asender thread (from
>> drbd2_receiver [7920])
>> Mar  4 12:26:25 hypatia kernel: block drbd2: data-integrity-alg: <not-used>
>> Mar  4 12:26:25 hypatia kernel: block drbd2: drbd_sync_handshake:
>> Mar  4 12:26:25 hypatia kernel: block drbd2: self
>> C4884637D2C418DF:922772A0478F5E1F:2DE51139CD7C3DF7:EB27F748FC21DC65 bits:0 
>> flags:0
>> Mar  4 12:26:25 hypatia kernel: block drbd2: peer
>> 922772A0478F5E1E:0000000000000000:2DE51139CD7C3DF6:EB27F748FC21DC65 bits:0 
>> flags:0
>> Mar  4 12:26:25 hypatia kernel: block drbd2: uuid_compare()=1 by rule 70
>> Mar  4 12:26:25 hypatia kernel: block drbd2: peer( Unknown -> Secondary ) 
>> conn(
>> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( WFBitMapS -> SyncSource )
>> pdsk( UpToDate -> Inconsistent )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Began resync as SyncSource (will
>> sync 0 KB [0 bits set]).
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Resync done (total 1 sec; 
>> paused 0
>> sec; 0 K/sec)
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( SyncSource -> Connected )
>> pdsk( Inconsistent -> UpToDate )

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://[email protected]
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to