Hello,
I was wondering if someone could eyeball the logs further below from a
resource that has completely failed over yesterday and today and tell me
if it looks like a "normal" failure from underlying storage, or if there
is anything strange?
I ask because there are 2 things that are odd:
1. On the primary node drbd reports that the underlying storage fails
for the resource (1 out of 27, the rest all fine and healthy) on node 1,
yet there are NO reports of failure from the underlying storage, which
happens to be a block device used by other (still healthy) resources
(the drbd backing devices are all logical volumes). The resource goes
Diskless on the primary but service continues because of the secondary
which is still fine.
2. 11 hours later, the same happens on the secondary node (different
machine, different physical storage), drbd reports read failure from
local storage there (also lvms over block device, the other resources
also fine), yet no reports of failure from underlying storage. This is
of course the nail in the coffin for the resource as both resources are
now Diskless . Again, all other resources that share same block device
are still fine and 100% healthy, no signs of any other issues on either
node.
Both nodes are drbd-9.0.9-1 from drbd.org on vanilla kernel.org kernel
4.9.58.
The failed resource has existed without any problems for many weeks, but
was originally created with drbd 9-0.8-1 on vanilla kernel 4.4.77. Both
nodes were upgraded to drbd-9.0.9-1/4.9.58 a few days ago. I don't know
if this is significant in any way.
Lastly, the failed resource is still there, both sides in Diskless
state, is there anything I can poke, maybe in /sys/kernel/debug, that
might give further info about what happened?
Thanks,
Eddie
Here is the log from the primary node when the first failure happened:
drbd RES7H10E/0 drbd42: local READ IO error sector 6192640+16 on dm-43
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in __req_mod. Detaching...
drbd RES7H10E/0 drbd42: sending new current UUID: 2A150CB88CD794F6
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1486856, 4096), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1486888, 20480), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1486928, 57344), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1478264, 4096), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1486704, 77824), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1486864, 12288), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1487048, 286720), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1487608, 262144), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
41556976, 61440), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,
1479376, 4096), but my Disk seems to have failed :(
And the log from the secondary node failure exactly 11 hours later:
drbd RES7H10E/0 drbd42: read: error=10 s=29090424s
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in drbd_endio_read_sec_final.
Detaching...
drbd RES7H10E/0 drbd42 node1.mydomain: Sending NegDReply. sector=29090424s.
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )
drbd RES7H10E node1.mydomain: Wrong magic value 0x0090d574 in protocol
version 112
drbd RES7H10E node1.mydomain: conn( Connected -> ProtocolError ) peer(
Primary -> Unknown )
drbd RES7H10E/0 drbd42 node1.mydomain: pdsk( Diskless -> DUnknown )
repl( Established -> Off )
drbd RES7H10E node1.mydomain: ack_receiver terminated
drbd RES7H10E node1.mydomain: Terminating ack_recv thread
drbd RES7H10E node1.mydomain: Connection closed
drbd RES7H10E node1.mydomain: conn( ProtocolError -> Unconnected )
drbd RES7H10E node1.mydomain: Restarting receiver thread
drbd RES7H10E node1.mydomain: conn( Unconnected -> Connecting )
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user