[DRBD-user] drbd 9 resource failure due to apparent local io failure, but odd

Eddie Chapman Wed, 25 Oct 2017 04:52:44 -0700

Hello,

I was wondering if someone could eyeball the logs further below from aresource that has completely failed over yesterday and today and tell meif it looks like a "normal" failure from underlying storage, or if thereis anything strange?


I ask because there are 2 things that are odd:

1. On the primary node drbd reports that the underlying storage failsfor the resource (1 out of 27, the rest all fine and healthy) on node 1,yet there are NO reports of failure from the underlying storage, whichhappens to be a block device used by other (still healthy) resources(the drbd backing devices are all logical volumes). The resource goesDiskless on the primary but service continues because of the secondarywhich is still fine.

2. 11 hours later, the same happens on the secondary node (differentmachine, different physical storage), drbd reports read failure fromlocal storage there (also lvms over block device, the other resourcesalso fine), yet no reports of failure from underlying storage. This isof course the nail in the coffin for the resource as both resources arenow Diskless . Again, all other resources that share same block deviceare still fine and 100% healthy, no signs of any other issues on eithernode.

Both nodes are drbd-9.0.9-1 from drbd.org on vanilla kernel.org kernel4.9.58.

The failed resource has existed without any problems for many weeks, butwas originally created with drbd 9-0.8-1 on vanilla kernel 4.4.77. Bothnodes were upgraded to drbd-9.0.9-1/4.9.58 a few days ago. I don't knowif this is significant in any way.

Lastly, the failed resource is still there, both sides in Disklessstate, is there anything I can poke, maybe in /sys/kernel/debug, thatmight give further info about what happened?


Thanks,
Eddie

Here is the log from the primary node when the first failure happened:

drbd RES7H10E/0 drbd42: local READ IO error sector 6192640+16 on dm-43
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in __req_mod. Detaching...
drbd RES7H10E/0 drbd42: sending new current UUID: 2A150CB88CD794F6
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )

drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1486856, 4096), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1486888, 20480), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1486928, 57344), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1478264, 4096), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1486704, 77824), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1486864, 12288), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1487048, 286720), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1487608, 262144), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,41556976, 61440), but my Disk seems to have failed :(drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(,1479376, 4096), but my Disk seems to have failed :(



And the log from the secondary node failure exactly 11 hours later:

drbd RES7H10E/0 drbd42: read: error=10 s=29090424s
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )

drbd RES7H10E/0 drbd42: Local IO failed in drbd_endio_read_sec_final.Detaching...

drbd RES7H10E/0 drbd42 node1.mydomain: Sending NegDReply. sector=29090424s.
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )

drbd RES7H10E node1.mydomain: Wrong magic value 0x0090d574 in protocolversion 112drbd RES7H10E node1.mydomain: conn( Connected -> ProtocolError ) peer(Primary -> Unknown )drbd RES7H10E/0 drbd42 node1.mydomain: pdsk( Diskless -> DUnknown )repl( Established -> Off )

drbd RES7H10E node1.mydomain: ack_receiver terminated
drbd RES7H10E node1.mydomain: Terminating ack_recv thread
drbd RES7H10E node1.mydomain: Connection closed
drbd RES7H10E node1.mydomain: conn( ProtocolError -> Unconnected )
drbd RES7H10E node1.mydomain: Restarting receiver thread
drbd RES7H10E node1.mydomain: conn( Unconnected -> Connecting )
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] drbd 9 resource failure due to apparent local io failure, but odd

Reply via email to