Hello,

I was wondering if someone could eyeball the logs further below from a resource that has completely failed over yesterday and today and tell me if it looks like a "normal" failure from underlying storage, or if there is anything strange?

I ask because there are 2 things that are odd:

1. On the primary node drbd reports that the underlying storage fails for the resource (1 out of 27, the rest all fine and healthy) on node 1, yet there are NO reports of failure from the underlying storage, which happens to be a block device used by other (still healthy) resources (the drbd backing devices are all logical volumes). The resource goes Diskless on the primary but service continues because of the secondary which is still fine.

2. 11 hours later, the same happens on the secondary node (different machine, different physical storage), drbd reports read failure from local storage there (also lvms over block device, the other resources also fine), yet no reports of failure from underlying storage. This is of course the nail in the coffin for the resource as both resources are now Diskless . Again, all other resources that share same block device are still fine and 100% healthy, no signs of any other issues on either node.

Both nodes are drbd-9.0.9-1 from drbd.org on vanilla kernel.org kernel 4.9.58.

The failed resource has existed without any problems for many weeks, but was originally created with drbd 9-0.8-1 on vanilla kernel 4.4.77. Both nodes were upgraded to drbd-9.0.9-1/4.9.58 a few days ago. I don't know if this is significant in any way.

Lastly, the failed resource is still there, both sides in Diskless state, is there anything I can poke, maybe in /sys/kernel/debug, that might give further info about what happened?

Thanks,
Eddie

Here is the log from the primary node when the first failure happened:

drbd RES7H10E/0 drbd42: local READ IO error sector 6192640+16 on dm-43
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in __req_mod. Detaching...
drbd RES7H10E/0 drbd42: sending new current UUID: 2A150CB88CD794F6
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1486856, 4096), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1486888, 20480), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1486928, 57344), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1478264, 4096), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1486704, 77824), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1486864, 12288), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1487048, 286720), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1487608, 262144), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 41556976, 61440), but my Disk seems to have failed :( drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 1479376, 4096), but my Disk seems to have failed :(


And the log from the secondary node failure exactly 11 hours later:

drbd RES7H10E/0 drbd42: read: error=10 s=29090424s
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in drbd_endio_read_sec_final. Detaching...
drbd RES7H10E/0 drbd42 node1.mydomain: Sending NegDReply. sector=29090424s.
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )
drbd RES7H10E node1.mydomain: Wrong magic value 0x0090d574 in protocol version 112 drbd RES7H10E node1.mydomain: conn( Connected -> ProtocolError ) peer( Primary -> Unknown ) drbd RES7H10E/0 drbd42 node1.mydomain: pdsk( Diskless -> DUnknown ) repl( Established -> Off )
drbd RES7H10E node1.mydomain: ack_receiver terminated
drbd RES7H10E node1.mydomain: Terminating ack_recv thread
drbd RES7H10E node1.mydomain: Connection closed
drbd RES7H10E node1.mydomain: conn( ProtocolError -> Unconnected )
drbd RES7H10E node1.mydomain: Restarting receiver thread
drbd RES7H10E node1.mydomain: conn( Unconnected -> Connecting )
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to