Peter, First: happy new year!
I've been doing some more tests to track down the cause of this bug. Since it looks like a kernel bug, I tried reproducing this with kernel 3.5.0, version 3.5.0-21.32~precise1. I could reproduce the faulty paths that multipathd was unable to remove, however: there were no hanging processes this time and thus no kernel crash.. which is an improvement. During the test I did see this happening: LUN-DATABASE (36006016061e02e003cf1aca4ae07e211) dm-1 DGC,VRAID size=200G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw |-+- policy='round-robin 0' prio=130 status=active | |- 4:0:1:1 sdi 8:128 active ready running | `- #:#:#:# - #:# active faulty running `-+- policy='round-robin 0' prio=10 status=enabled |- 4:0:0:1 sdg 8:96 active ready running `- #:#:#:# - #:# active faulty running LUN-LOGGING (36006016061e02e000286c1adae07e211) dm-2 , size=20G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw |-+- policy='round-robin 0' prio=130 status=active | |- #:#:#:# - #:# failed faulty running | `- 4:0:0:0 sde 8:64 active ready running `-+- policy='round-robin 0' prio=10 status=enabled |- #:#:#:# - #:# active faulty running `- 4:0:1:0 sdh 8:112 active ready running As you can see, multipathd fails to remove the 'faulty' paths from the device-mapping again. However, for some reason this didn't lead to processes stuck in 'D' state this time. During this, the following message was logged repeatedly: Jan 3 10:24:14 ealxs00161 multipathd: sdd: failed to get sysfs information Jan 3 10:24:14 ealxs00161 multipathd: sdd: unusable path So multipathd was retrying the removal, but it failed every time. After bringing the path back up, it restored OK and everything was fine again: LUN-DATABASE (36006016061e02e003cf1aca4ae07e211) dm-1 DGC,VRAID size=200G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw |-+- policy='round-robin 0' prio=130 status=active | |- 4:0:1:1 sdi 8:128 active ready running | `- 3:0:0:1 sdc 8:32 active ready running `-+- policy='round-robin 0' prio=10 status=enabled |- 4:0:0:1 sdg 8:96 active ready running `- 3:0:1:1 sdf 8:80 active ready running LUN-LOGGING (36006016061e02e000286c1adae07e211) dm-2 DGC,VRAID size=20G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw |-+- policy='round-robin 0' prio=130 status=active | |- 3:0:1:0 sdd 8:48 active ready running | `- 4:0:0:0 sde 8:64 active ready running `-+- policy='round-robin 0' prio=10 status=enabled |- 3:0:0:0 sdb 8:16 active ready running `- 4:0:1:0 sdh 8:112 active ready running After this, failing over again worked just fine, the paths that failed to be removed the last time were now removed without problems... Both machines survived about 10 up/down testruns. I'll attach the syslog of this run shortly. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1032550 Title: [multipath] failed to get sysfs information To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1032550/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs