On Tue, 2021-01-26 at 19:14 +0800, lixiaokeng wrote: > > > > Hi, > > > Unfortunately the verify_path() called before *and* after > > > domap() > > > in > > > coalesce_paths can't solve this problem. I think it is another > > > way to > > > lead multipath with wrong path, but now I can't find the way from > > > log. > > > > Can you provide multipathd -v3 logs, and kernel logs? Maybe I'll > > see > > something.
This is not a -v3 log, right? We can't see much what multipathd is doing. Anyway, I understand now that verify_paths() won't help. It looks only for paths that have been removed (i.e. don't exist any more in sysfs) since the last path detection. But then, when the error occurs, it seems that sdf has been removed *and re-added*. So, the check whether the path still exists succeeds. The uevents were also missed because the uevent handler didn't get the lock. > > (1)multipath -r: The sdf is found as a path of > 36001405b7679bd96b094bccbf971bc90 > (iscsi node is 4:0:0:2) > > (2)iscsi logout: The sdf is removed in iscsi in system time > [1202538.467014]. > > (3)iscsi login: The sdf appears in iscsi in system time > [1202538.825745]. > It is a path of 3600140584e11eb1818c4afab12c17800 (iscsi node > 2:0:0:0) > > Here I have a doubt. When I stop in domap using gdb and iscsi log > out/in, > the sdf will not be used again becasue the disk refcount is not > zero. I > add a print if the disk refcount is zero in put_disk_and_module (for > example lxk ref put after: name sdi; count 0), but there is not this > print > about sdf. Yes, this is a very good point, and it's indeed strange. multipathd should have opened a file descriptor to /dev/sdf in pathinfo(), and as long as that file is open, the use count shouldn't drop to 0, the disk devices (block device and scsi_disk device) shouldn't be released, and the major/minor number shouldn't be reused. Unless I'm missing something essential, that is. > Jan 25 12:37:48 client1 kernel: [1202538.467014] sd 4:0:0:2: [sdf] > Synchronizing SCSI cache > Jan 25 12:37:48 client1 kernel: [1202538.568195] scsi 4:0:0:2: alua: Detached > Jan 25 12:37:48 client1 kernel: [1202538.630507] sd 2:0:0:0: [sdf] 20971520 > 512-byte logical blocks: (10.7 GB/10.0 GiB) Less than 0.1s between the disappearance of 4:0:0:2 as sdf and reappearance of 2:0:0:0, without any sign of multipathd having noticed this change, is indeed quite strange. So we can only conclude that (if there's no kernel refcounting bug, which I doubt) either orphan_path()->uninitialize_path() had been called (closing the fd), or that opening the sd device had failed in the first place (in which case the path WWID should have been nulled in pathinfo(). In both cases it makes little sense that the path should still be part of a struct multipath. Please increase the log level of the "Couldn't open device node" message in pathinfo(), and see if respective errors are logged. Can you verify in the debugger if multipathd still has the fd to the disk device open? Perhaps you could trace scsi_disk_release() in the kernel? Martin -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel