On Tue, 31 Jul 2012, David Boyes wrote:
> Florian's original post.
> Corroborating posts from other users (Mark Post, etc)
> My data (on average 3 out of 100 tests fail)
>
> I'd be happy to send you more examples. Are you looking for something 
> specific? The script recently posted here (by you, I think) can generate as 
> much failure data as you like.
>
> To be clear, dasdfmt doesn't complain about other users, it fails because 
> there's no device for it to operate on (yet). Inserting a wait of a few 
> (variable between 1 and 30 seconds, depending on load) seconds reduces, but 
> does not eliminate, the failures. Introducing a 60-90 second wait produces a 
> fairly reliable operation, but still not 100%. Given the need for a reliable 
> test for use in automation and/or the number of devices that commonly need to 
> be processed to create large LVM collections, a minute and a half wait just 
> because we can't reliably depend on chccwdev to be atomic isn't acceptable.

Hm, ok. I think we are dealing with 3 different types of failures here:

* missing error handling in scripts

* failing to ensure exclusive usage
  * most of the tools needed to activate a device require exclusive usage
    of the device
  * most of the tools needed to activate a device trigger additional
    uevents which would lead udev to check this device out

so instead of:
chccwdev -e
dasdfmt
fdasd
mkswap
chccwdev -d

you need to do:
chccwdev -e
udevadm settle
dasdfmt
udevadm settle
fdasd
udevadm settle
mkswap
udevadm settle
chccwdev -d

And using the --exit-if-exists option is not enough here - you really need
udev to finish using the device.

* cases where udev settle is not enough
  * after udev settle no device node is created
  * after udev settle udev is still using the device

Since this thread is about the last class of failures I'd run a _lot_ of
tests over the last couple of days under various system loads to trigger
this specific error. I could not find one indication where udev settle did
not do its job.
However I found 2 possible related bugs: one in CIO where a device is left
in an unusable state and one in DASD which could lead to udev using the
device after settle returns (but I could not trigger this one).

Once I'm done with fixing this bugs I'll look into the distros to find out
if the fixes are applicable there and to look for other bugs lurking
there.

So I suspect that most of the things you observed are results of the 2nd
error class (but again I've not looked into the distros yet, maybe the
situation is different there).

Regards,
Sebastian

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For more information on Linux on System z, visit
http://wiki.linuxvm.org/

Reply via email to