On Wed, Oct 16, 2013 at 02:59:47PM +0200, Christian Borntraeger wrote: > Folks, > > from time to time I update valgrind or qemu to work reasonably well > with KVM. > > Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an > in kernel > kvm interrupt controller) with the following ioctl: > > #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct > kvm_create_device) > > qemu can then work on these devices with the ioctls > > /* ioctls for fds returned by KVM_CREATE_DEVICE */ > #define KVM_SET_DEVICE_ATTR _IOW(KVMIO, 0xe1, struct kvm_device_attr) > #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) > #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) > > struct kvm_device_attr { > __u32 flags; /* no flags currently defined */ > __u32 group; /* device-defined */ > __u64 attr; /* group-defined */ > __u64 addr; /* userspace address of attr data */ > }; > > to communicate with the in-kvm devices to exchange data. There is an > interesting > problem here: Properly defined ioctls allow to deduct the written or read area > of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind > will > decode the ioctl number according to the _IOxx prefixes and deducts the memory > area that the kernels reads/writes. This is very helpful for definedness > checkins. > And is not very helpful if you need to change structure size or it changes by itself because of 32bit->64bit move.
> For specific or more complex ioctls valgrind has special handling. The > problem is > now, that looking at the current implementations of kvm device attr, all use > 0,1,2, > etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what > device > it is talking to without tracking the original create ioctl. > > So it seems that by using a multiplexing ioctls we actually make the interface > less well defined and more error prone than individual ioctls. I do not see why it is more error prone. Yes, it is harder for valgrind to track it, but this by itself does not make it error prone. > > Another hint to look at are system calls. Older archs (like i386) have an IPC > system > call that multiplexes several things. But time has shown that having a system > for each > and every subfunctions is better that a multiplexer. (so newer archs like > amd64 dont > have an ipc system call they have semop, semget etc. This also makes tools > like strace > easier to implement. > The difference is in amount of those subfunctions. We can define separate ioctl for each cpu/device register of each architecture and it will be nicely straceable, but how much new ioctls this will create? It will easily reach 1000s quickly. So at some point you start multiplex. ONE_REG interface is multiplexer, DEVICE_ATTR is same. > Whats even more puzzling to me: The main complaint about ioctls is the > possibility to > create arbitrary interfaces into the kernel. Now with the IORW family, we > actually > had some limited form of control. With an additional uncoordinated group/attr > parameter we dropped all of that. How this control is actually used in the kernel? > > > So how to process from here? > a) leave it as is and accept false positives from valgrind and undecoded > straces strace still cannot decode kvm ioctls after all this years. I am not concerned about straces at all, it is less then useless to debug KVM. > b) We could enhance the device_addr group/attr values with size macros, e.g > the same as _IOWR (do we need direction?) for future users I am fine with that. But even in KVM this is not the first interface that has this drawback. See KVM_GET_DIRTY_LOG for instance. > c) Avoid the device api for new code and go back to individual ioctls in KVM Individual ioctls were a mess. People are adding devices without even knowing beforehand how final interface will look like. To accommodate such "design" strategy devices need to have as fine grained interfaces as possible otherwise ioctls will be deprecated faster than added (that was justification for ONE_REG too BTW). If you want fine grained interface you need to multiplex at some point or add hundreds of ioctls, if you need to multiplex anyway multiplexing at device level is logical thing to do. I understand valgrind sentiment, but it should not be a central point of interface design. > > The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model > everything > as a device. Having an in-kernel KVM device seemed logical. The problem is, > that we > should really have a 2nd look at the Linux system call ABI. Does it have to > follow the > device model? especially if the interface has obvious draw backs. > Try to add new Linux system call. It's very hard and for a good reason, you do not want to have 1000s of them where most of then is one-trick pony and mostly deprecated (but you cannot remove them or reuse them anyway). If we apply the same constrains to kvm ioctl additions as Linux has for system calls, KVM development will stall and people will be less then happy. -- Gleb.