Avi Kivity wrote: > Is this the host file descriptor? If so, we want to use something more > abstract (if the host side is in kernel, there will be no fd, or if the > device is implemented using >1 files (or <1 files)). This is indeed the host file descriptor. Host userland uses sys_open to retrieve it. I see the beauty of having the remote side in the kernel, however I fail to see why we would want to reinvent the wheel: asynchronous IO with O_DIRECT (to avoid host caching) does just what we want. System call latency adds to the in-kernel approach here.
> We'll want scatter/gather here. If you want scatter/gather, you have to do request merging in the guest and use the do_request function of the block queue. That is because in make_request you only have a single chunk at hand. With do_request, you would do that request merging twice and get twice the block device plug latency for nothing. The host is the better place to do IO scheduling, because it can optimize over IO from all guest machines. > >> +}; >> + >> +struct vdisk_iocb_container { >> + struct iocb iocb; >> + struct bio *bio; >> + struct vdisk_device *dev; >> + int ctx_index; >> + unsigned long context; >> + struct list_head list; >> +}; >> + >> +// from aio_abi.h >> +typedef enum io_iocb_cmd { >> + IO_CMD_PREAD = 0, >> + IO_CMD_PWRITE = 1, >> + >> + IO_CMD_FSYNC = 2, >> + IO_CMD_FDSYNC = 3, >> + >> + IO_CMD_POLL = 5, >> + IO_CMD_NOOP = 6, >> +} io_iocb_cmd_t; >> > > Our own commands, please. We need READV, WRITEV, and a barrier for > journalling filesystems. FDSYNC should work as a barrier, but is > wasteful. The FSYNC/FDSYNC distinction is meaningless. POLL/NOOP are > irrelevant. This matches the api of libaio. If userland translates this into struct iocp, this makes sense. The barrier however is a general problem with this approach: today, the asynchronous IO userspace api does not allow to submit a barrier. Therefore, our make_request function in the guest returns -ENOTSUPP in the guest which forces the file system to wait for IO completion. This does sacrifice some performance. The right thing to do would be to add the possibility to submit a barrier to the kernel aio interface. > We want to amortize the hypercall over multiple bios (but maybe you're > doing that -- I'm not 100% up to speed on the block layer) We don't. We do one per bio, and I agree that this is a major disadvantage of this approach. Since IO is slow (compared to vmenter/vmexit), it pays back from to better IO scheduling. On our platform, this approach outperforms the scatter/gather do_request one. > Any reason not to perform the work directly? I owe you an answer to this one, I have to revisit our CVS logs to find out. We used to call from make_request without workqueue before, and I cannot remember why we changed that. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel