Carsten Otte wrote: > > Avi Kivity wrote: >> Is this the host file descriptor? If so, we want to use something >> more abstract (if the host side is in kernel, there will be no fd, or >> if the device is implemented using >1 files (or <1 files)). > This is indeed the host file descriptor. Host userland uses sys_open > to retrieve it. I see the beauty of having the remote side in the > kernel, however I fail to see why we would want to reinvent the wheel: > asynchronous IO with O_DIRECT (to avoid host caching) does just what > we want.
I don't see an immediate need to put the host-side driver in the kernel, but I don't want to embed the host fd (which is an implementation detail) into the host/guest ABI. There may not even be a host fd. > System call latency adds to the in-kernel approach here. I don't understand this. > >> We'll want scatter/gather here. > If you want scatter/gather, you have to do request merging in the > guest and use the do_request function of the block queue. That is > because in make_request you only have a single chunk at hand. > With do_request, you would do that request merging twice and get twice > the block device plug latency for nothing. The host is the better > place to do IO scheduling, because it can optimize over IO from all > guest machines. The bio layer already has scatter/gather (basically, a biovec), but the aio api (which you copy) doesn't. The basic request should be a bio, not a bio page. I don't think the guest driver needs to do its own merging. >> >>> +}; >>> + >>> +struct vdisk_iocb_container { >>> + struct iocb iocb; >>> + struct bio *bio; >>> + struct vdisk_device *dev; >>> + int ctx_index; >>> + unsigned long context; >>> + struct list_head list; >>> +}; >>> + >>> +// from aio_abi.h >>> +typedef enum io_iocb_cmd { >>> + IO_CMD_PREAD = 0, >>> + IO_CMD_PWRITE = 1, >>> + >>> + IO_CMD_FSYNC = 2, >>> + IO_CMD_FDSYNC = 3, >>> + >>> + IO_CMD_POLL = 5, >>> + IO_CMD_NOOP = 6, >>> +} io_iocb_cmd_t; >>> >> >> Our own commands, please. We need READV, WRITEV, and a barrier for >> journalling filesystems. FDSYNC should work as a barrier, but is >> wasteful. The FSYNC/FDSYNC distinction is meaningless. POLL/NOOP >> are irrelevant. > This matches the api of libaio. If userland translates this into > struct iocp, this makes sense. The barrier however is a general > problem with this approach: today, the asynchronous IO userspace api > does not allow to submit a barrier. Therefore, our make_request > function in the guest returns -ENOTSUPP in the guest which forces the > file system to wait for IO completion. This does sacrifice some > performance. The right thing to do would be to add the possibility to > submit a barrier to the kernel aio interface. Right. But the ABI needs to support barriers regardless of host kernel support. When unavailable, barriers can be emulated by waiting for the request queue to flush itself. If we do implement the host side in the kernel, then barriers become available. > >> We want to amortize the hypercall over multiple bios (but maybe >> you're doing that -- I'm not 100% up to speed on the block layer) > We don't. We do one per bio, and I agree that this is a major > disadvantage of this approach. Since IO is slow (compared to > vmenter/vmexit), it pays back from to better IO scheduling. On our > platform, this approach outperforms the scatter/gather do_request one. I/O may be slow, but you can have a lot more disks than cpus. For example, if an I/O takes 1ms, and you have 100 disks, then you can issue 100K IOPS. With one hypercall per request, that's ~50% of a cpu (at about 5us per hypercall that goes all the way to userspace). That's not counting the overhead of calling io_submit(). -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel