On 08/25/2014 03:50 PM, Chris Friesen wrote:

I think I might have a glimmering of what's going on.  Someone please
correct me if I get something wrong.

I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
respect to max inflight operations, and neither does virtio-blk calling
virtio_add_queue() with a queue size of 128.

I think what's happening is that virtio_blk_handle_output() spins,
pulling data off the 128-entry queue and calling
virtio_blk_handle_request().  At this point that queue entry can be
reused, so the queue size isn't really relevant.

In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
every 32 writes we'll call virtio_submit_multiwrite() which calls down
into bdrv_aio_multiwrite().  That tries to merge requests and then for
each resulting request calls bdrv_aio_writev() which ends up calling
qemu_rbd_aio_writev(), which calls rbd_start_aio().

rbd_start_aio() allocates a buffer and converts from iovec to a single
buffer.  This buffer stays allocated until the request is acked, which
is where the bulk of the memory overhead with rbd is coming from (has
anyone considered adding iovec support to rbd to avoid this extra copy?).

The only limit I see in the whole call chain from
virtio_blk_handle_request() on down is the call to
bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
doesn't provide any limit on the absolute number of inflight operations,
only on operations/sec.  If the ceph server cluster can't keep up with
the aggregate load, then the number of inflight operations can still
grow indefinitely.

Chris

I was a bit concerned that I'd need to extend the IO throttling code to support a limit on total inflight bytes, but it doesn't look like that will be necessary.

It seems that using mallopt() to set the trim/mmap thresholds to 128K is enough to minimize the increase in RSS and also drop it back down after an I/O burst. For now this looks like it should be sufficient for our purposes.

I'm actually a bit surprised I didn't have to go lower, but it seems to work for both "dd" and dbench testcases so we'll give it a try.

Chris

Reply via email to