Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-06-30 Thread Marcus Granado

On 13/05/15 11:29, Bob Liu wrote:


On 04/28/2015 03:46 PM, Arianna Avanzini wrote:

Hello Christoph,

Il 28/04/2015 09:36, Christoph Hellwig ha scritto:

What happened to this patchset?



It was passed on to Bob Liu, who published a follow-up patchset here: 
https://lkml.org/lkml/2015/2/15/46



Right, and then I was interrupted by another xen-block feature: 'multi-page' 
ring.
Will back on this patchset soon. Thank you!

-Bob



Hi,

Our measurements for the multiqueue patch indicate a clear improvement 
in iops when more queues are used.


The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to 
a dom0 kernel 4.0 on 8 vcpus.


- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend 
applied to be used as a guest on 4 vcpus


- using a micron RealSSD P320h as the underlying local storage on a Dell 
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.


- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. 
We used direct_io to skip caching in the guest and ran fio for 60s 
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue 
depth of 32 for each queue was used to saturate individual vcpus in the 
guest.


We were interested in observing storage iops for different values of 
block sizes. Our expectation was that iops would improve when increasing 
the number of queues, because both the guest and dom0 would be able to 
make use of more vcpus to handle these requests.


These are the results (as aggregate iops for all the fio threads) that 
we got for the conditions above with sequential reads:


fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
8   32   512   158K 264K
8   321K   157K 260K
8   322K   157K 258K
8   324K   148K 257K
8   328K   124K 207K
8   32   16K84K 105K
8   32   32K50K  54K
8   32   64K24K  27K
8   32  128K11K  13K

8-queue iops was better than single queue iops for all the block sizes. 
There were very good improvements as well for sequential writes with 
block size 4K (from 80K iops with single queue to 230K iops with 8 
queues), and no regressions were visible in any measurement performed.


Marcus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] xen/block: add multi-page ring support

2015-06-23 Thread Marcus Granado

On 22/06/15 02:20, Bob Liu wrote:


On 06/09/2015 10:07 PM, Roger Pau Monné wrote:

El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:

...

Roger, I put them (patches) on devel/for-jens-4.2 on

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git

I think these two patches:
drivers: xen-blkback: delay pending_req allocation to connect_ring
xen/block: add multi-page ring support

are the only ones that haven't been Acked by you (or maybe they
have and I missed the Ack?)


Hello,

I was waiting to Ack those because the XenServer storage performance
folks found out that these patches cause a performance regression on
some of their tests. I'm adding them to the conversation so they can
provide more details about the issues they found, and whether we should
hold pushing this patches or not.



Hey,

Are there any updates? What's the performance regression problem?



Hi,

We were using the 2 last weeks to finish measurements on the multipage 
ring v5 patches in a range of diverse conditions.


The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with a back-ported multipage ring v5 
applied to our dom0 kernel 3.10.


- using a recent Ubuntu 15.04 kernel 3.19 with v5 frontend applied to be 
used as guest


- using a micron RealSSD P320h as the underlying local storage on a Dell 
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.


- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. 
We used direct_io to skip caching in the guest and ran fio for 60s for a 
number of block sizes ranging from 512 bytes to 4MiB. We also tried pure 
random and pure sequential reads. Random reads were used to counter-act 
read-ahead prefetching at the underlying storage.


We noticed that using large (>16) queue depths in fio would saturate 
individual vcpus in the guest, so to better utilise the cpu resources in 
the guest, we chose to (a) fix the queue depth to 4 for each fio thread, 
(b) increase the guest vcpus to 16 and (c) vary the number of fio 
threads from 1 to 64.


We were interested in observing storage iops and throughput for 
different values of in-flight requests (= io depth * fio threads) 
generated by the guest. Our expectation was that iops and throughput 
with single-page and multi-page rings would be the same up to 32 
in-flight requests (the number of requests that fit in a single-page 
ring), and then the single-page ring case would flat-line with >32 
in-flight requests, whereas the multi-page ring case would continue to 
show improvements until hitting some other bottleneck. The effect should 
be more visible when using requests with smaller block sizes because the 
measurements are less likely to be affected by memory copy delays or 
large data transfer delays to storage.


These are the results we got for the conditions above with 4KiB blocks 
and random reads:


fio_threads  io_depth  in_flight   1-page_IOPS  8-page_IOPS
14 4   19K  19K
4416   89K  89K
8432  149K 149K
   16464  131K 198K
   324   128  127K 208K
   644   256  132K 209K

We believe that this data shows that there's a clear improvement when 
using multi-page rings when there are more than 32 in-flight requests. 
We observed similar improvements when writing, and across all small 
block sizes. For block sizes >=16KiB, the results were similar between 
single- and multi-page rings, and we attribute that to bottlenecks when 
transferring large amounts of data that is not present with smaller 
block sizes.


Another reason for using random reads in the synthetic fio tests above 
is that we noticed that when sequential reads are used there were some 
anomalies that we believe would affect a fair comparison:


(A)- in some situations with sequential read, we observed a decreasing 
number of merges in the guest (according to 'iostat -x -m 1') with small 
block sizes <=4KiB when increasing the number of ring pages. There were 
no merges whenever in_flight < ring_pages * 32. With larger in_flight 
requests (>=128) -- visible with both 8 fio_threads x 32 io_depth and 32 
fio_threads x 8 io_depth -- storage throughput with 1 page was around 
25% better than with 8 pages. This is the regression that Roger was 
talking about previously in this discussion. It seems related to the 
merges of requests occurring much more frequently with 1 page than with 
8 pages. During the measurements, the average request queue size in 
iostat has always a similar value as the number of requests in the ring. 
I would appreciate potential explanations of why the guest kernel would 
behave like that. We believe that this regression is a corner-case that 
would be difficult to spot in a real-world load, where random reads are 
interspersed with sequential reads of many different block s