Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-18 Thread Rafal Mielniczuk
On 14/08/15 13:30, Rafal Mielniczuk wrote:
 On 14/08/15 09:31, Bob Liu wrote:
 On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
 On 12/08/15 11:17, Bob Liu wrote:
 On 08/12/2015 01:32 AM, Jens Axboe wrote:
 On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
 On 11/08/15 07:08, Bob Liu wrote:
 On 08/10/2015 11:52 PM, Jens Axboe wrote:
 On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
 ...
 Hello,

 We rerun the tests for sequential reads with the identical settings 
 but with Bob Liu's multiqueue patches reverted from dom0 and guest 
 kernels.
 The results we obtained were *better* than the results we got with 
 multiqueue patches applied:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
 *no-mq-patches_iops*
8   32   512   158K 264K 
 321K
8   321K   157K 260K 
 328K
8   322K   157K 258K 
 336K
8   324K   148K 257K 
 308K
8   328K   124K 207K 
 188K
8   32   16K84K 105K 
 82K
8   32   32K50K  54K 
 36K
8   32   64K24K  27K 
 16K
8   32  128K11K  13K 
 11K

 We noticed that the requests are not merged by the guest when the 
 multiqueue patches are applied,
 which results in a regression for small block sizes (RealSSD P320h's 
 optimal block size is around 32-64KB).

 We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 
 2.5 Internal SSD

 As I understand blk-mq layer bypasses I/O scheduler which also 
 effectively disables merges.
 Could you explain why it is difficult to enable merging in the blk-mq 
 layer?
 That could help closing the performance gap we observed.

 Otherwise, the tests shows that the multiqueue patches does not 
 improve the performance,
 at least when it comes to sequential read/writes operations.
 blk-mq still provides merging, there should be no difference there. 
 Does the xen patches set BLK_MQ_F_SHOULD_MERGE?

 Yes.
 Is it possible that xen-blkfront driver dequeue requests too fast after 
 we have multiple hardware queues?
 Because new requests don't have the chance merging with old requests 
 which were already dequeued and issued.

 For some reason we don't see merges even when we set multiqueue to 1.
 Below are some stats from the guest system when doing sequential 4KB 
 reads:

 $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
--iodepth=32 --time_based=1 --runtime=300 --bs=4KB
 --filename=/dev/xvdb

 $ iostat -xt 5 /dev/xvdb
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 0.500.002.73   85.142.009.63

 Device: rrqm/s   wrqm/s   r/s w/s rkB/swkB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 xvdb  0.00 0.00 156926.000.00 627704.00 0.00
 8.0030.060.190.190.00   0.01 100.48

 $ cat /sys/block/xvdb/queue/scheduler
 none

 $ cat /sys/block/xvdb/queue/nomerges
 0

 Relevant bits from the xenstore configuration on the dom0:

 /local/domain/0/backend/vbd/2/51728/dev = xvdb
 /local/domain/0/backend/vbd/2/51728/backend-kind = vbd
 /local/domain/0/backend/vbd/2/51728/type = phy
 /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = 1

 /local/domain/2/device/vbd/51728/multi-queue-num-queues = 1
 /local/domain/2/device/vbd/51728/ring-ref = 9
 /local/domain/2/device/vbd/51728/event-channel = 60
 If you add --iodepth-batch=16 to that fio command line? Both mq and 
 non-mq relies on plugging to get
 batching in the use case above, otherwise IO is dispatched immediately. 
 O_DIRECT is immediate. 
 I'd be more interested in seeing a test case with buffered IO of a file 
 system on top of the xvdb device,
 if we're missing merging for that case, then that's a much bigger issue.

  
 I was using the null block driver for xen blk-mq test.

 There were not merges happen any more even after patch: 
 https://lkml.org/lkml/2015/7/13/185
 (Which just converted xen block driver to use blk-mq apis)

 Will try a file system soon.

 I have more results for the guest with and without the patch
 https://lkml.org/lkml/2015/7/13/185
 applied to the latest stable kernel (4.1.5).

 Thank you.

 Command line used was:
 fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
 --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
 --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16

 without patch (--direct=1):
   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, 
 in_queue=11344352, util=100.00%

 with patch (--direct=1):
   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, 
 util=100.00%

 So request merge can happen just more difficult to be triggered.
 How about the iops of 

Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-14 Thread Bob Liu

On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
 On 12/08/15 11:17, Bob Liu wrote:
 On 08/12/2015 01:32 AM, Jens Axboe wrote:
 On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
 On 11/08/15 07:08, Bob Liu wrote:
 On 08/10/2015 11:52 PM, Jens Axboe wrote:
 On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
 ...
 Hello,

 We rerun the tests for sequential reads with the identical settings but 
 with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
 The results we obtained were *better* than the results we got with 
 multiqueue patches applied:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
 *no-mq-patches_iops*
8   32   512   158K 264K 321K
8   321K   157K 260K 328K
8   322K   157K 258K 336K
8   324K   148K 257K 308K
8   328K   124K 207K 188K
8   32   16K84K 105K 82K
8   32   32K50K  54K 36K
8   32   64K24K  27K 16K
8   32  128K11K  13K 11K

 We noticed that the requests are not merged by the guest when the 
 multiqueue patches are applied,
 which results in a regression for small block sizes (RealSSD P320h's 
 optimal block size is around 32-64KB).

 We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 
 Internal SSD

 As I understand blk-mq layer bypasses I/O scheduler which also 
 effectively disables merges.
 Could you explain why it is difficult to enable merging in the blk-mq 
 layer?
 That could help closing the performance gap we observed.

 Otherwise, the tests shows that the multiqueue patches does not improve 
 the performance,
 at least when it comes to sequential read/writes operations.
 blk-mq still provides merging, there should be no difference there. Does 
 the xen patches set BLK_MQ_F_SHOULD_MERGE?

 Yes.
 Is it possible that xen-blkfront driver dequeue requests too fast after 
 we have multiple hardware queues?
 Because new requests don't have the chance merging with old requests 
 which were already dequeued and issued.

 For some reason we don't see merges even when we set multiqueue to 1.
 Below are some stats from the guest system when doing sequential 4KB reads:

 $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
--iodepth=32 --time_based=1 --runtime=300 --bs=4KB
 --filename=/dev/xvdb

 $ iostat -xt 5 /dev/xvdb
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 0.500.002.73   85.142.009.63

 Device: rrqm/s   wrqm/s   r/s w/s rkB/swkB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 xvdb  0.00 0.00 156926.000.00 627704.00 0.00
 8.0030.060.190.190.00   0.01 100.48

 $ cat /sys/block/xvdb/queue/scheduler
 none

 $ cat /sys/block/xvdb/queue/nomerges
 0

 Relevant bits from the xenstore configuration on the dom0:

 /local/domain/0/backend/vbd/2/51728/dev = xvdb
 /local/domain/0/backend/vbd/2/51728/backend-kind = vbd
 /local/domain/0/backend/vbd/2/51728/type = phy
 /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = 1

 /local/domain/2/device/vbd/51728/multi-queue-num-queues = 1
 /local/domain/2/device/vbd/51728/ring-ref = 9
 /local/domain/2/device/vbd/51728/event-channel = 60
 If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq 
 relies on plugging to get
 batching in the use case above, otherwise IO is dispatched immediately. 
 O_DIRECT is immediate. 
 I'd be more interested in seeing a test case with buffered IO of a file 
 system on top of the xvdb device,
 if we're missing merging for that case, then that's a much bigger issue.

  
 I was using the null block driver for xen blk-mq test.

 There were not merges happen any more even after patch: 
 https://lkml.org/lkml/2015/7/13/185
 (Which just converted xen block driver to use blk-mq apis)

 Will try a file system soon.

 I have more results for the guest with and without the patch
 https://lkml.org/lkml/2015/7/13/185
 applied to the latest stable kernel (4.1.5).
 

Thank you.

 Command line used was:
 fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
 --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
 --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16
 
 without patch (--direct=1):
   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, 
 in_queue=11344352, util=100.00%
 
 with patch (--direct=1):
   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, 
 util=100.00%
 

So request merge can happen just more difficult to be triggered.
How about the iops of both cases?

 without patch buffered (--direct=0):
   xvdb: ios=1079051/0, merge=76/0, 

Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-14 Thread Rafal Mielniczuk
On 14/08/15 09:31, Bob Liu wrote:
 On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
 On 12/08/15 11:17, Bob Liu wrote:
 On 08/12/2015 01:32 AM, Jens Axboe wrote:
 On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
 On 11/08/15 07:08, Bob Liu wrote:
 On 08/10/2015 11:52 PM, Jens Axboe wrote:
 On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
 ...
 Hello,

 We rerun the tests for sequential reads with the identical settings 
 but with Bob Liu's multiqueue patches reverted from dom0 and guest 
 kernels.
 The results we obtained were *better* than the results we got with 
 multiqueue patches applied:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
 *no-mq-patches_iops*
8   32   512   158K 264K 
 321K
8   321K   157K 260K 
 328K
8   322K   157K 258K 
 336K
8   324K   148K 257K 
 308K
8   328K   124K 207K 
 188K
8   32   16K84K 105K 82K
8   32   32K50K  54K 36K
8   32   64K24K  27K 16K
8   32  128K11K  13K 11K

 We noticed that the requests are not merged by the guest when the 
 multiqueue patches are applied,
 which results in a regression for small block sizes (RealSSD P320h's 
 optimal block size is around 32-64KB).

 We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 
 Internal SSD

 As I understand blk-mq layer bypasses I/O scheduler which also 
 effectively disables merges.
 Could you explain why it is difficult to enable merging in the blk-mq 
 layer?
 That could help closing the performance gap we observed.

 Otherwise, the tests shows that the multiqueue patches does not 
 improve the performance,
 at least when it comes to sequential read/writes operations.
 blk-mq still provides merging, there should be no difference there. 
 Does the xen patches set BLK_MQ_F_SHOULD_MERGE?

 Yes.
 Is it possible that xen-blkfront driver dequeue requests too fast after 
 we have multiple hardware queues?
 Because new requests don't have the chance merging with old requests 
 which were already dequeued and issued.

 For some reason we don't see merges even when we set multiqueue to 1.
 Below are some stats from the guest system when doing sequential 4KB 
 reads:

 $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
--iodepth=32 --time_based=1 --runtime=300 --bs=4KB
 --filename=/dev/xvdb

 $ iostat -xt 5 /dev/xvdb
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 0.500.002.73   85.142.009.63

 Device: rrqm/s   wrqm/s   r/s w/s rkB/swkB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 xvdb  0.00 0.00 156926.000.00 627704.00 0.00
 8.0030.060.190.190.00   0.01 100.48

 $ cat /sys/block/xvdb/queue/scheduler
 none

 $ cat /sys/block/xvdb/queue/nomerges
 0

 Relevant bits from the xenstore configuration on the dom0:

 /local/domain/0/backend/vbd/2/51728/dev = xvdb
 /local/domain/0/backend/vbd/2/51728/backend-kind = vbd
 /local/domain/0/backend/vbd/2/51728/type = phy
 /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = 1

 /local/domain/2/device/vbd/51728/multi-queue-num-queues = 1
 /local/domain/2/device/vbd/51728/ring-ref = 9
 /local/domain/2/device/vbd/51728/event-channel = 60
 If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq 
 relies on plugging to get
 batching in the use case above, otherwise IO is dispatched immediately. 
 O_DIRECT is immediate. 
 I'd be more interested in seeing a test case with buffered IO of a file 
 system on top of the xvdb device,
 if we're missing merging for that case, then that's a much bigger issue.

  
 I was using the null block driver for xen blk-mq test.

 There were not merges happen any more even after patch: 
 https://lkml.org/lkml/2015/7/13/185
 (Which just converted xen block driver to use blk-mq apis)

 Will try a file system soon.

 I have more results for the guest with and without the patch
 https://lkml.org/lkml/2015/7/13/185
 applied to the latest stable kernel (4.1.5).

 Thank you.

 Command line used was:
 fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
 --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
 --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16

 without patch (--direct=1):
   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, 
 in_queue=11344352, util=100.00%

 with patch (--direct=1):
   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, 
 util=100.00%

 So request merge can happen just more difficult to be triggered.
 How about the iops of both cases?

Without the patch it is 318Kiops, with 

Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-12 Thread Rafal Mielniczuk
On 12/08/15 11:17, Bob Liu wrote:
 On 08/12/2015 01:32 AM, Jens Axboe wrote:
 On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
 On 11/08/15 07:08, Bob Liu wrote:
 On 08/10/2015 11:52 PM, Jens Axboe wrote:
 On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
 ...
 Hello,

 We rerun the tests for sequential reads with the identical settings but 
 with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
 The results we obtained were *better* than the results we got with 
 multiqueue patches applied:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
 *no-mq-patches_iops*
8   32   512   158K 264K 321K
8   321K   157K 260K 328K
8   322K   157K 258K 336K
8   324K   148K 257K 308K
8   328K   124K 207K 188K
8   32   16K84K 105K 82K
8   32   32K50K  54K 36K
8   32   64K24K  27K 16K
8   32  128K11K  13K 11K

 We noticed that the requests are not merged by the guest when the 
 multiqueue patches are applied,
 which results in a regression for small block sizes (RealSSD P320h's 
 optimal block size is around 32-64KB).

 We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 
 Internal SSD

 As I understand blk-mq layer bypasses I/O scheduler which also 
 effectively disables merges.
 Could you explain why it is difficult to enable merging in the blk-mq 
 layer?
 That could help closing the performance gap we observed.

 Otherwise, the tests shows that the multiqueue patches does not improve 
 the performance,
 at least when it comes to sequential read/writes operations.
 blk-mq still provides merging, there should be no difference there. Does 
 the xen patches set BLK_MQ_F_SHOULD_MERGE?

 Yes.
 Is it possible that xen-blkfront driver dequeue requests too fast after we 
 have multiple hardware queues?
 Because new requests don't have the chance merging with old requests which 
 were already dequeued and issued.

 For some reason we don't see merges even when we set multiqueue to 1.
 Below are some stats from the guest system when doing sequential 4KB reads:

 $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
--iodepth=32 --time_based=1 --runtime=300 --bs=4KB
 --filename=/dev/xvdb

 $ iostat -xt 5 /dev/xvdb
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 0.500.002.73   85.142.009.63

 Device: rrqm/s   wrqm/s   r/s w/s rkB/swkB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 xvdb  0.00 0.00 156926.000.00 627704.00 0.00
 8.0030.060.190.190.00   0.01 100.48

 $ cat /sys/block/xvdb/queue/scheduler
 none

 $ cat /sys/block/xvdb/queue/nomerges
 0

 Relevant bits from the xenstore configuration on the dom0:

 /local/domain/0/backend/vbd/2/51728/dev = xvdb
 /local/domain/0/backend/vbd/2/51728/backend-kind = vbd
 /local/domain/0/backend/vbd/2/51728/type = phy
 /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = 1

 /local/domain/2/device/vbd/51728/multi-queue-num-queues = 1
 /local/domain/2/device/vbd/51728/ring-ref = 9
 /local/domain/2/device/vbd/51728/event-channel = 60
 If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq 
 relies on plugging to get
 batching in the use case above, otherwise IO is dispatched immediately. 
 O_DIRECT is immediate. 
 I'd be more interested in seeing a test case with buffered IO of a file 
 system on top of the xvdb device,
 if we're missing merging for that case, then that's a much bigger issue.

  
 I was using the null block driver for xen blk-mq test.

 There were not merges happen any more even after patch: 
 https://lkml.org/lkml/2015/7/13/185
 (Which just converted xen block driver to use blk-mq apis)

 Will try a file system soon.

I have more results for the guest with and without the patch
https://lkml.org/lkml/2015/7/13/185
applied to the latest stable kernel (4.1.5).

Command line used was:
fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
--iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
--filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16

without patch (--direct=1):
  xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, 
util=100.00%

with patch (--direct=1):
  xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, 
util=100.00%

without patch buffered (--direct=0):
  xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60

with patch buffered (--direct=0):
  xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%



Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-11 Thread Bob Liu

On 08/10/2015 11:52 PM, Jens Axboe wrote:
 On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
 On 01/07/15 04:03, Jens Axboe wrote:
 On 06/30/2015 08:21 AM, Marcus Granado wrote:
 Hi,

 Our measurements for the multiqueue patch indicate a clear improvement
 in iops when more queues are used.

 The measurements were obtained under the following conditions:

 - using blkback as the dom0 backend with the multiqueue patch applied to
 a dom0 kernel 4.0 on 8 vcpus.

 - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
 applied to be used as a guest on 4 vcpus

 - using a micron RealSSD P320h as the underlying local storage on a Dell
 PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

 - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
 We used direct_io to skip caching in the guest and ran fio for 60s
 reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
 depth of 32 for each queue was used to saturate individual vcpus in the
 guest.

 We were interested in observing storage iops for different values of
 block sizes. Our expectation was that iops would improve when increasing
 the number of queues, because both the guest and dom0 would be able to
 make use of more vcpus to handle these requests.

 These are the results (as aggregate iops for all the fio threads) that
 we got for the conditions above with sequential reads:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
   8   32   512   158K 264K
   8   321K   157K 260K
   8   322K   157K 258K
   8   324K   148K 257K
   8   328K   124K 207K
   8   32   16K84K 105K
   8   32   32K50K  54K
   8   32   64K24K  27K
   8   32  128K11K  13K

 8-queue iops was better than single queue iops for all the block sizes.
 There were very good improvements as well for sequential writes with
 block size 4K (from 80K iops with single queue to 230K iops with 8
 queues), and no regressions were visible in any measurement performed.
 Great results! And I don't know why this code has lingered for so long,
 so thanks for helping get some attention to this again.

 Personally I'd be really interested in the results for the same set of
 tests, but without the blk-mq patches. Do you have them, or could you
 potentially run them?

 Hello,

 We rerun the tests for sequential reads with the identical settings but with 
 Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
 The results we obtained were *better* than the results we got with 
 multiqueue patches applied:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
 *no-mq-patches_iops*
   8   32   512   158K 264K 321K
   8   321K   157K 260K 328K
   8   322K   157K 258K 336K
   8   324K   148K 257K 308K
   8   328K   124K 207K 188K
   8   32   16K84K 105K 82K
   8   32   32K50K  54K 36K
   8   32   64K24K  27K 16K
   8   32  128K11K  13K 11K

 We noticed that the requests are not merged by the guest when the multiqueue 
 patches are applied,
 which results in a regression for small block sizes (RealSSD P320h's optimal 
 block size is around 32-64KB).

 We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 
 Internal SSD

 As I understand blk-mq layer bypasses I/O scheduler which also effectively 
 disables merges.
 Could you explain why it is difficult to enable merging in the blk-mq layer?
 That could help closing the performance gap we observed.

 Otherwise, the tests shows that the multiqueue patches does not improve the 
 performance,
 at least when it comes to sequential read/writes operations.
 
 blk-mq still provides merging, there should be no difference there. Does the 
 xen patches set BLK_MQ_F_SHOULD_MERGE?
 

Yes.
Is it possible that xen-blkfront driver dequeue requests too fast after we have 
multiple hardware queues?
Because new requests don't have the chance merging with old requests which were 
already dequeued and issued.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-11 Thread Rafal Mielniczuk
On 11/08/15 07:08, Bob Liu wrote:
 On 08/10/2015 11:52 PM, Jens Axboe wrote:
 On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
 On 01/07/15 04:03, Jens Axboe wrote:
 On 06/30/2015 08:21 AM, Marcus Granado wrote:
 Hi,

 Our measurements for the multiqueue patch indicate a clear improvement
 in iops when more queues are used.

 The measurements were obtained under the following conditions:

 - using blkback as the dom0 backend with the multiqueue patch applied to
 a dom0 kernel 4.0 on 8 vcpus.

 - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
 applied to be used as a guest on 4 vcpus

 - using a micron RealSSD P320h as the underlying local storage on a Dell
 PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

 - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
 We used direct_io to skip caching in the guest and ran fio for 60s
 reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
 depth of 32 for each queue was used to saturate individual vcpus in the
 guest.

 We were interested in observing storage iops for different values of
 block sizes. Our expectation was that iops would improve when increasing
 the number of queues, because both the guest and dom0 would be able to
 make use of more vcpus to handle these requests.

 These are the results (as aggregate iops for all the fio threads) that
 we got for the conditions above with sequential reads:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
   8   32   512   158K 264K
   8   321K   157K 260K
   8   322K   157K 258K
   8   324K   148K 257K
   8   328K   124K 207K
   8   32   16K84K 105K
   8   32   32K50K  54K
   8   32   64K24K  27K
   8   32  128K11K  13K

 8-queue iops was better than single queue iops for all the block sizes.
 There were very good improvements as well for sequential writes with
 block size 4K (from 80K iops with single queue to 230K iops with 8
 queues), and no regressions were visible in any measurement performed.
 Great results! And I don't know why this code has lingered for so long,
 so thanks for helping get some attention to this again.

 Personally I'd be really interested in the results for the same set of
 tests, but without the blk-mq patches. Do you have them, or could you
 potentially run them?

 Hello,

 We rerun the tests for sequential reads with the identical settings but 
 with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
 The results we obtained were *better* than the results we got with 
 multiqueue patches applied:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
 *no-mq-patches_iops*
   8   32   512   158K 264K 321K
   8   321K   157K 260K 328K
   8   322K   157K 258K 336K
   8   324K   148K 257K 308K
   8   328K   124K 207K 188K
   8   32   16K84K 105K 82K
   8   32   32K50K  54K 36K
   8   32   64K24K  27K 16K
   8   32  128K11K  13K 11K

 We noticed that the requests are not merged by the guest when the 
 multiqueue patches are applied,
 which results in a regression for small block sizes (RealSSD P320h's 
 optimal block size is around 32-64KB).

 We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 
 Internal SSD

 As I understand blk-mq layer bypasses I/O scheduler which also effectively 
 disables merges.
 Could you explain why it is difficult to enable merging in the blk-mq layer?
 That could help closing the performance gap we observed.

 Otherwise, the tests shows that the multiqueue patches does not improve the 
 performance,
 at least when it comes to sequential read/writes operations.
 blk-mq still provides merging, there should be no difference there. Does the 
 xen patches set BLK_MQ_F_SHOULD_MERGE?

 Yes.
 Is it possible that xen-blkfront driver dequeue requests too fast after we 
 have multiple hardware queues?
 Because new requests don't have the chance merging with old requests which 
 were already dequeued and issued.


For some reason we don't see merges even when we set multiqueue to 1.
Below are some stats from the guest system when doing sequential 4KB reads:

$ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
  --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
--filename=/dev/xvdb

$ iostat -xt 5 /dev/xvdb
avg-cpu:  %user   %nice %system %iowait  

Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-11 Thread Jens Axboe

On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:

On 11/08/15 07:08, Bob Liu wrote:

On 08/10/2015 11:52 PM, Jens Axboe wrote:

On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:

On 01/07/15 04:03, Jens Axboe wrote:

On 06/30/2015 08:21 AM, Marcus Granado wrote:

Hi,

Our measurements for the multiqueue patch indicate a clear improvement
in iops when more queues are used.

The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to
a dom0 kernel 4.0 on 8 vcpus.

- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
applied to be used as a guest on 4 vcpus

- using a micron RealSSD P320h as the underlying local storage on a Dell
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
We used direct_io to skip caching in the guest and ran fio for 60s
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
depth of 32 for each queue was used to saturate individual vcpus in the
guest.

We were interested in observing storage iops for different values of
block sizes. Our expectation was that iops would improve when increasing
the number of queues, because both the guest and dom0 would be able to
make use of more vcpus to handle these requests.

These are the results (as aggregate iops for all the fio threads) that
we got for the conditions above with sequential reads:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
   8   32   512   158K 264K
   8   321K   157K 260K
   8   322K   157K 258K
   8   324K   148K 257K
   8   328K   124K 207K
   8   32   16K84K 105K
   8   32   32K50K  54K
   8   32   64K24K  27K
   8   32  128K11K  13K

8-queue iops was better than single queue iops for all the block sizes.
There were very good improvements as well for sequential writes with
block size 4K (from 80K iops with single queue to 230K iops with 8
queues), and no regressions were visible in any measurement performed.

Great results! And I don't know why this code has lingered for so long,
so thanks for helping get some attention to this again.

Personally I'd be really interested in the results for the same set of
tests, but without the blk-mq patches. Do you have them, or could you
potentially run them?


Hello,

We rerun the tests for sequential reads with the identical settings but with 
Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
The results we obtained were *better* than the results we got with multiqueue 
patches applied:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
*no-mq-patches_iops*
   8   32   512   158K 264K 321K
   8   321K   157K 260K 328K
   8   322K   157K 258K 336K
   8   324K   148K 257K 308K
   8   328K   124K 207K 188K
   8   32   16K84K 105K 82K
   8   32   32K50K  54K 36K
   8   32   64K24K  27K 16K
   8   32  128K11K  13K 11K

We noticed that the requests are not merged by the guest when the multiqueue 
patches are applied,
which results in a regression for small block sizes (RealSSD P320h's optimal 
block size is around 32-64KB).

We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 Internal 
SSD

As I understand blk-mq layer bypasses I/O scheduler which also effectively 
disables merges.
Could you explain why it is difficult to enable merging in the blk-mq layer?
That could help closing the performance gap we observed.

Otherwise, the tests shows that the multiqueue patches does not improve the 
performance,
at least when it comes to sequential read/writes operations.

blk-mq still provides merging, there should be no difference there. Does the 
xen patches set BLK_MQ_F_SHOULD_MERGE?


Yes.
Is it possible that xen-blkfront driver dequeue requests too fast after we have 
multiple hardware queues?
Because new requests don't have the chance merging with old requests which were 
already dequeued and issued.



For some reason we don't see merges even when we set multiqueue to 1.
Below are some stats from the guest system when doing sequential 4KB reads:

$ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
   --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
--filename=/dev/xvdb

$ iostat -xt 5 /dev/xvdb
avg-cpu:  %user   %nice %system %iowait  

Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-10 Thread Jens Axboe

On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:

On 01/07/15 04:03, Jens Axboe wrote:

On 06/30/2015 08:21 AM, Marcus Granado wrote:

Hi,

Our measurements for the multiqueue patch indicate a clear improvement
in iops when more queues are used.

The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to
a dom0 kernel 4.0 on 8 vcpus.

- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
applied to be used as a guest on 4 vcpus

- using a micron RealSSD P320h as the underlying local storage on a Dell
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
We used direct_io to skip caching in the guest and ran fio for 60s
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
depth of 32 for each queue was used to saturate individual vcpus in the
guest.

We were interested in observing storage iops for different values of
block sizes. Our expectation was that iops would improve when increasing
the number of queues, because both the guest and dom0 would be able to
make use of more vcpus to handle these requests.

These are the results (as aggregate iops for all the fio threads) that
we got for the conditions above with sequential reads:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
  8   32   512   158K 264K
  8   321K   157K 260K
  8   322K   157K 258K
  8   324K   148K 257K
  8   328K   124K 207K
  8   32   16K84K 105K
  8   32   32K50K  54K
  8   32   64K24K  27K
  8   32  128K11K  13K

8-queue iops was better than single queue iops for all the block sizes.
There were very good improvements as well for sequential writes with
block size 4K (from 80K iops with single queue to 230K iops with 8
queues), and no regressions were visible in any measurement performed.

Great results! And I don't know why this code has lingered for so long,
so thanks for helping get some attention to this again.

Personally I'd be really interested in the results for the same set of
tests, but without the blk-mq patches. Do you have them, or could you
potentially run them?


Hello,

We rerun the tests for sequential reads with the identical settings but with 
Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
The results we obtained were *better* than the results we got with multiqueue 
patches applied:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
*no-mq-patches_iops*
  8   32   512   158K 264K 321K
  8   321K   157K 260K 328K
  8   322K   157K 258K 336K
  8   324K   148K 257K 308K
  8   328K   124K 207K 188K
  8   32   16K84K 105K 82K
  8   32   32K50K  54K 36K
  8   32   64K24K  27K 16K
  8   32  128K11K  13K 11K

We noticed that the requests are not merged by the guest when the multiqueue 
patches are applied,
which results in a regression for small block sizes (RealSSD P320h's optimal 
block size is around 32-64KB).

We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 Internal 
SSD

As I understand blk-mq layer bypasses I/O scheduler which also effectively 
disables merges.
Could you explain why it is difficult to enable merging in the blk-mq layer?
That could help closing the performance gap we observed.

Otherwise, the tests shows that the multiqueue patches does not improve the 
performance,
at least when it comes to sequential read/writes operations.


blk-mq still provides merging, there should be no difference there. Does 
the xen patches set BLK_MQ_F_SHOULD_MERGE?


--
Jens Axboe


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-10 Thread Bob Liu

On 08/10/2015 07:03 PM, Rafal Mielniczuk wrote:
 On 01/07/15 04:03, Jens Axboe wrote:
 On 06/30/2015 08:21 AM, Marcus Granado wrote:
 Hi,

 Our measurements for the multiqueue patch indicate a clear improvement
 in iops when more queues are used.

 The measurements were obtained under the following conditions:

 - using blkback as the dom0 backend with the multiqueue patch applied to
 a dom0 kernel 4.0 on 8 vcpus.

 - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
 applied to be used as a guest on 4 vcpus

 - using a micron RealSSD P320h as the underlying local storage on a Dell
 PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

 - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
 We used direct_io to skip caching in the guest and ran fio for 60s
 reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
 depth of 32 for each queue was used to saturate individual vcpus in the
 guest.

 We were interested in observing storage iops for different values of
 block sizes. Our expectation was that iops would improve when increasing
 the number of queues, because both the guest and dom0 would be able to
 make use of more vcpus to handle these requests.

 These are the results (as aggregate iops for all the fio threads) that
 we got for the conditions above with sequential reads:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
  8   32   512   158K 264K
  8   321K   157K 260K
  8   322K   157K 258K
  8   324K   148K 257K
  8   328K   124K 207K
  8   32   16K84K 105K
  8   32   32K50K  54K
  8   32   64K24K  27K
  8   32  128K11K  13K

 8-queue iops was better than single queue iops for all the block sizes.
 There were very good improvements as well for sequential writes with
 block size 4K (from 80K iops with single queue to 230K iops with 8
 queues), and no regressions were visible in any measurement performed.
 Great results! And I don't know why this code has lingered for so long, 
 so thanks for helping get some attention to this again.

 Personally I'd be really interested in the results for the same set of 
 tests, but without the blk-mq patches. Do you have them, or could you 
 potentially run them?

 Hello,
 
 We rerun the tests for sequential reads with the identical settings but with 
 Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
 The results we obtained were *better* than the results we got with multiqueue 
 patches applied:
 
 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
 *no-mq-patches_iops*
  8   32   512   158K 264K 321K
  8   321K   157K 260K 328K
  8   322K   157K 258K 336K
  8   324K   148K 257K 308K
  8   328K   124K 207K 188K
  8   32   16K84K 105K 82K
  8   32   32K50K  54K 36K
  8   32   64K24K  27K 16K
  8   32  128K11K  13K 11K
 
 We noticed that the requests are not merged by the guest when the multiqueue 
 patches are applied,
 which results in a regression for small block sizes (RealSSD P320h's optimal 
 block size is around 32-64KB).
 
 We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 
 Internal SSD
 

Which block scheduler was used in domU?  Please try to cat 
/sys/block/sdxxx/queue/scheduler.
How about the result if using noop scheduler?

Thanks,
Bob Liu

 As I understand blk-mq layer bypasses I/O scheduler which also effectively 
 disables merges.
 Could you explain why it is difficult to enable merging in the blk-mq layer?
 That could help closing the performance gap we observed.
 
 Otherwise, the tests shows that the multiqueue patches does not improve the 
 performance,
 at least when it comes to sequential read/writes operations.
 
 Rafal
 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-08-10 Thread Rafal Mielniczuk
On 01/07/15 04:03, Jens Axboe wrote:
 On 06/30/2015 08:21 AM, Marcus Granado wrote:
 Hi,

 Our measurements for the multiqueue patch indicate a clear improvement
 in iops when more queues are used.

 The measurements were obtained under the following conditions:

 - using blkback as the dom0 backend with the multiqueue patch applied to
 a dom0 kernel 4.0 on 8 vcpus.

 - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
 applied to be used as a guest on 4 vcpus

 - using a micron RealSSD P320h as the underlying local storage on a Dell
 PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

 - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
 We used direct_io to skip caching in the guest and ran fio for 60s
 reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
 depth of 32 for each queue was used to saturate individual vcpus in the
 guest.

 We were interested in observing storage iops for different values of
 block sizes. Our expectation was that iops would improve when increasing
 the number of queues, because both the guest and dom0 would be able to
 make use of more vcpus to handle these requests.

 These are the results (as aggregate iops for all the fio threads) that
 we got for the conditions above with sequential reads:

 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
  8   32   512   158K 264K
  8   321K   157K 260K
  8   322K   157K 258K
  8   324K   148K 257K
  8   328K   124K 207K
  8   32   16K84K 105K
  8   32   32K50K  54K
  8   32   64K24K  27K
  8   32  128K11K  13K

 8-queue iops was better than single queue iops for all the block sizes.
 There were very good improvements as well for sequential writes with
 block size 4K (from 80K iops with single queue to 230K iops with 8
 queues), and no regressions were visible in any measurement performed.
 Great results! And I don't know why this code has lingered for so long, 
 so thanks for helping get some attention to this again.

 Personally I'd be really interested in the results for the same set of 
 tests, but without the blk-mq patches. Do you have them, or could you 
 potentially run them?

Hello,

We rerun the tests for sequential reads with the identical settings but with 
Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
The results we obtained were *better* than the results we got with multiqueue 
patches applied:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  
*no-mq-patches_iops*
 8   32   512   158K 264K 321K
 8   321K   157K 260K 328K
 8   322K   157K 258K 336K
 8   324K   148K 257K 308K
 8   328K   124K 207K 188K
 8   32   16K84K 105K 82K
 8   32   32K50K  54K 36K
 8   32   64K24K  27K 16K
 8   32  128K11K  13K 11K

We noticed that the requests are not merged by the guest when the multiqueue 
patches are applied,
which results in a regression for small block sizes (RealSSD P320h's optimal 
block size is around 32-64KB).

We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5 Internal 
SSD

As I understand blk-mq layer bypasses I/O scheduler which also effectively 
disables merges.
Could you explain why it is difficult to enable merging in the blk-mq layer?
That could help closing the performance gap we observed.

Otherwise, the tests shows that the multiqueue patches does not improve the 
performance,
at least when it comes to sequential read/writes operations.

Rafal



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-06-30 Thread Jens Axboe

On 06/30/2015 08:21 AM, Marcus Granado wrote:

On 13/05/15 11:29, Bob Liu wrote:


On 04/28/2015 03:46 PM, Arianna Avanzini wrote:

Hello Christoph,

Il 28/04/2015 09:36, Christoph Hellwig ha scritto:

What happened to this patchset?



It was passed on to Bob Liu, who published a follow-up patchset here:
https://lkml.org/lkml/2015/2/15/46



Right, and then I was interrupted by another xen-block feature:
'multi-page' ring.
Will back on this patchset soon. Thank you!

-Bob



Hi,

Our measurements for the multiqueue patch indicate a clear improvement
in iops when more queues are used.

The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to
a dom0 kernel 4.0 on 8 vcpus.

- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
applied to be used as a guest on 4 vcpus

- using a micron RealSSD P320h as the underlying local storage on a Dell
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
We used direct_io to skip caching in the guest and ran fio for 60s
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
depth of 32 for each queue was used to saturate individual vcpus in the
guest.

We were interested in observing storage iops for different values of
block sizes. Our expectation was that iops would improve when increasing
the number of queues, because both the guest and dom0 would be able to
make use of more vcpus to handle these requests.

These are the results (as aggregate iops for all the fio threads) that
we got for the conditions above with sequential reads:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
 8   32   512   158K 264K
 8   321K   157K 260K
 8   322K   157K 258K
 8   324K   148K 257K
 8   328K   124K 207K
 8   32   16K84K 105K
 8   32   32K50K  54K
 8   32   64K24K  27K
 8   32  128K11K  13K

8-queue iops was better than single queue iops for all the block sizes.
There were very good improvements as well for sequential writes with
block size 4K (from 80K iops with single queue to 230K iops with 8
queues), and no regressions were visible in any measurement performed.


Great results! And I don't know why this code has lingered for so long, 
so thanks for helping get some attention to this again.


Personally I'd be really interested in the results for the same set of 
tests, but without the blk-mq patches. Do you have them, or could you 
potentially run them?


--
Jens Axboe


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-06-30 Thread Bob Liu

On 06/30/2015 10:21 PM, Marcus Granado wrote:
 On 13/05/15 11:29, Bob Liu wrote:

 On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
 Hello Christoph,

 Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
 What happened to this patchset?


 It was passed on to Bob Liu, who published a follow-up patchset here: 
 https://lkml.org/lkml/2015/2/15/46


 Right, and then I was interrupted by another xen-block feature: 'multi-page' 
 ring.
 Will back on this patchset soon. Thank you!

 -Bob

 
 Hi,
 
 Our measurements for the multiqueue patch indicate a clear improvement in 
 iops when more queues are used.
 
 The measurements were obtained under the following conditions:
 
 - using blkback as the dom0 backend with the multiqueue patch applied to a 
 dom0 kernel 4.0 on 8 vcpus.
 
 - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend applied to 
 be used as a guest on 4 vcpus
 
 - using a micron RealSSD P320h as the underlying local storage on a Dell 
 PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
 
 - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. We 
 used direct_io to skip caching in the guest and ran fio for 60s reading a 
 number of block sizes ranging from 512 bytes to 4MiB. Queue depth of 32 for 
 each queue was used to saturate individual vcpus in the guest.
 
 We were interested in observing storage iops for different values of block 
 sizes. Our expectation was that iops would improve when increasing the number 
 of queues, because both the guest and dom0 would be able to make use of more 
 vcpus to handle these requests.
 
 These are the results (as aggregate iops for all the fio threads) that we got 
 for the conditions above with sequential reads:
 
 fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
 8   32   512   158K 264K
 8   321K   157K 260K
 8   322K   157K 258K
 8   324K   148K 257K
 8   328K   124K 207K
 8   32   16K84K 105K
 8   32   32K50K  54K
 8   32   64K24K  27K
 8   32  128K11K  13K
 
 8-queue iops was better than single queue iops for all the block sizes. There 
 were very good improvements as well for sequential writes with block size 4K 
 (from 80K iops with single queue to 230K iops with 8 queues), and no 
 regressions were visible in any measurement performed.
 

Great! Thank you very much for the test.

I'm trying to rebase these patches to the latest kernel version(v4.1) and will 
send out in following days.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-06-30 Thread Marcus Granado

On 13/05/15 11:29, Bob Liu wrote:


On 04/28/2015 03:46 PM, Arianna Avanzini wrote:

Hello Christoph,

Il 28/04/2015 09:36, Christoph Hellwig ha scritto:

What happened to this patchset?



It was passed on to Bob Liu, who published a follow-up patchset here: 
https://lkml.org/lkml/2015/2/15/46



Right, and then I was interrupted by another xen-block feature: 'multi-page' 
ring.
Will back on this patchset soon. Thank you!

-Bob



Hi,

Our measurements for the multiqueue patch indicate a clear improvement 
in iops when more queues are used.


The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to 
a dom0 kernel 4.0 on 8 vcpus.


- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend 
applied to be used as a guest on 4 vcpus


- using a micron RealSSD P320h as the underlying local storage on a Dell 
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.


- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. 
We used direct_io to skip caching in the guest and ran fio for 60s 
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue 
depth of 32 for each queue was used to saturate individual vcpus in the 
guest.


We were interested in observing storage iops for different values of 
block sizes. Our expectation was that iops would improve when increasing 
the number of queues, because both the guest and dom0 would be able to 
make use of more vcpus to handle these requests.


These are the results (as aggregate iops for all the fio threads) that 
we got for the conditions above with sequential reads:


fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
8   32   512   158K 264K
8   321K   157K 260K
8   322K   157K 258K
8   324K   148K 257K
8   328K   124K 207K
8   32   16K84K 105K
8   32   32K50K  54K
8   32   64K24K  27K
8   32  128K11K  13K

8-queue iops was better than single queue iops for all the block sizes. 
There were very good improvements as well for sequential writes with 
block size 4K (from 80K iops with single queue to 230K iops with 8 
queues), and no regressions were visible in any measurement performed.


Marcus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-05-13 Thread Bob Liu

On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
 Hello Christoph,
 
 Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
 What happened to this patchset?

 
 It was passed on to Bob Liu, who published a follow-up patchset here: 
 https://lkml.org/lkml/2015/2/15/46
 

Right, and then I was interrupted by another xen-block feature: 'multi-page' 
ring.
Will back on this patchset soon. Thank you!

-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-04-28 Thread Christoph Hellwig
What happened to this patchset?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-04-28 Thread Arianna Avanzini

Hello Christoph,

Il 28/04/2015 09:36, Christoph Hellwig ha scritto:

What happened to this patchset?



It was passed on to Bob Liu, who published a follow-up patchset here: 
https://lkml.org/lkml/2015/2/15/46


Thanks,
Arianna


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel