Re: Performance degradation in IO writes vs. reads (was scsi-mq V2)

2014-08-24 Thread Sagi Grimberg

On 8/21/2014 5:02 PM, Sagi Grimberg wrote:

On 8/21/2014 4:03 PM, Christoph Hellwig wrote:

On Thu, Aug 21, 2014 at 03:32:09PM +0300, Sagi Grimberg wrote:

So I just got back to checking this issue of *extremely low* IO write
performance I got in 3.16-rc2.


Please test with 3.16 final.  There once issue each in aio and dio
that caused bad I/O performance regression that were only fixed last
minute in the 3.16 cycle.



I'll do that now.



Indeed this issue is resolved in 3.16. Sorry for the false alarm...
Thanks Christoph.

Sagi.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Performance degradation in IO writes vs. reads (was scsi-mq V2)

2014-08-21 Thread Sagi Grimberg

On 7/14/2014 12:13 PM, Sagi Grimberg wrote:

SNIP


I'd like to share some benchmarks I took on this patch set using iSER
initiator (+2 pre-submitted performance improvements) vs LIO iSER target.
I ran workloads I think are interesting use-cases (single LUN with 1,2,4
IO threads up to a fully occupied system doing IO to multiple LUNs).
Overall (except 2 strange anomalies) seems that scsi-mq patches
(use_blk_mq=N) roughly sustains traditional scsi performance.
On the other hand scsi-mq code path (use_blk_mq=Y) on its own clearly
shows better performance (tables below).

At first I too hit the aio issues discussed in this thread and converted
to scsi-mq.3-no-rebase for testing (thanks Doug  Rob for raising it).
I must say that for some reason I get very low numbers for writes vs.
reads (writes perf stuck at ~20K IOPs per thread), this happens
on 3.16-rc2 even before scsi-mq patches. Did anyone step on this as well
or is it just a weird problem I'm having in my setup?
Anyway this is why my benchmarks shows only randread IO pattern (getting
familiar numbers). I need to figure out whats wrong
with IO writes - I'll start bisecting on this.



Hi,

So I just got back to checking this issue of *extremely low* IO write
performance I got in 3.16-rc2.

Reminder:
I used iSER to benchmark Christoph's scsi-mq patches performance and
noticed that direct-IO writes are stuck at 20-50K IOPs instead of 350K
IOPs I was used to see for a single device. This issue existed also when
I removed the scsi-mq patches, so I started bisecting to see what broke
stuff.

Finally I got to the completely unbisectable bulk that seems to yield
the issue:

/* IO write poor performance*/
2b777c9 ceph_sync_read: stop poking into iov_iter guts
f0d1bec new helper: copy_page_from_iter()
84c3d55 fuse: switch to -write_iter()
b30ac0f btrfs: switch to -write_iter()
3ef045c ocfs2: switch to -write_iter()
bf97f3b xfs: switch to -write_iter()
50b5551 afs: switch to -write_iter()
da56e45 gfs2: switch to -write_iter()
edaf436 nfs: switch to -write_iter()
f5674c3 ubifs: switch to -write_iter()
a8f3550 bury __generic_file_aio_write()
3dae875 cifs: switch to -write_iter()
d4637bc udf: switch to -write_iter()
9b88416 convert ext4 to -write_iter()
a832475 Merge ext4 changes in ext4_file_write() into for-next
1456c0a blkdev_aio_write() - turn into blkdev_write_iter()
8174202 write_iter variants of {__,}generic_file_aio_write()
3644424 ceph: switch to -read_iter()
3aa2d19 nfs: switch to -read_iter()
a886038 fs/block_dev.c: switch to -read_iter()
2ba5bbe shmem: switch to -read_iter()
fb9096a pipe: switch to -read_iter()
e6a7bcb cifs: switch to -read_iter()
37c20f1 fuse_file_aio_read(): convert to -read_iter()
3cd9ad5 ocfs2: switch to -read_iter()
0279782 ecryptfs: switch to -read_iter()
b4f5d2c xfs: switch to -read_iter()
aad4f8b switch simple generic_file_aio_read() users to -read_iter()
293bc98 new methods: -read_iter() and -write_iter()
7f7f25e replace checking for -read/-aio_read presence with check in 
-f_mode

b318891 xfs: trim the argument lists of xfs_file_{dio,buffered}_aio_write()
37938463 blkdev_aio_read(): switch to generic_file_read_iter(), get rid 
of iov_shorten()

0c94933 iov_iter_truncate()
28060d5 btrfs: switch check_direct_IO() to iov_iter
91f79c4 new helper: iov_iter_get_pages_alloc()
f67da30 new helper: iov_iter_npages()
5b46f25 f2fs: switch to iov_iter_alignment()
c9c37e2 fuse: switch to iov_iter_get_pages()
d22a943 fuse: pull iov_iter initializations up
7b2c99d new helper: iov_iter_get_pages()
3320c60 dio: take updating -result into do_direct_IO(
71d8e53 start adding the tag to iov_iter
ed978a8 new helper: generic_file_read_iter()
23faa7b fuse_file_aio_write(): merge initializations of iov_iter
05bb2e0 ceph_aio_read(): keep iov_iter across retries
886a391 new primitive: iov_iter_alignment()
26978b8 give -direct_IO() a copy of iov_iter
31b1403 switch {__,}blockdev_direct_IO() to iov_iter
a6cbcd4 get rid of pointless iov_length() in -direct_IO()
16b1f05 ext4: switch the guts of -direct_IO() to iov_iter
619d30b convert the guts of nfs_direct_IO() to iov_iter
d8d3d94 pass iov_iter to -direct_IO()
cb66a7a kill generic_segment_checks()
0ae5e4d __btrfs_direct_write(): switch to iov_iter
f8579f8 generic_file_direct_write(): switch to iov_iter
e7c2460 kill iov_iter_copy_from_user()
f6c0a19 fs/file.c: don't open-code kvfree()
/* IO write performance is OK*/

I tried to isolate the issue by running fio on a null_blk
device and also got performance degradation although it wasn't
as severe as with iSER/iSCSI. IO write performance decreased
from 360K IOPs to 280 KIOPs.

So at the moment I can't pin point the problem, but I figured
I'd raise the issue in case anyone else had stepped on this one
(hard to imagine that no one saw this...)

I'll run perf comparison to see if I get anything interesting.

CC'ing Al Viro, the author of all the above commits.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a 

Re: Performance degradation in IO writes vs. reads (was scsi-mq V2)

2014-08-21 Thread Christoph Hellwig
On Thu, Aug 21, 2014 at 03:32:09PM +0300, Sagi Grimberg wrote:
 So I just got back to checking this issue of *extremely low* IO write
 performance I got in 3.16-rc2.

Please test with 3.16 final.  There once issue each in aio and dio
that caused bad I/O performance regression that were only fixed last
minute in the 3.16 cycle.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance degradation in IO writes vs. reads (was scsi-mq V2)

2014-08-21 Thread Sagi Grimberg

On 8/21/2014 4:03 PM, Christoph Hellwig wrote:

On Thu, Aug 21, 2014 at 03:32:09PM +0300, Sagi Grimberg wrote:

So I just got back to checking this issue of *extremely low* IO write
performance I got in 3.16-rc2.


Please test with 3.16 final.  There once issue each in aio and dio
that caused bad I/O performance regression that were only fixed last
minute in the 3.16 cycle.



I'll do that now.

Thanks,
Sagi.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-14 Thread Sagi Grimberg

On 7/8/2014 5:48 PM, Christoph Hellwig wrote:
SNIP

I've pushed out a new scsi-mq.3 branch, which has been rebased on the
latest core-for-3.17 tree + the RFC: clean up command setup series
from June 29th.  Robert Elliot found a problem with not fully zeroed
out UNMAP CDBs, which is fixed by the saner discard handling in that
series.

There is a new patch to factor the code from the above series for
blk-mq use, which I've attached below.  Besides that the only changes
are minor merge fixups in the main blk-mq usage patch.


Hey Christoph  Co,

I'd like to share some benchmarks I took on this patch set using iSER 
initiator (+2 pre-submitted performance improvements) vs LIO iSER target.
I ran workloads I think are interesting use-cases (single LUN with 1,2,4 
IO threads up to a fully occupied system doing IO to multiple LUNs).
Overall (except 2 strange anomalies) seems that scsi-mq patches 
(use_blk_mq=N) roughly sustains traditional scsi performance.
On the other hand scsi-mq code path (use_blk_mq=Y) on its own clearly 
shows better performance (tables below).


At first I too hit the aio issues discussed in this thread and converted 
to scsi-mq.3-no-rebase for testing (thanks Doug  Rob for raising it).
I must say that for some reason I get very low numbers for writes vs. 
reads (writes perf stuck at ~20K IOPs per thread), this happens
on 3.16-rc2 even before scsi-mq patches. Did anyone step on this as well 
or is it just a weird problem I'm having in my setup?
Anyway this is why my benchmarks shows only randread IO pattern (getting 
familiar numbers). I need to figure out whats wrong

with IO writes - I'll start bisecting on this.

I also reviewed the patch set and at this point, I don't have any 
comments. So you can add to the series:
Reviewed-by: Sagi Grimberg 'sa...@dev.mellanox.co.il' (or Tested-by - 
whatever you choose).


I want to state that I tested a traditional iSER initiator - no scsi-mq 
adoption at all.
I started looking into adopting scsi-mq to iSCSI/iSER recently and I 
must that say the scsi-mq adoption is not so
trivial due to iSCSI session-wide CmdSN/StatSN ordering constraints 
(can't just use more RDMA channels per connection...)
I'll be on vacation for the next couple of weeks, so I'll start a 
separate thread to get the community input on this matter.



Results: table entries are KIOPS(CPU%)
3.16-rc2 (scsi-mq patches reverted)
   Threads/LUN   1   24
#LUNs
 1 231(6.5%)   355(18.5%)   337(31.1%)
 2 446(13.6%)  673(37.2%)   654(49.8%)
 4 594(25%)960(49.41%) 1165(99.3%)
 81018(50.3%) 1563(99.6%)  1696(99.9%)
 16   1660(86.5%) 1731(99.6%)  1710(100%)


3.16-rc2 (scsi-mq included, use_blk_mq=N)
   Threads/LUN   1   24
#LUNs
 1 231(6.5%)   351(18.5%)   337(31.4%)
 2 446(13.6%)  660(37.3%)   647(50%)
 4 591(25%)967(49.7%)  1136(98.1%)
 81014(52.1%) 1296(100%)   1470(100%)
 16   1741(100%)  1761(100%)   1853(100%)


3.16-rc2 (scsi-mq included, use_blk_mq=Y)
   Threads/LUN   1   24
#LUNs
 1 265(6.4%)   465(13.4%)   572(27.9%)
 2 507(13.4%)  902(27.8%)  1034(45.9%)
 4 697(25%)   1197(49.5%)  1477(98.6%)
 81257(53.6%) 1856(98.7%)  1906(100%)
 16   1991(100%)  2021(100%)   2020(100%)

Notes:
- IOPs measurements are the average of a 60 seconds runs.
- The CPU measurement is the total usage across all CPUs, In order
  to understand per-CPU utilization value should be normalized to 16
  cores.
- scsi_mq (use_blk_mq=N) has roughly the same performance as
  traditional scsi IO path but I see an anomaly in test cases
  {8 LUNs, 2/4 threads per LUN}. This may result in NUMA
  misalignment for threads/interrupts – requires further
  investigation.
- iSER initiator has no Multi-Queue awareness.

Testing environment:
- Initiator and target systems of 16 (8x2) cores (Hyperthreading
  disabled).
- CPU model: Intel(R) Xeon(R) @ 2.60GHz
- Block Layer settings:
- scheduler=noop
- rq_affinity=1
- add_random=0
- nomerges=1
- Single FDR link between the target and initiator.
- Device model: Mellanox ConnectIB (the numbers are also familiar
  with Mellanox ConnectX-3).
- MSIX interrupt vectors were spread across system cores.
- irqbalancer was disabled.
- scsi_host settings:
- cmd_per_lun=32 (default)
- can_queue=113 (default)
- In the multi-LUN test cases, each LUN exposed via different
  scsi_host (iSCSI session).

Software:
- fio version: 2.0.13
- LIO iSER target (target-pending for-next)
- Null backing devices (NULLIO)
- Upstream based iSER initiator + internal pre-submitted
  performance enhancements.

fio configuration:
rw=randread
bs=1k
iodepth=128
loops=1
ioengine=libaio
direct=1
invalidate=1
fsync_on_close=1
randrepeat=1

Re: scsi-mq V2

2014-07-14 Thread Benjamin LaHaise
Hi Robert,

On Sun, Jul 13, 2014 at 05:15:15PM +, Elliott, Robert (Server Storage) 
wrote:
   I will see if that solves the problem with the scsi-mq-3 tree, or
   at least some of the bisect trees leading up to it.
  
  scsi-mq-3 is still going after 45 minutes.  I'll leave it running
  overnight.
  
 
 That has been going strong for 18 hours, so I think that's the patch
 we need.

Thanks for taking the time to narrow this down.  I've applied the fix to 
my aio-fixes tree at git://git.kvack.org/~bcrl/aio-fixes.git and fowarded 
it on to Linus as well.

-ben
-- 
Thought is the essence of where you are now.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-13 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
 ow...@vger.kernel.org] On Behalf Of Elliott, Robert (Server Storage)
 Sent: Saturday, July 12, 2014 6:20 PM
 To: Benjamin LaHaise
 Cc: Christoph Hellwig; Jeff Moyer; Jens Axboe; dgilb...@interlog.com;
 James Bottomley; Bart Van Assche; linux-scsi@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: RE: scsi-mq V2
 
  I will see if that solves the problem with the scsi-mq-3 tree, or
  at least some of the bisect trees leading up to it.
 
 scsi-mq-3 is still going after 45 minutes.  I'll leave it running
 overnight.
 

That has been going strong for 18 hours, so I think that's the patch
we need.



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-12 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: Benjamin LaHaise [mailto:b...@kvack.org]
 Sent: Friday, 11 July, 2014 9:55 AM
 To: Elliott, Robert (Server Storage)
 Cc: Christoph Hellwig; Jeff Moyer; Jens Axboe; dgilb...@interlog.com; James
 Bottomley; Bart Van Assche; linux-scsi@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
...
 Can you try the below totally untested patch instead?  It looks like
 put_reqs_available() is not irq-safe.
 

With that addition alone, fio still runs into the same problem.

I added the same fix to get_reqs_available, which also accesses 
kcpu-reqs_available, and the test has run for 35 minutes with 
no problem.

Patch applied:

diff --git a/fs/aio.c b/fs/aio.c
index e59bba8..8e85e26 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
 {
struct kioctx_cpu *kcpu;
+   unsigned long flags;
 
preempt_disable();
kcpu = this_cpu_ptr(ctx-cpu);
 
+   local_irq_save(flags);
kcpu-reqs_available += nr;
+
while (kcpu-reqs_available = ctx-req_batch * 2) {
kcpu-reqs_available -= ctx-req_batch;
atomic_add(ctx-req_batch, ctx-reqs_available);
}
 
+   local_irq_restore(flags);
preempt_enable();
 }
 
@@ -847,10 +851,12 @@ static bool get_reqs_available(struct kioctx *ctx)
 {
struct kioctx_cpu *kcpu;
bool ret = false;
+   unsigned long flags;
 
preempt_disable();
kcpu = this_cpu_ptr(ctx-cpu);
 
+   local_irq_save(flags);
if (!kcpu-reqs_available) {
int old, avail = atomic_read(ctx-reqs_available);
 
@@ -869,6 +875,7 @@ static bool get_reqs_available(struct kioctx *ctx)
ret = true;
kcpu-reqs_available--;
 out:
+   local_irq_restore(flags);
preempt_enable();
return ret;
 }

--
I will see if that solves the problem with the scsi-mq-3 tree, or 
at least some of the bisect trees leading up to it.

A few other comments:

1. Those changes boost _raw_spin_lock_irqsave into first place
in perf top:

  6.59%  [kernel][k] _raw_spin_lock_irqsave
  4.37%  [kernel][k] put_compound_page
  2.87%  [scsi_debug][k] sdebug_q_cmd_hrt_complete
  2.74%  [kernel][k] _raw_spin_lock
  2.73%  [kernel][k] apic_timer_interrupt
  2.41%  [kernel][k] do_blockdev_direct_IO
  2.24%  [kernel][k] __get_page_tail
  1.97%  [kernel][k] _raw_spin_unlock_irqrestore
  1.87%  [kernel][k] scsi_queue_rq
  1.76%  [scsi_debug][k] schedule_resp

Maybe (later) kcpu-reqs_available should converted to an atomic,
like ctx-reqs_available, to reduce that overhead?

2. After the f8567a3 patch, aio_complete has one early return that 
bypasses the call to put_reqs_available.  Is that OK, or does
that mean that sync iocbs will now eat up reqs_available?

/*
 * Special case handling for sync iocbs:
 *  - events go directly into the iocb for fast handling
 *  - the sync task with the iocb in its stack holds the single iocb
 *ref, no other paths have a way to get another ref
 *  - the sync task helpfully left a reference to itself in the iocb
 */
if (is_sync_kiocb(iocb)) {
iocb-ki_user_data = res;
smp_wmb();
iocb-ki_ctx = ERR_PTR(-EXDEV);
wake_up_process(iocb-ki_obj.tsk);
return;
}


3. The f8567a3 patch renders this comment in aio.c out of date - it's 
no longer incremented when pulled off the ringbuffer, but is now 
incremented when aio_complete is called.

struct {
/*
 * This counts the number of available slots in the ringbuffer,
 * so we avoid overflowing it: it's decremented (if positive)
 * when allocating a kiocb and incremented when the resulting
 * io_event is pulled off the ringbuffer.
 *
 * We batch accesses to it with a percpu version.
 */
atomic_treqs_available;
} cacheline_aligned_in_smp;


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-12 Thread Elliott, Robert (Server Storage)
 I will see if that solves the problem with the scsi-mq-3 tree, or
 at least some of the bisect trees leading up to it.

scsi-mq-3 is still going after 45 minutes.  I'll leave it running
overnight.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-11 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
 ow...@vger.kernel.org] On Behalf Of Elliott, Robert (Server Storage)
 
 I added some prints in aio_setup_ring and  ioctx_alloc and
 rebooted.  This time it took much longer to hit the problem.  It
 survived dozens of ^Cs.  Running a few minutes, though, IOPS
 eventually dropped.  So, sometimes it happens immediately,
 sometimes it takes time to develop.
 
 I will rerun bisect-1 -2 and -3 for longer times to increase
 confidence that they didn't just appear good.

Allowing longer run times before declaring success, the problem 
does appear in all of the bisect trees.  I just let fio
continue to run for many minutes - no ^Cs necessary.

no-rebase: good for  45 minutes (I will leave that running for
  8 more hours)
bisect-1: bad
bisect-2: bad
bisect-3: bad
bisect-4: bad


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-11 Thread Christoph Hellwig
On Fri, Jul 11, 2014 at 06:02:03AM +, Elliott, Robert (Server Storage) 
wrote:
 Allowing longer run times before declaring success, the problem 
 does appear in all of the bisect trees.  I just let fio
 continue to run for many minutes - no ^Cs necessary.
 
 no-rebase: good for  45 minutes (I will leave that running for
   8 more hours)

Ok, thanks.  If it's still running tomorrow morning let's look into the
aio reverts again.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-11 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: Christoph Hellwig [mailto:h...@infradead.org]
 Sent: Friday, 11 July, 2014 1:15 AM
 To: Elliott, Robert (Server Storage)
 Cc: Jeff Moyer; Christoph Hellwig; Jens Axboe; dgilb...@interlog.com; James
 Bottomley; Bart Van Assche; Benjamin LaHaise; linux-scsi@vger.kernel.org;
 linux-ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 On Fri, Jul 11, 2014 at 06:02:03AM +, Elliott, Robert (Server Storage)
 wrote:
  Allowing longer run times before declaring success, the problem
  does appear in all of the bisect trees.  I just let fio
  continue to run for many minutes - no ^Cs necessary.
 
  no-rebase: good for  45 minutes (I will leave that running for
8 more hours)
 
 Ok, thanks.  If it's still running tomorrow morning let's look into the
 aio reverts again.

That ran 9 total hours with no problem.

Rather than revert in the bisect trees, I added just this single additional
patch to the no-rebase tree, and the problem appeared:


48a2e94154177286b3bcbed25ea802232527fa7c
aio: fix aio request leak when events are reaped by userspace

diff --git a/fs/aio.c b/fs/aio.c
index 4f078c0..e59bba8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1021,6 +1021,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)

/* everything turned out well, dispose of the aiocb. */
kiocb_free(iocb);
+   put_reqs_available(ctx, 1); /* added by patch f8567 */

/*
 * We have to order our ring_info tail store above and test
@@ -1101,7 +1102,7 @@ static long aio_read_events_ring(struct kioctx *ctx,

pr_debug(%li  h%u t%u\n, ret, head, tail);

-   put_reqs_available(ctx, ret);
+   /* put_reqs_available(ctx, ret); removed by patch f8567 */
 out:
mutex_unlock(ctx-ring_lock);


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-11 Thread Benjamin LaHaise
On Fri, Jul 11, 2014 at 02:33:12PM +, Elliott, Robert (Server Storage) 
wrote:
 That ran 9 total hours with no problem.
 
 Rather than revert in the bisect trees, I added just this single additional
 patch to the no-rebase tree, and the problem appeared:

Can you try the below totally untested patch instead?  It looks like
put_reqs_available() is not irq-safe.

-ben
-- 
Thought is the essence of where you are now.


diff --git a/fs/aio.c b/fs/aio.c
index 955947e..4b97180 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
 {
struct kioctx_cpu *kcpu;
+   unsigned long flags;
 
preempt_disable();
kcpu = this_cpu_ptr(ctx-cpu);
 
+   local_irq_save(flags);
kcpu-reqs_available += nr;
+
while (kcpu-reqs_available = ctx-req_batch * 2) {
kcpu-reqs_available -= ctx-req_batch;
atomic_add(ctx-req_batch, ctx-reqs_available);
}
 
+   local_irq_restore(flags);
preempt_enable();
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Christoph Hellwig
On Thu, Jul 10, 2014 at 12:53:36AM +, Elliott, Robert (Server Storage) 
wrote:
 the problem still occurs - fio results in low or 0 IOPS, with perf top 
 reporting unusual amounts of time spent in do_io_submit and io_submit.

The diff between the two version doesn't show too much other possible
interesting commits, the most interesting being some minor block
updates.

I guess we'll have to a manual bisect, I've pushed out a
scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
the block tree, and a scsi-mq.3-bisect-2 branch that is just after
the merge of the block tree to get started.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Benjamin LaHaise
On Wed, Jul 09, 2014 at 11:20:40PM -0700, Christoph Hellwig wrote:
 On Thu, Jul 10, 2014 at 12:53:36AM +, Elliott, Robert (Server Storage) 
 wrote:
  the problem still occurs - fio results in low or 0 IOPS, with perf top 
  reporting unusual amounts of time spent in do_io_submit and io_submit.
 
 The diff between the two version doesn't show too much other possible
 interesting commits, the most interesting being some minor block
 updates.
 
 I guess we'll have to a manual bisect, I've pushed out a
 scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
 the block tree, and a scsi-mq.3-bisect-2 branch that is just after
 the merge of the block tree to get started.

There is one possible concern that could be exacerbated by other changes in 
the system: if the application is running close to the bare minimum number 
of requests allocated in io_setup(), the per cpu reference counters will 
have a hard time batching things.  It might be worth testing with an 
increased number of requests being allocated if this is the case.

-ben
-- 
Thought is the essence of where you are now.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jens Axboe

On 2014-07-10 15:36, Benjamin LaHaise wrote:

On Wed, Jul 09, 2014 at 11:20:40PM -0700, Christoph Hellwig wrote:

On Thu, Jul 10, 2014 at 12:53:36AM +, Elliott, Robert (Server Storage) 
wrote:

the problem still occurs - fio results in low or 0 IOPS, with perf top
reporting unusual amounts of time spent in do_io_submit and io_submit.


The diff between the two version doesn't show too much other possible
interesting commits, the most interesting being some minor block
updates.

I guess we'll have to a manual bisect, I've pushed out a
scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
the block tree, and a scsi-mq.3-bisect-2 branch that is just after
the merge of the block tree to get started.


There is one possible concern that could be exacerbated by other changes in
the system: if the application is running close to the bare minimum number
of requests allocated in io_setup(), the per cpu reference counters will
have a hard time batching things.  It might be worth testing with an
increased number of requests being allocated if this is the case.


That's how fio always runs, it sets up the context with the exact queue 
depth that it needs. Do we have a good enough understanding of other aio 
use cases to say that this isn't the norm? I would expect it to be, it's 
the way that the API would most obviously be used.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Benjamin LaHaise
On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
 That's how fio always runs, it sets up the context with the exact queue 
 depth that it needs. Do we have a good enough understanding of other aio 
 use cases to say that this isn't the norm? I would expect it to be, it's 
 the way that the API would most obviously be used.

The problem with this approach is that it works very poorly with per cpu 
reference counting's batching of references, which is pretty much a 
requirement now that many core systems are the norm.  Allocating the bare 
minimum is not the right thing to do today.  That said, the default limits 
on the number of requests probably needs to be raised.

-ben
-- 
Thought is the essence of where you are now.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jens Axboe

On 2014-07-10 15:44, Benjamin LaHaise wrote:

On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:

That's how fio always runs, it sets up the context with the exact queue
depth that it needs. Do we have a good enough understanding of other aio
use cases to say that this isn't the norm? I would expect it to be, it's
the way that the API would most obviously be used.


The problem with this approach is that it works very poorly with per cpu
reference counting's batching of references, which is pretty much a
requirement now that many core systems are the norm.  Allocating the bare
minimum is not the right thing to do today.  That said, the default limits
on the number of requests probably needs to be raised.


Sorry, that's a complete cop-out. Then you handle this internally, 
allocate a bigger pool and cap the limit if you need to. Look at the 
API. You pass in the number of requests you will use. Do you expect 
anyone to double up, just in case? Will never happen.


But all of this is side stepping the point that there's a real bug 
reported here. The above could potentially explain the it's using X 
more CPU, or it's Y slower. The above is a softlock, it never completes.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Benjamin LaHaise
On Thu, Jul 10, 2014 at 03:48:10PM +0200, Jens Axboe wrote:
 On 2014-07-10 15:44, Benjamin LaHaise wrote:
 On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
 That's how fio always runs, it sets up the context with the exact queue
 depth that it needs. Do we have a good enough understanding of other aio
 use cases to say that this isn't the norm? I would expect it to be, it's
 the way that the API would most obviously be used.
 
 The problem with this approach is that it works very poorly with per cpu
 reference counting's batching of references, which is pretty much a
 requirement now that many core systems are the norm.  Allocating the bare
 minimum is not the right thing to do today.  That said, the default limits
 on the number of requests probably needs to be raised.
 
 Sorry, that's a complete cop-out. Then you handle this internally, 
 allocate a bigger pool and cap the limit if you need to. Look at the 
 API. You pass in the number of requests you will use. Do you expect 
 anyone to double up, just in case? Will never happen.
 
 But all of this is side stepping the point that there's a real bug 
 reported here. The above could potentially explain the it's using X 
 more CPU, or it's Y slower. The above is a softlock, it never completes.

I'm not trying to cop out on this -- I'm asking for a data point to see 
if changing the request limits has any effect.

-ben

 -- 
 Jens Axboe

-- 
Thought is the essence of where you are now.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Christoph Hellwig
On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
 There is one possible concern that could be exacerbated by other changes in 
 the system: if the application is running close to the bare minimum number 
 of requests allocated in io_setup(), the per cpu reference counters will 
 have a hard time batching things.  It might be worth testing with an 
 increased number of requests being allocated if this is the case.

Well, Robert said reverting the two aio commits didn't help.  Either he
didn't manage to boot into the right kernel, or we need to look
elsewhere for the culprit.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jens Axboe

On 2014-07-10 15:50, Christoph Hellwig wrote:

On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:

There is one possible concern that could be exacerbated by other changes in
the system: if the application is running close to the bare minimum number
of requests allocated in io_setup(), the per cpu reference counters will
have a hard time batching things.  It might be worth testing with an
increased number of requests being allocated if this is the case.


Well, Robert said reverting the two aio commits didn't help.  Either he
didn't manage to boot into the right kernel, or we need to look
elsewhere for the culprit.


Rob, let me know what scsi_debug setup you use, and I can try and 
reproduce it here as well.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jens Axboe

On 2014-07-10 15:50, Benjamin LaHaise wrote:

On Thu, Jul 10, 2014 at 03:48:10PM +0200, Jens Axboe wrote:

On 2014-07-10 15:44, Benjamin LaHaise wrote:

On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:

That's how fio always runs, it sets up the context with the exact queue
depth that it needs. Do we have a good enough understanding of other aio
use cases to say that this isn't the norm? I would expect it to be, it's
the way that the API would most obviously be used.


The problem with this approach is that it works very poorly with per cpu
reference counting's batching of references, which is pretty much a
requirement now that many core systems are the norm.  Allocating the bare
minimum is not the right thing to do today.  That said, the default limits
on the number of requests probably needs to be raised.


Sorry, that's a complete cop-out. Then you handle this internally,
allocate a bigger pool and cap the limit if you need to. Look at the
API. You pass in the number of requests you will use. Do you expect
anyone to double up, just in case? Will never happen.

But all of this is side stepping the point that there's a real bug
reported here. The above could potentially explain the it's using X
more CPU, or it's Y slower. The above is a softlock, it never completes.


I'm not trying to cop out on this -- I'm asking for a data point to see
if changing the request limits has any effect.


Fair enough, if the question is does it solve the regression, then 
it's a valid data point. Rob/Doug, for fio, you can just double the 
iodepth passed in in engines/libaio:fio_libaio_init() and test with that 
and see if it makes a difference.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: Jens Axboe [mailto:ax...@kernel.dk]
 Sent: Thursday, 10 July, 2014 8:53 AM
 To: Christoph Hellwig; Benjamin LaHaise
 Cc: Elliott, Robert (Server Storage); dgilb...@interlog.com; James Bottomley;
 Bart Van Assche; linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 On 2014-07-10 15:50, Christoph Hellwig wrote:
  On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
  There is one possible concern that could be exacerbated by other changes
 in
  the system: if the application is running close to the bare minimum number
  of requests allocated in io_setup(), the per cpu reference counters will
  have a hard time batching things.  It might be worth testing with an
  increased number of requests being allocated if this is the case.
 
  Well, Robert said reverting the two aio commits didn't help.  Either he
  didn't manage to boot into the right kernel, or we need to look
  elsewhere for the culprit.
 
 Rob, let me know what scsi_debug setup you use, and I can try and
 reproduce it here as well.
 
 --
 Jens Axboe

This system has 6 online CPUs and 64 possible CPUs.

Printing avail and req_batch in that loop results in many of these:
** 3813 printk messages dropped ** [10643.503772] ctx 88042d8d4cc0 avail=0 
req_batch=2

Adding CFLAGS_aio.o := -DDEBUG to the Makefile to enable
those pr_debug prints results in nothing extra printing,
so it's not hitting an error.

Printing nr_events and aio_max_nr at the top of ioctx_alloc results in
these as fio starts:

[  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339070] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339074] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339076] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339076] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359772] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359971] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359972] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359985] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.359986] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359987] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359995] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359995] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359998] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359998] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.362529] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.362529] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.363510] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363513] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363520] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363521] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398113] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398115] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398122] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398124] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398124] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398130] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398131] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398164] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398165] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398499] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.400489] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.401478] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.401491] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434522] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.434523] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434526] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434533] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.435370] hrtimer: interrupt took 6868 ns
[  186.435491] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.435492] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.447864] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.449896] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449900] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449901] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.449909] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.449932] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449933] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.461147] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.461176] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.461177] ioctx_alloc

Re: scsi-mq V2

2014-07-10 Thread Benjamin LaHaise
On Thu, Jul 10, 2014 at 02:36:40PM +, Elliott, Robert (Server Storage) 
wrote:
 
 
  -Original Message-
  From: Jens Axboe [mailto:ax...@kernel.dk]
  Sent: Thursday, 10 July, 2014 8:53 AM
  To: Christoph Hellwig; Benjamin LaHaise
  Cc: Elliott, Robert (Server Storage); dgilb...@interlog.com; James 
  Bottomley;
  Bart Van Assche; linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org
  Subject: Re: scsi-mq V2
  
  On 2014-07-10 15:50, Christoph Hellwig wrote:
   On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
   There is one possible concern that could be exacerbated by other changes
  in
   the system: if the application is running close to the bare minimum 
   number
   of requests allocated in io_setup(), the per cpu reference counters will
   have a hard time batching things.  It might be worth testing with an
   increased number of requests being allocated if this is the case.
  
   Well, Robert said reverting the two aio commits didn't help.  Either he
   didn't manage to boot into the right kernel, or we need to look
   elsewhere for the culprit.
  
  Rob, let me know what scsi_debug setup you use, and I can try and
  reproduce it here as well.
  
  --
  Jens Axboe
 
 This system has 6 online CPUs and 64 possible CPUs.
 
 Printing avail and req_batch in that loop results in many of these:
 ** 3813 printk messages dropped ** [10643.503772] ctx 88042d8d4cc0 
 avail=0 req_batch=2
 
 Adding CFLAGS_aio.o := -DDEBUG to the Makefile to enable
 those pr_debug prints results in nothing extra printing,
 so it's not hitting an error.
 
 Printing nr_events and aio_max_nr at the top of ioctx_alloc results in
 these as fio starts:
 
 [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536

Something is horribly wrong here.  There is no way that value for nr_events 
should be passed in to ioctx_alloc().  This implies that userland is calling 
io_setup() with an impossibly large value for nr_events.  Can you post the 
actual diff for your fs/aio.c relative to linus' tree?

-ben


 [  186.339070] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.339074] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.339076] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339076] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.359772] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.359971] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.359972] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.359985] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.359986] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.359987] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.359995] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.359995] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.359998] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.359998] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.362529] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.362529] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.363510] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.363513] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.363520] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.363521] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.398113] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.398115] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.398121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.398122] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.398124] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.398124] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.398130] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.398131] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.398164] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.398165] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.398499] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.400489] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.401478] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.401491] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.434522] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.434523] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.434526] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.434533] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
 [  186.435370] hrtimer: interrupt took 6868 ns
 [  186.435491] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.435492] ioctx_alloc: nr_events=512 aio_max_nr=65536
 [  186.447864] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261

Re: scsi-mq V2

2014-07-10 Thread Jeff Moyer
Benjamin LaHaise b...@kvack.org writes:

 
 [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536

 Something is horribly wrong here.  There is no way that value for nr_events 
 should be passed in to ioctx_alloc().  This implies that userland is calling 
 io_setup() with an impossibly large value for nr_events.  Can you post the 
 actual diff for your fs/aio.c relative to linus' tree?


fio does exactly this!  it passes INT_MAX.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: Christoph Hellwig [mailto:h...@infradead.org]
 Sent: Thursday, 10 July, 2014 1:21 AM
 To: Elliott, Robert (Server Storage)
 Cc: Jens Axboe; dgilb...@interlog.com; Christoph Hellwig; James Bottomley;
 Bart Van Assche; Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 On Thu, Jul 10, 2014 at 12:53:36AM +, Elliott, Robert (Server Storage)
 wrote:
  the problem still occurs - fio results in low or 0 IOPS, with perf top
  reporting unusual amounts of time spent in do_io_submit and io_submit.
 
 The diff between the two version doesn't show too much other possible
 interesting commits, the most interesting being some minor block
 updates.
 
 I guess we'll have to a manual bisect, I've pushed out a

 scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
 the block tree

good.

 and a scsi-mq.3-bisect-2 branch that is just after the merge of the 
 block tree to get started.

good.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Christoph Hellwig
On Thu, Jul 10, 2014 at 03:51:44PM +, Elliott, Robert (Server Storage) 
wrote:
  scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
  the block tree
 
 good.
 
  and a scsi-mq.3-bisect-2 branch that is just after the merge of the 
  block tree to get started.
 
 good.

It's starting to look weird.  I'll prepare another two bisect branches
around some MM changes, which seems the only other possible candidate.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Christoph Hellwig
On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
 It's starting to look weird.  I'll prepare another two bisect branches
 around some MM changes, which seems the only other possible candidate.

I've pushed out scsi-mq.3-bisect-3 and scsi-mq.3-bisect-4 for you.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: Christoph Hellwig [mailto:h...@infradead.org]
 Sent: Thursday, 10 July, 2014 11:15 AM
 To: Elliott, Robert (Server Storage)
 Cc: Jens Axboe; dgilb...@interlog.com; James Bottomley; Bart Van Assche;
 Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
  It's starting to look weird.  I'll prepare another two bisect branches
  around some MM changes, which seems the only other possible candidate.
 
 I've pushed out scsi-mq.3-bisect-3 

Good.

 and scsi-mq.3-bisect-4 for you.

Bad.

Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index df95a2f..11b65d4 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct 
BITSFUNC(fake_sections) *out,
uint64_t flags = GET_LE(in-sh_flags);

bool copy = flags  SHF_ALLOC 
+   (GET_LE(in-sh_size) ||
+   (GET_LE(in-sh_type) != SHT_RELA 
+   GET_LE(in-sh_type) != SHT_REL)) 
strcmp(name, .altinstructions) 
strcmp(name, .altinstr_replacement);

Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

perf top shows lookup_ioctx function alongside io_submit and
do_io_submit this time:
 14.96%  [kernel] [k] lookup_ioctx
 14.71%  libaio.so.1.0.1  [.] io_submit
 13.78%  [kernel] [k] system_call
 10.79%  [kernel] [k] system_call_after_swapgs
 10.17%  [kernel] [k] do_io_submit
  8.91%  [kernel] [k] copy_user_generic_string
  4.24%  [kernel] [k] io_submit_one
  3.93%  [kernel] [k] blk_flush_plug_list
  3.32%  fio  [.] fio_libaio_commit
  2.84%  [kernel] [k] sysret_check
  2.06%  [kernel] [k] blk_finish_plug
  1.89%  [kernel] [k] SyS_io_submit
  1.48%  [kernel] [k] blk_start_plug
  1.04%  fio  [.] io_submit@plt
  0.84%  [kernel] [k] __get_user_4
  0.74%  [kernel] [k] system_call_fastpath
  0.60%  [kernel] [k] _copy_from_user
  0.51%  diff [.] 0x7abb

ftrace on CPU 0 shows similar repetition to before:
 fio-4107  [000]    389.992300: lookup_ioctx -do_io_submit
 fio-4107  [000]    389.992300: blk_start_plug -do_io_submit
 fio-4107  [000]    389.992300: io_submit_one -do_io_submit
 fio-4107  [000]    389.992300: blk_finish_plug -do_io_submit
 fio-4107  [000]    389.992300: blk_flush_plug_list 
-blk_finish_plug
 fio-4107  [000]    389.992301: SyS_io_submit 
-system_call_fastpath
 fio-4107  [000]    389.992301: do_io_submit -SyS_io_submit
 fio-4107  [000]    389.992301: lookup_ioctx -do_io_submit
 fio-4107  [000]    389.992301: blk_start_plug -do_io_submit
 fio-4107  [000]    389.992301: io_submit_one -do_io_submit
 fio-4107  [000]    389.992301: blk_finish_plug -do_io_submit
 fio-4107  [000]    389.992301: blk_flush_plug_list 
-blk_finish_plug
 fio-4107  [000]    389.992301: SyS_io_submit 
-system_call_fastpath
 fio-4107  [000]    389.992302: do_io_submit -SyS_io_submit
 fio-4107  [000]    389.992302: lookup_ioctx -do_io_submit
 fio-4107  [000]    389.992302: blk_start_plug -do_io_submit
 fio-4107  [000]    389.992302: io_submit_one -do_io_submit
 fio-4107  [000]    389.992302: blk_finish_plug -do_io_submit
 fio-4107  [000]    389.992302: blk_flush_plug_list 
-blk_finish_plug



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jeff Moyer
Elliott, Robert (Server Storage) elli...@hp.com writes:

 -Original Message-
 From: Christoph Hellwig [mailto:h...@infradead.org]
 Sent: Thursday, 10 July, 2014 11:15 AM
 To: Elliott, Robert (Server Storage)
 Cc: Jens Axboe; dgilb...@interlog.com; James Bottomley; Bart Van Assche;
 Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
  It's starting to look weird.  I'll prepare another two bisect branches
  around some MM changes, which seems the only other possible candidate.
 
 I've pushed out scsi-mq.3-bisect-3 

 Good.

 and scsi-mq.3-bisect-4 for you.

 Bad.

 Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
 diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
 index df95a2f..11b65d4 100644
 --- a/arch/x86/vdso/vdso2c.h
 +++ b/arch/x86/vdso/vdso2c.h
 @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct 
 BITSFUNC(fake_sections) *out,
   uint64_t flags = GET_LE(in-sh_flags);

   bool copy = flags  SHF_ALLOC 
 + (GET_LE(in-sh_size) ||
 + (GET_LE(in-sh_type) != SHT_RELA 
 + GET_LE(in-sh_type) != SHT_REL)) 
   strcmp(name, .altinstructions) 
   strcmp(name, .altinstr_replacement);

 Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
 an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

Hi, Rob,

Can you get sysrq-t output for me?  I don't know how/why we'd continue
to get io_submits for an exiting process.

Thanks,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jeff Moyer
Jeff Moyer jmo...@redhat.com writes:

 Hi, Rob,

 Can you get sysrq-t output for me?  I don't know how/why we'd continue
 to get io_submits for an exiting process.

Also, do you know what sys_io_submit is returning?
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jens Axboe

On 2014-07-10 17:11, Jeff Moyer wrote:

Benjamin LaHaise b...@kvack.org writes:



[  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536


Something is horribly wrong here.  There is no way that value for nr_events
should be passed in to ioctx_alloc().  This implies that userland is calling
io_setup() with an impossibly large value for nr_events.  Can you post the
actual diff for your fs/aio.c relative to linus' tree?



fio does exactly this!  it passes INT_MAX.


That's correct, I had actually forgotten about this. It was a change 
made a few years back, in correlation with the aio optimizations posted 
then, basically telling aio to ignore that silly (and broken) user ring.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jeff Moyer
Jens Axboe ax...@kernel.dk writes:

 On 2014-07-10 17:11, Jeff Moyer wrote:
 Benjamin LaHaise b...@kvack.org writes:


 [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536

 Something is horribly wrong here.  There is no way that value for nr_events
 should be passed in to ioctx_alloc().  This implies that userland is calling
 io_setup() with an impossibly large value for nr_events.  Can you post the
 actual diff for your fs/aio.c relative to linus' tree?


 fio does exactly this!  it passes INT_MAX.

 That's correct, I had actually forgotten about this. It was a change
 made a few years back, in correlation with the aio optimizations
 posted then, basically telling aio to ignore that silly (and broken)
 user ring.

I still don't see how you accomplish that.  Making it bigger doesn't get
rid of it.  ;-)

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jens Axboe

On 2014-07-10 22:05, Jeff Moyer wrote:

Jens Axboe ax...@kernel.dk writes:


On 2014-07-10 17:11, Jeff Moyer wrote:

Benjamin LaHaise b...@kvack.org writes:



[  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536


Something is horribly wrong here.  There is no way that value for nr_events
should be passed in to ioctx_alloc().  This implies that userland is calling
io_setup() with an impossibly large value for nr_events.  Can you post the
actual diff for your fs/aio.c relative to linus' tree?



fio does exactly this!  it passes INT_MAX.


That's correct, I had actually forgotten about this. It was a change
made a few years back, in correlation with the aio optimizations
posted then, basically telling aio to ignore that silly (and broken)
user ring.


I still don't see how you accomplish that.  Making it bigger doesn't get
rid of it.  ;-)


See the patches from back then - INT_MAX basically just meant the same 
as 0, but 0 could not be used because of the (silly) setup with the 
wrappers around the syscalls. So INT_MAX was overloaded to mean no ring 
events, I don't care.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: Jeff Moyer [mailto:jmo...@redhat.com]
 Sent: Thursday, 10 July, 2014 2:14 PM
 To: Elliott, Robert (Server Storage)
 Cc: Christoph Hellwig; Jens Axboe; dgilb...@interlog.com; James Bottomley;
 Bart Van Assche; Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 Elliott, Robert (Server Storage) elli...@hp.com writes:
 
  -Original Message-
  From: Christoph Hellwig [mailto:h...@infradead.org]
  Sent: Thursday, 10 July, 2014 11:15 AM
  To: Elliott, Robert (Server Storage)
  Cc: Jens Axboe; dgilb...@interlog.com; James Bottomley; Bart Van Assche;
  Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org
  Subject: Re: scsi-mq V2
 
  On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
   It's starting to look weird.  I'll prepare another two bisect branches
   around some MM changes, which seems the only other possible candidate.
 
  I've pushed out scsi-mq.3-bisect-3
 
  Good.
 
  and scsi-mq.3-bisect-4 for you.
 
  Bad.
 
  Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
  diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
  index df95a2f..11b65d4 100644
  --- a/arch/x86/vdso/vdso2c.h
  +++ b/arch/x86/vdso/vdso2c.h
  @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct
 BITSFUNC(fake_sections) *out,
  uint64_t flags = GET_LE(in-sh_flags);
 
  bool copy = flags  SHF_ALLOC 
  +   (GET_LE(in-sh_size) ||
  +   (GET_LE(in-sh_type) != SHT_RELA 
  +   GET_LE(in-sh_type) != SHT_REL)) 
  strcmp(name, .altinstructions) 
  strcmp(name, .altinstr_replacement);
 
  Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
  an fio hang, with one CPU (CPU 0) stuck in io_submit loops.
 

I added some prints in aio_setup_ring and  ioctx_alloc and
rebooted.  This time it took much longer to hit the problem.  It 
survived dozens of ^Cs.  Running a few minutes, though, IOPS 
eventually dropped.  So, sometimes it happens immediately,
sometimes it takes time to develop.

I will rerun bisect-1 -2 and -3 for longer times to increase
confidence that they didn't just appear good.

On this bisect-4 run, as IOPS started to drop from 900K to 40K, 
I ran perf top when it was at 700K.  You can see io_submit times
creeping up.

  4.30%  [kernel][k] do_io_submit
  4.29%  [kernel][k] _raw_spin_lock_irqsave
  3.88%  libaio.so.1.0.1 [.] io_submit
  3.55%  [kernel][k] system_call
  3.34%  [kernel][k] put_compound_page
  3.11%  [kernel][k] io_submit_one
  3.06%  [kernel][k] system_call_after_swapgs
  2.89%  [kernel][k] copy_user_generic_string
  2.45%  [kernel][k] lookup_ioctx
  2.16%  [kernel][k] apic_timer_interrupt
  2.00%  [kernel][k] _raw_spin_lock
  1.97%  [scsi_debug][k] sdebug_q_cmd_hrt_complete
  1.84%  [kernel][k] __get_page_tail
  1.74%  [kernel][k] do_blockdev_direct_IO
  1.68%  [kernel][k] blk_flush_plug_list
  1.41%  [kernel][k] _raw_spin_unlock_irqrestore
  1.24%  [scsi_debug][k] schedule_resp

finally settling like before:
 14.15%  [kernel][k] do_io_submit
 13.61%  libaio.so.1.0.1 [.] io_submit
 11.81%  [kernel][k] system_call
 10.11%  [kernel][k] system_call_after_swapgs
  8.59%  [kernel][k] io_submit_one
  8.56%  [kernel][k] copy_user_generic_string
  7.96%  [kernel][k] lookup_ioctx
  5.33%  [kernel][k] blk_flush_plug_list
  3.11%  [kernel][k] blk_finish_plug
  2.84%  [kernel][k] sysret_check
  2.63%  fio [.] fio_libaio_commit
  2.27%  [kernel][k] blk_start_plug
  1.17%  [kernel][k] SyS_io_submit

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-09 Thread Douglas Gilbert

On 14-07-08 10:48 AM, Christoph Hellwig wrote:

On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:

Changes from V1:
  - rebased on top of the core-for-3.17 branch, most notable the
scsi logging changes
  - fixed handling of cmd_list to prevent crashes for some heavy
workloads
  - fixed incorrect handling of !target-can_queue
  - avoid scheduling a workqueue on I/O completions when no queues
are congested

In addition to the patches in this thread there also is a git available at:

git://git.infradead.org/users/hch/scsi.git scsi-mq.2



I've pushed out a new scsi-mq.3 branch, which has been rebased on the
latest core-for-3.17 tree + the RFC: clean up command setup series
from June 29th.  Robert Elliot found a problem with not fully zeroed
out UNMAP CDBs, which is fixed by the saner discard handling in that
series.

There is a new patch to factor the code from the above series for
blk-mq use, which I've attached below.  Besides that the only changes
are minor merge fixups in the main blk-mq usage patch.


Be warned: both Rob Elliott and I can easily break
the scsi-mq.3 branch. It seems as though a regression
has slipped in. I notice that Christoph has added a
new branch called scsi-mq.3-no-rebase.

For those interested, watch this space.

Doug Gilbert

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-09 Thread Jens Axboe

On 2014-07-09 18:39, Douglas Gilbert wrote:

On 14-07-08 10:48 AM, Christoph Hellwig wrote:

On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:

Changes from V1:
  - rebased on top of the core-for-3.17 branch, most notable the
scsi logging changes
  - fixed handling of cmd_list to prevent crashes for some heavy
workloads
  - fixed incorrect handling of !target-can_queue
  - avoid scheduling a workqueue on I/O completions when no queues
are congested

In addition to the patches in this thread there also is a git
available at:

git://git.infradead.org/users/hch/scsi.git scsi-mq.2



I've pushed out a new scsi-mq.3 branch, which has been rebased on the
latest core-for-3.17 tree + the RFC: clean up command setup series
from June 29th.  Robert Elliot found a problem with not fully zeroed
out UNMAP CDBs, which is fixed by the saner discard handling in that
series.

There is a new patch to factor the code from the above series for
blk-mq use, which I've attached below.  Besides that the only changes
are minor merge fixups in the main blk-mq usage patch.


Be warned: both Rob Elliott and I can easily break
the scsi-mq.3 branch. It seems as though a regression
has slipped in. I notice that Christoph has added a
new branch called scsi-mq.3-no-rebase.


Rob/Doug, those issues look very much like problems in the aio code. Can 
either/both of you try with:


f8567a3845ac05bb28f3c1b478ef752762bd39ef
edfbbf388f293d70bf4b7c0bc38774d05e6f711a

reverted (in that order) and see if that changes anything.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-07-09 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: Jens Axboe [mailto:ax...@kernel.dk]
 Sent: Wednesday, 09 July, 2014 2:38 PM
 To: dgilb...@interlog.com; Christoph Hellwig; James Bottomley; Bart Van
 Assche; Elliott, Robert (Server Storage); linux-scsi@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 On 2014-07-09 18:39, Douglas Gilbert wrote:
  On 14-07-08 10:48 AM, Christoph Hellwig wrote:
  On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
  Changes from V1:
- rebased on top of the core-for-3.17 branch, most notable the
  scsi logging changes
- fixed handling of cmd_list to prevent crashes for some heavy
  workloads
- fixed incorrect handling of !target-can_queue
- avoid scheduling a workqueue on I/O completions when no queues
  are congested
 
  In addition to the patches in this thread there also is a git
  available at:
 
  git://git.infradead.org/users/hch/scsi.git scsi-mq.2
 
 
  I've pushed out a new scsi-mq.3 branch, which has been rebased on the
  latest core-for-3.17 tree + the RFC: clean up command setup series
  from June 29th.  Robert Elliot found a problem with not fully zeroed
  out UNMAP CDBs, which is fixed by the saner discard handling in that
  series.
 
  There is a new patch to factor the code from the above series for
  blk-mq use, which I've attached below.  Besides that the only changes
  are minor merge fixups in the main blk-mq usage patch.
 
  Be warned: both Rob Elliott and I can easily break
  the scsi-mq.3 branch. It seems as though a regression
  has slipped in. I notice that Christoph has added a
  new branch called scsi-mq.3-no-rebase.
 
 Rob/Doug, those issues look very much like problems in the aio code. Can
 either/both of you try with:
 
 f8567a3845ac05bb28f3c1b478ef752762bd39ef
 edfbbf388f293d70bf4b7c0bc38774d05e6f711a
 
 reverted (in that order) and see if that changes anything.
 
 
 --
 Jens Axboe

scsi-mq.3-no-rebase, which has all the scsi updates from scsi-mq.3
but is based on 3.16.0-rc2 rather than 3.16.0-rc4, works fine:
* ^C exits fio cleanly with scsi_debug devices
* ^C exits fio cleanly with mpt3sas devices
* fio hits 1M IOPS with 16 hpsa devices
* fio hits 700K IOPS with 6 mpt3sas devices
* 38 device test to mpt3sas, hpsa, and scsi_debug devices runs OK


With:
* scsi-mq-3, which is based on 3.16.0-rc4
* [PATCH] x86-64: fix vDSO build from https://lkml.org/lkml/2014/7/3/738
* those two aio patches reverted

the problem still occurs - fio results in low or 0 IOPS, with perf top 
reporting unusual amounts of time spent in do_io_submit and io_submit.

perf top:
 14.38%  [kernel]  [k] do_io_submit
 13.71%  libaio.so.1.0.1   [.] io_submit
 13.32%  [kernel]  [k] system_call
 11.60%  [kernel]  [k] system_call_after_swapgs
  8.88%  [kernel]  [k] lookup_ioctx
  8.78%  [kernel]  [k] copy_user_generic_string
  7.78%  [kernel]  [k] io_submit_one
  5.97%  [kernel]  [k] blk_flush_plug_list
  2.73%  fio   [.] fio_libaio_commit
  2.70%  [kernel]  [k] sysret_check
  2.68%  [kernel]  [k] blk_finish_plug
  1.98%  [kernel]  [k] blk_start_plug
  1.17%  [kernel]  [k] SyS_io_submit
  1.17%  [kernel]  [k] __get_user_4
  0.99%  fio   [.] io_submit@plt
  0.85%  [kernel]  [k] _copy_from_user
  0.79%  [kernel]  [k] system_call_fastpath

Repeating some of last night's investigation details for the lists:

ftrace of one of the CPUs for all functions shows these 
are repeatedly being called:
 
   ...-34508 [004]   6360.790714: io_submit_one -do_io_submit
   ...-34508 [004]   6360.790714: blk_finish_plug -do_io_submit
   ...-34508 [004]   6360.790714: blk_flush_plug_list 
-blk_finish_plug
   ...-34508 [004]   6360.790714: SyS_io_submit 
-system_call_fastpath
   ...-34508 [004]   6360.790715: do_io_submit -SyS_io_submit
   ...-34508 [004]   6360.790715: lookup_ioctx -do_io_submit
   ...-34508 [004]   6360.790715: blk_start_plug -do_io_submit
   ...-34508 [004]   6360.790715: io_submit_one -do_io_submit
   ...-34508 [004]   6360.790715: blk_finish_plug -do_io_submit
   ...-34508 [004]   6360.790715: blk_flush_plug_list 
-blk_finish_plug
   ...-34508 [004]   6360.790715: SyS_io_submit 
-system_call_fastpath
   ...-34508 [004]   6360.790715: do_io_submit -SyS_io_submit
   ...-34508 [004]   6360.790715: lookup_ioctx -do_io_submit
   ...-34508 [004]   6360.790716: blk_start_plug -do_io_submit
   ...-34508 [004]   6360.790716: io_submit_one -do_io_submit
   ...-34508 [004]   6360.790716: blk_finish_plug -do_io_submit
   ...-34508 [004]   6360.790716: blk_flush_plug_list 
-blk_finish_plug
   ...-34508 [004]   6360.790716

Re: scsi-mq V2

2014-07-08 Thread Christoph Hellwig
On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
 Changes from V1:
  - rebased on top of the core-for-3.17 branch, most notable the
scsi logging changes
  - fixed handling of cmd_list to prevent crashes for some heavy
workloads
  - fixed incorrect handling of !target-can_queue
  - avoid scheduling a workqueue on I/O completions when no queues
are congested
 
 In addition to the patches in this thread there also is a git available at:
 
   git://git.infradead.org/users/hch/scsi.git scsi-mq.2


I've pushed out a new scsi-mq.3 branch, which has been rebased on the
latest core-for-3.17 tree + the RFC: clean up command setup series
from June 29th.  Robert Elliot found a problem with not fully zeroed
out UNMAP CDBs, which is fixed by the saner discard handling in that
series.

There is a new patch to factor the code from the above series for
blk-mq use, which I've attached below.  Besides that the only changes
are minor merge fixups in the main blk-mq usage patch.

---
From f925c317c74849666d599926d8ad8f34ef99d5cf Mon Sep 17 00:00:00 2001
From: Christoph Hellwig h...@lst.de
Date: Tue, 8 Jul 2014 13:16:17 +0200
Subject: scsi: add scsi_setup_cmnd helper

Factor out command setup code that will be shared with the blk-mq code path.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 drivers/scsi/scsi_lib.c |   40 ++--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 116f541..61afae8 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1116,6 +1116,27 @@ static int scsi_setup_fs_cmnd(struct scsi_device *sdev, 
struct request *req)
return scsi_cmd_to_driver(cmd)-init_command(cmd);
 }
 
+static int scsi_setup_cmnd(struct scsi_device *sdev, struct request *req)
+{
+   struct scsi_cmnd *cmd = req-special;
+
+   if (!blk_rq_bytes(req))
+   cmd-sc_data_direction = DMA_NONE;
+   else if (rq_data_dir(req) == WRITE)
+   cmd-sc_data_direction = DMA_TO_DEVICE;
+   else
+   cmd-sc_data_direction = DMA_FROM_DEVICE;
+
+   switch (req-cmd_type) {
+   case REQ_TYPE_FS:
+   return scsi_setup_fs_cmnd(sdev, req);
+   case REQ_TYPE_BLOCK_PC:
+   return scsi_setup_blk_pc_cmnd(sdev, req);
+   default:
+   return BLKPREP_KILL;
+   }
+}
+
 static int
 scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
 {
@@ -1219,24 +1240,7 @@ static int scsi_prep_fn(struct request_queue *q, struct 
request *req)
goto out;
}
 
-   if (!blk_rq_bytes(req))
-   cmd-sc_data_direction = DMA_NONE;
-   else if (rq_data_dir(req) == WRITE)
-   cmd-sc_data_direction = DMA_TO_DEVICE;
-   else
-   cmd-sc_data_direction = DMA_FROM_DEVICE;
-
-   switch (req-cmd_type) {
-   case REQ_TYPE_FS:
-   ret = scsi_setup_fs_cmnd(sdev, req);
-   break;
-   case REQ_TYPE_BLOCK_PC:
-   ret = scsi_setup_blk_pc_cmnd(sdev, req);
-   break;
-   default:
-   ret = BLKPREP_KILL;
-   }
-
+   ret = scsi_setup_cmnd(sdev, req);
 out:
return scsi_prep_return(q, req, ret);
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-06-30 Thread Jens Axboe
On 06/25/2014 10:50 PM, Jens Axboe wrote:
 On 2014-06-25 10:51, Christoph Hellwig wrote:
 This is the second post of the scsi-mq series.

 At this point the code is ready for merging and use by developers and
 early
 adopters.  The core blk-mq code isn't that suitable for slow devices
 yet, mostly due to the lack of an I/O scheduler, but Jens is working
 on it.
 Similarly there is no dm-multipath support for drivers using blk-mq yet,
 but I'm working on it.  It should also be noted that the code doesn't
 actually support multiple hardware queues or fine grained tuning of the
 blk-mq parameters yet.  All these could be added fairly easily as soon
 as low-level drivers want to make use of them.

 The amount of chances to the existing code are fairly small, and mostly
 speedups or cleanups that also apply to the old path as well.  Because
 of this I also haven't bothered to put it under a config option, just
 like the blk-mq core.

 The usage of blk-mq dramatically decreases CPU usage under all
 workloads going
 down from 100% CPU usage that the old setup can hit easily to usually
 less
 than 20% for maxing out storage subsystems with 512byte reads and writes,
 and it allows to easily archive millions of IOPS.  Bart and Robert have
 helped with some very detailed measurements that they might be able to
 send
 in reply to this, although these usually involve significantly
 reworked low
 level drivers to avoid other bottle necks.

 One major objection to previous iterations of this code was the simple
 replacement of the host_lock with atomic counters for the host and busy
 counters.  The host_lock avoidance on it's own already improves
 performance,
 and with the patch to avoid maintaining the per-target busy counter
 unless
 needed we now replace a lock round trip on the host_lock with just a
 single
 atomic increment in the submission path, and a single atomic decrement in
 completion path, which should provide benefits even for the oddest RISC
 architecture.  Longer term I'd still love to get rid of these entirely
 and use the counters in blk-mq, but due to the difference in how they
 are maintained this doesn't seem feasible as long as we still need to
 support the legacy request code path.

 Changes from V1:
   - rebased on top of the core-for-3.17 branch, most notable the
 scsi logging changes
   - fixed handling of cmd_list to prevent crashes for some heavy
 workloads
   - fixed incorrect handling of !target-can_queue
   - avoid scheduling a workqueue on I/O completions when no queues
 are congested

 In addition to the patches in this thread there also is a git
 available at:

 git://git.infradead.org/users/hch/scsi.git scsi-mq.2
 
 You can add my acked/reviewed-by to the series.

Ran stress testing from Friday to now, 65h of beating up on it and no
problems observed. 47TB read and 20TB written for a total of 17.7
billion of IOs issued and completed. Latencies look good. I officially
declare this code for bug free.

Bug-free-by: Jens Axboe ax...@fb.com

Now lets get this queued up for inclusion, pretty please.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-06-30 Thread Christoph Hellwig
On Mon, Jun 30, 2014 at 09:20:51AM -0600, Jens Axboe wrote:
 Ran stress testing from Friday to now, 65h of beating up on it and no
 problems observed. 47TB read and 20TB written for a total of 17.7
 billion of IOs issued and completed. Latencies look good. I officially
 declare this code for bug free.
 
 Bug-free-by: Jens Axboe ax...@fb.com
 
 Now lets get this queued up for inclusion, pretty please.

I'm still looking for one (or better two) persons familar with the
SCSI and/or block code to go over it and do a real detailed review.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-06-30 Thread Martin K. Petersen
 Christoph == Christoph Hellwig h...@infradead.org writes:

Christoph I'm still looking for one (or better two) persons familar
Christoph with the SCSI and/or block code to go over it and do a real
Christoph detailed review.

I'm on vacation for a couple of days. Will review Wednesday.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-06-27 Thread Bart Van Assche
On 06/27/14 00:07, Elliott, Robert (Server Storage) wrote:
 -Original Message-
 From: Jens Axboe [mailto:ax...@kernel.dk]
 Sent: Wednesday, 25 June, 2014 11:51 PM
 To: Christoph Hellwig; James Bottomley
 Cc: Bart Van Assche; Elliott, Robert (Server Storage); linux-
 s...@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: scsi-mq V2

 On 2014-06-25 10:51, Christoph Hellwig wrote:
 This is the second post of the scsi-mq series.

 ...

 Changes from V1:
   - rebased on top of the core-for-3.17 branch, most notable the
 scsi logging changes
   - fixed handling of cmd_list to prevent crashes for some heavy
 workloads
   - fixed incorrect handling of !target-can_queue
   - avoid scheduling a workqueue on I/O completions when no queues
 are congested

 In addition to the patches in this thread there also is a git available at:

 git://git.infradead.org/users/hch/scsi.git scsi-mq.2

 You can add my acked/reviewed-by to the series.
 
 Since March 20th (circa LSF-MM 2014) we've run many hours of tests
 with hpsa and the scsi-mq tree.  We've also done a little bit of 
 testing with mpt3sas and, in the last few days, scsi_debug.
 
 Although there are certainly more problems to find and improvements
 to be made, it's become quite stable.  It's even been used on the
 boot drives of our test servers.
 
 For the patches in scsi-mq.2 you may add:
 Tested-by: Robert Elliott elli...@hp.com

Performance of scsi-mq-v2 looks even better than that of scsi-mq-v1. The
slight single-LUN regression is gone, peak IOPS with use_blk_mq=Y on my
test setup is now 3x the performance of use_blk_mq=N and latency has
been reduced further. I think this means reducing the number of context
switches did really help :-) Detailed measurement results can be found
on https://drive.google.com/file/d/0B1YQOreL3_FxWmZfbl8xSzRfdGM/.

If you want you may add to the scsi-mq-v2 patch series:

Tested-by: Bart Van Assche bvanass...@acm.org

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: scsi-mq V2

2014-06-26 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: Jens Axboe [mailto:ax...@kernel.dk]
 Sent: Wednesday, 25 June, 2014 11:51 PM
 To: Christoph Hellwig; James Bottomley
 Cc: Bart Van Assche; Elliott, Robert (Server Storage); linux-
 s...@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 On 2014-06-25 10:51, Christoph Hellwig wrote:
  This is the second post of the scsi-mq series.
 
...
 
  Changes from V1:
- rebased on top of the core-for-3.17 branch, most notable the
  scsi logging changes
- fixed handling of cmd_list to prevent crashes for some heavy
  workloads
- fixed incorrect handling of !target-can_queue
- avoid scheduling a workqueue on I/O completions when no queues
  are congested
 
  In addition to the patches in this thread there also is a git available at:
 
  git://git.infradead.org/users/hch/scsi.git scsi-mq.2
 
 You can add my acked/reviewed-by to the series.
 
 --
 Jens Axboe

Since March 20th (circa LSF-MM 2014) we've run many hours of tests
with hpsa and the scsi-mq tree.  We've also done a little bit of 
testing with mpt3sas and, in the last few days, scsi_debug.

Although there are certainly more problems to find and improvements
to be made, it's become quite stable.  It's even been used on the
boot drives of our test servers.

For the patches in scsi-mq.2 you may add:
Tested-by: Robert Elliott elli...@hp.com


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


scsi-mq V2

2014-06-25 Thread Christoph Hellwig
This is the second post of the scsi-mq series.

At this point the code is ready for merging and use by developers and early
adopters.  The core blk-mq code isn't that suitable for slow devices
yet, mostly due to the lack of an I/O scheduler, but Jens is working on it.
Similarly there is no dm-multipath support for drivers using blk-mq yet,
but I'm working on it.  It should also be noted that the code doesn't
actually support multiple hardware queues or fine grained tuning of the
blk-mq parameters yet.  All these could be added fairly easily as soon
as low-level drivers want to make use of them.

The amount of chances to the existing code are fairly small, and mostly
speedups or cleanups that also apply to the old path as well.  Because
of this I also haven't bothered to put it under a config option, just
like the blk-mq core.

The usage of blk-mq dramatically decreases CPU usage under all workloads going
down from 100% CPU usage that the old setup can hit easily to usually less
than 20% for maxing out storage subsystems with 512byte reads and writes,
and it allows to easily archive millions of IOPS.  Bart and Robert have
helped with some very detailed measurements that they might be able to send
in reply to this, although these usually involve significantly reworked low
level drivers to avoid other bottle necks.

One major objection to previous iterations of this code was the simple
replacement of the host_lock with atomic counters for the host and busy
counters.  The host_lock avoidance on it's own already improves performance,
and with the patch to avoid maintaining the per-target busy counter unless
needed we now replace a lock round trip on the host_lock with just a single
atomic increment in the submission path, and a single atomic decrement in
completion path, which should provide benefits even for the oddest RISC
architecture.  Longer term I'd still love to get rid of these entirely
and use the counters in blk-mq, but due to the difference in how they
are maintained this doesn't seem feasible as long as we still need to
support the legacy request code path.

Changes from V1:
 - rebased on top of the core-for-3.17 branch, most notable the
   scsi logging changes
 - fixed handling of cmd_list to prevent crashes for some heavy
   workloads
 - fixed incorrect handling of !target-can_queue
 - avoid scheduling a workqueue on I/O completions when no queues
   are congested

In addition to the patches in this thread there also is a git available at:

git://git.infradead.org/users/hch/scsi.git scsi-mq.2

This work was sponsored by the ION division of Fusion IO.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-06-25 Thread Jens Axboe

On 2014-06-25 10:51, Christoph Hellwig wrote:

This is the second post of the scsi-mq series.

At this point the code is ready for merging and use by developers and early
adopters.  The core blk-mq code isn't that suitable for slow devices
yet, mostly due to the lack of an I/O scheduler, but Jens is working on it.
Similarly there is no dm-multipath support for drivers using blk-mq yet,
but I'm working on it.  It should also be noted that the code doesn't
actually support multiple hardware queues or fine grained tuning of the
blk-mq parameters yet.  All these could be added fairly easily as soon
as low-level drivers want to make use of them.

The amount of chances to the existing code are fairly small, and mostly
speedups or cleanups that also apply to the old path as well.  Because
of this I also haven't bothered to put it under a config option, just
like the blk-mq core.

The usage of blk-mq dramatically decreases CPU usage under all workloads going
down from 100% CPU usage that the old setup can hit easily to usually less
than 20% for maxing out storage subsystems with 512byte reads and writes,
and it allows to easily archive millions of IOPS.  Bart and Robert have
helped with some very detailed measurements that they might be able to send
in reply to this, although these usually involve significantly reworked low
level drivers to avoid other bottle necks.

One major objection to previous iterations of this code was the simple
replacement of the host_lock with atomic counters for the host and busy
counters.  The host_lock avoidance on it's own already improves performance,
and with the patch to avoid maintaining the per-target busy counter unless
needed we now replace a lock round trip on the host_lock with just a single
atomic increment in the submission path, and a single atomic decrement in
completion path, which should provide benefits even for the oddest RISC
architecture.  Longer term I'd still love to get rid of these entirely
and use the counters in blk-mq, but due to the difference in how they
are maintained this doesn't seem feasible as long as we still need to
support the legacy request code path.

Changes from V1:
  - rebased on top of the core-for-3.17 branch, most notable the
scsi logging changes
  - fixed handling of cmd_list to prevent crashes for some heavy
workloads
  - fixed incorrect handling of !target-can_queue
  - avoid scheduling a workqueue on I/O completions when no queues
are congested

In addition to the patches in this thread there also is a git available at:

git://git.infradead.org/users/hch/scsi.git scsi-mq.2


You can add my acked/reviewed-by to the series.

--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html