On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote:
> On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote:
> 
> > So I'm pretty sure this discrepancy is attributed to the small block
> > random I/O bottleneck currently present for all Linux/SCSI core LLDs
> > regardless of physical or virtual storage fabric.
> > 
> > The SCSI wide host-lock less conversion that happened in .38 code back
> > in 2010, and subsequently having LLDs like virtio-scsi convert to run in
> > host-lock-less mode have helped to some extent..  But it's still not
> > enough..
> > 
> > Another example where we've been able to prove this bottleneck recently
> > is with the following target setup:
> > 
> > *) Intel Romley production machines with 128 GB of DDR-3 memory
> > *) 4x FusionIO ioDrive 2 (1.5 TB @ PCI-e Gen2 x2)
> > *) Mellanox PCI-exress Gen3 HCA running at 56 gb/sec 
> > *) Infiniband SRP Target backported to RHEL 6.2 + latest OFED
> > 
> > In this setup using ib_srpt + IBLOCK w/ emulate_write_cache=1 +
> > iomemory_vsl export we end up avoiding SCSI core bottleneck on the
> > target machine, just as with the tcm_vhost example here for host kernel
> > side processing with vhost.
> > 
> > Using Linux IB SRP initiator + Windows Server 2008 R2 SCSI-miniport SRP
> > (OFED) Initiator connected to four ib_srpt LUNs, we've observed that
> > MSFT SCSI is currently outperforming RHEL 6.2 on the order of ~285K vs.
> > ~215K with heavy random 4k WRITE iometer / fio tests.  Note this with an
> > optimized queue_depth ib_srp client w/ noop I/O schedulering, but is
> > still lacking the host_lock-less patches on RHEL 6.2 OFED..
> > 
> > This bottleneck has been mentioned by various people (including myself)
> > on linux-scsi the last 18 months, and I've proposed that that it be
> > discussed at KS-2012 so we can start making some forward progress:
> 
> Well, no, it hasn't.  You randomly drop things like this into unrelated
> email (I suppose that is a mention in strict English construction) but
> it's not really enough to get anyone to pay attention since they mostly
> stopped reading at the top, if they got that far: most people just go by
> subject when wading through threads initially.
> 

It most certainly has been made clear to me, numerous times from many
people in the Linux/SCSI community that there is a bottleneck for small
block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non
Linux based SCSI subsystems.

My apologies if mentioning this issue last year at LC 2011 to you
privately did not take a tone of a more serious nature, or that
proposing a topic for LSF-2012 this year was not a clear enough
indication of a problem with SCSI small block random I/O performance.

> But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32
> kernel, which is now nearly three years old) is 25% slower than W2k8R2
> on infiniband isn't really going to get anyone excited either
> (particularly when you mention OFED, which usually means a stack
> replacement on Linux anyway).
> 

The specific issue was first raised for .38 where we where able to get
most of the interesting high performance LLDs converted to using
internal locking methods so that host_lock did not have to be obtained
during each ->queuecommand() I/O dispatch, right..?

This has helped a good deal for large multi-lun scsi_host configs that
are now running in host-lock less mode, but there is still a large
discrepancy single LUN vs. raw struct block_device access even with LLD
host_lock less mode enabled.

Now I think the virtio-blk client performance is demonstrating this
issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block
flash benchmarks that is demonstrate some other yet-to-be determined
limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio
randrw workload.

> What people might pay attention to is evidence that there's a problem in
> 3.5-rc6 (without any OFED crap).  If you're not going to bother
> investigating, it has to be in an environment they can reproduce (so
> ordinary hardware, not infiniband) otherwise it gets ignored as an
> esoteric hardware issue.
> 

It's really quite simple for anyone to demonstrate the bottleneck
locally on any machine using tcm_loop with raw block flash.  Take a
struct block_device backend (like a Fusion IO /dev/fio*) and using
IBLOCK and export locally accessible SCSI LUNs via tcm_loop..

Using FIO there is a significant drop for randrw 4k performance between
tcm_loop <-> IBLOCK vs. raw struct block device backends.  And no, it's
not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI
LUN limitation for small block random I/Os on the order of ~75K for each
SCSI LUN.

If anyone has gone actually gone faster than this with any single SCSI
LUN on any storage fabric, I would be interested in hearing about your
setup.

Thanks,

--nab

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Reply via email to