Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
On Fri, 6 Jul 2012, James Bottomley wrote: What people might pay attention to is evidence that there's a problem in 3.5-rc6 (without any OFED crap). If you're not going to bother investigating, it has to be in an environment they can reproduce (so ordinary hardware, not infiniband) otherwise it gets ignored as an esoteric hardware issue. The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been supported for a long time and its a very important technology given the problematic nature of ethernet at high network speeds. OFED crap exists for those running RHEL5/6. The new enterprise distros are based on the 3.2 kernel which has pretty good Infiniband support out of the box. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote: On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote: So I'm pretty sure this discrepancy is attributed to the small block random I/O bottleneck currently present for all Linux/SCSI core LLDs regardless of physical or virtual storage fabric. The SCSI wide host-lock less conversion that happened in .38 code back in 2010, and subsequently having LLDs like virtio-scsi convert to run in host-lock-less mode have helped to some extent.. But it's still not enough.. Another example where we've been able to prove this bottleneck recently is with the following target setup: *) Intel Romley production machines with 128 GB of DDR-3 memory *) 4x FusionIO ioDrive 2 (1.5 TB @ PCI-e Gen2 x2) *) Mellanox PCI-exress Gen3 HCA running at 56 gb/sec *) Infiniband SRP Target backported to RHEL 6.2 + latest OFED In this setup using ib_srpt + IBLOCK w/ emulate_write_cache=1 + iomemory_vsl export we end up avoiding SCSI core bottleneck on the target machine, just as with the tcm_vhost example here for host kernel side processing with vhost. Using Linux IB SRP initiator + Windows Server 2008 R2 SCSI-miniport SRP (OFED) Initiator connected to four ib_srpt LUNs, we've observed that MSFT SCSI is currently outperforming RHEL 6.2 on the order of ~285K vs. ~215K with heavy random 4k WRITE iometer / fio tests. Note this with an optimized queue_depth ib_srp client w/ noop I/O schedulering, but is still lacking the host_lock-less patches on RHEL 6.2 OFED.. This bottleneck has been mentioned by various people (including myself) on linux-scsi the last 18 months, and I've proposed that that it be discussed at KS-2012 so we can start making some forward progress: Well, no, it hasn't. You randomly drop things like this into unrelated email (I suppose that is a mention in strict English construction) but it's not really enough to get anyone to pay attention since they mostly stopped reading at the top, if they got that far: most people just go by subject when wading through threads initially. It most certainly has been made clear to me, numerous times from many people in the Linux/SCSI community that there is a bottleneck for small block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non Linux based SCSI subsystems. My apologies if mentioning this issue last year at LC 2011 to you privately did not take a tone of a more serious nature, or that proposing a topic for LSF-2012 this year was not a clear enough indication of a problem with SCSI small block random I/O performance. But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32 kernel, which is now nearly three years old) is 25% slower than W2k8R2 on infiniband isn't really going to get anyone excited either (particularly when you mention OFED, which usually means a stack replacement on Linux anyway). The specific issue was first raised for .38 where we where able to get most of the interesting high performance LLDs converted to using internal locking methods so that host_lock did not have to be obtained during each -queuecommand() I/O dispatch, right..? This has helped a good deal for large multi-lun scsi_host configs that are now running in host-lock less mode, but there is still a large discrepancy single LUN vs. raw struct block_device access even with LLD host_lock less mode enabled. Now I think the virtio-blk client performance is demonstrating this issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block flash benchmarks that is demonstrate some other yet-to-be determined limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio randrw workload. What people might pay attention to is evidence that there's a problem in 3.5-rc6 (without any OFED crap). If you're not going to bother investigating, it has to be in an environment they can reproduce (so ordinary hardware, not infiniband) otherwise it gets ignored as an esoteric hardware issue. It's really quite simple for anyone to demonstrate the bottleneck locally on any machine using tcm_loop with raw block flash. Take a struct block_device backend (like a Fusion IO /dev/fio*) and using IBLOCK and export locally accessible SCSI LUNs via tcm_loop.. Using FIO there is a significant drop for randrw 4k performance between tcm_loop - IBLOCK vs. raw struct block device backends. And no, it's not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI LUN limitation for small block random I/Os on the order of ~75K for each SCSI LUN. If anyone has gone actually gone faster than this with any single SCSI LUN on any storage fabric, I would be interested in hearing about your setup. Thanks, --nab ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
On Fri, 2012-07-06 at 17:49 +0400, James Bottomley wrote: On Fri, 2012-07-06 at 02:13 -0700, Nicholas A. Bellinger wrote: On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote: On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote: SNIP This bottleneck has been mentioned by various people (including myself) on linux-scsi the last 18 months, and I've proposed that that it be discussed at KS-2012 so we can start making some forward progress: Well, no, it hasn't. You randomly drop things like this into unrelated email (I suppose that is a mention in strict English construction) but it's not really enough to get anyone to pay attention since they mostly stopped reading at the top, if they got that far: most people just go by subject when wading through threads initially. It most certainly has been made clear to me, numerous times from many people in the Linux/SCSI community that there is a bottleneck for small block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non Linux based SCSI subsystems. My apologies if mentioning this issue last year at LC 2011 to you privately did not take a tone of a more serious nature, or that proposing a topic for LSF-2012 this year was not a clear enough indication of a problem with SCSI small block random I/O performance. But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32 kernel, which is now nearly three years old) is 25% slower than W2k8R2 on infiniband isn't really going to get anyone excited either (particularly when you mention OFED, which usually means a stack replacement on Linux anyway). The specific issue was first raised for .38 where we where able to get most of the interesting high performance LLDs converted to using internal locking methods so that host_lock did not have to be obtained during each -queuecommand() I/O dispatch, right..? This has helped a good deal for large multi-lun scsi_host configs that are now running in host-lock less mode, but there is still a large discrepancy single LUN vs. raw struct block_device access even with LLD host_lock less mode enabled. Now I think the virtio-blk client performance is demonstrating this issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block flash benchmarks that is demonstrate some other yet-to-be determined limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio randrw workload. What people might pay attention to is evidence that there's a problem in 3.5-rc6 (without any OFED crap). If you're not going to bother investigating, it has to be in an environment they can reproduce (so ordinary hardware, not infiniband) otherwise it gets ignored as an esoteric hardware issue. It's really quite simple for anyone to demonstrate the bottleneck locally on any machine using tcm_loop with raw block flash. Take a struct block_device backend (like a Fusion IO /dev/fio*) and using IBLOCK and export locally accessible SCSI LUNs via tcm_loop.. Using FIO there is a significant drop for randrw 4k performance between tcm_loop - IBLOCK vs. raw struct block device backends. And no, it's not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI LUN limitation for small block random I/Os on the order of ~75K for each SCSI LUN. Here, you're saying here that the end to end SCSI stack tops out at around 75k iops, which is reasonably respectable if you don't employ any mitigation like queue steering and interrupt polling ... what were the mitigation techniques in the test you employed by the way? ~75K per SCSI LUN in a multi-lun per host setup is being optimistic btw. On the other side of the coin, the same pure block device can easily go ~200K per backend.- For the simplest case with tcm_loop, a struct scsi_cmnd is queued via cmwq to execute in process context - submit the backend I/O. Once completed from IBLOCK, the I/O is run though a target completion wq, and completed back to SCSI. There is no fancy queue steering or interrupt polling going on (at least not in tcm_loop) because it's a simple virtual SCSI LLD similar to scsi_debug. But previously, you ascribed a performance drop of around 75% on virtio-scsi (topping out around 15-20k iops) to this same problem ... that doesn't really seem likely. No. I ascribed the performance difference between virtio-scsi+tcm_vhost vs. bare-metal raw block flash to this bottleneck in Linux/SCSI. It's obvious that virtio-scsi-raw going through QEMU SCSI / block is having some other shortcomings. Here's the rough ranges of concern: 10K iops: standard arrays 100K iops: modern expensive fast flash drives on 6Gb links 1M iops: PCIe NVMexpress like devices SCSI should do arrays with no problem at all, so I'd be really concerned that it can't make 0-20k iops. If you push the system and fine tune it, SCSI can just about
Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
On Fri, 2012-07-06 at 15:30 -0500, Christoph Lameter wrote: On Fri, 6 Jul 2012, James Bottomley wrote: What people might pay attention to is evidence that there's a problem in 3.5-rc6 (without any OFED crap). If you're not going to bother investigating, it has to be in an environment they can reproduce (so ordinary hardware, not infiniband) otherwise it gets ignored as an esoteric hardware issue. The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been supported for a long time and its a very important technology given the problematic nature of ethernet at high network speeds. OFED crap exists for those running RHEL5/6. The new enterprise distros are based on the 3.2 kernel which has pretty good Infiniband support out of the box. So I don't think the HCAs or Infiniband fabric was the limiting factor for small block random I/O in the RHEL 6.2 w/ OFED vs. Windows Server 2008 R2 w/ OFED setup mentioned earlier. I've seen both FC and iSCSI fabrics demonstrate the same type of random small block I/O performance anomalies with Linux/SCSI clients too. The v3.x Linux/SCSI clients are certainly better in the multi-lun per host small block random I/O case, but single LUN performance is (still) lacking compared to everything else. Also RHEL 6.2 does have the scsi-host-lock less bits in place now, but it's been more a matter of converting OFED ib_srp code to run in host-lock less mode to realize extra gains for multi-lun per host. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization