On Fri, Jan 8, 2010 at 3:17 PM, David Dillow <d...@thedillows.org> wrote: > On Fri, 2010-01-08 at 14:40 -0700, Chris Worley wrote: >> On Wed, Jan 6, 2010 at 6:57 PM, David Dillow <d...@thedillows.org> wrote: >> > On Wed, 2010-01-06 at 17:16 -0700, Chris Worley wrote: >> >> 1) I'm seeing small block random writes (32KB and smaller) get better >> >> performance over SRP than they do as a local drive. I'm guessing this >> >> is async behavior: once the written data is on the wire, it's deemed >> >> complete, and setting a sync flag would disable this. Is this >> >> correct? > >> >> If not, any ideas why SRP random writes would be faster than >> >> the same writes locally? >> > >> > I would guess deeper queue depths and more cache available on the >> > target, especially if you are using a Linux-based SRP target. >> >> I do set the ib_srp initiator "srp_sg_tablesize" to its maximum of 58. > > The max is 255, which will guarantee you can send up to a 1020 KB I/O > without breaking it into two SCSI commands. In practice, you're likely > to be able to send larger requests, as you will often have some > contiguous runs in the data pages.
I've tried a larger max... 58 is all I can get. Maybe getting more is dependent on some other setting. > > This is probably not hurting you at smaller request sizes. > >> >> 2) I'm seeing very poor sequential vs. random I/O performance (both >> >> read and write) at small block sizes (random performs well, sequential >> >> performance is poor). I'm using direct I/O and the noop scheduler on >> >> the initiator, so there should be no coalescing. Coalescing on these >> >> drives is not a good thing to do, as they are ultra low latency, and >> >> much faster if the OS doesn't try to coalesce. Could anything in the >> >> IB/SRP/SCST stack be trying to coalesce sequential data? >> > >> > Yes, if you have more requests outstanding than available queue depth -- >> > ie queue backpressure/congestion -- even noop will merge sequential >> > requests in the queue. You could avoid this by setting max_sectors_kb to >> > the maximum IO size you wish the drive to see. >> >> I thought if the device was opened with the O_DIRECT flag, then the >> scheduler should have nothing to coalesce. > > Depends on how many I/Os your application has in flight at once, > assuming it is using AIO or threads. If you have more requests in flight > than can be queued, the block layer will coalesce if possible. I do use AIO, always 64 threads, each w/ 64 outstanding I/O's. Local or iSER initiator based, I never see any coalescing. Only w/ SRP. > >> > Though, I'd be surprised if there was no benefit at all to the OS >> > coalescing under congestion. Benefit isn't the issue. It needs to be benchmarked w/o artificial aids that cloud the results. I'm not really fond of sequential I/O, as it seldom really exists in real applications (except for logging apps), but if I'm going to test it, I need valid numbers. I could do like the SAN/FC vendors do, and just take the throughput for 1MB blocks and divide the TPS by 2M and call that the 512 byte block IOPS ;) >> >> For sequential I/O benchmarking, I need to see the real results for >> that size packet. Direct I/O works for me everywhere except SRP. > > Hmm, that seems a bit odd, but there is nothing in the SRP initiator > that would cause the behavior you are seeing -- it just hands over the > requests the SCSI and block layers give it. Are you observing this via > diskstats at the initiator or the target side of the SRP connection? Diskstats on the initiator side. There is the scst_vdisk "Direct I/O" option that's been commented out of the code, as it's not supposed to work... maybe direct I/O doesn't work... but that would be the target side. > > You could also try using sgp_dd from lustre-iokit, but I've seen some > oddities from it -- it couldn't drive the hardware I was testing at full > speed, where XDD and some custom tools I wrote did. > > You may have mentioned this, but are you using the raw device, or a > filesystem over top of it? It depends: this #2 issue, sequential vs random: it's atop the raw block device. The third issue was atop MD. As some of this thread has been snipped, I'm not completely sure which issue we're discussing. > > Also, I've seen some interesting things like device mapper reporting a 4 > KB read as 8 512 byte sectors, even though it was handed to DM as a 4KB > request, so there could be gremlins there as well. I don't know how the > MD device driver reports this. > > What does the output of 'cd /sys/block/sda/queue && head *' look like, > where sda should be replaced with the SRP disk. It would also be > interesting to see that for iSCSI, and > in /sys/class/scsi_disk/0:0:0:0/device for both connection types to see > if there is a difference. Initiator or target? The target side isn't a SCSI device, it's a block device. I guess I could use scst_local to make it look scsi-ish. > >> > Have you tried using the function tracer or perf tools found in recent >> > kernels to follow the data path and find the hotspots? >> >> I have not. I parse the data from diskstats. A pointer to these >> tools would be appreciated. > > You can find information on them in the kernel source, under > Documentation/trace/ftrace.txt and tools/perf/Documentation > > You can also try blktrace. Thanks, Chris > > Dave > > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html