On Fri, 2010-01-08 at 14:40 -0700, Chris Worley wrote:
> On Wed, Jan 6, 2010 at 6:57 PM, David Dillow <d...@thedillows.org> wrote:
> > On Wed, 2010-01-06 at 17:16 -0700, Chris Worley wrote:
> >> 1) I'm seeing small block random writes (32KB and smaller) get better
> >> performance over SRP than they do as a local drive.  I'm guessing this
> >> is async behavior: once the written data is on the wire, it's deemed
> >> complete, and setting a sync flag would disable this.  Is this
> >> correct?

> >> If not, any ideas why SRP random writes would be faster than
> >> the same writes locally?
> >
> > I would guess deeper queue depths and more cache available on the
> > target, especially if you are using a Linux-based SRP target.
> 
> I do set the ib_srp initiator "srp_sg_tablesize" to its maximum of 58.

The max is 255, which will guarantee you can send up to a 1020 KB I/O
without breaking it into two SCSI commands. In practice, you're likely
to be able to send larger requests, as you will often have some
contiguous runs in the data pages.

This is probably not hurting you at smaller request sizes.

> >> 2) I'm seeing very poor sequential vs. random I/O performance (both
> >> read and write) at small block sizes (random performs well, sequential
> >> performance is poor).  I'm using direct I/O and the noop scheduler on
> >> the initiator, so there should be no coalescing.  Coalescing on these
> >> drives is not a good thing to do, as they are ultra low latency, and
> >> much faster if the OS doesn't try to coalesce.  Could anything in the
> >> IB/SRP/SCST stack be trying to coalesce sequential data?
> >
> > Yes, if you have more requests outstanding than available queue depth --
> > ie queue backpressure/congestion -- even noop will merge sequential
> > requests in the queue. You could avoid this by setting max_sectors_kb to
> > the maximum IO size you wish the drive to see.
> 
> I thought if the device was opened with the O_DIRECT flag, then the
> scheduler should have nothing to coalesce.

Depends on how many I/Os your application has in flight at once,
assuming it is using AIO or threads. If you have more requests in flight
than can be queued, the block layer will coalesce if possible.

> > Though, I'd be surprised if there was no benefit at all to the OS
> > coalescing under congestion.
> 
> For sequential I/O benchmarking, I need to see the real results for
> that size packet.  Direct I/O works for me everywhere except SRP.

Hmm, that seems a bit odd, but there is nothing in the SRP initiator
that would cause the behavior you are seeing -- it just hands over the
requests the SCSI and block layers give it. Are you observing this via
diskstats at the initiator or the target side of the SRP connection?

You could also try using sgp_dd from lustre-iokit, but I've seen some
oddities from it -- it couldn't drive the hardware I was testing at full
speed, where XDD and some custom tools I wrote did.

You may have mentioned this, but are you using the raw device, or a
filesystem over top of it?

Also, I've seen some interesting things like device mapper reporting a 4
KB read as 8 512 byte sectors, even though it was handed to DM as a 4KB
request, so there could be gremlins there as well. I don't know how the
MD device driver reports this.

What does the output of 'cd /sys/block/sda/queue && head *' look like,
where sda should be replaced with the SRP disk. It would also be
interesting to see that for iSCSI, and
in /sys/class/scsi_disk/0:0:0:0/device for both connection types to see
if there is a difference.

> > Have you tried using the function tracer or perf tools found in recent
> > kernels to follow the data path and find the hotspots?
> 
> I have not.  I parse the data from diskstats.  A pointer to these
> tools would be appreciated.

You can find information on them in the kernel source, under
Documentation/trace/ftrace.txt and tools/perf/Documentation

You can also try blktrace.

Dave


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to