On Mon, Aug 28, 2017 at 06:31:33PM +0000, Bart Van Assche wrote:
> On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
> > I still disagree that we should have an indirect function call like this
> > in the fast path.
> > 
> > All that can be done by clearing or setting a flag on the first call to
> > ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
> > that, but for SCSI we probably can't use that given that it has more
> > meaning for the old request path.  But how about just adding a new
> > RQD_DRV_INITIALIZED or similar flag?
> 
> Hello Christoph,
> 
> Sorry but I'm not enthusiast about the proposal to introduce a
> RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying
> behavior difference, namely that .initialize_rq_fn() would be called from
> inside blk_get_request() for pass-through requests and from inside the prep
> function for filesystem requests. Another disadvantage of that approach is
> that the block layer core never clears request.atomic_flags completely but
> only sets and clears individual flags. The SCSI core would have to follow
> that model and hence code for clearing RQD_DRV_INITIALIZED would have to be
> added to all request completion paths in the SCSI core.
> 
> Have you noticed that Ming Lei's patch series introduces several new atomic
> operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY
> manipulations. Have you noticed that for SCSI drivers these patches introduce
> an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path?
> I have derived these numbers from the random write SRP performance numbers
> as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be
> multiplied with the number of I/O requests processed in parallel. The number
> of jobs in Ming Lei's test was 64 but that's probably way higher than the
> actual I/O parallelism.

Hi Bart,

Did you see perf regression on SRP with smaller jobs after applying
my patchset V3?

I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64,
looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS.
v4.13-rc6+blk-next+patch V3(3rd column):


---------------------------------------
 IOPS(K)  |    NONE     |    NONE
---------------------------------------
read      |      475.83 |      485.88
---------------------------------------
randread  |      142.86 |      141.96
---------------------------------------
write     |       483.9 |      492.39
---------------------------------------
randwrite |      124.83 |      124.53
---------------------------------------



---------------------------------------
 CPU(%)   |    NONE     |    NONE
---------------------------------------
read      |       15.43 |       15.81
---------------------------------------
randread  |        9.87 |        9.75
---------------------------------------
write     |       17.67 |       18.34
---------------------------------------
randwrite |       10.96 |       10.56
---------------------------------------



---------------------------------------
 LAT(us)  |    NONE     |    NONE
---------------------------------------
read      |        2.15 |        2.11
---------------------------------------
randread  |        7.17 |        7.21
---------------------------------------
write     |        2.11 |        2.08
---------------------------------------
randwrite |         8.2 |        8.22
---------------------------------------



---------------------------------------
MERGE(K)  |    NONE     |    NONE
---------------------------------------
read      |     8798.59 |     9064.09
---------------------------------------
randread  |        0.02 |        0.19
---------------------------------------
write     |     8847.73 |     9102.18
---------------------------------------
randwrite |        0.03 |        0.13
---------------------------------------



-- 
Ming

Reply via email to